COMPUTER RESEARCH & TECHNOLOGY
 

ETopics The Wayback Machine: A Web Archives Search Engine

Remember what the ABC looked like in 1996? Or Google, when it graduated from Stanford and went live under its own URL (Web address) in 1999? What about long-lost Infoseek, which has vanished entirely from the search engine scene?

If you can't recall the specifics, a new service from the Internet Archive is available to help. The Wayback Machine is a search engine that contains over 100 terabytes and 10 billion web pages archived from 1996 to the present. It's an absolutely phenomenal gift to the web community.

What is the Internet Archive Wayback Machine?

The Internet Archive Wayback Machine is a service that allows people to visit archived versions of stored websites.  Visitors to the Internet Archive Wayback Machine can type in an URL, select a date, and then begin surfing on an archived version of the web.   Imagine surfing circa 1999 and looking at all the Y2K hype, or revisiting an older copy of your favourite website.  The Internet Archive Wayback Machine can make all of this possible.

Just How Big Is 100 Terabytes?

In comparison to some familiar everyday data banks:

A megabyte is... about the size of a mystery novel, or the size of a floppy disk, or about a million bytes. 

A gigabyte is... a thousand megabytes, or about a billion bytes. One copy of the Encyclopaedia Britannica (2,619 pages per copy) is one gigabyte. The ancient Library of Alexandria (400,000 scrolls) was about 8 Gigabytes.

A terabyte is... a million megabytes, or about a trillion bytes. A thousand copies of the Encyclopaedia Britannica are one Terabyte. A public library (of about 300,000 books) is 3 Terabytes. A radio station (10,000 LPs and CDs, or 15,000 hours of music) is about 8 Terabytes.

To see a more comprehensive comparison table of the Internet Archive's collections containing material dating from 1996 to the present lookup http://www.archive.org/xterabytes.html

How Can We Find Specific Historical Information?

The Wayback Machine also has an advanced search form designed specifically to help the adventurous web archaeologist. You can limit your search to a particular date range, or even a particular date. Advanced search also offers some subtle options that reveal a very thoughtful approach to the design of the search interface.

For example, you can match a Web address exactly to see a specific page, or you can request every page associated with a Web address to see all archived pages from a site.

Advanced search also provides controls for displaying redirected pages, file types, and duplicates. A helpful list of hints and tips for advanced search shows how to refine your query for a number of common types of searches.

The Internet Archive provides a number of "special collections" that are absolutely fascinating glimpses of historical web sites. These include snapshots from September 11th, the year 2000 election, U.S. government sites, and one that'll really get your nostalgic juices flowing, Web Pioneers.

Is this Service Expected to be Popular with the Public?

The Wayback Machine was unveiled just last week, but has already been overwhelmed by users. The service is "intermittent" for the time being, meaning you won't always see a complete list of results for a particular Web address. The Internet Archive is working to add servers, but expects the process to take "weeks." In the mean time, there's still plenty available for viewing from the 100 terabyte archive of the web.

Can I link my current system to old pages on the Internet Archive Wayback Machine?

Yes, the system has been constructed in such a way that it can be used and referenced by anybody. If there is an archived page that you would like to reference on your own web page, you can copy the Web address and create a link to it.

Are other sites available in the Internet Archive Wayback Machine?

The Internet Archive is attempting to archive the entire publicly available web.   Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl.  It's also possible that some sites were not archived because they were password protected or otherwise inaccessible to our automated systems.

Who was involved in creating the Internet Archive Wayback Machine?

The original idea for the Internet Archive Wayback Machine began in 1996, when the Internet Archive first began archiving the web.  Now, five years later, with over 100 terabytes and a dozen web crawls completed, the Internet Archive has made the Internet Archive Wayback Machine available to the public.  The Internet Archive has relied on donations of web crawls, technology and expertise from Alexa Internet and others.   The Internet Archive Wayback Machine is owned and operated by the Internet Archive.

Some sites are not available because of exclusions. What does that mean?

This means web site owners can instruct automated systems not to crawl their sites. There is a universal standard for robot exclusion. If a web site owner ever decides they prefer not to have a web crawler visiting their site, the developer’s crawlers will stop visiting those files and mark all files previously gathered as unavailable. Sometimes a web site owner will request crawling or archiving a site not occur.

My old site is not listed, how can I get it included?

The developer has been crawling the web since 1996, which has resulted in a massive archive. If you have a web site, and you would like to ensure that it is saved for posterity in the Archive, chances are that it's already there. However, if it is not you can visit the "Archive Your Site" page and get it crawled.

I don't want my site's pages in the archive. How do I remove them?

By installing a robots.txt file on your web server, you can exclude your site from being archived, as well as block access to them on the archive. For information, see the webmasters page.

How can I help to get extra material on the Archive?

The Internet Archive actively seeks donations of digital materials for preservation. Alexa Internet provides access to a web-wide crawl that contains copies of the publicly accessible web. If you have digital materials that may be of interest to future generations, you can submit it here.

The Internet Archive Wayback Machine

http://www.archive.org/index.php

Wayback Machine Advanced Search

http://web.archive.org/collections/web/advanced.html


Arthur Hissey
Computer Research & Technology
www.crt.net.au


RELEVANT LINKS
find additional information quickly

ETOPICS
what are they?

Keep up to date with the latest in the IT/Communications industry by listening to ABC Local Radio on FM107.1, every Tuesday morning at 9.15AM.

Computer Research & Technology Managing Director Arthur Hissey and Morning Host Janice McGilchrist will be discussing current matters of interest and future directions in the IT industry.

Transcripts of these discussions and other topics are available, just click on the links.


ETopic Archives
browse the archived ETopics
Check out the ETopic Archives
Full Archive List
Browse Alphabetically
A - E
F - J
K - O
P - U
V - Z
Last 5 ETopics
A Map? On Flickr? Is that a question?
Net ID scheme offers passport to online safety, especially for children online
What is ViewDo? ViewDo Helps People Help Themselves
Australian Dictionary of Biography Online
Google Earth Revisited