|
|
COMPUTER RESEARCH & TECHNOLOGY |
|
Searching on the Internet today can be compared to dragging a net across the surface of the ocean. There is a however a vast quantity and wealth of information that is "deep", and therefore missed by normal search engines. The reason is really quite simple, the fundamental search methodologies and technologies have hardly evolved at all since the inception of the Internet. Traditional search engines generate their sorted catalogues of information by trawling across and through the "surface" Web pages. If a page is to be discovered it must be static and linked to other pages. These same search engines do not have the ability to "see" or retrieve information content in the "deep" Web, which is defined as information content located in searchable databases that will only appear dynamically and in response to a direct query. Because traditional search engine crawlers cannot probe under the surface, the "deep" Web has always been hidden from view. The surface Web might be likened to a bookstore or a library in which the search engine knows all the books on the shelves. But if you want to find a book that isn't in stock, or to mine data that's in the books, you need a more powerful tool. Big search companies like AltaVista and Google don't see a future in trolling the "deep" Web. "There is a lot of information in databases that's not that useful, like really large databases of data from radio telescopes," says Google CEO Larry Page. What is the "deep" or Invisible Web? The "deep" Web, sometimes called the "invisible" Web, is information that is stored primarily in searchable databases. However the information from these databases can only be discovered and returned by querying or questioning them directly. Without a direct query, the database does not give up any of its buried information for publication on the Internet. When they are queried, "deep" Web sites will produce their information as dynamic or temporary Web pages in real-time and then disappear from sight again. Even though these dynamic pages will have a single unique Web address that will allow them to be retrieved in the future, they do not present a permanent observable presence to be seen by people or software trolling the Web. How is the "deep" Web different from the "surface" Web? As we have learned, traditionally search engines are the primary means of finding information on the "surface" of the Web. They get their listings by document authors submitting their Web pages manually (least used) or by "crawling" or "spidering" from one Web page to another and making an indexed list of the words. When indexing a document or page, the crawler also maintains a list of Hypertext links it comes across so that it may later go and crawl over those sites as well. Like ripples spreading across a pond search engine crawlers are able to extend their indexes further and further from their starting points. Consequently, to be discovered, "surface" Web pages have to be static and linked to other pages. But the majority of the information, also known as "content", of the invisible or "deep" Web is kept in databases. When an indexing spider comes across a database, the spider is automatically locked out since there is not a way to link the content to any of the search engines. Traditional search engines cannot "see" or retrieve content in the "deep" Web, which by definition is dynamic information served up in real time from a database in response to a direct query. Therefore the "deep" Web has always been hidden from view. For example in America it is no longer necessary to call the airline to see if flights are on time before heading to the airport. Instead you can visit TheTrip.com's online Flight Tracker, which lets you search a constantly updated Federal Aviation Administration database. These databases maintain real-time information about aircraft in flight in the U.S. direct from the cockpit. While Trip and Flight Tracker are indexed by major search engines, they cannot index the information that is inside the tracker database. That information is inaccessible unless you know how to find it. Major search engines, no matter how good they are, don't have a prayer of accessing real-time information like this. Other "deep" Web Databases include Stock and Share trading records, yellow pages, patent databases, Merriam-Webster Dictionary and Trade Valuation handbooks on vehicles. I occasionally see "deep" Web content using search engines. Why is that? Any "deep" Web content listed on a static Web page is discoverable by crawlers and can therefore be indexed by search engines. This most often occurs when a Web page author discovers some useful "deep" Web content and puts its dynamic Web address on a static Web page. How come I havent heard about the "deep" Web before now? In the early days of the Web, there were relatively few documents and sites to visit. It was easy to manage "posting" all the documents as "static" pages. Because the results were relatively permanent and constantly available, they were effortlessly crawled by conventional search engines. What is far less well known is that information is now being published on the Web by different methods, especially on the larger sites. The sheer volume of these sites requires that the information to be contained and managed in a database, the results of which are "hidden from plain sight" from common search engines. The evolution of the Web to a database system has been gradual and largely unnoticed. However, many Internet information professionals have noted the importance of searchable databases to Web content throughout this period. Is the "deep" Web the same thing as the "invisible" Web? As early as 1994 the phrase "invisible Web" has been used to refer to information content that was "invisible" to conventional search engines. We would prefer not to use the term "invisible Web" because it is misleading. What is "invisible" about searchable databases is that they cannot not be indexed or queried by conventional search engines. However newer types of "deep" search engines are making this information visible once again. The actual problem is not the "visibility" or "invisibility" of the Web, but the information gathering technologies used by conventional search engines in collecting their content. For these reasons, many have chosen to refer to the information in searchable databases as the "deep" Web. Whilst it may be somewhat hidden, it can be made clearly available if different know-how is used to access it. Just how big is the "deep" Web? It is believed that public information on the "deep" Web is currently about 550 times larger than that commonly referred to as the World Wide Web. The "deep" Web is said to contain over 7,500 terabytes of information, compared to 19 terabytes of information on the surface of the Web. In order to understand these sizes we must first understand the units of storage capacity used in the computer industry. A gigabyte is a measure of computer data storage capacity and is "roughly" a billion bytes. A terabyte is approximately a thousand billion bytes (that is, a thousand gigabytes). The "deep" Web contains nearly 550 billion individual documents compared to approximately one billion on the surface of the Web. It is further estimated that more than 200,000 "deep" Web sites currently exist. Sixty of the largest "deep" Web sites in themselves alone collectively contain around 750 terabytes of information. This in itself exceeds the size of the surface Web by 40 times. How does the information quality of the "deep" Web differ from the "surface" Web? Deep Web sites tend to be narrower with deeper content than normal surface sites. Total quality content of the "deep" Web is at least 2,000 times greater than that of the surface Web. Deep Web content is almost always highly relevant to the information being sought, its market and its domain. Around half of the "deep" Web information content resides in databases formed for specific topics alone. Surprisingly a full 95% of the "deep" Web is publicly accessible information, i.e. it is not subject to fees or subscriptions to gain access to it. Is the "deep" Web growing faster or slower than the "surface" Web? Without doubt the "deep" Web is the fastest growing category of new information around on the Internet today. All the indicators are that the "deep" Web will be the dominant concept for the next-generation Internet. This next generation is sometime known as the X-Internet. What other factors may make Internet information "deep"? The World Wide Web (HTTP protocol) is only one subset of the Internets information content. Other Internet protocols besides the Web include FTP (file transfer protocol), email, news, Telnet and Gopher. There is also a large store of private, intranet information hidden behind firewalls. Often companies will have internal document storage that will exceed terabytes of information. On average 44% of the "contents" of a typical Web document reside in HTML and other coded information (for example, XML or Javascripts). Finally, multimedia (images, music) is another growing category of Internet content. All of these sources significantly contribute to the "deep" Internet content. Arthur Hissey |
|
ETOPICS |
|
Keep up to date with the latest in the IT/Communications industry by listening to ABC Local Radio on FM107.1, every Tuesday morning at 9.15AM. Computer Research & Technology Managing Director Arthur Hissey and Morning Host Janice McGilchrist will be discussing current matters of interest and future directions in the IT industry. Transcripts of these discussions and other topics are available, just click on the links. |
|
ETopic Archives |
| Check out the ETopic Archives |
| Full Archive List |
| Browse Alphabetically |
| A - E |
| F - J |
| K - O |
| P - U |
| V - Z |
| Last 5 ETopics |
| A Map? On Flickr? Is that a question? |
| Net ID scheme offers passport to online safety, especially for children online |
| What is ViewDo? ViewDo Helps People Help Themselves |
| Australian Dictionary of Biography Online |
| Google Earth Revisited |