COMPUTER RESEARCH & TECHNOLOGY
 

ETopics Search Engine Seminar

Though the Internet is a relatively new comer to our lives it has not been since the invention of the printing press have we had such a powerful communication tool. Almost overnight we have been given virtually unlimited global access to an information store never before even dreamt of.

The World Wide Web has been heralded as a boon for those involved in business, administration, education and research. It provides an affordable, efficient and reliable access to information sources throughout the entire world; further most of that information is for free.

The sheer volume of information on Web alone is staggering in its proportion. The mind-bending growth the Internet’s information resources alone have the ability to cripple this medium before it even reaches maturity.

In its early years Search Engines were little more than lists of sites that someone found interesting and decided to mount on a Web site for others to access. You might liken this making to a list of interesting books you have read and then posting it in a public place so that it is available to your friends and other likeminded people.

As more people starting using these lists of interesting sites their authors began presenting the information in a more structured way. For instance, the sites of interest were listed alphabetically or in categories. The very first Search Engines would search these lists by matching the Web address or the title of a Web site doc for instance.

The popularity of Search Engines grew rapidly as more and more people started using the Internet. The overwhelming amount of Web sites and information on the Internet made the Search Engine the tool of choice for most people trying to navigate their way through to the desired location or information.

Later it would become obvious that no one single Search Engine could be all things to all people and a proliferation of different Search Engines began to appear. As a result, the Meta-Search Engine was born. It was a single Search Engine that went out and searched multiple other Search Engines on the users behalf. The result of these multiple searches was then formatted and returned as a single response.

Popularity inevitably attracts entrepreneurs and so it was for Search Engines. Instead of providing the single point functionality of searching for information the designers of these systems attempted to make them an Internet Experience. They started by including games, news, horoscopes, chat sites, sports etc. These new sites were even given new names, Portals. Portals not only competed for your attention by offering ever increasing functions they also bombarded your senses with advertising. As a consequence, Portals became more and more useless as a reference from which to begin a search and their users soon revolted.

Disgruntled users quickly began seeking alternative methods of searching. They joined special interest groups of their own and were instrumental in the emergence of a new generation of "clutter-free" Search Engines that specialised in ease of use and high speed performance. Fortunately the success of this new breed is now forcing the traditional Search Engine designer to re-think their strategies.

The Information Challenge

Information on the Internet has not been collected, stored or distributed in a structured manner. It is also presented in many different formats. Locating the right information has the potential to be frustrating and time consuming as it has to be satisfying. So how do we do it? We begin by assessing the web.

Accessing the Web.

You use a piece of software called a Browser to access the Web. Information is stored on the Web on Web Servers. Examples of browsers are Internet Explorer and Netscape Navigator.

Hypertext

Hypertext is the process of displaying or arranging information on a computer screen in a manner that emulates human thought processes.

In a printed book, users must read sequentially through the text or manually scan pages and paragraphs (usually with the aid of an index or table of contents) in an attempt to locate information that is of interest to them.

The human brain, however, isn’t always at its most efficient when it receives information in a sequential or linear manner

Hypertext documents are associative rather than linear. This is achieved by incorporating links with the text that allow users to jump to another part of the document (or another document) containing related information and then return to where they left off.

On the Web, keywords are linked to other passages or documents. Linked keywords are called hyperlinks, and are generally displayed on screen in a different colour, italicized or underlined, thus allowing them to be readily identified. By moving the mouse pointer to a keyword and clicking, readers are taken to the related information.

As One Example; you might be logged into an Australian library via the Internet reading a document about the Endeavour. A reference to Captain James Cook is highlighted, so you can click on this keyword. The link might refer to a document stored on a machine at Cambridge University in the United Kingdom. Your browser, in accordance with instructions contained in the link, automatically connects to the Cambridge University computer and retrieves the related document.

The Virtual Library.

Regardless of your topic of interest, you’ll almost certainly find an information resource dedicated to it online.

The Internet offers access to magazines, books, articles, research, newspapers, newsletters and discussion; papers covering the broad spectrum of human knowledge.

Formerly paper-bound government reports, court decisions, parliamentary records and national archives had been given a new lease on life and a much larger audience – by being added to the information online.

There is now no need to wait for the next imprint of your favourite encyclopaedia to find in-depth, expert discussions of current events and discoveries. Similarly, you no longer need to rely on the daily paper for coverage of community, national and international events.

….There are now entire catalogues of information, archival material and public records online to which public access was previously inconceivable. This was not because governments and libraries wanted to lock this information away, there was just no cost-efficient way to provide decentralized access to it.

Behind the Hype

The Internet’s wealth of information is both a blessing and a burden.

Unlike resources in a library, information on the Internet is not clearly classified or categorised. ..There is no online equivalent to the reference section, the newspaper reading area, or even the Dewey decimal system.

….Almost any piece of information you might want is online, but it could be spread across tens of thousands of computers, which act as hosts for the millions of online pages. Without a central index or cross-referencing system, how can you track down a particular piece of information.

Chasing Information

As children, we were taught basic information finding skills.

….Search Engines can be seen as cyberspace "librarians".

Users submit a query, which the Search Engine uses to locate relevant Web Sites and other resources. It then displays details of matched sites and resources for users to browse through and explore. But here the similarity ends.

While Search Engines are pivotal in helping users find their way around the Internet, they do have certain limitations.

Their massive databases of Web sites – which often contain details of hundreds of millions of Web documents – coupled with high-end computers, allow users to quickly find sites matching their queries. But speed isn’t everything. Search Engines are as capable – and some would say as likely – to provide false directions with blinding speed. There are two major causes of difficulty, each of which aggravates the other..

The first is that few people are trained in the "art" of locating information. Research requires meticulous planning and sorting.

The second is that online search tools often lack "intelligence". Often the outcomes of search results are comprised of a hotchpotch of relevant and irrelevant links.

The solution to this problem is twofold. Firstly, Internet users need to learn research skills that will enable them to narrow their field of enquiry. Secondly, they should learn how Search Engines work, especially the advanced search options, which can be useful in filtering out irrelevant information.

Search Strategies

Simple & Advanced Searches

Connect to your favourite Search Engine; type your query in the text box and click on the search icon. ….(e.g. chicken soup recipe). It’s fast, simple and requires little preparation. But it is also the type of search most likely to waste your time, since it will turn up hundreds (perhaps thousands of irrelevant links and matches. Avoid simple searches whenever possible.

Plan Your search terms

Wherever possible, choose specific terms over general terms. For instance, the keyword yacht is preferable to boat, and chlorine will produce more useful results than chemical.

Spelling

Some words can be correctly spelt in a number of different ways and languages. For example British / Australian English and American English will often spell the same word in different ways, e.g. gray/grey – favourite/favorite – colour/color.

Synonyms

Use synonyms in your search query. Different countries will use different words to describe the same object. For instance where Australians will refer to a car or motor vehicle, Americans are more likely to use the term automobile.

Operators

These are special words or symbols that give the Search Engine precise instructions on how to match Web sites to your query. This makes the response more accurate or relevant and reduces the likelihood of unwanted results. Unfortunately, there is no standardised or universal language when it comes to Search Engines. It will be necessary to identify the methods of your favourite Search Engine(s) before you can use them efficiently

Incremental Searches

Many Search Engines allow you to progressively narrow down the results of a query by refining your search terms. For example if you started a search using the term Medicare you would get a massive amount of irrelevant responses as the Search Engines identified not only Australian Medicare but the hundreds of thousands of Medicare sites in the American health care system. Clearly not what you may be after.

If we were to incrementally refine our search by looking for the keywords Australia + insurance + benefits, we would get significantly less responses, but still probably far too many to be useful. By incrementally refining our search even further to say Australia + Medicare + insurance + benefits, will bring us far fewer but a much more relevant set of responses.

Partial Searches

If you are really lost on the spelling or know only part of a keyword and you are feeling lucky, you could try a partial search. The Search Engine will look for parts of words that match your search term. For instance by entering the term medic*, this will return matches for medical, medicate, Medicare and so on. From this point you may be able to refine your search by discovering and using the full keyword in a later search.

Boolean Searches

Almost all Search Engines accept search queries that use Boolean operators. Such Booleans include words like AND, OR, NOT and NEAR, as well as symbols such as quotation marks and parentheses.

AND

Instructs a Search Engine to display only those responses that contain all terms joined by the AND statement. For example the search - Jurassic AND park AND movie – will ignore those sites or document that contain only one or two of the search terms.

OR

Instructs the Search Engine to display responses that contain at least one of the search terms joined by the OR statement. For example to search for sites or documents containing the words - Jurassic OR park OR movie – would return all responses that contained at least one of those words.

NOT

Instructs the Search Engine to ignore any responses that contain the word appearing after the NOT statement even though it may have found other search items contained in the search. For example a search containing the following - pet AND care NOT cats – would find all those responses for pet care for all pets except cats.

Parentheses ( )

Parentheses are used in advanced searches to group parts of Boolean queries together. For example to find cake recipes that use either bananas or apples, use a search query like – cake AND recipes AND (banana OR apple)

Quotation Marks " "

Quotation marks tell Search Engines to find only those documents or sites that contain the search terms or phrases in the exact order in which they are entered. So to find sites that have information pertaining to The Great Barrier Reef enter the keywords in quotation marks i.e. "the great barrier reef".

Natural Language Processing

Search Engine engineers have borrowed from the Artificial Intelligence development community to bring about Search Engines that allow untrained user to obtain search results by typing in search queries in a natural English form. For example a system user may type in "What is the average temperature of Australia" and get a relevant response.

English however is one of the most complex languages in existence, so this approach is a lot harder than it at first seems. A single word may have many meanings depending on its context, the verbal inflection and non-verbal cues to name but a few causes of change.

This is not to say there has not been progress however, one such system exists. It may be used for experience, try www.ask.com to experiment

The Anatomy of a Search Engine

With literally millions of sites online and countless more being added each day, keeping track of what is available on the Internet is a superhuman task.

Some people see Search Engines as a massive indexed lists of web sites and information resources while others see them as huge databases containing information on almost everything imaginable. In truth, Search Engines are a little of both - a web site directory and searchable database of sites.

To gather together information contained on the Internet, Search Engines generally employ special programs called spiders or web robots. These programs crawl all over the Internet day and night looking for new sites. When they find one, that site is downloaded so that its keywords and phrases can be indexed and categorised into the Search Engines databases.

Types of Search Engines.

Passive Search Engines

Passive Search Engines do not use spiders or Web robots. Instead they rely on users submitting their own favourite sites that are then added to the Search Engines database.

Active Search Engines

Rely on spiders or web robots to collect, maintain and update their listings. Individuals may also submit their own sites for recording.

Meta-Search Engines

Are not Search Engines in their on right. Instead they are used as an interface to search multiple Search Engines at the one time, gather results, reformat those results and submit the responses to the user.

Human editors maintain the last type. These editors thoroughly read each new site before categorising it. These are usually considered the most authoritative.

Word Weighting

Search engines also use database technologies in an attempt to automatically refine searches by using word weighting. Each word in a search engine’s database is weighed according to its frequency in the database. Words that appear frequently like "computer" or "internet" will have low ratings while rarely used words will have a much higher weighting. In determining a response to a query, a Search Engine will look at how many of the search words a document or site contains and give them a higher or lower priority accordingly.

Language Translation

While not strictly a search mechanism, some Search Engines will translate text into a language that is specified. At the time of writing the Alta Vista, Search Engine could process direct translations from English to French, German, Italian or Spanish. On the other hand, it could also translate Italian, German and Portuguese to English. Many other language lookups for single words or text phrases are also available. To try this facility visit http://world.altavista.com/

Family Filter

Most good Search Engines will also offer filtering services to remove unsavoury content from the responses or search results. Sites that most people find unsuitable and request them to be blocked are pornography, violence, and information about drugs and hate speech. Because these facilities are most often used in conjunction with children’s access to the Internet they will often also provide a password function so that the filter can not be disabled without authorisation.

Intelligence ??

Some Search Engines claim to use intelligent systems to give better search results. The Search Engine Excite claims that its ICE (Intelligent Concept Extraction) system not only scan its databases for exact matches of the search keywords but also attempts to find linkages "conceptually". As an example, if a search keyword of confectionery were entered, this search engine would also find matches for chocolate (even if the word confectionery was not listed on the sites searched)

Meaning Based Systems on the other hand use technology that attempts to understand the specific meaning of the search terms and the context in which they are used. For example searching for Darwin could return Charles Darwin, Darwin in the NT, other cities called Darwin, a piece of software called Darwin etc. When search results are returned a meaning can be attached to each of the search terms so that further searches are more refined

Tutorials

Search Engines will often provide online tutorials for their users so that they get the best from their searches. To try one visit:
http://help.altavista.com/help/search/help_adv

FAQ Systems

FAQ stands for Frequently Asked Questions. FAQ files offer and excellent source of information. They are a document that contains a list of frequently asked questions. There are hundreds (if not thousands) of files available on the Internet covering a very broad selection of subjects. Even if you don’t find the immediate answer to your query there is a very high probability a close link will be displayed for further investigation.

To try one visit: www.faqs.org

Virtual Libraries

Virtual libraries as their name suggests are Web sites designed using similar principles to those of traditional libraries. As you might expect they are usually extensively cross-referenced, carefully organised and the materials chosen are on the basis of information quality and authority. To sample a couple try:

http://www.ipl.org/ - or - www.libraryspot.com

A New Generation

Today’s Search Engines, which offer information on news, sports, weather, horoscopes, auctions, classifieds, maps, telephone directories, free email and shopping services to name but a few are becoming far too unwieldy to be of much use.

The number of options that confront them on the home page easily confuses new users, and experienced users are disgusted by the increasing amount of screen space dedicated to advertising.

A new breed of streamlined search services are starting to appear. Instead of matching web documents based on the number of times your search term(s) appear in them, they use PageRank technology to determine a document’s popularity. PageRank assumes that the more Web sites or Web pages that link to a particular Web document, the more likely it is that the document contains relevant, authoritative information. In essence, each link to that document is treated as a "vote" for it. A document with more votes is treated as more "important" than one with fewer votes.

The Search Engine then analyses the "importance" of the web pages or sites containing links to the document. If these are important, their votes are given added weight. If a web page that is frequently linked to by other Web pages in turn contains links to more pages, those pages are given even greater weight.

Caching

Search Engines can cache many copies of the actual Web pages. This allows it to return a search result even if the Web Server is temporarily unavailable or the Web page has been deleted or removed.

Similar Pages

Good Search Engines will also offer to find similar pages as links if you find a particularly good match to what you are looking for.

Intelligent Agents

Perhaps the last word in future Search Engines, is that the systems are still very much in their infancy. Intelligent agents are software systems designed to "learn" as much as possible about your search habits by understanding your style of expressing yourself and your information requirements.

An Intelligent Agent pays particular attention to the types of Web sites you like to visit, the News Groups that you read, and the sort of information you most often access. It then analyses this information and becomes pro-active in suggesting points of interest to you as it independently sources information on your behalf.

To test one of these sites try visiting:

www.rusure.com

Collaborative Filtering

If you are a book or movie enthusiast you will generally ask your friends for their recommendations or create yourself a list to get feedback from others. Collaborative filters work in a very similar manner. Having identified your preferences, it then compares your tastes with other people who are like-minded and makes recommendations based on their preferences. For example, Amazon.com, a bookshop will track your purchases and each time you return to the site it recommends similar items.

Hints

There are many excellent on-line tutorials that provide information on using Search Engines effectively. Try www.imagescape.com/helpweb/www/seek.html

Need to know the best place to start a search? Try www.connectedteacher.com/tips/searchhints.asp

Experience shows that starting with specific searches and then moving to more general search terms if unsuccessful is the best strategy.

Different Search Engines handle the default search format differently. Some use AND (which matches only ALL of your keywords) searches. By default others use OR (which matches any of your keywords) searches.

Most people think Search Engines actually go out and search web sites for them when they submit a search query. They do not; in fact they search their databases of previously indexed sites. This way you get results in seconds rather than weeks.

The order in which you specify your search terms can be important to some Search Engines. Important terms listed first is usually the best policy.

By and large Search Engines ignore commonly occurring words like "a", "and", "the". It is not necessary to include these in your keywords when searching.

Need to know how one Search Engine compare with another. Try visiting www.searchenginewatch.com

Librarians know a lot about finding information, so it makes sense to listen to what they have to say. They have a very good document on locating information at http://www.dpi.state.wi.us/dpi/dltcl/lbstat/search2.html


Arthur Hissey
Computer Research & Technology
www.crt.net.au


RELEVANT LINKS
find additional information quickly

Search Engines Links

Meta-Search Engines

Agent Software

ETOPICS
what are they?

Keep up to date with the latest in the IT/Communications industry by listening to ABC Local Radio on FM107.1, every Tuesday morning at 9.15AM.

Computer Research & Technology Managing Director Arthur Hissey and Morning Host Janice McGilchrist will be discussing current matters of interest and future directions in the IT industry.

Transcripts of these discussions and other topics are available, just click on the links.


ETopic Archives
browse the archived ETopics
Check out the ETopic Archives
Full Archive List
Browse Alphabetically
A - E
F - J
K - O
P - U
V - Z
Last 5 ETopics
A Map? On Flickr? Is that a question?
Net ID scheme offers passport to online safety, especially for children online
What is ViewDo? ViewDo Helps People Help Themselves
Australian Dictionary of Biography Online
Google Earth Revisited