Tutorial – Guide to Effective Searching of the Internet
[ Previous | Home | Index | Next ]
Despite its intimidating name, Boolean search techniques are really quite simple to learn and can add tremendous effectiveness to your searching. While working through this part, most of you will recognize constructs that were taught to you in high school math.
"Boolean" searching draws its name from George Boole, a mathematician and logician from the 19th century. He developed Boolean algebra, which is the basis for this form of structured search technique. Boolean algebra is also of prime importance to the design of modern computers.
Most information on the Web is highly unstructured. Boolean search techniques were first applied by information professionals to traditional search services like Dialog or Lexis-Nexis. Boolean techniques, while not supported by all Internet search services, provide a way for you to bring structure to this unstructured environment.
Without Boolean techniques, you are stuck with doing a lot of free-text searching, meaning, looking for documents that contain words you think will be in the document you are seeking. Sheer document volume makes free-text searching difficult and prone to failure. Boolean techniques give you the power to narrow your search to a reasonable number of potentially useful documents thereby increasing your likelihood of success.
Topic 13: Boolean Overview
Topic 14: AND Operator
Topic 15: OR Operator
You may click on the topic names above to go directly to them.
Boolean logic is used to construct search statements using logical operators and specified syntax. These are combined into Boolean expressions, which always are either true or false when evaluated.
The shopping list of operators and syntax available to Boolean searching (though not supported by all Boolean search services) is:
- AND — terms on both sides of this operator must be present somewhere in the document in order to be scored as a result
- OR — terms on EITHER side of this operator are sufficient to be scored as a result
- AND NOT — documents containing the term AFTER this operator are rejected from the results set
- NEAR — similar to AND, only both terms have to be within a specified word distance from one another in order to be scored as a result
- BEFORE — similar to NEAR, only the first (left-hand) term before this operator has to occur within a specified word distance BEFORE the term on the right side of this operator in order for the source document to be scored as a result
- AFTER — similar to NEAR, only the first (left-hand) term before this operator has to occur within a specified word distance AFTER the term on the right side of this operator in order for the source document to be scored as a result
- Phrases — combined words or terms that must appear directly ADJACENT to one another and in the phrase order for the source document to be scored as a result
- Wildcards (stemming) — beginning characters that must match the same beginning characters in a document's words in order for it to be scored
- Parentheses — nested operators that are evaluated in an inside-out order of precedence.
Example uses of these operators are based on the sample tutorial problem of finding information on the peregrine falcon discussed in Topics 5 through 12.
The underlying premise of Boolean logic is set theory. The AND operator is equivalent to the set INTERSECTION operation; the OR operator is equivalent to the UNION set operation. To help explain these concepts, specific topics below use so-called Venn diagrams. Don't worry about the fancy name. The diagrams are color-coded to indicate the result of an operation. The universe of possible results is shown in yellow on these diagrams; the accepted results in blue.
One way to decide when to use the AND or OR operators is to test whether your keywords are different concepts, or a just different ways (synonyms) to say the same thing. For different concepts, use AND; for synonyms, use OR.
Boolean search syntax needs to follow a precise structure. Queries constructed using Boolean syntax do not look like real sentences. The AND and OR Boolean operators, in particular, sometimes seem to mean the opposite to what they do in natural language. Searching based on simple sentences and phrases is a different construct known as natural text searching.
AND means "I want only documents that contain both words." AND logic focuses, coordinates and narrows a search. The connector AND narrows a search, retrieving only those records containing at least one term or phrase from each concept. The AND operator is a binary one; that is, it operates on the terms or phrases on both sides of it. It is the same concept as INTERSECTION in set theory.
Using AltaVista document counts, the results of the query "endangered species" AND "peregrine falcon*" is:
Note the AND operator says nothing about where the terms or phrases are located in the document with respect to one another, nor whether their linkage makes sense or not. This operator only requires that the terms or phrases immediately on both sides of the AND must both appear in the document.
The AND operator can be used to chain a number of required terms or phrases together, all of which must be present in order for the outcome to be a successful result. For example, the query London AND "Big Ben" AND "Buckingham Palace" AND Trafalgar would only return documents that contained all four terms or phrases.
AND should be your most frequently used Boolean operator. |
The AND operator is also a very useful qualifier. For example, AltaVista counts for falcon* total 340,707. Some of these references are to cars, others to various companies, falconry or a sundry of products using the name falcon. To zero in on the falcon bird, a search phrase of falcon AND bird* removes these extraneous references. The AltaVista document count now becomes 36,939.
False "results" can be common using the AND operator. For example, let's apply Jan's query of endangered species AND peregrine falcon* to a large document discussing unusual birds. In one section it could discuss the 200 mph diving speed of peregrine falcons; in another the extinction of the dodo bird. A positive result would be scored for this document, even though there is no discussion about the endangered status of peregrine falcons. One of the reasons these false positives occur on the Internet is the occurrence of large Web documents that simply list links or references to other documents and contain HUGE numbers of terms. They often produce false results.
Topic 15: OR Operator
OR means "I want documents that contain either word; I don't care which word." OR broadens a search and makes it less focused. It is equivalent to the UNION operator in set theory. Again, using our peregrine falcon example, the results set for this operator looks like:
The document counts from AltaVista using this OR operator are:
These results illustrate some interesting facts. First, the OR operator is NOT equivalent to a sum. Documents which contain both phrases still get counted as a single document. Second, we would expect at minimum the OR operator to result in a total number of documents no smaller than the count of documents for the largest term or phrase in the operation (i.e,. "endangered species" with 143,786 counts). Yet our result set is smaller than this. Why?
Strictly speaking the results shown should not happen. The reason they do is based on internal decisions search engines make in evaluating queries and to keep search performance snappy. See further Topic 41 for some perplexing behavior of the Alta Vista search engine.
Use OR to string together synonyms; be careful about mixing it in with AND! |
The OR operator can be used to chain a number of terms or phrases together, any one of which must be present in order for the outcome to be a successful result. For example, the query London OR "Big Ben" OR "Buckingham Palace" OR Trafalgar would return all documents that contained one or more of these four terms or phrases. As with the AND operator, there is no assurance that any of these terms or phrases are logically or conceptually linked in any of the results documents.
Unless used in parenthetical clauses (most useful for synonyms) or as a fishing expedition as part of preliminaries to a search, we do not recommend the use of the OR operator. Overuse of the OR operator can cause results sets to grow too large to be useful.
Nonetheless, the OR operator is one of the two main operators within Boolean syntax. It should be used in a controlled way to expand your results set, most often as part of a parenthetical argument.
[ Previous | Home | Index | Next ]
|