|
A BRIGHTPLANET CASE STUDY
Lawrence Livermore National Laboratory (LLNL) is a U.S. Department of Energy national laboratory operated by the University of California. The laboratory was founded in September 1952 as a second nuclear weapons design laboratory to promote innovation in the design of the nation's nuclear stockpile through creative science and engineering. LLNL is also one of the world's premier scientific centers, where cutting-edge science and engineering research for national security is used to break new ground in other areas of national importance, including energy, biomedicine, and environmental science.
One significant application area at LLNL is end user analysis of export license requests. The US Government places export restrictions on specific items that could contribute to the proliferation of weapons of mass destruction and requires exporters of such technologies to obtain export licenses. LLNL participates in the analysis of export license requests. Specifically, LLNL uses DQM along with other tools to conduct open source analysis of the endusers and consignees party to export license requests.
The Problem
Recently, the laboratory has been conducting increased information acquisition and analysis as part of its support for the homeland security effort. This also requires sweeps of publicly available information sources. By far, the largest of those sources and one of the most difficult to survey is the Internet.
It is particularly difficult to find targeted information on the net because there is no central index. While there are hundreds of general search sources, such as All-The-Web, Google, YAHOO, etc., none of them indexes the entire Internet — they can search only a fraction of the surface Web (the several billion or so Web pages most people think of as the Internet). To make matters worse, many sites are not indexed by any search source at all. Finally, the available sources index little or none of the Deep Web, which is 300-500 times larger than the entire surface Web — most information is completely missed by using just one or a few search sources.
While a search engine creates its index by following links on standard (surface) Web pages, the Deep Web, is made up of hundreds of thousands of publicly accessible databases. At each Deep Web database, the user enters a query and gets back a Web page created dynamically ("on-the-spot") specific to the search. These dynamic Web pages are not linked since they didn’t exist before the user’s query and cease to exist after being sent to the user — the search engines can’t see nor find them. That creates a huge information gap because the Deep Web is hundreds of times bigger than the surface Web. Its distributed and non-centralized nature only serves to make it much harder to survey, index, and harvest. Collectively, this represents prodigious amounts of data and is by far, the largest source of information in the world.
BrightPlanet Solution
To enable it to sweep and scour the Internet, LLNL turned to BrightPlanet’s Deep Query ManagerTM (DQM). First, it allows them to search using hundreds or thousands of search sources for each individual harvest. All the major sources are included plus many, many more. Then there are over 70,000 Deep Web sources available and that number keeps increasing as BrightPlanet keeps finding, analyzes, and adds thousands more sources.
As with so many other researchers, eliminating undesired documents is very important — the user can easily drown in the returned results because of all the clutter. LLNL sometimes uses domain name filtering to eliminate documents from undesired countries or to target specific countries.
The Difference Analysis feature is very important to LLNL. It allows them to immediately see what has changed from harvest to harvest. The more sites they monitor, the more valuable this feature is and the more time they save. LLNL need only review the new and changed documents rather than going repeatedly through the thousands of documents that have not changed.
Laboratory Benefits
Because LLNL has multiple users of DQM tasked with different intelligence gathering responsibilities, they have reported several features as most valuable for these tasks:
Strict Boolean Adherence and DQM’s extended Boolean operator set provides LLNL with complete control over their queries. Regardless of whether the sources accessed use strict Boolean, a loose Boolean interpretation, or are not Boolean sources at all, DQM only accepts those documents that truly meet the constraints of the query. The effect is as if all sources supported true Boolean operators.
LLNL is able to share source groups. A knowledgeable researcher can build targeted and vetted sources groups focused primarily on LLNL’s targeted search areas and then share those source groups with all other researchers using DQM in the organization. There is no need for others to reproduce and test source groups for the same purpose — it only need be done once and maintained by one individual regardless of how many users benefit from its use throughout the laboratory.
The incremental search capability of the Difference Reporting feature allows LLNL researchers to monitor specific sites and to instantly recognize new information. This is particularly important because of the sheer volume of data requiring analysis.
Deep Harvesting is at the heart of much research. DQM’s ability to use hundreds or thousands of sources for each harvest allows massively wide sweeps of the Internet.
Site Harvesting is also very important because it "fills in the gaps" inaccessible to Deep Harvesting. Recall that earlier in this study, it was reported that no search source indexes the entire surface Web, and in fact, there are many sites that are not indexed at all. The Site Harvest feature allows LLNL to harvest sites that are not indexed by any search engine or other source. That’s because Site Harvest does not depend on any search sources. Rather, LLNL enters the URL of the target site and DQM harvests it and indexes it for the user. Then the site can be searched directly through DQM — essentially, DQM becomes the search engine for any site identified by the user.
LLNL reports: "With DQM, we receive a better quality of hits than from Google". DQM provides many different search and analysis features to address the needs of the laboratory’s individual researchers and their unique tasks. It allows them to filter out undesired documents while targeting desired information and pre-processing returned information for later, further analysis. As LLNL succinctly describes it: "DQM helps us cut through all the clutter!"
Additional Features:
- Archiving full document results for permanent record
- Local searching for new information within archived results
- Extensive inclusion and exclusion term filtering
- Extensive inclusion and exclusion domain filtering
- Modified document comparisons highlighting added and deleted text
|