Deep Web: A Primer

Deep Web: A Primer

What is the Deep Web?

The Deep Web is a complex concept. It is essentially two categories of data.
The first is basically any information that is not easy to obtain through standard searching, which could be Twitter or Facebook posts, links buried many layers down in a dynamic page, or results that sit so far down the standard search results that typical users will never find them.

The second category is the larger of the two and represents a vast repository of information that is not accessible to standard search engines. It is comprised of content found in websites, databases, and other sources. Often it is only accessible through a custom query directed at individual websites, which cannot be accomplished by a simple “surface web” search.

The Deep Web isn’t found in a single location. It consists of both structured and unstructured content – a huge amount of which is found in databases. This content has often been compiled by experts, researchers, analysts and through automated processing systems at an array of institutions throughout the world. All of the content is housed in different systems, with different structures, at physical locations that can be as far apart as New York and Hong Kong.

BrightPlanet has patented the technology to automate custom queries that target thousands of Deep Web sources simultaneously. Our solutions find topic-specific content and provide highly qualified, relevant results for research, analysis, tracking and monitoring – all in real-time – completely automating the process of retrieving Big Data from the Deep Web regardless of how it is stored.

How is the Deep Web different from the Surface Web?

The Surface Web contains only a fraction of the content available on-line. Standard search engines simply cannot find or retrieve content in the Deep Web. Why? Because many of the Deep Web sources require a direct query to access a database, and standard search engines aren’t built to do that.

Standard search engines are the primary means for finding information on the surface Web. These tools (think Google, Yahoo!, and Bing) obtain their results in one of two ways. First, authors may submit their own Web pages for listing directly to the search engine company. Direct listing accounts for a small fraction of surface Web results, and means those search tools are often forced to find their own information.

Search engines do this by performing a “crawl” or “spider”, following one hypertext link to another. This process takes the pages and puts them into an index that the engine can refer to during future searches.

Simply stated, the crawler starts searching for hyperlinks on a page. If that crawler finds one that leads to another document, it records the link and schedules that new page for later crawling.

Search engine crawlers extend their indexes further and further from their starting points, like ripples flowing across a pond, in an effort to find everything available. But due the limitations inherent in crawler searches, they will never find all the content that exists.

Thus, to be discovered, “surface” Web pages must be static and linked to other pages. Traditional search engines often cannot “see” or retrieve content in the Deep Web, which includes dynamic content retrieved from a database.

How large is the Deep Web?

It’s almost impossible to measure the size of the Deep Web. While some early estimates put the size of the Deep Web at 4,000-5,000 times larger than surface web, the changing dynamic of how information is accessed and presented means that the Deep Web is growing exponentially and at a rate that defies quantification.

Why haven’t I heard about the Deep Web before?

In the earliest days of the Web there were relatively few documents and sites. It was a manageable task to post all documents as static pages; since results were persistent and constantly available they could easily be crawled by conventional search engines.

Now, information is published on the Web in a different way. This is especially true for dynamic content, larger sites or traditional information providers moving their content to the Internet. The sheer volume of these sites requires the information to be managed through automated systems with databases.

The contents of these databases are hidden in plain sight from standard search engines, since they often require a query to produce results. Some of these sites may have hundreds of pages to navigate through, but thousands of pages that can be searched. Think of a major news site, like CNN.com. You would not be able to follow links from their homepage to find a page from two years ago, but you would be able to search for that page because it is stored and available in their database.

The evolution of the Web to a database-centric design has been gradual and largely unnoticed. Many Internet information professionals have noted the importance of searchable databases. But BrightPlanet’s Deep Web white paper is the first to comprehensively define and quantify this category of Web content.

Is the Deep Web the same thing as the “invisible” Web?

In a word, yes. But “invisible” implies that you’ll never see it. That’s why we prefer “Deep Web” – because the information is there, if you have the right technology to find it.

As early as 1994, Dr. Jill Ellsworth first coined the phrase “invisible Web” to refer to information that was publicly available, but not being returned by conventional search engines. But that is just a semantic difference that doesn’t address the core issue.

The real problem is the spidering and crawling technology used by conventional search engines that returns links based on popularity, not content. But this same Big Data content clearly and readily available if different technology, such as the suite of BrightPlanet solutions, is used to access it.