spacer.png, 0 kB
FAQ's
FREQUENTLY ASKED QUESTIONS
  1. How does DQM differ from search engines like Google or Yahoo?
  2. Is DQM a server-based application or do I need to download it onto my machine?
  3. Why do I have to select sources in DQM?
  4. Will sites being harvested know who I am?
  5. Am I limited to a specific number of queries per month?
  6. What is the purpose of filtering?
  7. If a site is removed from the internet, can I still access information I had harvested previously?
  8. How/why, am I able to search Web search result sets?
  9. What is the first step in setting up a portal using the DQM/P?
  10. What else must be done in creating the portal?
  11. Can any of the above steps be performed automatically?
  12. Can DQM harvest from my internal Intranet sites?
  13. Can DQM harvest PDF files or just HTML pages? And what other formats can be processed?
  14. What search options are available to search DQM?
  15. From what kinds of sources can DQM harvest?
  16. What is: differential weighting of "authoritative" documents?
  17. What is the Deep Web?
  18. How does the Deep Web differ from the "surface" Web?
  19. Why haven't I heard before about the Deep Web?
  20. Is the Deep Web the same thing as the "invisible" Web?
  21. How large is the Deep Web?
  22. How does the content and quality of the Deep Web differ from the "surface" Web?
  23. Is the Deep Web growing faster or slower than the "surface" Web?
  24. Why can't I search the Deep Web using standard search engines?
  25. I occasionally see Deep Web content using search engines. Why is that?
  26. I often miss "surface" Web content using search engines. Why is that?
  27. How much of the Deep Web is captured by CompletePlanet?
  28. What other factors may make Internet information "deep"?
  29. Where can I learn more about the Deep Web?


How does DQM differ from search engines like Google or Yahoo?

DQM is not a search engine itself, rather, it submits your queries to the search engines and Deep Web databases of your choice. As a result, you can conduct a search that will cover far, far more ground than a search with a single engine. With DQM, you can search hundreds or thousands of engines and Deep Web databases in one harvest — there is no other way to search such a large portion of the surface and Deep Web at one time in one easy step.

Back to Top

Is DQM a server-based application or do I need to download it onto my machine?

DQM is server-based, so you don't have to download any software onto your computer now or later. Just use your browser to access our harvest servers. Once you are a customer, added features become automatically available to you. At each release, we send you an email announcement describing the feature and its benefit. You can then log in to your account and begin using the feature. Each feature is easily found and fully described in the online Help system, complete with instructions on how to use it.

Back to Top

Why do I have to select sources in DQM?

Actually, you don't have to. DQM can do all the work or let you make selections, assisting you in the process. Here are all your choices when conducting a Deep Harvest:

  1. You can have DQM analyze your query and select the sources for you.
  2. You can select a pre-existing source group provided with DQM.
  3. You can create a custom source group that you can use over and over again for harvests in that topic area. When you create a group, you have three ways to work — you can use any one or combination of methods to create a source group or augment an existing group:
    • You can draw from existing source groups to create your custom groups.
    • You can search the DQM source database by entering the search topic area and letting DQM identify relevant sources for you.
    • You can add your own favorite search sources to DQM if DQM doesn't already offer them.
The easiest way to start is by letting DQM select the sources for you just by analyzing your query. The DQM Help system provides more information and explicit instructions for using each of these methods.

Back to Top

Will sites being harvested know who I am?

When you use DQM, you maintain annonymity. That's because a Deep harvest uses BrightPlanet's servers to harvest from each of the selected sources. No information is provided about you or your organization and it is BrightPlanet's IP address that is captured, not yours.

Back to Top

Am I limited to a specific number of queries per month?

This answer requires some qualification. For normal users, there is no limit to the number of queries you can submit per month. But there may be an additional charge for those users submitting thousands of queries each month.

Back to Top

What is the purpose of filtering?

Filtering allows you to either select or reject Web documents based upon the terms contained in those documents. For example, you can create an inclusion filter that will accept only documents that contain the terms in the filter. You can also create a filter to exclude documents and include the use of several inclusion and exclusion filters for one harvest. Once a filter is created, you can use it for inclusion or exclusion, but not both in the same harvest.

Back to Top

If a site is removed from the internet, can I still access information I had harvested previously?

Yes! Once you have conducted harvests, the textual content of all the returned documents remains on our servers for access by your account until you delete them. In addition, you can select one or more results and share them with other DQM users in your organization.

Back to Top

How/why, am I able to search Web search result sets?

Besides searching multiple engines and Deep Web databases on the Internet, you can also search one or more of your own results sets that you previously harvested from the net. This allows you to keep vetted results as you acquire them and derive more value by searching them with more specificity in the future.

Back to Top

What is the first step in setting up a portal using the DQM/P?

You must have the Publisher module and a subject taxonomy (a subject tree by topic) into which you want the documents placed. If you don't have one, BrightPlanet can work with you to create one.

Back to Top

What else must be done in creating the portal?

All documents from each of the target Web sites must be harvested.

  1. All documents must be fully analyzed for content.
  2. A centralized, comprehensive, searchable index must be built.
  3. Each document must then be placed within the proper subject node in the taxonomical structure (the subject tree).

Back to Top

Can any of the above steps be performed automatically?

With the DQM/P, THEY ALL CAN!   And that's a very important part of the power brought by the Publisher module of DQM to solve your portal setup and maintenance problems. Once you've created the subject taxonomy and entered the list of URLs (hundreds or thousands) of the sites from which you want to harvest, the rest of the process is automatic and requires no manual input. You're free to perform other work or go home while it runs overnight or during the weekend.

What once required a staff of librarians and many months or even years to process and place hundreds of thousands or millions of documents can now be created and maintained within a day or two by one person.

Back to Top

Can DQM harvest from my internal Intranet sites?

Yes, internal Web sites are just another harvest URL to DQM. Therefore, DQM provides "one stop shopping" for your staff to be able to search and access documents on any number of internal and external Web sites. No longer must they visit and search them one-by-one.

Back to Top

Can DQM harvest PDF files or just HTML pages? And what other formats can be processed?

DQM is ready to process HTML, PDF, and straight text documents, and if you purchase the File Translation module, you can additionally harvest 370 other file types, such as Word, WordPerfect, WordStar, RTF, Excel, Lotus, PowerPoint, etc.

Back to Top

What search options are available to search DQM?

You can enter free text or use Boolean operators. The list of available Boolean operators is extensive and includes the standard fare, such as AND, OR, NOT, AND NOT, and more advanced operators including NEAR, BEFORE, AFTER, etc. You can also use phrases and Boolean operators in any combinations in your query to specifically target your quarry.

Back to Top

From what kinds of sources can DQM harvest?

DQm can harvest from any and all of the following in any combination:

  • Standard Web-wide search engines
  • More than Deep Web searchable databases
  • Proprietary (subscription or password protected) Web sources - assuming release agreements from the source to the customer
  • All existing digital documents maintained by the customer (option)

Back to Top

What is: differential weighting of "authoritative" documents?

Authoritative documents or sites are those that have been identified as providing very high quality content. They are the standard by which others are judged and are viewed as the "best of the best".

When authoritative documents are found in a search, differential weighting causes them to be listed at the top of the returned results — the best results are always available for a researcher to find first.

Back to Top

What is the Deep Web?

The Deep Web is content that resides in searchable databases, the results from which can only be discovered by a direct query. Without the directed query, the database does not publish the result. When queried, Deep Web sites post their results as dynamic Web pages in real-time. Though these dynamic pages have a unique URL address that allows them to be retrieved again later, they are not persistent.

Back to Top

How does the Deep Web differ from the "surface" Web?

Search engines — the primary means for finding information on the "surface" Web — obtain their listings in two ways. Authors may submit their own Web pages for listing, generally a minor contributor to total listings. Or, search engines "crawl" or "spider" documents by following one hypertext link to another. Simply stated, when indexing a given document or page, if the crawler encounters a hypertext link on that page to another document, it records that incidence and schedules that new page for later crawling. Like ripples propagating across a pond, in this manner search engine crawlers are able to extend their indexes further and further from their starting points.

Thus, to be discovered, "surface" Web pages must be static and linked to other pages. Traditional search engines cannot "see" or retrieve content in the Deep Web, which by definition is dynamic content served up in real time from a database in response to a direct query.

Back to Top

Why haven't I heard before about the Deep Web?

In the earliest days of the Web, there were relatively few documents and sites. It was a manageable task to "post" all documents as "static" pages. Because all results were persistent and constantly available, they could easily be crawled by conventional search engines.

What has not been broadly recognized is that information is now being published in a different means on the Web, especially for larger sites or for traditional information providers now moving their content online. The sheer volume of these sites requires the information to be managed from a database, the results of which are "hidden in plain sight" from search engines.

The evolution of the Web to a database-centric design has been gradual and largely unnoticed. Many Internet information professionals have noted the importance of searchable databases to Web content. But BrightPlanet’s Deep Web white paper was the first to comprehensively define, quantify and characterize this entirely different category of Web content.

Back to Top

Is the Deep Web the same thing as the "invisible" Web?

As early as 1994, Dr. Jill Ellsworth first coined the phrase "invisible Web" to refer to information content that was "invisible" to conventional search engines. We avoid the term "invisible Web" because it is inaccurate. The only thing "invisible" about searchable databases is that they are not indexable or queryable by conventional search engines. Using our technology, they are totally "visible" to those that need to access them.

The real problem is not the "visibility" or "invisibility" of the Web, but the spidering technologies used by conventional search engines to collect their content. For these reasons, we have chosen to call information in searchable databases the Deep Web. Yes, it is somewhat hidden, but clearly available if technology such as ours is used to access it.

Back to Top

How large is the Deep Web?

It's very big; the 60 largest Deep Web sources contain 84 billion pages of content. That's about 750 terabytes of information — sufficient by themselves to exceed the size of the surface Web by 40 times. For comparison, Yahoo! (the largest Crawler-based search engine) indexes about 18 billion pages. Currently, BrightPlanet software is configured to query 70,000+ Deep Web sources. This number continues to grow.

Back to Top

How does the content and quality of the Deep Web differ from the "surface" Web?

Deep Web sites tend to be narrower with deeper content than conventional surface sites. Total quality content of the Deep Web is much greater than that of the surface Web. Deep Web content is highly relevant to every information need, market and domain. More than half of the Deep Web content resides in topic specific databases. A full 95% of the Deep Web is publicly accessible information — not subject to fees or subscriptions.

Back to Top

Is the Deep Web growing faster or slower than the "surface" Web?

The Deep Web is the fastest growing category of new information on the Internet. All signs point to the Deep Web as the dominant paradigm for the next-generation Internet.

Back to Top

Why can't I search the Deep Web using standard search engines?

Searching on the Internet today can be compared to dragging a net across the surface of the ocean. There is a wealth of information that is deep, and therefore missed. The reason is simple: basic search methodology and technology have not evolved significantly since the inception of the Internet.

Traditional search engines create their card catalogs by spidering or crawling "surface" Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines cannot "see" or retrieve content in the Deep Web, which is defined as content in searchable databases that only appears dynamically in response to a direct query. Because traditional search engine crawlers can not probe beneath the surface, the Deep Web has heretofore been hidden in plain sight.

Back to Top

I occasionally see Deep Web content using search engines. Why is that?

Any Deep Web content listed on a static Web page is discoverable by crawlers and therefore indexable by search engines. This can occur when a Web page author discovers some useful Deep Web content and posts its dynamic URL address on a static Web page.

Back to Top

I often miss "surface" Web content using search engines. Why is that?

Search engines themselves impose decision rules with respect to either depth or breadth of surface pages indexed for a given site. There is also broad variability in the timeliness of results from these engines. Specialized surface sources or engines should therefore be considered when truly deep searching is desired. Again, the "bright line" between deep and surface Web shows shades of gray.

Back to Top

How much of the Deep Web is captured by CompletePlanet?

Over Deep Web sites are presently listed on CompletePlanet. CompletePlanet will continue to extend its deep Web coverage. Ultimately, CompletePlanet’s goal is to list all Deep Web sites and to keep current with new ones as they arise.

Back to Top

What other factors may make Internet information "deep"?

The World Wide Web (HTTP protocol) is but a subset of Internet content. Other Internet protocols besides the Web include FTP (file transfer protocol), email, news, Telnet and Gopher (most prominent among pre-Web protocols). There is also a large storehouse of private, intranet information hidden behind firewalls; many large companies have internal document stores that exceed terabytes of information. Also, on average 44% of the "contents" of a typical Web document reside in HTML and other coded information (for example, XML or Javascripts). Finally, multimedia (images, music) is another growing category of Internet content.

All of these sources can contribute to "deep" Internet content. However, CompletePlanet is currently focused on only public, text-based content, whether surface or deep.

Back to Top

Where can I learn more about the Deep Web?

See BrightPlanet’s comprehensive white paper, The Deep Web: Surfacing Hidden Value.

Back to Top

 
spacer.png, 0 kB
spacer.png, 0 kB
spacer.png, 0 kB

Sitemap Privacy About Us Contact Us Site Use

spacer.png, 0 kB