spacer.png, 0 kB
Harvest Engine

The BrightPlanet Harvest Engine is designed to find and retrieve documents, regardless of their format, language or storage location/technology. In developing the Harvest Engine, it was necessary to overcome four key challenges:

four_dimensions

The Harvest Engine is able to overcome all of these challenges.

  • Formats – BrightPlanet handles HTML, PDF, text, and XML documents natively. Through our partnership with Stellent, Inc., we are able to include over 370 additional file formats for accessing file formats from Microsoft (e.g., Word, PowerPoint, Project Outlook), Lotus, Adobe, and many other vendors.
  • Storage Technology – Many products are able to crawl static (surface) Web sites, using crawler (sometimes called spider) technology. BrightPlanet's crawler is extremely efficient and is able to target its crawling activities to the most relevant documents.
  • Through patented technology, BrightPlanet has developed algorithms for automated configuration to tens of thousands of Deep Web sites which contain dynamic content. Deep Web sites include most major news archives as well as thousands of specialized sources. These sources typically represent the best, most definitive content sources for their subject area. For example, in the health sciences field, the Centers for Disease Control, National Institutes of Health, PubMed, Mayo Clinic, and American Medical Association are all Deep Web sites, inaccessible from conventional Web crawlers.
  • Addtionally, the BrightPlanet Harvest Engine is able to access internal file systems, extending its reach to all of your internal documents.
  • Locations – We have found that for many real-world applications, it is important to access content from both your intranet and the open Internet. The BrightPlanet Harvest Engine can be configured to work inside your firewall to harvest proprietary content, while simultaneously reaching out to Internet and extranet sites. The Harvest Engine is designed to work with most popular firewall and proxy server products.
  • Language – While much content of interest is available in the English language, there is a vast body of knowledge available in other languages, as well. The Harvest Engine, in conjunction with the Rosette products of our partner, Basis Technology, is able to search and harvest in many other languages. These include most European languages, Arabic, Chinese, Japanese, Korean, and Russian.
Many Inputs, One output

One of the Harvest Engine's key strengths is its ability to harvest from these many types of sources. Equally powerful, however, it the fact that, regardless of source, the output of the Harvest Engine is always a single, canonical, Unicode-based, XML format. This well documented, simple output stream allows easy implementation of the Harvest Engine in a variety of applications. 

application program interface (API)

The Harvest Engine API is a comprehensive set of calls written in Java. If needed, the Harvest Engine can be distributed across as many servers as needed to accommodate the harvest volumes you anticipate for your application. Full documentation is provided. Contact BrightPlanet to see a copy.

 
spacer.png, 0 kB
spacer.png, 0 kB
spacer.png, 0 kB

Sitemap Privacy About Us Contact Us Site Use

spacer.png, 0 kB