|
As a data exchange format, eXtensible Markup Language (XML) does provide data representation extensibility. But the simple use of XML is insufficient alone to provide full extensibility. Fully extensible systems need to have at least these capabilities:
- Extensible data representation so that any data type and form can be transmitted between two disparate systems. XML and its other structured cousins such as RDF and OWL perform this role. Note however, that standard data exchange formats have been an active topic of research and adoption for at least 20 years, with other notable formats such as ASN.1, CDF, EDI, etc., also performing the task now largely being overtaken by XML
- Extensible semantics. Since more than one source of data is brought into an extended environment it likely introduces new semantics and heterogeneities. These mismatches fall into the classic challenge areas of data federation. The key point however, is that simply being able to ingest extended data does nothing if the meaning of that data is not also captured. Semantic extensibility requires more structured data representations (RDF-S or OWL, for example), reference vocabularies and ontologies, and utilities and means to map the meanings between different schema
- Extensible data management. Though native XML data bases and other extensions to conventional data systems have been attempted, truly extensible data management systems have not yet been developed that: 1) perform at scale; 2) can be extended without re-architecting the schema; 3) can be extended without re-processing the original source data; and 4) perform efficiently. Until extensible infrastructures with these capabilities are available, extensibility will not become viable at the enterprise level and will remain an academic or startup curiosity.
- Extensible capabilities through extendable and interoperable applications or tools. Though we are now moving up the stack into the application layer, real extensibility comes from true interoperability. Service-oriented architectures (SOAs), business process modeling (BPM) and other approaches allow the registry and message brokering among extended apps and services. But central v. decentralized systems, inclusion or not of business process interoperability, and even the accommodation of the other extensible imperatives above make this last layer potentially, fiendishly difficult.
These challenges are especially daunting in a completely decentralized, anarchic, distributed environment such as the broader Internet, where scales are measured in the millions to billions of documents. Nonetheless, elimination of stovepipes and making all information available for effective use requires these extensibility hurdles be cleared.
A key failing in most attempts to capture extensible attributes is to do so solely within the ingest pipeline. This approach has the drawbacks of: 1) ingest bottlenecks, since all documents need to be processed; and 2) requiring re-processing of the entire repository should new attribute needs or tools be added to the system (as is inevitable). These drawbacks importantly lead to poor performance, long lag times and poor scalability. These drawbacks are in many ways related to the unique nature of semi-structured data and the semantics they may contain.
Unlike structured or unstructured data, there is no accepted database engine for the knowledge base specific to semi-structured data:
Other attempts to manage the middle ground of semi-structured data has involved either modifying RDBMS systems to be XML enabled, adding some structure to existing IR systems or developing new, native XML data systems from scratch (which are unproven and don’t scale).
BrightPlanet’s corporate research emphasis is geared to overcome the demonstrable weaknesses of these three approaches. We term this research effort the eXtensible Semantic Data Model and its associated extensions to our standard XSD (Text) Engine .
|