spacer.png, 0 kB
Internet Languages, Character Sets & Encodings

A BRIGHTPLANET TUTORIAL 

Download the Complete Tutorial in PDF Format

BrightPlanet Corporation
Released: March 2006

Introduction

Broad-scale, international open source harvesting from the Internet poses many challenges in use and translation of legacy encodings that have vexed academics and researchers for many years. Successfully addressing these challenges will only grow in importance as the relative percentage of international sites grows in relation to conventional English ones.

A major challenge in internationalization and foreign source support is "encoding." Encodings specify the arbitrary assignment of numbers to the symbols (characters or ideograms) of the world's written languages needed for electronic transfer and manipulation. One of the first encodings developed in the 1960s was ASCII (numerals, plus a-z; A-Z); others developed over time to deal with other unique characters and the many symbols of (particularly) the Asiatic languages.

Some languages have many character encodings and some encodings, for example Chinese and Japanese, have very complex systems for handling the large number of unique characters. Two different encodings can be incompatible by assigning the same number to two distinct symbols, or vice versa. So-called Unicode set out to consolidate many different encodings, all using separate code plans into a single system that could represent all written languages within the same character encoding. There are a few Unicode techniques and formats, the most common being UTF-8.

The Internet was originally developed via efforts in the United States funded by ARPA (later DARPA) and NSF, extending back to the 1960s. At the time of its commercial adoption in the early 1990s via the Word Wide Web protocols, it was almost entirely dominated by English by virtue of this U.S. heritage and the emergence of English as the lingua franca of the technical and research community.

However, with the maturation of the Internet as a global information repository and means for instantaneous e-commerce, today's online community now approaches 1 billion users from all existing countries. The Internet has become increasingly multi-lingual.

Efficient and automated means to discover, search, query, retrieve and harvest content from across the Internet thus require an understanding of the source human languages in use and the means to encode them for electronic transfer and manipulation. This Tutorial provides a brief introduction to these topics.

Copyright© 2006. BrightPlanet Corporation – All rights reserved.

 
spacer.png, 0 kB
spacer.png, 0 kB
spacer.png, 0 kB

Sitemap Privacy About Us Contact Us Site Use

spacer.png, 0 kB