An annotated reference to the Thai implementations


WARNING: This paper is snapshot of a work-in-progress. It is put online for a closed group review. It is by no mean intended for wide circulation.


Please refer to Requirements for Thai Localization by James Clark while this document is going through a major overhaul.


This paper is an effort to collect and annotate relevant standards, technical and implementation aspects of Thai languageimplementation on various computer platforms.

This paper uses blue text to convey general ideas whileit uses the green color todistinguish annotation (personal impression of the author) fromothers.

Table of Contents



[ Join | Previous | Next | Random Site | List Sites | Master Site ]


I. Thai Character Set Standards

There is a Thai paper which, althought it was written some time ago in 1993, describes the importance of standards and standardisation work along with the general background on these activities very nicely. For those who are not familiar with these standards activities, it is recommended that the paper IT and Open Systems Standards should be read at least once in order to get a more out of this writing.

Work done by the Thai national standards body

1) TIS 620-2533:1990

In Thailand, the one and only Thai Character Set standard is TIS 620-2533, defined by the Thai Industrial Standards Institute (TISI), Ministry of Industry, Royal Thai Government. It was the work of the 536th Technical Committee, TC536, who was, and is still, in charge of Thai Information Technology Standards. TIS 620-2533 is a revision of the earlier standard TIS 620-2529. Assignments of each code point in these two versions of TIS 620 remains the same.

The -2529 and -2533 tags are the Buddhist Era commonly used in Thailand which designates the year when the standards were issued.

TIS 620 defines an eight-bit character environment. Assigned character values are given in a character table published by NECTEC.

The TIS 620-2533 is a Coded Character Set standard which is used for Information Interchange. Many implementations made a big assumption that "character" and "glyph" are synonymous. TIS 620 has never meant to be codepoint assignments for display; its sole purpose is for information interchange. Each character is an atomic entity of the Thai language.

Due to limitations in the rendering engines, some platforms require separated codepoints for the same character if it were rendered differently. Brief examples of these can be viewed here under the topic Displaying.

2) ISO-IR-166

In early '90s, the second subcommittee of TC536 (TISI/TC536/SC2) submitted TIS 620-2533 to the European Computing Manufacturers' Association, ECMA, who acted as the registrar of the ISO 2375 Character Set Repertoire. ECMA registered TIS 620-2533 and assigned the registration International Registration #166, ISO-IR-166. This regitration is now maintained by ISO/IEC JTC1/SC2/WG3.

ISO-IR-166 is a superset, i.e. not exactly the same, of TIS 620-2533. The reason for this discrepancy lies on the fact that many countries had the same desire to use 8-bit character environment which is limited to a total of 256 code positions. In order to prevent the ambiguity of sending a character to computers which assumes different 8-bit character environment, ISO/IEC 2022:1994 and ISO/IEC 4873:1991 mandate that character sets be declared by character set designation functions and several ways to juggle between multiple character sets into the limited 8-bit character environment.

In order to construct the TIS 620-2533 8-bit character set, one must map the "English" portion of ISO/IEC 646 (ISO-IR-6, also known as US-ASCII) into G0 and the "Thai" portion of TIS 620-2533 in G1 by using the following escape sequences:

GZD4 04/02   ESC 02/08 04/02
G1D6 05/04   ESC 02/13 05/04

Apart from the fact that ISO-IR-166 was written in English while TIS 620 was in Thai, the only two differences between these two standards are:

[To add explanation on G0/G1 and GL/GR]


Proprietary work accomplished by vendors

3) IBM Thailand Extended Single-Byte Character Set (CP838)

IBM Corp has registered its Codepage 838 with the Internet Assigned Numbers Authority, IANA. This character set is also listed as a supported encoding by Sun Microsystems' Java Development Kit version 1.1, refer to JDK 1.1 supported encoding documentation.

No code table is available online.

It is believed that CP838 is what many Thai implementors called "KU Code" developed at Kasetsart University Bangkhen Campus by Prof Yuen Poovarawan, which IBM (Thailand) Co Ltd later adopted around early '80s. The "KU Code" predates TIS 620-2529 and code assignments for the two character sets are not compatible.

At least in Thailand, the use of this character set over the Internet is found to be unpopular.

4) IBM Thai (CP874)

This character set is listed as a supported encoding by the JDK 1.1.

IBM provides a summary of how CP874 are different from Windows-874 (below). This summary is available from Unicode Consortium.

5) Microsoft Thai (MS874 or Windows-874)

This character set is what Microsoft Corp uses for its recent versions of the MS-DOS and the Windows families of operating systems. It is an extension of TIS 620-2533. This character set is listed as a supported encoding by the JDK 1.1.

6) Mac Thai

This character set is defined and used by Apple Computer, Inc for its Thai implementation in the MacOS operating system and, probably the subsequent one, the Rhapsody. Likewise, this character set is also a supported encoding in JDK 1.1.


II. Other Relevant Standards

1) ISO/IEC 10646-1:1993

The ISO/IEC 10646-1 is an attempt to establish a well-defined coded character elements for all scripts. Each code point in the canonical form is 4 bytes. A two-byte short-form, with the two most significant bytes being zero, where most scripts are located in is called the Basic Multilingual Plane - BMP.

Thai characters are assigned to the "lower" half of page "0E" of the BMP, meaning the data elements are located in the range of 00000E00 through 00000E7F, inclusively.

While one may access an online code assignment table for the BMP here, you will need a 10646/Unicode font to display. Unicode fonts, such as the Bitstream Cyberbit, can be download free of charge over the Internet. But the use of such a font may have undesirable effects on machines which have not much memory.

A cross reference table between ISO/IEC 10646-1 and TIS 620-2533 is available here.

2) Unicode 2.1

The Unicode Consortium is an independent group of organizational and individual members who have a common interest in solving a classic problem of internationalization (i18n) of software in computer. The first piece of the i18n puzzle that the consortium choose to address is the release of the Unicode standard, which is currently at version 2.1.

Unicode is a two-octet character table. Code assignments of Unicode 2.0 at its time of release was exactly the same as those of the Two-Octet Form (UCS-2) or the code assignment of the BMP in ISO/IEC 10646-1:1993.

The "Thai portion" in the Unicode is assigned to the "lower" half of page "0E", meaning the data elements ranging from 0E00 through 0E7F, inclusively.

3) ISO/IEC 8859-11:2001 - Latin/Thai

The ISO/IEC 8859 Series has published part 11, Latin/Thai, 8-bit single-byte coded graphic character sets in late 2001.

4) TIS 820-2538:1995 Standard Thai Keyboard Layout

In B.E.2538 (1995 A.D.), the Thai Industrial Standards Institute reissues a revision of the standard Thai keyboard layout, dubbed TIS 820-2538, which supercedes the prior version of the standard released seven years earlier, TIS 820-2531.

There are three less commonly used Thai characters and three commonly used symbols being added to the layout of TIS 820-2531. These additions appears in shaded keycaps in the layout here.

5) Thai API Consortium's "WTT" Input/Output Methods

Between late 1980s and early 1990s, the Thai software market was at its full swing. Thai PC market, then, was still small and no operating system vendors but Apple Computer Inc had made effort to address the needs of Thai language.

Such needs were so strong that a niche hardware market -- so called "Thai Card", which was a modified video adaptor together with appropriated drivers -- came to an existence. However, since every vendor used a proprietary driver and APIs, application software had to adapt to suit different driver of different Thai Card.

Out of the confusing situation, the consumers lost most. One application might be tied with one Thai card/driver while other required another set of computer with a different configuration.

In 1991, a group of computer professionals, headed by Dr. Thaweesak Koanantakool of the Information Processing Institute for Education and Development, Thammasat University, joined hands together and form the Thai API Consortium (TAPIC) to address these serious incompatibility problems. The work of TAPIC, facilitatating by the National Electronics and Computer Technology Center (NECTEC), was called the "WTT 2.0" specifications.

WTT 2.0 was a major overhaul of the WTT 1.0 system developed by Koanantakool and his team a year earlier. While it addressed the then current problems in the Thai personal computer market, many of its features are now inherited by major systems software vendors. Digital Equipment Corporation (now Compaq) adopted the entire WTT I/O behavior in its Thai versions of the OpenVMS and Digital UNIX operating systems. Microsoft adopted WTT input method and cursor movement behaviors in their Thai versions of Windows95/98 and WindowsNT. WTT Keyboard Layout was a strong push for TISI to revise its Thai Keyboard Layout standard (TIS 820-2538:1995). The Thai Locale for the X Window System has been added to the source pool since X11R6.3 using WTT's TACSIS charset.

The social implications of WTT to the Thai computer society was published by the Bangkok Post Annual Information Technology Directory 1992 in the article the Fate of Thai Localisation.


III. Distinct Characteristics of Thai

1) Basic Problems

The ZzzThai Group of the University of Electro-communications, Japan, has made a very nice illustration of basic problems of Thai on computer implementations.

2) Sanitising Keystroke Order versus Normalisation

When the first mechanical Thai typewriter was invented in 1891 (B.E.2434), it didn't matter which order of keystrokes a typist typed as long as all characters were typed in correctly.

Typing makes
Typing also makes

This is no longer true with computers. Different orders make different strings. No commercial operating system is going to return the equality result from calling to a string compare function given the strings {0xB7 0xD5 0xE8} and {0xB7 0xE8 0xD5}eventhough the users who typed both strings had probably thought both were the same. No commercial database would preduce the same hash value for both strings, hence, giving an unmatch status if the data was entered using one way but queried in another sequence.

Starting from the first implementation of Thai computer I/O device circa 1968 (B.E.2511), this subtle problem has been living with Thai computer users from the very beginning.

The Typewriting textbook by the Vocational Education Department, Thai Ministry of Education mandates the correct order when typing combining characters that a vowel is always typed in before a tonemark or any other special mark. In the case of an example above, the first sequence is the correct one.

Since most Thai learn typing by themselves, a large percentage of Thai text are entered either in consistently incorrect sequences or mixed order. When computerisation became more popular and most business has turned away from typewriters to word processors, the problem is amplified by a large factor.

To fix this problem, there are two schools of thoughts. One approach is to disallow a wrong sequence to be entered into the system. The champion of this approach is the WTT Input/Output Method Specification which was developed by the Thai API Consortium (TAPIC) around 1990-91. Microsoft Corp's keyboard driver for Thai language also follows this specification.

The other approach is to normalise Thai strings into the correct order before using the string for processing. Normalisation takes place everywhere from the input subsystems (keyboard read, file read, network read, etc.) but this approach cannot guarantee a well-behalf string should the input is made on a character basis -- without a complex callback capability. This limitation was the primary reason that TAPIC did not adopt this approach.

There is also an archive of selected discussions closely related to this issue on the ISO10646 mailing list back in 1992. These messages were bundled and distributed as document ISO/IEC JTC1/SC2 N2589.

[More on normalisation will be added after details on WTT character classification is added.]

With regards to TIS 620 characters, a special attention should be paid on codepoint 0xD3 (Sara Am). This codepoint is an atomic unit such that this codepoint is a vowel in its own. It exists because Sara Am exists on a typical Thai typewriter and computer keyboard, which is modeled after Thai typewriter, adopts it as is. Sara Am, when looked from another angle, comprises of two atomic glyphs: 0xED (Nikkahit) and 0xD2 (Sara Aa) -- both are also vowels. When rendered, Nikkahit is rendered above the preceding consonant while Sara Aa follows that consonant.

To illustrate this, let's take a look at different keystrokes and encoded strings:

The users normally presume all these three sequences would produce the same word (water). Unless the operating system and/or the application software is specifically made Thai aware, sequences 1 and 2 are inequal and sequence 3 may not, under certain Thai implementation, even be able to enter into the system.

In order to make sure that inconsistencies would cease to exist, an implementation might have to adopt both approaches.

3) Rendering Thai script


IV. Thai Language Implementations onthe Internet

1) Current Practices

2) Standard-based Approaches


V. Thai Script and LanguageProcessings

1) How Thai processing is different from others?

2) Minimum Requirements


Acknowledgements

This paper is a concerted effort of many people who wish to see fully interoperable Thai implementations. The author would like to thank in particular the following contributors: