WARNING: This paper is snapshot of a work-in-progress. It is put online for a closed group review. It is by no mean intended for wide circulation.
This paper is an effort to collect and annotate relevant standards, technical and implementation aspects of Thai languageimplementation on various computer platforms.
This paper uses blue text to convey general ideas whileit uses the green color todistinguish annotation (personal impression of the author) fromothers.


[ Join
| Previous
| Next
| Random
Site | List
Sites | Master Site ]
There is a Thai paper which, althought it was written some time ago in 1993, describes the importance of standards and standardisation work along with the general background on these activities very nicely. For those who are not familiar with these standards activities, it is recommended that the paper IT and Open Systems Standards should be read at least once in order to get a more out of this writing.
In Thailand, the one and only Thai Character Set standard is TIS 620-2533, defined by the Thai Industrial Standards Institute (TISI), Ministry of Industry, Royal Thai Government. It was the work of the 536th Technical Committee, TC536, who was, and is still, in charge of Thai Information Technology Standards. TIS 620-2533 is a revision of the earlier standard TIS 620-2529. Assignments of each code point in these two versions of TIS 620 remains the same.
The -2529 and -2533 tags are the Buddhist Era commonly used in Thailand which designates the year when the standards were issued.
TIS 620 defines an eight-bit character environment. Assigned character values are given in a character table published by NECTEC.
The TIS 620-2533 is a Coded Character Set standard which is used for Information Interchange. Many implementations made a big assumption that "character" and "glyph" are synonymous. TIS 620 has never meant to be codepoint assignments for display; its sole purpose is for information interchange. Each character is an atomic entity of the Thai language.
Due to limitations in the rendering engines, some platforms require separated codepoints for the same character if it were rendered differently. Brief examples of these can be viewed here under the topic Displaying.
In early '90s, the second subcommittee of TC536 (TISI/TC536/SC2) submitted TIS 620-2533 to the European Computing Manufacturers' Association, ECMA, who acted as the registrar of the ISO 2375 Character Set Repertoire. ECMA registered TIS 620-2533 and assigned the registration International Registration #166, ISO-IR-166. This regitration is now maintained by ISO/IEC JTC1/SC2/WG3.
ISO-IR-166 is a superset, i.e. not exactly the same, of TIS 620-2533. The reason for this discrepancy lies on the fact that many countries had the same desire to use 8-bit character environment which is limited to a total of 256 code positions. In order to prevent the ambiguity of sending a character to computers which assumes different 8-bit character environment, ISO/IEC 2022:1994 and ISO/IEC 4873:1991 mandate that character sets be declared by character set designation functions and several ways to juggle between multiple character sets into the limited 8-bit character environment.
In order to construct the TIS 620-2533 8-bit character set, one must map the "English" portion of ISO/IEC 646 (ISO-IR-6, also known as US-ASCII) into G0 and the "Thai" portion of TIS 620-2533 in G1 by using the following escape sequences:
GZD4 04/02 ESC 02/08 04/02 G1D6 05/04 ESC 02/13 05/04
Apart from the fact that ISO-IR-166 was written in English while TIS 620 was in Thai, the only two differences between these two standards are:
- the definition of these two escape sequences, and,
- codepoint A0 which is defined in ISO-IR-166 as a no-break space character but reserved in TIS 620.
[To add explanation on G0/G1 and GL/GR]
IBM Corp has registered its Codepage 838 with the Internet Assigned Numbers Authority, IANA. This character set is also listed as a supported encoding by Sun Microsystems' Java Development Kit version 1.1, refer to JDK 1.1 supported encoding documentation.
No code table is available online.
It is believed that CP838 is what many Thai implementors called "KU Code" developed at Kasetsart University Bangkhen Campus by Prof Yuen Poovarawan, which IBM (Thailand) Co Ltd later adopted around early '80s. The "KU Code" predates TIS 620-2529 and code assignments for the two character sets are not compatible.
At least in Thailand, the use of this character set over the Internet is found to be unpopular.
This character set is listed as a supported encoding by the JDK 1.1.
IBM provides a summary of how CP874 are different from Windows-874 (below). This summary is available from Unicode Consortium.
This character set is what Microsoft Corp uses for its recent versions of the MS-DOS and the Windows families of operating systems. It is an extension of TIS 620-2533. This character set is listed as a supported encoding by the JDK 1.1.
This character set is defined and used by Apple Computer, Inc for its Thai implementation in the MacOS operating system and, probably the subsequent one, the Rhapsody. Likewise, this character set is also a supported encoding in JDK 1.1.
The ISO/IEC 10646-1 is an attempt to establish a well-defined coded character elements for all scripts. Each code point in the canonical form is 4 bytes. A two-byte short-form, with the two most significant bytes being zero, where most scripts are located in is called the Basic Multilingual Plane - BMP.
Thai characters are assigned to the "lower" half of page "0E" of the BMP, meaning the data elements are located in the range of 00000E00 through 00000E7F, inclusively.
While one may access an online code assignment table for the BMP here, you will need a 10646/Unicode font to display. Unicode fonts, such as the Bitstream Cyberbit, can be download free of charge over the Internet. But the use of such a font may have undesirable effects on machines which have not much memory.
A cross reference table between ISO/IEC 10646-1 and TIS 620-2533 is available here.
The Unicode Consortium is an independent group of organizational and individual members who have a common interest in solving a classic problem of internationalization (i18n) of software in computer. The first piece of the i18n puzzle that the consortium choose to address is the release of the Unicode standard, which is currently at version 2.1.
Unicode is a two-octet character table. Code assignments of Unicode 2.0 at its time of release was exactly the same as those of the Two-Octet Form (UCS-2) or the code assignment of the BMP in ISO/IEC 10646-1:1993.
The "Thai portion" in the Unicode is assigned to the "lower" half of page "0E", meaning the data elements ranging from 0E00 through 0E7F, inclusively.
The ISO/IEC 8859 Series has published part 11, Latin/Thai, 8-bit single-byte coded graphic character sets in late 2001.
In B.E.2538 (1995 A.D.), the Thai Industrial Standards Institute reissues a revision of the standard Thai keyboard layout, dubbed TIS 820-2538, which supercedes the prior version of the standard released seven years earlier, TIS 820-2531.
There are three less commonly used Thai characters and three commonly used symbols being added to the layout of TIS 820-2531. These additions appears in shaded keycaps in the layout here.
Between late 1980s and early 1990s, the Thai software market was at its full swing. Thai PC market, then, was still small and no operating system vendors but Apple Computer Inc had made effort to address the needs of Thai language.
Such needs were so strong that a niche hardware market -- so called "Thai Card", which was a modified video adaptor together with appropriated drivers -- came to an existence. However, since every vendor used a proprietary driver and APIs, application software had to adapt to suit different driver of different Thai Card.
Out of the confusing situation, the consumers lost most. One application might be tied with one Thai card/driver while other required another set of computer with a different configuration.
In 1991, a group of computer professionals, headed by Dr. Thaweesak Koanantakool of the Information Processing Institute for Education and Development, Thammasat University, joined hands together and form the Thai API Consortium (TAPIC) to address these serious incompatibility problems. The work of TAPIC, facilitatating by the National Electronics and Computer Technology Center (NECTEC), was called the "WTT 2.0" specifications.
WTT 2.0 was a major overhaul of the WTT 1.0 system developed by Koanantakool and his team a year earlier. While it addressed the then current problems in the Thai personal computer market, many of its features are now inherited by major systems software vendors. Digital Equipment Corporation (now Compaq) adopted the entire WTT I/O behavior in its Thai versions of the OpenVMS and Digital UNIX operating systems. Microsoft adopted WTT input method and cursor movement behaviors in their Thai versions of Windows95/98 and WindowsNT. WTT Keyboard Layout was a strong push for TISI to revise its Thai Keyboard Layout standard (TIS 820-2538:1995). The Thai Locale for the X Window System has been added to the source pool since X11R6.3 using WTT's TACSIS charset.
The social implications of WTT to the Thai computer society was published by the Bangkok Post Annual Information Technology Directory 1992 in the article the Fate of Thai Localisation.
The ZzzThai Group of the University of Electro-communications, Japan, has made a very nice illustration of basic problems of Thai on computer implementations.
When the first mechanical Thai typewriter was invented in 1891 (B.E.2434), it didn't matter which order of keystrokes a typist typed as long as all characters were typed in correctly.
Typing makes
Typing also makes This is no longer true with computers. Different orders make different strings. No commercial operating system is going to return the equality result from calling to a string compare function given the strings {0xB7 0xD5 0xE8} and {0xB7 0xE8 0xD5}eventhough the users who typed both strings had probably thought both were the same. No commercial database would preduce the same hash value for both strings, hence, giving an unmatch status if the data was entered using one way but queried in another sequence.
Starting from the first implementation of Thai computer I/O device circa 1968 (B.E.2511), this subtle problem has been living with Thai computer users from the very beginning.
The Typewriting textbook by the Vocational Education Department, Thai Ministry of Education mandates the correct order when typing combining characters that a vowel is always typed in before a tonemark or any other special mark. In the case of an example above, the first sequence is the correct one.
Since most Thai learn typing by themselves, a large percentage of Thai text are entered either in consistently incorrect sequences or mixed order. When computerisation became more popular and most business has turned away from typewriters to word processors, the problem is amplified by a large factor.
To fix this problem, there are two schools of thoughts. One approach is to disallow a wrong sequence to be entered into the system. The champion of this approach is the WTT Input/Output Method Specification which was developed by the Thai API Consortium (TAPIC) around 1990-91. Microsoft Corp's keyboard driver for Thai language also follows this specification.
The other approach is to normalise Thai strings into the correct order before using the string for processing. Normalisation takes place everywhere from the input subsystems (keyboard read, file read, network read, etc.) but this approach cannot guarantee a well-behalf string should the input is made on a character basis -- without a complex callback capability. This limitation was the primary reason that TAPIC did not adopt this approach.
There is also an archive of selected discussions closely related to this issue on the ISO10646 mailing list back in 1992. These messages were bundled and distributed as document ISO/IEC JTC1/SC2 N2589.
[More on normalisation will be added after details on WTT character classification is added.]
With regards to TIS 620 characters, a special attention should be paid on codepoint 0xD3 (Sara Am). This codepoint is an atomic unit such that this codepoint is a vowel in its own. It exists because Sara Am exists on a typical Thai typewriter and computer keyboard, which is modeled after Thai typewriter, adopts it as is. Sara Am, when looked from another angle, comprises of two atomic glyphs: 0xED (Nikkahit) and 0xD2 (Sara Aa) -- both are also vowels. When rendered, Nikkahit is rendered above the preceding consonant while Sara Aa follows that consonant.
To illustrate this, let's take a look at different keystrokes and encoded strings:
- Three-keystroke sequence: 0xB9 (No Nu) 0xE9 (Mai Tho) 0xD3 (Sara Am)
- Four-keystroke sequence: 0xB9 (No Nu) 0xED (Nikkahit) 0xE9 (Mai Tho) 0xD2 (Sara Aa)
- Incorrect (tonemark precedes the nikkahit vowel) four-keystroke sequence: 0xB9 (No Nu) 0xE9 (Mai Tho) 0xED (Nikkahit) 0xD2 (Sara Aa)
The users normally presume all these three sequences would produce the same word
(water). Unless the operating system and/or the application software is specifically made Thai aware, sequences 1 and 2 are inequal and sequence 3 may not, under certain Thai implementation, even be able to enter into the system.
In order to make sure that inconsistencies would cease to exist, an implementation might have to adopt both approaches.
This paper is a concerted effort of many people who wish to see fully interoperable Thai implementations. The author would like to thank in particular the following contributors:
- Surayuth Boonmatat, Thai Industrial Standards Institute, Ministry of Industry, Thailand
- Joris Goetschalckx, the European Commission, Belgium
- Thai Project, National Center for Science Information Systems, Japan
- ZzzThai Group, University of Electro-communications, Japan
- Andreas Prilop, Universitaet Hannover, Germany