Javascript & Unicode: The Unconsummated Marriage

Why We Need Better Unicode Support for Javascript, And What Can Be Done About It

One of the smartest things ECMA did was requiring Javascript engines to use Unicode strings. This has important implications beyond the simple ability to represent any character in any language. It actually enables important new functionality, which I will explain shortly.

But first, let us meet the participants in this marriage.

The Groom: Unicode

The Unicode Standard comes with ample dowry: some 246,877 characters, covering 90 world scripts (as of version 5.2). This is an impressive database, and the Unicode Consortium has even supplied it with several indexing options. One of these indices is the General Category of each character, which tells us what type of character it is: Uppercase Letter (Lu), Lowercase Letter (Ll), Decimal Number (Nd), Punctuation, or any of the additional 26 categories.

Unicode also has a hidden treasure — the Numeric Value Property:

Codepoint Character Name Numeric Value Property
U+0031 1 DIGIT ONE 1
U+0032 2 DIGIT TWO 2
U+0033 3 DIGIT THREE 3
U+0661 ? ARABIC-INDIC DIGIT ONE 1
U+0662 ? ARABIC-INDIC DIGIT TWO 2
U+0663 ? ARABIC-INDIC DIGIT THREE 3
U+2155 ? VULGAR FRACTION ONE FIFTH 0.2
U+2156 ? VULGAR FRACTION TWO FIFTHS 0.4
U+2157 ? VULGAR FRACTION THREE FIFTHS 0.6
U+2158 ? VULGAR FRACTION FOUR FIFTHS 0.8

Although not all characters have a numeric value property, all numerals, number letters (e.g., Roman numerals, such as VII or IX), ideographic numbers and others are associated with a numeric value.

The implication is that any program wishing to support numerals in non-Latin scripts can easily do so by simply using the numeric value property from the Unicode table. This can be applied to:

  • numerically sorting arrays
  • identifying digits that are to be dialed by a mobile application
  • validating and interpreting user input
  • a variety of additional practical uses

Kudos to the guys at the Unicode Consortium! This is really excellent work.

The Bride: Javascript

Having endured my fair share of suffering as a result of handling international languages in C++ and other golden oldie programming languages and OSs (does wchar_t ring a bell, anyone?), I truly appreciate the fact that Javascript uses Unicode for its internal string representation.

However, this is as far as this bride is willing to go. If you actually try to use Javascript’s Unicode capabilities in real applications, you will find yourself banging your head against the wall with every new step you take.

For example, have a look at the following input validation function for an “Age” field:

function validateAge(value) {
   return /^\d+$/.test(value);
}

The regular expression used by this function ensures that all characters of the supplied value are digits. This is achieved by ensuring that the entire string matches \d+, which in turn matches one or more digits. (The ^ sign and the $ sign are anchored to the start and end of the tested string respectively.) This is fine and dandy when dealing with ASCII digits. But what happens when the user inserts Indic or Arabic digits? Will it still work?

The answer is, if you test this validation function with Arabic/Indic digits on any major browser, the validation will fail. The validation function will reject the input, although it is of course perfectly valid. The ECMAScript Standard explains why this happens (section 15.10.2.12):
“The production CharacterClassEscape :: d evaluates by returning the ten-element set of characters containing the characters 0 through 9 inclusive.”

In plain English, \d only matches ASCII digits, which means that \d is equivalent to [0-9]. We could have just written:

function validateAge(value) {
   return /^[0-9]+$/.test(value);
}

To include the Arabic/Indic numerals, we also need to add the U+0660 – U+0669 range, as follows:

function validateAge(value) {
   return /^[0-9\u0660-\u0669]+$/.test(value);
}

Javascript’s (Lack of) Support of Unicode Regex

Although this might be a good solution for Arabic, what if we want to support other languages and scripts that have their own numerals, such as Bengali, Thai, Lao, Tibetan and Myanmar? It would make sense to use the Unicode General Category [“Number, Decimal Digit”] (“Nd” for short) mentioned above. This category includes 411 characters that are all different world script numerals. And being Unicode compliant, one would expect the Javascript regular expression \d to actually suit all of the characters in the Nd category.
Unfortunately, this is not the case. It turns out that Javascript’s regular expressions are only halfway Level 1 compliant with the Unicode Regular Expression Standard. Level 1 simply means that the regular expression engine can deal with Unicode characters and match these characters based on their hexadecimal values (implemented in Javascript via the \u escape sequence).

So let me spell it out: Level 1 Support does not actually provide a great deal of support for handling international scripts. In the [Standard’s language] it goes like this:
“Level 1 is the minimally useful level of support for Unicode. All regex implementations dealing with Unicode should be at least Level 1.”
But Javascript is not even a Level 1 conformant, as the Standard explicitly also requires the handling of character classes based on the character’s General Category. Had it met this requirement, Javascript would probably allow something like [:Nd:] or \p{Nd} to match decimal numerals. Then we could write our validation function as follows:

function validateAge(value) {
   return /^[:Nd:]+$/.test(value);
}

Unicode-Style parseInt() and parseFloat()

What about parseInt()? This Javascript function basically takes a string argument and converts it into an integer. For example, if we want to calculate the year of birth based on the Age field, we could implement something like this:

function getBirthYear(age) {
   return new Date().getYear() - parseInt(age);
}

We can call this while supplying the Age field content:

getBirthYear(ageField.value)

Assuming ageField is an INPUT field, we would achieve its value as a string, and getBirthYear() would convert it into an integer, using the parseInt() global function. However, this does not work when the user enters the age using non-Latin numerals.

Theoretically, by using the numeric value property supplied by the Unicode Standard, it should be possible to create a parseUniInt() function – a sibling of the standard parseInt() that also handles non-Latin numerals. The same goes for parseFloat(). It would be extremely convenient to achieve the numeric value of a vulgar fraction that happens to be represented by a Unicode character:

Codepoint Character Name Numeric Value Property
U+2155 ? VULGAR FRACTION ONE FIFTH 0.2
U+2156 ? VULGAR FRACTION TWO FIFTHS 0.4
U+2157 ? VULGAR FRACTION THREE FIFTHS 0.6
U+2158 ? VULGAR FRACTION FOUR FIFTHS 0.8

Unfortunately, Javascript does not actually support this.

Sorting The Unicode Way

Another useful feature of the Numeric Value Unicode property is the ability to sort arrays in numerical order, instead of textual order. To illustrate the problem, let us consider the following example:

var melting = ["2300ºF Maganese", "1946ºF Gold", "786ºF Zinc", "450ºF Tin"];
metling.sort();
alert(melting.join(',')); // displays: 1946ºF Gold,2300ºF Maganese,450ºF Tin,786ºF Zinc

Note that Javascript performs textual sorting by default. This means that “450ºF Tin” is placed after “2300ºF Maganese”, and “786ºF Zinc” is placed last. In order to sort by numeric value, we need to supply our own comparison function, which should return 1, 0, or -1, according to the relative order between its arguments a and b:

metling.sort(function(a, b) {
   var i = parseInt(a), j = parseInt(b);
   return i > j ? 1 : i == j ? 0 : -1;
});
alert(melting.join(',')); // displays: 450ºF Tin,786ºF Zinc,1946ºF Gold,2300ºF Maganese

It is easy to do this with numbers written with ASCII digits, but what about Thai numerals? Again, if the Javascript engine were to supply a parseUniInt() function, this would be a piece of cake:

var melting = ["????º? ??????????", "????º? ???", "???º? ???????", "???º? ?????"];
metling.sort(function(a, b) {
   var i = parseUniInt(a), j = parseUniInt(b);
   return i > j ? 1 : i == j ? 0 : -1;
});
alert(melting.join(',')); // displays: ???º? ?????,???º? ???????,????º? ???,????º? ??????????

An Unanswered Call From The Standard Committee

Note that Javascript’s regular expression syntax is [governed by the ECMA Standard]. It therefore seems that ECMA is the party responsible for not going all the way in Javascript’s marriage to Unicode. I sincerely hope this will be corrected in future versions of the ECMAScript Standard.

That being said, it seems that the Javascript engine implementers are also at fault. Section 2 of the Standard explicitly states the following:

“A conforming implementation of ECMAScript is permitted to support program and regular expression syntax not described in this specification.”

This is an open call from the Standard Committee to the Javascript engine implementers. I wonder why this call was never answered. It might be because the people who implement these engines are not the people who later use them to build international web applications.

As both a Javascript developer and a user, I would like to say that we need full Level 1 support for the Javascript regex engine. We need Unicode-aware parseInt() and parseFloat() in Javascript. These will enable application developers to make their apps useful for international users — users who are rapidly becoming the majority of the web audience. We need to make their experience as local and convenient as that of English-speakers.

What about you? Have you ever developed an international web application and dealt with these challenges? Please share your experiences.

This entry was posted in Javascript and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *