Unicode Numbers In Javascript

Handling Non-ASCII Numerals In Javascript — The Way It Should Have Been Handled

Source code available at uninums on github.

A couple of weeks ago, I ranted about the lack of proper Unicode support in Javascript. Granted, Javascript supports Unicode strings, but if you want to parse such strings to numbers (e.g., the user enters a phone number using Chinese numerals), you will have to handle this yourself. So here is a small utility script that implements five methods for handling non-ASCII numerals in Javascript:

Function	Description
normalDigits(s)	Normalizes string s by replacing all non-ASCII digits with ASCII digits. normalDigits(‘٠۴६’) == ‘046’ normalDigits(‘123’) == ‘123’
normalSpaces(s)	Normalizes string s by replacing all whitespace characters with either a space (‘\x20’) or a newline (‘\n’) as appropriate: normalSpaces(‘Hello\t\rWorld’) == ‘Hello\x20\nWorld’ normalSpaces(‘\xA0\u2003’) == ‘\x20\x20’ normalSpaces(‘\u2028) == ‘\n’ As a special case, normalSpaces() also replaces CRLF to a single newline character. So normalSpaces(‘\r\n’) == ‘\n’.
parseUniInt(s,r)	Returns the integer value at the start of string s, ignoring leading spaces and using radix r. This is equivalent to the behavior of Javascript’s internal parseInt() function, but also handles non-ASCII digits: parseUniInt(‘٠۴६’, 10) == parseInt(‘046’, 10) == 46 parseUniInt(‘٠۴६’) == parseInt(‘046’) == 38 // assumes radix=8 due to leading zero parseUniInt(‘٠۴६hello’) == parseInt(‘046hello’) == 38 parseUniInt(‘hello’) == parseInt(‘hello’) == NaN
parseUniFloat(s)	Returns the float value at the start of string s, ignoring leading spaces. This is equivalent to the behavior of Javascript’s internal parseFloat() function, but also handles non-ASCII digits: parseUniFloat(‘٠۴.६’) == parseFloat(‘04.6’) == 4.6 parseUniFloat(‘٠۴.६hello’) == parseFloat(‘04.6hello’) == 4.6 parseUniFloat(‘hello’) == parseFloat(‘hello’) == NaN
sortNumeric(a)	Sorts array a according to the numeric float values of its items: sortNumeric([‘3 dogs’,’10 cats’,’2 mice’]) == [‘2 mice’,’3 dogs’,’10 cats’] sortNumeric([‘٣ dogs’,’١٠ cats’,’٢ mice’]) == [‘٢ mice’,’٣ dogs’,’١٠ cats’] Note that using Javascript’s internal sort() function will order ’10 cats’ before ‘2 mice’ because it is string based rather than numeric.

All of these functions are available in the uninums.js file. You are welcome to use/modify/redistribute it as you see fit.

Let’s Start With The Space Normalization Function

The Javascript Standard published by ECMA states that all of the following Unicode characters should be treated as whitespace:

Code Unit Value	Name
\u0009	Tab
\u000B	Vertical Tab (‘\v’)
\u000C	Form Feed (‘\f’)
\u0020	Space (‘ ‘)
\u00A0	No-break space
\uFEFF	Byte Order Mark
Other category “Zs”	Any other Unicode “space separator”

In version 5.2 of the Unicode Standard, the “Zs” category adds the following characters:

Code Unit Value	Name
\u1680	Ogham Space Mark
\u180E	Mongolian Vowel Separator
\u2000	En Quad
\u2001	Em Quad
\u2002	En Space
\u2003	Em Space
\u2004	Three-per-em space
\u2005	Four-per-em space
\u2006	Six-per-em space
\u2007	Figure space
\u2008	Punctuation space
\u2009	Thin space
\u200A	Hair space
\u202F	Narrow no-break space
\u205F	Medium mathematical space
\u3000	Ideographic space

So the normalSpace() function should basically replace all occurrences of one of these characters with a simple space (‘\x20’):

function normalSpaces(s) {
   var Zs_and_friends = new RegExp('[\u0009\u000B\u000C\u00A0\u1680\u180E' +
      '\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A' +
      '\u202F\u205F\u3000\uFEFF]', 'g');

   return s ? s.toString().replace(Zs_and_friends, ' ') : s;
}

We would also like to replace line terminators with newline characters. The Javascript Standard says that all of the following should be treated as line terminators:

Code Unit Value	Name
\u000A	Line Feed (‘\n’)
\u000D	Carriage Return (‘\r’)
\u2028	Line separator
\u2029	Paragraph separator

It also says that a CRLF sequence should be treated as a single line terminator.

We want to normalize line terminators as well:

function normalSpaces(s) {
   var Zs_and_friends = new RegExp('[\u0009\u000B\u000C\u00A0\u1680\u180E' +
      '\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A' +
      '\u202F\u205F\u3000\uFEFF]', 'g');

   var line_terminators = new RegExp('\u000D\u000A|[\u000D\u2028\u2029]', 'g');

   return s ? s.toString().replace(Zs_and_friends,' ').replace(line_terminators,'\n') : s;
}

Implementing The Digit Normalization Function

The normalDigits() function is implemented in a similar manner to the normalSpaces() function, with one difference — it uses 10 regular expressions, one for each digit:

function normalDigits(s) {
   if (!s) return s;
   s = s.toString();
   for (var i = 0; i <= 9; ++i) s = s.replace(Nd[i], i);
   return s;
}

The Nd variable is an array which contains 10 elements, each of which is a regular expression matching all Unicode characters that represent the same decimal digit. (I will not list them here as they are long, but you can find their definition in uninums.js.) All in all, they amount to 411 characters.

Implementing The Parse Functions

Having implemented the normalization functions, most of the hard work has already been done. We can now easily implement the parseUniInt() and parseUniFloat() functions:

function parseUniInt(s, radix) {
   return parseInt(s && typeof(s) != 'number' ? normalDigits(normalSpaces(s.toString())) : s, radix);
}

function parseUniFloat(s) {
   return parseFloat(s && typeof(s) != 'number' ? normalDigits(normalSpaces(s.toString())) : s);
}

Note that if s is not already a number, we should first convert it to a string using the toString() function, then normalize it with our normalization functions, and finally pass the resulting string to the standard parseInt() or parseFloat().

It is clear why we normalize the digits, but why normalize the spaces as well? Well, according to the Javascript Standard, parseInt() and parseFloat() should strip leading spaces before parsing commences, so we’re normalizing them just in case the Javascript engine does not understand non-ASCII spaces.

Implementing The Sort Function

Using the parseUniInt() function, it is very easy to implement the sortNumeric() function:

function sortNumeric(array) {
   return array.sort(function(a,b) {
      var va = parseUniFloat(a), vb = parseUniFloat(b);
      return isNaN(va) ? -1 : isNaN(vb) ? 1 : va < vb ? -1 : va == vb ? 0 : 1;
   });
}

Javascript’s sort() function can receive an argument which should be a comparator function. This comparator function receives two arguments, a and b, and should return 1 if a is bigger than b, (-1) if a is smaller than b, and 0 if a equals b. So we simply implement such a function using our parseUniFloat() function to get the float value of the string arguments.

Conclusion

The utility functions included in uninums.js are useful for developing internationalized web applications. However, they are not as fast as they would have been had they been implemented inside the Javascript engine. As I have written before, the Javascript Standard is gravely lacking in its required support for Unicode. I do hope that future versions of the Javascript Standard fix this. In the meanwhile, we have to resort to other means, such as uninums.js. I hope you find this useful for your applications.

Have you ever developed an international web application and dealt with these challenges?
I would love to hear about your experiences.

Let’s Start With The Space Normalization Function

Implementing The Digit Normalization Function

Implementing The Parse Functions

Implementing The Sort Function

Conclusion

8 Responses to Unicode Numbers In Javascript

Leave a Reply Cancel reply

Roy Sharon

Recent Posts

Archives

Categories

Meta