Unicode Numbers In Javascript

Handling Non-ASCII Numerals In Javascript — The Way It Should Have Been Handled

Source code available at uninums on github.

A couple of weeks ago, I ranted about the lack of proper Unicode support in Javascript. Granted, Javascript supports Unicode strings, but if you want to parse such strings to numbers (e.g., the user enters a phone number using Chinese numerals), you will have to handle this yourself. So here is a small utility script that implements five methods for handling non-ASCII numerals in Javascript:

Function Description
normalDigits(s) Normalizes string s by replacing all non-ASCII digits with ASCII digits. 

  • normalDigits(‘٠۴६’) == ‘046’
  • normalDigits(‘123’) == ‘123’
normalSpaces(s) Normalizes string s by replacing all whitespace characters with either a space (‘\x20’) or a newline (‘\n’) as appropriate: 

  • normalSpaces(‘Hello\t\rWorld’) == ‘Hello\x20\nWorld’
  • normalSpaces(‘\xA0\u2003’) == ‘\x20\x20’
  • normalSpaces(‘\u2028) == ‘\n’

As a special case, normalSpaces() also replaces CRLF to a single newline character. So normalSpaces(‘\r\n’) == ‘\n’.

parseUniInt(s,r) Returns the integer value at the start of string s, ignoring leading spaces and using radix r. This is equivalent to the behavior of Javascript’s internal parseInt() function, but also handles non-ASCII digits: 

  • parseUniInt(‘٠۴६’, 10) == parseInt(‘046’, 10) == 46
  • parseUniInt(‘٠۴६’) == parseInt(‘046’) == 38 // assumes radix=8 due to leading zero
  • parseUniInt(‘٠۴६hello’) == parseInt(‘046hello’) == 38
  • parseUniInt(‘hello’) == parseInt(‘hello’) == NaN
parseUniFloat(s) Returns the float value at the start of string s, ignoring leading spaces. This is equivalent to the behavior of Javascript’s internal parseFloat() function, but also handles non-ASCII digits: 

  • parseUniFloat(‘٠۴.६’) == parseFloat(‘04.6’) == 4.6
  • parseUniFloat(‘٠۴.६hello’) == parseFloat(‘04.6hello’) == 4.6
  • parseUniFloat(‘hello’) == parseFloat(‘hello’) == NaN
sortNumeric(a) Sorts array a according to the numeric float values of its items: 

  • sortNumeric([‘3 dogs’,’10 cats’,’2 mice’]) == [‘2 mice’,’3 dogs’,’10 cats’]
  • sortNumeric([‘٣ dogs’,’١٠ cats’,’٢ mice’]) == [‘٢ mice’,’٣ dogs’,’١٠ cats’]

Note that using Javascript’s internal sort() function will order ’10 cats’ before ‘2 mice’ because it is string based rather than numeric.

All of these functions are available in the uninums.js file. You are welcome to use/modify/redistribute it as you see fit.

Let’s Start With The Space Normalization Function

The Javascript Standard published by ECMA states that all of the following Unicode characters should be treated as whitespace:

Code Unit Value Name
\u0009 Tab
\u000B Vertical Tab (‘\v’)
\u000C Form Feed (‘\f’)
\u0020 Space (‘ ‘)
\u00A0 No-break space
\uFEFF Byte Order Mark
Other category “Zs” Any other Unicode “space separator”

In version 5.2 of the Unicode Standard, the “Zs” category adds the following characters:

Code Unit Value Name
\u1680 Ogham Space Mark
\u180E Mongolian Vowel Separator
\u2000 En Quad
\u2001 Em Quad
\u2002 En Space
\u2003 Em Space
\u2004 Three-per-em space
\u2005 Four-per-em space
\u2006 Six-per-em space
\u2007 Figure space
\u2008 Punctuation space
\u2009 Thin space
\u200A Hair space
\u202F Narrow no-break space
\u205F Medium mathematical space
\u3000 Ideographic space

So the normalSpace() function should basically replace all occurrences of one of these characters with a simple space (‘\x20’):

function normalSpaces(s) {
   var Zs_and_friends = new RegExp('[\u0009\u000B\u000C\u00A0\u1680\u180E' +
      '\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A' +
      '\u202F\u205F\u3000\uFEFF]', 'g');

   return s ? s.toString().replace(Zs_and_friends, ' ') : s;
}

We would also like to replace line terminators with newline characters. The Javascript Standard says that all of the following should be treated as line terminators:

Code Unit Value Name
\u000A Line Feed (‘\n’)
\u000D Carriage Return (‘\r’)
\u2028 Line separator
\u2029 Paragraph separator

It also says that a CRLF sequence should be treated as a single line terminator.

We want to normalize line terminators as well:

function normalSpaces(s) {
   var Zs_and_friends = new RegExp('[\u0009\u000B\u000C\u00A0\u1680\u180E' +
      '\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A' +
      '\u202F\u205F\u3000\uFEFF]', 'g');

   var line_terminators = new RegExp('\u000D\u000A|[\u000D\u2028\u2029]', 'g');

   return s ? s.toString().replace(Zs_and_friends,' ').replace(line_terminators,'\n') : s;
}

Implementing The Digit Normalization Function

The normalDigits() function is implemented in a similar manner to the normalSpaces() function, with one difference — it uses 10 regular expressions, one for each digit:

function normalDigits(s) {
   if (!s) return s;
   s = s.toString();
   for (var i = 0; i <= 9; ++i) s = s.replace(Nd[i], i);
   return s;
}

The Nd variable is an array which contains 10 elements, each of which is a regular expression matching all Unicode characters that represent the same decimal digit. (I will not list them here as they are long, but you can find their definition in uninums.js.) All in all, they amount to 411 characters.

Implementing The Parse Functions

Having implemented the normalization functions, most of the hard work has already been done. We can now easily implement the parseUniInt() and parseUniFloat() functions:

function parseUniInt(s, radix) {
   return parseInt(s && typeof(s) != 'number' ? normalDigits(normalSpaces(s.toString())) : s, radix);
}

function parseUniFloat(s) {
   return parseFloat(s && typeof(s) != 'number' ? normalDigits(normalSpaces(s.toString())) : s);
}

Note that if s is not already a number, we should first convert it to a string using the toString() function, then normalize it with our normalization functions, and finally pass the resulting string to the standard parseInt() or parseFloat().

It is clear why we normalize the digits, but why normalize the spaces as well? Well, according to the Javascript Standard, parseInt() and parseFloat() should strip leading spaces before parsing commences, so we’re normalizing them just in case the Javascript engine does not understand non-ASCII spaces.

Implementing The Sort Function

Using the parseUniInt() function, it is very easy to implement the sortNumeric() function:

function sortNumeric(array) {
   return array.sort(function(a,b) {
      var va = parseUniFloat(a), vb = parseUniFloat(b);
      return isNaN(va) ? -1 : isNaN(vb) ? 1 : va < vb ? -1 : va == vb ? 0 : 1;
   });
}

Javascript’s sort() function can receive an argument which should be a comparator function. This comparator function receives two arguments, a and b, and should return 1 if a is bigger than b, (-1) if a is smaller than b, and 0 if a equals b. So we simply implement such a function using our parseUniFloat() function to get the float value of the string arguments.

Conclusion

The utility functions included in uninums.js are useful for developing internationalized web applications. However, they are not as fast as they would have been had they been implemented inside the Javascript engine. As I have written before, the Javascript Standard is gravely lacking in its required support for Unicode. I do hope that future versions of the Javascript Standard fix this. In the meanwhile, we have to resort to other means, such as uninums.js. I hope you find this useful for your applications.

Have you ever developed an international web application and dealt with these challenges?
I would love to hear about your experiences.

This entry was posted in Javascript and tagged , , . Bookmark the permalink.

7 Responses to Unicode Numbers In Javascript

  1. Sangkhim says:

    Cool, it’s really useful for me in my project.
    But I found out some bug with replace function.
    By default, JavaScript replace function replace only the first occurrence in the string.
    So when we have several digit with the same number it’s replace only one.

    Here my simple solution, by adding g it’s work fine for me:
    RegExp(“[“+e[b]+”]”,”g”)

  2. Very useful function that solved an issue we had with survey respondents entering numbers in their native language. Once thing I did notice however was that some Chinese encoding values are not supported. We where dealing with Japanese so its not an issue but I figured its worth mentioning. Might be fixable by just adding those code page values to the main array.

    Thanks for posting this!

    • Roy Sharon says:

      Thank you for your comment. I intended to include in the script the entire Nd (Number, Decimal Digit) category within the Basic Multilingual Plane (code points 0000-FFFF). Are the missing Chinese numerals in this range? Can you please give me an example of one so I can check why it didn’t work? Thanks!

  3. Kaysar says:

    really great js function…… work fine & save time…….. thanks a lot

  4. WilliamBoop says:

    wow, awesome article post.Really looking forward to read more. Really Great. Cacho

  5. Gintare says:

    Thank you. It was very useful.

Leave a Reply

Your email address will not be published. Required fields are marked *