Handling Non-ASCII Numerals In Javascript — The Way It Should Have Been Handled
Source code available at uninums on github.
A couple of weeks ago, I ranted about the lack of proper Unicode support in Javascript. Granted, Javascript supports Unicode strings, but if you want to parse such strings to numbers (e.g., the user enters a phone number using Chinese numerals), you will have to handle this yourself. So here is a small utility script that implements five methods for handling non-ASCII numerals in Javascript:
Function | Description |
---|---|
normalDigits(s) | Normalizes string s by replacing all non-ASCII digits with ASCII digits.
|
normalSpaces(s) | Normalizes string s by replacing all whitespace characters with either a space (‘\x20’) or a newline (‘\n’) as appropriate:
As a special case, normalSpaces() also replaces CRLF to a single newline character. So normalSpaces(‘\r\n’) == ‘\n’. |
parseUniInt(s,r) | Returns the integer value at the start of string s, ignoring leading spaces and using radix r. This is equivalent to the behavior of Javascript’s internal parseInt() function, but also handles non-ASCII digits:
|
parseUniFloat(s) | Returns the float value at the start of string s, ignoring leading spaces. This is equivalent to the behavior of Javascript’s internal parseFloat() function, but also handles non-ASCII digits:
|
sortNumeric(a) | Sorts array a according to the numeric float values of its items:
Note that using Javascript’s internal sort() function will order ’10 cats’ before ‘2 mice’ because it is string based rather than numeric. |
All of these functions are available in the uninums.js file. You are welcome to use/modify/redistribute it as you see fit.
Let’s Start With The Space Normalization Function
The Javascript Standard published by ECMA states that all of the following Unicode characters should be treated as whitespace:
Code Unit Value | Name |
---|---|
\u0009 | Tab |
\u000B | Vertical Tab (‘\v’) |
\u000C | Form Feed (‘\f’) |
\u0020 | Space (‘ ‘) |
\u00A0 | No-break space |
\uFEFF | Byte Order Mark |
Other category “Zs” | Any other Unicode “space separator” |
In version 5.2 of the Unicode Standard, the “Zs” category adds the following characters:
Code Unit Value | Name |
---|---|
\u1680 | Ogham Space Mark |
\u180E | Mongolian Vowel Separator |
\u2000 | En Quad |
\u2001 | Em Quad |
\u2002 | En Space |
\u2003 | Em Space |
\u2004 | Three-per-em space |
\u2005 | Four-per-em space |
\u2006 | Six-per-em space |
\u2007 | Figure space |
\u2008 | Punctuation space |
\u2009 | Thin space |
\u200A | Hair space |
\u202F | Narrow no-break space |
\u205F | Medium mathematical space |
\u3000 | Ideographic space |
So the normalSpace() function should basically replace all occurrences of one of these characters with a simple space (‘\x20’):
function normalSpaces(s) { var Zs_and_friends = new RegExp('[\u0009\u000B\u000C\u00A0\u1680\u180E' + '\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A' + '\u202F\u205F\u3000\uFEFF]', 'g'); return s ? s.toString().replace(Zs_and_friends, ' ') : s; }
We would also like to replace line terminators with newline characters. The Javascript Standard says that all of the following should be treated as line terminators:
Code Unit Value | Name |
---|---|
\u000A | Line Feed (‘\n’) |
\u000D | Carriage Return (‘\r’) |
\u2028 | Line separator |
\u2029 | Paragraph separator |
It also says that a CRLF sequence should be treated as a single line terminator.
We want to normalize line terminators as well:
function normalSpaces(s) { var Zs_and_friends = new RegExp('[\u0009\u000B\u000C\u00A0\u1680\u180E' + '\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A' + '\u202F\u205F\u3000\uFEFF]', 'g'); var line_terminators = new RegExp('\u000D\u000A|[\u000D\u2028\u2029]', 'g'); return s ? s.toString().replace(Zs_and_friends,' ').replace(line_terminators,'\n') : s; }
Implementing The Digit Normalization Function
The normalDigits() function is implemented in a similar manner to the normalSpaces() function, with one difference — it uses 10 regular expressions, one for each digit:
function normalDigits(s) { if (!s) return s; s = s.toString(); for (var i = 0; i <= 9; ++i) s = s.replace(Nd[i], i); return s; }
The Nd variable is an array which contains 10 elements, each of which is a regular expression matching all Unicode characters that represent the same decimal digit. (I will not list them here as they are long, but you can find their definition in uninums.js.) All in all, they amount to 411 characters.
Implementing The Parse Functions
Having implemented the normalization functions, most of the hard work has already been done. We can now easily implement the parseUniInt() and parseUniFloat() functions:
function parseUniInt(s, radix) { return parseInt(s && typeof(s) != 'number' ? normalDigits(normalSpaces(s.toString())) : s, radix); } function parseUniFloat(s) { return parseFloat(s && typeof(s) != 'number' ? normalDigits(normalSpaces(s.toString())) : s); }
Note that if s is not already a number, we should first convert it to a string using the toString() function, then normalize it with our normalization functions, and finally pass the resulting string to the standard parseInt() or parseFloat().
It is clear why we normalize the digits, but why normalize the spaces as well? Well, according to the Javascript Standard, parseInt() and parseFloat() should strip leading spaces before parsing commences, so we’re normalizing them just in case the Javascript engine does not understand non-ASCII spaces.
Implementing The Sort Function
Using the parseUniInt() function, it is very easy to implement the sortNumeric() function:
function sortNumeric(array) { return array.sort(function(a,b) { var va = parseUniFloat(a), vb = parseUniFloat(b); return isNaN(va) ? -1 : isNaN(vb) ? 1 : va < vb ? -1 : va == vb ? 0 : 1; }); }
Javascript’s sort() function can receive an argument which should be a comparator function. This comparator function receives two arguments, a and b, and should return 1 if a is bigger than b, (-1) if a is smaller than b, and 0 if a equals b. So we simply implement such a function using our parseUniFloat() function to get the float value of the string arguments.
Conclusion
The utility functions included in uninums.js are useful for developing internationalized web applications. However, they are not as fast as they would have been had they been implemented inside the Javascript engine. As I have written before, the Javascript Standard is gravely lacking in its required support for Unicode. I do hope that future versions of the Javascript Standard fix this. In the meanwhile, we have to resort to other means, such as uninums.js. I hope you find this useful for your applications.
Have you ever developed an international web application and dealt with these challenges?
I would love to hear about your experiences.
Cool, it’s really useful for me in my project.
But I found out some bug with replace function.
By default, JavaScript replace function replace only the first occurrence in the string.
So when we have several digit with the same number it’s replace only one.
Here my simple solution, by adding g it’s work fine for me:
RegExp(“[“+e[b]+”]”,”g”)
…
Of course, how stupid of me. Following your comment I’ve fixed it both here in the post and also in the github repository.
Thanks!
Very useful function that solved an issue we had with survey respondents entering numbers in their native language. Once thing I did notice however was that some Chinese encoding values are not supported. We where dealing with Japanese so its not an issue but I figured its worth mentioning. Might be fixable by just adding those code page values to the main array.
Thanks for posting this!
Thank you for your comment. I intended to include in the script the entire Nd (Number, Decimal Digit) category within the Basic Multilingual Plane (code points 0000-FFFF). Are the missing Chinese numerals in this range? Can you please give me an example of one so I can check why it didn’t work? Thanks!
Thank you. It was very useful.
красивый вебсайт трафик – трафик
arret traitement propecia – cliquez sur ce site, cliquez sur la page a venir
Купить аккаунт call of duty mobile – Купить аккаунт Mobile legends, купить RP лига легенд