Posted on April 29, 2008 by Grant SkinnerDuring the course of developing the Spelling Plus Library, and more recently while adding multilingual support to it, I discovered two serious bugs with the Regular Expression implementation in ActionScript, and how it handles accented characters. First, RegExp in AS3 does not include accented characters in the word character class. For example, the pattern /\w+/ (match one or more word characters) matches “r” and “sume” in “résume”, when it should match the full string. UPDATE: Arthur has pointed out in the comments that this is correct according to the ECMAScript and POSIX RegEx specifications. \w is intended to match just the set [a-zA-Z0-9_] , which it does in AS3. With that being understood, it would be nice to have support for unicode property sets (which allow you to match word characters in any language, among other things), but I can understand that this may have an unacceptable impact on the size of the Flash Player. Secondly, there is a somewhat obscure problem with how the Flash player matches \S and accented characters. Specifically, it appears that it does not count accented characters properly when matching them to \S, and this results in weird results. This is not the case with the negated whitespace character set [^\s], although these sets should exhibit identical behaviour in RegEx. This issue is pretty weird, so I’ll give a few examples:
All of the above work properly if you substitute [^\s] for \S. Hopefully this is helpful for other people working with RegExp, especially with languages other than English. It is quite frustrating to work around – I ended up writing a specialized character lexer instead of using RegExp in SPL. Know of any other RegExp bugs in AS3? Share them in the comments.
Follow @gskinner on Twitter for more news and views on interactive media.
|
|
|
12 Comments
RegExp Bugs With Accented Characters
Bookmarked your post over at Blog Bookmarker.com!
Posted by: with on Apr 30, 2008 1:46am URL: http://www.blogbookmarker.com/tags/with
A wild guess as to the problem with \S matching too many characters: it has a problem with som cases of multi-byte character runs, which wouldn't be very surprising since regexps suck at non-ascii on all systems I've used them in.
The regexp engine in Firefox seems to handle all the \S+ cases (although it has the same basic problem of \w not matching accented characters).
Posted by: Theo on Apr 30, 2008 2:03am URL: http://blog.iconara.net
Theo,
Yes, this was my thought too. It's not counting the multi-byte character correctly in this case for some reason. Matching the trailing space is a little strange as well, but is likely related to the same problem. My guess would be the counting problem causes it to skip trying to match the space character completely.
Posted by: Grant Skinner on Apr 30, 2008 8:50am URL: http://gskinner.com/blog/
Hi Grant.
Regarding accented letters: while this is a bit counter intuitive, it's actually part of the ECMA 262 specs. The character class is just a shortcut for the a-z, A-Z, 0-9 ranges + "_" , which does not include accented letters.
Cheers
Arthur Debert
[1] The spec http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf
[2] The POSIX regex spec : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1500.html
Posted by: Arthur Debert on Apr 30, 2008 8:52am URL: http://www.stimuli.com.br/
Arthur,
Right you are - my bad. I guess the problem then is that AS3's RegExp implementation does not include support for any extended character classes (ex. unicode property sets), though I can understand that this may be due to file size implications in the player.
I'll update the article to reflect this.
Posted by: Grant Skinner on Apr 30, 2008 9:15am URL: http://gskinner.com/blog/
I´ve pointed this bug 1 year ago, but no one listen to me. I hope they listen you now!
Posted by: Marcos Neves on Apr 30, 2008 9:55am
I live in México and since the first Flex sdk came out I realise about this bug. Today is a habitual practice to use more complicated RegExp to do something with spanish text.
Posted by: Quantium on Apr 30, 2008 11:03am URL: http://www.quantium.com.mx
I think I've found another regex bug:
Any idea how
/^(.*)-(.*)$/
doesn't find
aaaa - bbbb
Posted by: Nikos Katsikanis on Oct 9, 2008 2:29am URL: http://www.ecommercetotal.co.uk
Great list, it helps clear up much of the htacess mystery and confusion that comes from creating such files.
Posted by: clearance london on Nov 28, 2008 9:41am URL: http://www.wecleareverything.co.uk
Nikos - I just tested that pattern in RegExr, and it seems to work fine for me.
Posted by: Grant Skinner on Dec 5, 2008 10:13am URL: http://gskinner.com/blog/
String#replace accepts a function as a second argument. The function will have arguments for the match, and the index in the string where the match begins (and another for the entire string). But for unicode characters, the index is wrong (or at least, not what I would expect!)
trace("_x_x".replace(/x/g, function(match, i, str) {
trace(i)
trace(str.charAt(i), str.charAt(i) === "x");
return "_";
}));
trace("__".replace(//g, function(match, i, str) {
trace(i);
trace(str.charAt(i), str.charAt(i) === "");
return "_";
}));
Posted by: Eric Skogen on Sep 3, 2009 10:50am
I'm not sure if it's on the RegEx side or the TextField, but when getting the index position of a match it seems to be off if there are special characters such as em-dash or smart single quotes. As with the em-dash, it looks like it offsets the index position by 3 characters.
Posted by: Eric Decker on Oct 2, 2009 1:36pm URL: http://labs.eric-decker.com