digitalmars.D.learn - std.regex character consumption
- petevik38 yahoo.com.au (26/26) Oct 08 2010 I've been running into a few problems with regular expressions in D. One
- Jonathan M Davis (8/43) Oct 08 2010 Well, without looking at the code, I can't say for certain what's going ...
I've been running into a few problems with regular expressions in D. One of the issues I've had recently is matching strings with non ascii characters. As an example: auto re = regex( `(.*)\.txt`, "i" ); re.printProgram(); auto m = match( "bà.txt", re ); writefln( "'%s'", m.captures[1] ); When I run this I get the following error: dchar decode(in char[], ref size_t): Invalid UTF-8 sequence [160 46 116 120] around index 0 printProgram() 0: REparen len=1 n=0, pc=>10 9: REanystar 10: REistring x4, '.txt' 19: REend While investigating the cause, I noticed that during execution of many of the regex instructions (e.g. REanystar), the source is advanced with: src++; However in other cases (REanychar), it is advanced with: src += std.utf.stride(input, src); I found that by replacing the code REanystar with stride, the code worked as expected. Although I can't claim to have a solid understanding of the code, it seems to me that most of the cases of src++ should be using stride instead. Is this correct, or have I made some silly mistake and got completely the wrong end of the stick?
Oct 08 2010
On Friday, October 08, 2010 14:13:36 petevik38 yahoo.com.au wrote:I've been running into a few problems with regular expressions in D. One of the issues I've had recently is matching strings with non ascii characters. As an example: =20 auto re =3D regex( `(.*)\.txt`, "i" ); re.printProgram(); auto m =3D match( "b=C3=A0.txt", re ); writefln( "'%s'", m.captures[1] ); =20 When I run this I get the following error: =20 dchar decode(in char[], ref size_t): Invalid UTF-8 sequence [160 46 116 120] around index 0 printProgram() 0: REparen len=3D1 n=3D0, pc=3D>10 9: REanystar 10: REistring x4, '.txt' 19: REend =20 While investigating the cause, I noticed that during execution of many of the regex instructions (e.g. REanystar), the source is advanced with: =20 src++; =20 However in other cases (REanychar), it is advanced with: =20 src +=3D std.utf.stride(input, src); =20 I found that by replacing the code REanystar with stride, the code worked as expected. Although I can't claim to have a solid understanding of the code, it seems to me that most of the cases of src++ should be using stride instead. =20 Is this correct, or have I made some silly mistake and got completely the wrong end of the stick?Well, without looking at the code, I can't say for certain what's going on,= but=20 using ++ with chars or wchars is definitely wrong in virtually all cases.=20 stride() will actually go to the next code point, while ++ will just go to = the=20 next code unit, which could be in the middle of a code point. =2D Jonathan M Davis
Oct 08 2010