digitalmars.D - Regex and UTF-8
- Andrea Fontana (11/11) Nov 18 2011 I build a data access layer in c++. This layer works with mongo db where
- Dmitry Olshansky (7/19) Nov 18 2011 Which version of std.regex are you using - the one from git master or
- Andrea Fontana (11/36) Nov 18 2011 It seems related to toLower too...
- Dmitry Olshansky (24/55) Nov 18 2011 You mean one of prepackaged zips|debs|etc. from the website? It uses the...
I build a data access layer in c++. This layer works with mongo db where string are always encoded using UTF-8. I've ported this layer in D using swig. String is written correctly in console but when i use std.regex sometimes it gives an exception: core.exception.UnicodeException src/rt/util/utf.d(290): invalid UTF-8 sequence Byte sequence (for better undestanding) is: [83, 195, 179, 32] And the string was "S=C3=B2 " (with accented o and a space) I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on utf.d?=20
Nov 18 2011
On 18.11.2011 17:58, Andrea Fontana wrote:I build a data access layer in c++. This layer works with mongo db where string are always encoded using UTF-8. I've ported this layer in D using swig. String is written correctly in console but when i use std.regex sometimes it gives an exception: core.exception.UnicodeException src <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invalid UTF-8 sequence Byte sequence (for better undestanding) is: [83, 195, 179, 32] And the string was "Sò " (with accented o and a space) I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on utf.d?Which version of std.regex are you using - the one from git master or the one in the latest release? If it's the former then I'm willing to look into this thing on weekend, if you can get a hold of a pair: string + pattern that fails like this. -- Dmitry Olshansky
Nov 18 2011
It seems related to toLower too... Here the line with exception: s =3D replace(s, regex(`[^"a-zA-Z0-9=C3=A0=C3=B2=C3=A8=C3=A9=C3=AC=C3=B9\.]= `, "g"), " ").toLower(); Where s is a string with that sequence... Using dmd 2.056 Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:On 18.11.2011 17:58, Andrea Fontana wrote:eI build a data access layer in c++. This layer works with mongo db wher=gstring are always encoded using UTF-8. I've ported this layer in D usin=dswig. String is written correctly in console but when i use std.regex sometimes it gives an exception: core.exception.UnicodeException src <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invali==20UTF-8 sequence Byte sequence (for better undestanding) is: [83, 195, 179, 32] And the string was "S=C3=B2 " (with accented o and a space) I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on utf.d?=20 Which version of std.regex are you using - the one from git master or=20 the one in the latest release? If it's the former then I'm willing to look into this thing on weekend,=if you can get a hold of a pair: string + pattern that fails like this. =20 =20
Nov 18 2011
On 18.11.2011 21:07, Andrea Fontana wrote:It seems related to toLower too... Here the line with exception: s = replace(s, regex(`[^"a-zA-Z0-9àòèéìù\.]`, "g"), " ").toLower(); Where s is a string with that sequence... Using dmd 2.056You mean one of prepackaged zips|debs|etc. from the website? It uses the old regex, which, I have to admit, is not that good with unicode. Then ... well you are somewhat out of luck untill next release. That's where brand new regex engine is coming, provided I figure out mysterious FreeBSD|OSX issue (sigh). Unfortunately, I was very busy recently, though maybe this weekend I'll finally work something out. I just tested it with my version on win32 ... well it hits one of asserts (it should have been exception, ouch!), but the fix was easy. It's all about . that works as simple '.' char in [], it's just wrong to escape it inside character class (some engines do allow this, though it's confusing like hell). After that it outputs stuff like this: std.regex.RegexException std\regex.d(1939): invalid escape sequence Pattern with error: `[^"a-zA-Z0-9àòèéìù\.` <--HERE-- `]` After changing \. --> . It does work for me with s = "Sò ", no exceptions. Bottom line: Thanks, as I uncovered a serious issue i.e. misjudged assert on wrong escapes in character classes. Second if you are on win32/linux you might want to try fresh github version. And stay tuned for the next release that should fix most of regex issues once and for all.Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:-- Dmitry OlshanskyOn 18.11.2011 17:58, Andrea Fontana wrote:I build a data access layer in c++. This layer works with mongo db where string are always encoded using UTF-8. I've ported this layer in D using swig. String is written correctly in console but when i use std.regex sometimes it gives an exception: core.exception.UnicodeException src <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invalid UTF-8 sequence Byte sequence (for better undestanding) is: [83, 195, 179, 32] And the string was"Sò " (with accented o and a space) I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on utf.d?Which version of std.regex are you using - the one from git master or the one in the latest release? If it's the former then I'm willing to look into this thing on weekend, if you can get a hold of a pair: string + pattern that fails like this.
Nov 18 2011