digitalmars.D.learn - unicode confusion
- Charles Hixson via Digitalmars-d-learn (53/53) Apr 04 2016 Well, at least I think that it's unicode confusion. When a store values...
Well, at least I think that it's unicode confusion. When a store values into a string (in an array of structs) and then compare it against itself, it compares fine, and if I write it out at that point it writes out fine. And validate says it's good unicode. But later... valid = true, len = 17, wrd = true , cnt = 2, txt = Gesammtabentheuer valid = true, len = 27, wrd = true , cnt = 1, txt = νεÏεληγεÏá½³ÏÎ·Ï valid = true, len = 17, wrd = true , cnt = 1, txt = ζηÏοῦÏιν valid = true, len = 36, wrd = true , cnt = 1, txt = αἱμοÏÏοÏδοκαύÏÏÎ·Ï valid = true, len = 18, wrd = true , cnt = 2, txt = Î´Ï Î½Î·Î¸Ïμεν valid = true, len = 20, wrd = true , cnt = 1, txt = ÏÏοÏκÏοÏÏÏ valid = true, len = 20, wrd = true , cnt = 1, txt = ÏκοÏÏμÎνην valid = true, len = 18, wrd = true , cnt = 1, txt = á¼Î³Î±ÏηÏοί valid = true, len = 28, wrd = true , cnt = 1, txt = ×Ö½Ö·×Ö°×Ö´×ָּתָ×Ö¼ valid = true, len = 19, wrd = true , cnt = 1, txt = Î¤Ï ÏÏηνικά valid = true, len = 17, wrd = true , cnt = 2, txt = IODOHYDRARGYRATIS valid = true, len = 21, wrd = true , cnt = 1, txt = ÏοινικίÏιν valid = true, len = 17, wrd = true , cnt = 1, txt = Spectrophotometer valid = true, len = 26, wrd = true , cnt = 1, txt = αἰνιÏÏόμενοι valid = true, len = 70, wrd = true , cnt = 1, txt = ÎÎΣΠÎÎÎΡÎÎÎΣΧÎÎÎÎÎΡÎÎΣΠÎÎÎ¥ÎÎ ÎΤÎΣÎÎÎ valid = true, len = 18, wrd = true , cnt = 1, txt = μικÏÏÏαÏα valid = true, len = 23, wrd = true , cnt = 1, txt = á¼ÏοÏá½±ÏηÏίν valid = true, len = 18, wrd = true , cnt = 1, txt = ××ֹקְש×Öµ× valid = true, len = 17, wrd = true , cnt = 1, txt = διαμένÏν . . . (etc. for 39599 lines) (And it looks worse than that, actually, because control characters aren't coming through). I think the originals were usually greek letters due to an earlier test (why there should be so many greek words I don't know...but if they're there I want them to be handled properly), but the corrupted text is such a small part of the original file that I can't be certain. Valid = true means that it passed string validates right before being printed. wrd = true means that the only characters in it should be isAlpha, hyphen, apostrophe, or underscore. cnt = n means that it was detected n times in the dataset (of 8013 text files). And the string in each struct is only written once in the execution of the program. I was scanning the dataset looking to see what long words were valid...I didn't expect THIS at all. And as you can see from, e.g., "Spectrophotometer", ASCII values don't seem to be damaged at all. FWIW, I was expecting to encounter an occasional Greek, French, or Chinese word...but nothing like this. I'd think it was the conversion from string to dchar[] and back that was the problem, but when I test immediately after I know I've written to the string everything looks right. So I'm guessing it's something about how unicode is handled.
Apr 04 2016