www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - unicode confusion

Well, at least I think that it's unicode confusion.  When a store values 
into a string (in an array of structs) and then compare it against 
itself, it compares fine, and if I write it out at that point it writes 
out fine.  And validate says it's good unicode.

But later...
valid = true, len = 17, wrd = true , cnt =     2, txt = Gesammtabentheuer
valid = true, len = 27, wrd = true , cnt =     1, txt = 
νεφεληγερέτης
valid = true, len = 17, wrd = true , cnt =     1, txt =
ζητοῦσιν
valid = true, len = 36, wrd = true , cnt =     1, txt = 
αἱμορροϊδοκαύστης
valid = true, len = 18, wrd = true , cnt =     2, txt =
δυνηθώμεν
valid = true, len = 20, wrd = true , cnt =     1, txt =
προσκρούσω
valid = true, len = 20, wrd = true , cnt =     1, txt =
σκοτωμένην
valid = true, len = 18, wrd = true , cnt =     1, txt =
ἀγαπητοί
valid = true, len = 28, wrd = true , cnt =     1, txt = 
הַֽמְזִמָּתָהּ
valid = true, len = 19, wrd = true , cnt =     1, txt =
Τυρρηνικά
valid = true, len = 17, wrd = true , cnt =     2, txt = IODOHYDRARGYRATIS
valid = true, len = 21, wrd = true , cnt =     1, txt = 
χοινικίσιν
valid = true, len = 17, wrd = true , cnt =     1, txt = Spectrophotometer
valid = true, len = 26, wrd = true , cnt =     1, txt = 
αἰνιττόμενοι
valid = true, len = 70, 
wrd = true , cnt =     
1, txt = 
ΓΗΣΠΛΕΘΡΑΔΙΣΧΙΛΙΑΕΡΓΑΣÎ
ΜΟΥΑΠΟΤΗΣΟΜΟ
valid = true, len = 18, wrd = true , cnt =     1, txt =
μικρότατα
valid = true, len = 23, wrd = true , cnt =     1, txt = 
ἀποπάτησίν
valid = true, len = 18, wrd = true , cnt =     1, txt =
מוֹקְשֵׁי
valid = true, len = 17, wrd = true , cnt =     1, txt =
διαμένων
      . . . (etc. for 39599 lines)
(And it looks worse than that, actually, because control characters 
aren't coming through).
I think the originals were usually greek letters due to an earlier test 
(why there should be so many greek words I don't know...but if they're 
there I want them to be handled properly), but the corrupted text is 
such a small part of the original file that I can't be certain.  Valid = 
true means that it passed string validates right before being printed.  
wrd = true means that the only characters in it should be isAlpha, 
hyphen, apostrophe, or underscore.  cnt = n means that it was detected n 
times in the dataset (of 8013 text files). And the string in each struct 
is only written once in the execution of the program.

I was scanning the dataset looking to see what long words were valid...I 
didn't expect THIS at all.  And as you can see from, e.g., 
"Spectrophotometer", ASCII values don't seem to be damaged at all.

FWIW, I was expecting to encounter an occasional Greek, French, or 
Chinese word...but nothing like this.  I'd think it was the conversion 
from string to dchar[] and back that was the problem, but when I test 
immediately after I know I've written to the string everything looks 
right.  So I'm guessing it's something about how unicode is handled.
Apr 04 2016