digitalmars.D.bugs - [Issue 10203] New: std.string.toUpperInPlace is... not in place
- d-bugmail puremagic.com (19/19) May 29 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10203
- d-bugmail puremagic.com (7/7) May 29 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10203
- d-bugmail puremagic.com (36/39) May 30 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10203
- d-bugmail puremagic.com (21/28) May 30 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10203
- d-bugmail puremagic.com (20/32) Sep 21 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10203
- d-bugmail puremagic.com (12/17) Sep 21 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10203
- d-bugmail puremagic.com (14/24) Sep 21 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10203
- d-bugmail puremagic.com (16/16) Sep 21 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10203
http://d.puremagic.com/issues/show_bug.cgi?id=10203 Summary: std.string.toUpperInPlace is... not in place Product: D Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Phobos AssignedTo: nobody puremagic.com ReportedBy: monarchdodra gmail.com As reported by manu in his talk. All in the title. toUpperInPlace will actually allocated a new string. As a matter of fact, it will allocate, a LOT: a new string per char. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 29 2013
http://d.puremagic.com/issues/show_bug.cgi?id=10203 Fixed here https://github.com/D-Programming-Language/phobos/pull/1322 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 29 2013
http://d.puremagic.com/issues/show_bug.cgi?id=10203Fixed here https://github.com/D-Programming-Language/phobos/pull/1322Turns out, upon investigation, that my fix is incorrect. What is more problematic though, is that the review revealed that there are cases where toUpperInPlace simply *cannot* be inplace: there characters out there, whose UTF representation is not the same length for upper case and lower case. What this means is that the lower case representation could be longer than the original length. How could this be done in place? It also reveals more problems: If the resulting string is actually smaller, than slices that aliased the same data will now reference garbage values. For example: 'İ', which is u0130, and a two byte encoding will become 'i'. So when I take the string "İa": auto a = "\xC4\xB0\x61"; auto b = a; toLowerInPlace(a); //Now: //a == "\x69\xB0" //b == "\x69\xB0\x61" Oops: Trailing code unit :/ Or: say, I have "aİa", and want to reduce in place "İ" auto a = "\x61\xC4\xB0\x61"; auto b = a[1 .. 3]; toLowerInPlace(b); //Now: //b == "\x69" //OK //a == "\x61\x69\xB0\x61" //Wait, what is that B0 doinh here? -------- It seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior. I think it should either be tagged with a massive red unsafe, or deprecated. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 30 2013
http://d.puremagic.com/issues/show_bug.cgi?id=10203 monarchdodra gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |http://d.puremagic.com/issu | |es/show_bug.cgi?id=9629 AssignedTo|monarchdodra gmail.com |nobody puremagic.comauto a = "\xC4\xB0\x61"; auto b = a; toLowerInPlace(a); //Now: //a == "\x69\xB0" //b == "\x69\xB0\x61" Oops: Trailing code unit :/Wait, this example is wrong, corrected as: take the string "aİ" auto a = "\x61\xC4\xB0"; auto b = a; toLowerInPlace(a); //Now: //a == "\x61\x69" //b == "\x61\x69\xB0" Oops: Trailing code unit :/ Sorry. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 30 2013
http://d.puremagic.com/issues/show_bug.cgi?id=10203 Dmitry Olshansky <dmitry.olsh gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dmitry.olsh gmail.com 13:58:19 PDT ---Wait, this example is wrong, corrected as: take the string "aİ" auto a = "\x61\xC4\xB0";auto a = "\x61\xC4\xB0".dup;auto b = a; toLowerInPlace(a); //Now: //a == "\x61\x69" //b == "\x61\x69\xB0" Oops: Trailing code unit :/ Sorry.I belive we now should have solid treatment of toUpper/toLower (pending a bugfix in the works). For me this gives now: //a == 61 69 CC 87 //b == 61 C4 B0 i.e. toLowerInPlace fails to do this in place because resulting length increases. Should it try to extend if a.capacity allows it? (I bet it has about 15 bytes of storage) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Sep 21 2013
http://d.puremagic.com/issues/show_bug.cgi?id=10203 bearophile_hugs eml.cc changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bearophile_hugs eml.ccIt seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior. I think it should either be tagged with a massive red unsafe, or deprecated.Even if it's impossible for Unicode strings, I'd like to keep a version of it for just arrays of ASCII chars, that is a common enough use case. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Sep 21 2013
http://d.puremagic.com/issues/show_bug.cgi?id=10203 14:26:34 PDT ---It also works quite well for UTF-16. And it does now. And I would have kept it if only because of backwards compatibility perspective. There is no other primitive yet, but a version(s) with output range for all string transformations is something to look forward to. The only question is whether or not should this function try to use extra capacity beyond the length if it's present (that would make InPlace suffix look saner). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------It seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior. I think it should either be tagged with a massive red unsafe, or deprecated.Even if it's impossible for Unicode strings, I'd like to keep a version of it for just arrays of ASCII chars, that is a common enough use case.
Sep 21 2013
http://d.puremagic.com/issues/show_bug.cgi?id=10203 w0rp <devw0rp gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |devw0rp gmail.com It may be worth considering presenting only what is possible, so offer toUpperInPlace or toLowerInPlace only on ASCII data. As you say, Unicode strings change in byte length dramatically, but the ASCII range stays the same. So I think it would be worth offering functions with different names, or perhaps with locale template arguments. (Locale.ascii, or whatever.) Just so there can be a guarantee of something never allocating when you want to avoid that, but still let you get at a "almost there" function when you want that. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Sep 21 2013