digitalmars.D - numericValue for (unicode) characters
- monarch_dodra (51/51) Jan 02 2013 There is an ER that would allow to convert characters to numebers:
- bearophile (13/16) Jan 02 2013 For the ASCII version I have two use cases:
- Dmitry Olshansky (14/28) Jan 02 2013 Then we can maybe just drop this function? What's wrong with
- Andrei Alexandrescu (5/25) Jan 02 2013 Unnecessary flow :o).
- Dmitry Olshansky (6/31) Jan 02 2013 Yup, and it's 2 lines then. And if one really wants to chain it:
- monarch_dodra (25/34) Jan 02 2013 Well, just because its almost trivial to us doesn't mean it hurts
- H. S. Teoh (8/37) Jan 02 2013 +1. Code intent is important.
- Dmitry Olshansky (15/40) Jan 03 2013 I don't mind adding because of completeness and/or symmetry stand point
- monarch_dodra (11/15) Jan 03 2013 Hum... We could always "camp" the std.uni's numericValue function?
- Dmitry Olshansky (4/17) Jan 03 2013 We'd pretty much have to.
- monarch_dodra (10/23) Jan 03 2013 Or, you know... I could just implement both at the same time.
- Dmitry Olshansky (6/27) Jan 03 2013 It's just an idea that I have exceptionally fast version for Unicode
- monarch_dodra (6/9) Jan 04 2013 Well, I already mentioned to you how I was planning to do it:
- H. S. Teoh (30/37) Jan 03 2013 [...]
- monarch_dodra (11/29) Jan 04 2013 ... alsmost! 1e12 will have a negative value when cast to int. To
- Jonathan M Davis (6/9) Jan 04 2013 I'm not a fan of the ASCII version returning -1, but I don't really have...
- Dmitry Olshansky (10/19) Jan 04 2013 I find low-level stuff that throws to be overly awkward to deal with
- monarch_dodra (29/51) Jan 04 2013 I finished an implementation:
- monarch_dodra (7/14) Jan 04 2013 Wait: I figured it out: They are just non-numbers that happen to
- Dmitry Olshansky (15/68) Jan 04 2013 Well, for start it features tons of code duplication. But I'm replacing
- monarch_dodra (29/67) Jan 04 2013 Well, I wrote that with duplication, keeping in mind you would
- Dmitry Olshansky (21/87) Jan 04 2013 Basically check the bottom of that page:
- monarch_dodra (15/35) Jan 04 2013 Sounds like the root of the problem is that isNumber !=
- H. S. Teoh (14/30) Jan 04 2013 Yikes. That's pretty ... nasty. :-(
- monarch_dodra (20/23) Jan 07 2013 I guess it's just bad wording from the standard.
- H. S. Teoh (20/48) Jan 09 2013 Hmph. I guess we need to differentiate between the unicode category
- Dmitry Olshansky (33/73) Jan 10 2013 isNumber - _Number_ General category (as defined by Unicode 1:1)
- monarch_dodra (16/63) Jan 10 2013 Are you sure about that? The four values of Numeric_Type are:
- monarch_dodra (6/7) Jan 07 2013 Thank you for all your feed back.
- bearophile (9/12) Jan 02 2013 I think you meant to write:
There is an ER that would allow to convert characters to numebers: http://d.puremagic.com/issues/show_bug.cgi?id=5543 For example: '1' => 1 Or, unicode considered: 'Ⅶ' => 7 Long story short, it was decided that it wasn't std.conv.to's job to do this conversion, but rather, there should be a function called "numericValue" inside std.uni and std.ascii that would do this job. What remains are defining how these methods should work. Things to keep in mind: - ASCII to int should be fast. - unicode numeric values span from -0.5 to 1.0e12. - unicode numeric values can be fractional. - ALL unicode numeric values can be EXACTLY represented in a double. Given these observations, I'd like to propose these: //------------------------------ //std.ascii.numericValue /** Given an ascii character, returns that character's numeric value if it is numeric ($(D isNumeric)), and -1 otherwise */ pure safe nothrow int numericValue(dchar c); //------------------------------ //std.uni.numericValue /** Given a unicode character, returns that character's numeric value if it is numeric ($(D isNumeric)), and throws an exception otherwise */ pure safe double numericValue(dchar c); //------------------------------ The rationale for this: std.ascii: I think returning -1 as a magic number should help keep the code faster and with less clutter than with exceptions. returning an int is the obvious choice for numbers that span -1 to 10. std.uni: double is the only type that can hold all ranges of unicode's numeric values. This time, uni throws exceptions. This is for two reasons: 1. Choosing a magic number is difficult, and error prone. Correct code would have to look like: "if (std.uni.numericValue(c) > -0.7) {...}" 2. When dealing with unicode, overhead of the exception is probably cleaner and not as critical as with ascii. *********************************************** Thoughts? I wanted to get this ER moved forward. I don't think uni.numericValue will be finished soon, but I would have wanted std.ascii's done sooner rather than later.
Jan 02 2013
monarch_dodra:The rationale for this: std.ascii: I think returning -1 as a magic number should help keep the code faster and with less clutter than with exceptions.For the ASCII version I have two use cases: - Where I want to go fast&unsafe I just use "c - '0'". - When I want more safety I'd like to use something as to!(), that raises exceptions in case of errors. A function that works on ASCII and returns -1 doesn't give me much more than "c - '0'". So maybe exceptions are good in the ASCII case too. There is also std.typecons.nullable, it's a possibility for std.uni.numericValue. Generally Phobos should eat more of its dog food :-) Bye, bearophile
Jan 02 2013
1/2/2013 7:24 PM, bearophile пишет:monarch_dodra:Then we can maybe just drop this function? What's wrong with if(std.ascii.isNumeric(a)) a -= '0'; else enforce(false); I mean that the time to look it up in std library is much bigger then to roll your own with any of the 2 semantics. Unlike the unicode version, of course. Then IMO having the std.ascii one is mostly just for symmetry and thus I think that both should just use some sentinel value.The rationale for this: std.ascii: I think returning -1 as a magic number should help keep the code faster and with less clutter than with exceptions.For the ASCII version I have two use cases: - Where I want to go fast&unsafe I just use "c - '0'". - When I want more safety I'd like to use something as to!(), that raises exceptions in case of errors. A function that works on ASCII and returns -1 doesn't give me much more than "c - '0'". So maybe exceptions are good in the ASCII case too.There is also std.typecons.nullable, it's a possibility for std.uni.numericValue. Generally Phobos should eat more of its dog food :-)double.nan sounds more like it.Bye, bearophile-- Dmitry Olshansky
Jan 02 2013
On 1/2/13 3:13 PM, Dmitry Olshansky wrote:1/2/2013 7:24 PM, bearophile пишет:Unnecessary flow :o). enforce(std.ascii.isNumeric(a)); a -= '0'; Andreimonarch_dodra:Then we can maybe just drop this function? What's wrong with if(std.ascii.isNumeric(a)) a -= '0'; else enforce(false);The rationale for this: std.ascii: I think returning -1 as a magic number should help keep the code faster and with less clutter than with exceptions.For the ASCII version I have two use cases: - Where I want to go fast&unsafe I just use "c - '0'". - When I want more safety I'd like to use something as to!(), that raises exceptions in case of errors. A function that works on ASCII and returns -1 doesn't give me much more than "c - '0'". So maybe exceptions are good in the ASCII case too.
Jan 02 2013
1/3/2013 12:21 AM, Andrei Alexandrescu пишет:On 1/2/13 3:13 PM, Dmitry Olshansky wrote:Yup, and it's 2 lines then. And if one really wants to chain it: map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...); Hardly makes it Phobos candidate then ;) -- Dmitry Olshansky1/2/2013 7:24 PM, bearophile пишет:Unnecessary flow :o). enforce(std.ascii.isNumeric(a)); a -= '0';monarch_dodra:Then we can maybe just drop this function? What's wrong with if(std.ascii.isNumeric(a)) a -= '0'; else enforce(false);The rationale for this: std.ascii: I think returning -1 as a magic number should help keep the code faster and with less clutter than with exceptions.For the ASCII version I have two use cases: - Where I want to go fast&unsafe I just use "c - '0'". - When I want more safety I'd like to use something as to!(), that raises exceptions in case of errors. A function that works on ASCII and returns -1 doesn't give me much more than "c - '0'". So maybe exceptions are good in the ASCII case too.
Jan 02 2013
On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky wrote:Yup, and it's 2 lines then. And if one really wants to chain it: map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...); Hardly makes it Phobos candidate then ;)Well, just because its almost trivial to us doesn't mean it hurts to have it. The fact that you can even operate on chars in such a fashion (c - '0') is not obvious to everyone: I've seen time and time again code such as: //---- if (97 <= c && c <= 122) c -= 97; //---- numericValue helps keep things clean and self documented. What's more, it helps keep ascii complete. Code originally written for ascii is easily upgreable to support uni (and vice-versa). Further more, *writing* "std.ascii.numericValue" self documents ascii only support, which is less obvious than code using "c - '0'": In the original pull request to "improve" conv.to, the fact that it did not support unicode didn't even cross our minds. Seeing "std.ascii.numericValue" raises the eyebrow. It *forces* unicode consideration (regardless of which is right, it can't be ignored). Really, by the rationale of "it's 2 lines", we shouldn't even have "std.ascii.isNumeric" at all... On Wednesday, 2 January 2013 at 20:13:32 UTC, Dmitry Olshansky wrote:1/2/2013 7:24 PM, bearophile пишет:Hum... nan. I like it.There is also std.typecons.nullable, it's a possibility for std.uni.numericValue. Generally Phobos should eat more of its dog food :-)double.nan sounds more like it.
Jan 02 2013
On Wed, Jan 02, 2013 at 11:15:31PM +0100, monarch_dodra wrote:On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky wrote:+1. Code intent is important. [...]Yup, and it's 2 lines then. And if one really wants to chain it: map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...); Hardly makes it Phobos candidate then ;)Well, just because its almost trivial to us doesn't mean it hurts to have it. The fact that you can even operate on chars in such a fashion (c - '0') is not obvious to everyone: I've seen time and time again code such as: //---- if (97 <= c && c <= 122) c -= 97; //---- numericValue helps keep things clean and self documented.On Wednesday, 2 January 2013 at 20:13:32 UTC, Dmitry Olshansky wrote:+1 for nan. It's about time we used nan for something useful beyond just an annoying default value for floating-point variables. :) T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG1/2/2013 7:24 PM, bearophile пишет:Hum... nan. I like it.There is also std.typecons.nullable, it's a possibility for std.uni.numericValue. Generally Phobos should eat more of its dog food :-)double.nan sounds more like it.
Jan 02 2013
1/3/2013 2:15 AM, monarch_dodra пишет:On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky wrote:I don't mind adding because of completeness and/or symmetry stand point as I said. I do see another cool issue popping up though. It's a problem of how the anti-hijacking works. Say we add numericValue right now to std.ascii but not std.uni. A release later we have numericValue in std.uni (well hopefully they are both in the same 2.062 ;) ). Now take this code: map!numericValue(...) If the code also happens to import std.uni it's going to stop compiling. That's one of reasons I think our hopes on stability (as in compiles in 5 years from now) are ill placed as we can't have it until the library is essentially dead in stone. -- Dmitry OlshanskyYup, and it's 2 lines then. And if one really wants to chain it: map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...); Hardly makes it Phobos candidate then ;)Well, just because its almost trivial to us doesn't mean it hurts to have it. The fact that you can even operate on chars in such a fashion (c - '0') is not obvious to everyone: I've seen time and time again code such as: //---- if (97 <= c && c <= 122) c -= 97; //---- numericValue helps keep things clean and self documented. What's more, it helps keep ascii complete. Code originally written for ascii is easily upgreable to support uni (and vice-versa). Further more, *writing* "std.ascii.numericValue" self documents ascii only support, which is less obvious than code using "c - '0'": In the original pull request to "improve" conv.to, the fact that it did not support unicode didn't even cross our minds. Seeing "std.ascii.numericValue" raises the eyebrow. It *forces* unicode consideration (regardless of which is right, it can't be ignored). Really, by the rationale of "it's 2 lines", we shouldn't even have "std.ascii.isNumeric" at all...
Jan 03 2013
On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:Now take this code: map!numericValue(...) If the code also happens to import std.uni it's going to stop compiling.Hum... We could always "camp" the std.uni's numericValue function? //---- double numericValue()(dchar c) const nothrow safe { static assert(false, "Sorry, std.uni.numericValue is not yet implemented"); } //---- This would avoid the breakage you mentioned.
Jan 03 2013
03-Jan-2013 21:13, monarch_dodra пишет:On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:We'd pretty much have to. -- Dmitry OlshanskyNow take this code: map!numericValue(...) If the code also happens to import std.uni it's going to stop compiling.Hum... We could always "camp" the std.uni's numericValue function? //---- double numericValue()(dchar c) const nothrow safe { static assert(false, "Sorry, std.uni.numericValue is not yet implemented"); } //----
Jan 03 2013
On Thursday, 3 January 2013 at 18:11:45 UTC, Dmitry Olshansky wrote:03-Jan-2013 21:13, monarch_dodra пишет:Or, you know... I could just implement both at the same time. It's not like there's an *urgency* for the ascii version or anything. I think I'll just do that. So... do we agree on ascii: int - not found => -1 uni: double - not found => nan ? I can still get started anyways, even if it isn't definite.On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:We'd pretty much have to.Now take this code: map!numericValue(...) If the code also happens to import std.uni it's going to stop compiling.Hum... We could always "camp" the std.uni's numericValue function? [SNIP]
Jan 03 2013
03-Jan-2013 23:40, monarch_dodra пишет:On Thursday, 3 January 2013 at 18:11:45 UTC, Dmitry Olshansky wrote:Me fine.03-Jan-2013 21:13, monarch_dodra пишет:Or, you know... I could just implement both at the same time. It's not like there's an *urgency* for the ascii version or anything. I think I'll just do that. So... do we agree on ascii: int - not found => -1 uni: double - not found => nan ?On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:We'd pretty much have to.Now take this code: map!numericValue(...) If the code also happens to import std.uni it's going to stop compiling.Hum... We could always "camp" the std.uni's numericValue function? [SNIP]I can still get started anyways, even if it isn't definite.It's just an idea that I have exceptionally fast version for Unicode just around the corner, but I wouldn't mind some competition ;) -- Dmitry Olshansky
Jan 03 2013
On Thursday, 3 January 2013 at 20:14:43 UTC, Dmitry Olshansky wrote:It's just an idea that I have exceptionally fast version for Unicode just around the corner, but I wouldn't mind some competition ;)Well, I already mentioned to you how I was planning to do it: Just stupid binary search over ranges of numbers indexed on 0. The "big" chunk of work, actually (IMO), is just creating the raw data...
Jan 04 2013
On Thu, Jan 03, 2013 at 08:40:47PM +0100, monarch_dodra wrote: [...]Or, you know... I could just implement both at the same time. It's not like there's an *urgency* for the ascii version or anything. I think I'll just do that. So... do we agree on ascii: int - not found => -1 uni: double - not found => nan[...] LGTM. :) I did think of what might happen if somebody wrote an int cast for std.uni.numericValue: void sloppyProgrammersFunction(dchar ch) { // First attempt: compiler error: can't implicitly // convert double -> int ... //int val = std.uni.numericValue(ch); // ... so sloppy programmer inserts a cast int val = cast(int)std.uni.numericValue(ch); // On Linux/64, if numericValue returns nan, this prints // -int.max. writeln(val); // So this should work: if (val < 0) { // (In fact, it will still work if // std.ascii.numericValue were used instead.) writeln("Sloppy code caught the problem correctly!"); } } So it seems that everything should be alright. This particular example occurred to me, 'cos I'm thinking of how often one wishes to extract an integral value from a string, and usually one doesn't think that floating point is necessary(!), so the cast from double is a rather big temptation (even though it's wrong!). T -- Tell me and I forget. Teach me and I remember. Involve me and I understand. -- Benjamin Franklin
Jan 03 2013
On Thursday, 3 January 2013 at 21:51:14 UTC, H. S. Teoh wrote:On Thu, Jan 03, 2013 at 08:40:47PM +0100, monarch_dodra wrote: [...]... alsmost! 1e12 will have a negative value when cast to int. To be 100% correct in regards to converting, the end user would have to use long. But that'd be a *really exceptional* case behavior... Even with long, the only problem with the code is that the user would not know the difference between exact integral, and inexact integral. Well, that's what the user gets for being sloppy I guess. In any case, I think we'd have to provide an example section with a "recommended" way for casting to integral.Or, you know... I could just implement both at the same time. It's not like there's an *urgency* for the ascii version or anything. I think I'll just do that. So... do we agree on ascii: int - not found => -1 uni: double - not found => nan[...] LGTM. :) I did think of what might happen if somebody wrote an int cast for std.uni.numericValue [SNIP] writeln("Sloppy code caught the problem correctly!");
Jan 04 2013
On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:So... do we agree on ascii: int - not found => -1 uni: double - not found => nanI'm not a fan of the ASCII version returning -1, but I don't really have a better suggestion. I suppose that you could throw instead, but I don't know if that's a good idea or not. It _would_ be more consistent with our other conversion functions however. - Jonathan M Davis
Jan 04 2013
04-Jan-2013 15:58, Jonathan M Davis пишет:On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:I find low-level stuff that throws to be overly awkward to deal with (not to mention performance problems). Hm... I've found an brilliant primitive Expected!T that could be of great help in error code vs exceptions problem. See the recent Andrei's talk that went live not long ago: http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C Time to put the analogous stuff into Phobos? -- Dmitry OlshanskySo... do we agree on ascii: int - not found => -1 uni: double - not found => nanI'm not a fan of the ASCII version returning -1, but I don't really have a better suggestion. I suppose that you could throw instead, but I don't know if that's a good idea or not. It _would_ be more consistent with our other conversion functions however. - Jonathan M Davis
Jan 04 2013
On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:04-Jan-2013 15:58, Jonathan M Davis пишет:I finished an implementation: https://github.com/D-Programming-Language/phobos/pull/1052 It is not "pull ready", so we can still discuss it. I raised a couple of issues in the pull, which I'll copy here: //---- I did run into a couple of issues, namelly that I'm not getting 100% equivalence between chars that are numeric, and chars with numeric value... Is this normal...? * There's a fair bit of chars that have numeric value, but aren't isNumber. I think they might be new in 6.1.0. But I'm not sure. I decided it was best to have them return nan, instead of having inconsistent behavior. * There's a couple characters in tableLo that have numeric values. These aren't considered in isNumber either. I think this might be a bug though. * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC SIGN". These return wild values, and in particular two of them return -1. I *think* this should actually return nan for us, because (AFAIK), -1 is just wild for invalid :/ Maybe we should just return -1 on invalid unicode? Or maybe it's just my input file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt It doesn't have a separate field for isNumber/numericValue, so it is forced to write a wild number. Maybe these four chars should return nan? //---- Oh yeah, I also added isNumber to std.ascii. Feels wrong to not have it if we have numericValue.On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:I find low-level stuff that throws to be overly awkward to deal with (not to mention performance problems). Hm... I've found an brilliant primitive Expected!T that could be of great help in error code vs exceptions problem. See the recent Andrei's talk that went live not long ago: http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C Time to put the analogous stuff into Phobos?So... do we agree on ascii: int - not found => -1 uni: double - not found => nanI'm not a fan of the ASCII version returning -1, but I don't really have a better suggestion. I suppose that you could throw instead, but I don't know if that's a good idea or not. It _would_ be more consistent with our other conversion functions however. - Jonathan M Davis
Jan 04 2013
On Friday, 4 January 2013 at 17:48:28 UTC, monarch_dodra wrote://---- Maybe we should just return -1 on invalid unicode? Or maybe it's just my input file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt It doesn't have a separate field for isNumber/numericValue, so it is forced to write a wild number. Maybe these four chars should return nan?Wait: I figured it out: They are just non-numbers that happen to be inside Nl (Number Letter): http://unicode.org/cldr/utility/character.jsp?a=12433 Documentation on this is not very clear, nor consistent, so sorry for any confusion. Well, I guess there is a bug in std.isNumber then...
Jan 04 2013
04-Jan-2013 21:48, monarch_dodra пишет:On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...04-Jan-2013 15:58, Jonathan M Davis пишет:I finished an implementation: https://github.com/D-Programming-Language/phobos/pull/1052 It is not "pull ready", so we can still discuss it.On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:I find low-level stuff that throws to be overly awkward to deal with (not to mention performance problems). Hm... I've found an brilliant primitive Expected!T that could be of great help in error code vs exceptions problem. See the recent Andrei's talk that went live not long ago: http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C Time to put the analogous stuff into Phobos?So... do we agree on ascii: int - not found => -1 uni: double - not found => nanI'm not a fan of the ASCII version returning -1, but I don't really have a better suggestion. I suppose that you could throw instead, but I don't know if that's a good idea or not. It _would_ be more consistent with our other conversion functions however. - Jonathan M DavisI raised a couple of issues in the pull, which I'll copy here: //---- I did run into a couple of issues, namelly that I'm not getting 100% equivalence between chars that are numeric, and chars with numeric value... Is this normal...?Yes, it's called Unicode ;)* There's a fair bit of chars that have numeric value, but aren't isNumber. I think they might be new in 6.1.0. But I'm not sure. I decided it was best to have them return nan, instead of having inconsistent behavior.You also might be using 6.2. It's released as of a fall of 2012.* There's a couple characters in tableLo that have numeric values. These aren't considered in isNumber either. I think this might be a bug though. * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC SIGN". These return wild values, and in particular two of them return -1. I *think* this should actually return nan for us, because (AFAIK), -1 is just wild for invalid :/Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated. So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category. Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.Maybe we should just return -1 on invalid unicode? Or maybe it's just my input file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt It doesn't have a separate field for isNumber/numericValue, so it is forced to write a wild number. Maybe these four chars should return nan?Nope. Does letter 'A' return a wild number?//---- Oh yeah, I also added isNumber to std.ascii. Feels wrong to not have it if we have numericValue.-- Dmitry Olshansky
Jan 04 2013
On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:04-Jan-2013 21:48, monarch_dodra пишет:Well, I wrote that with duplication, keeping in mind you would probably replace both. I thought it be cleaner to have some duplication, than a warped single implementation. I could also make the extra effort. I was really concerned with first having an implementation that is unicode correct. I also though that, at worst, you could use my parsed data ;) to submit your own (superior?) pull.I finished an implementation: https://github.com/D-Programming-Language/phobos/pull/1052 It is not "pull ready", so we can still discuss it.Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...Well, the thing is that I'm getting contradictory info from the consortium itself: Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN" According to the "UnicodeData.txt", its numeric value is -1. According to The "Unocide utilities", it is not a numeric type, and it's value is null: http://unicode.org/cldr/utility/character.jsp?a=12456 Also according to the consortium: "-1" is an illegal numeric value. http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:] Really, all the info seems to indicate a bug in UnicodeData.txt: They really seem like 4 entries in Nl that aren't numbers. I've found a couple people on internet discussing this, but no hard conclusion :/ **** Anyways, those 4 CUNEIFORM asside, what do you make of the entries in Lo: http://unicode.org/cldr/utility/character.jsp?a=F96B These appear to be numeric, but aren't inside Nd/No/Nl. They should return true to isNumber, no? Maybe isNumber's "documented behavior" is wrong?* There's a couple characters in tableLo that have numeric values. These aren't considered in isNumber either. I think this might be a bug though. * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC SIGN". These return wild values, and in particular two of them return -1. I *think* this should actually return nan for us, because (AFAIK), -1 is just wild for invalid :/Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated. So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category. Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.Maybe we should just return -1 on invalid unicode? Or maybe it's just my input file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt It doesn't have a separate field for isNumber/numericValue, so it is forced to write a wild number. Maybe these four chars should return nan?Nope. Does letter 'A' return a wild number?
Jan 04 2013
05-Jan-2013 00:51, monarch_dodra пишет:On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:Fixed ;)04-Jan-2013 21:48, monarch_dodra пишет:Well, I wrote that with duplication, keeping in mind you would probably replace both. I thought it be cleaner to have some duplication, than a warped single implementation. I could also make the extra effort. I was really concerned with first having an implementation that is unicode correct. I also though that, at worst, you could use my parsed data ;) to submit your module that is well due for peer review.I finished an implementation: https://github.com/D-Programming-Language/phobos/pull/1052 It is not "pull ready", so we can still discuss it.Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...Basically check the bottom of that page: .... See also: Unicode Display Problems. Version 3.6; ICU version: 50.0.1.0; Unicode version: 6.1.0.0 So it's not up to date. The file is. I can test with ICU 51 to see what it reports.Well, the thing is that I'm getting contradictory info from the consortium itself: Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN" According to the "UnicodeData.txt", its numeric value is -1. According to The "Unocide utilities", it is not a numeric type, and it's value is null: http://unicode.org/cldr/utility/character.jsp?a=12456 Also according to the consortium: "-1" is an illegal numeric value. http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:] Really, all the info seems to indicate a bug in UnicodeData.txt: They really seem like 4 entries in Nl that aren't numbers. I've found a couple people on internet discussing this, but no hard conclusion :/* There's a couple characters in tableLo that have numeric values. These aren't considered in isNumber either. I think this might be a bug though. * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC SIGN". These return wild values, and in particular two of them return -1. I *think* this should actually return nan for us, because (AFAIK), -1 is just wild for invalid :/Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated. So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category. Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.Maybe we should just return -1 on invalid unicode? Or maybe it's just my input file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt It doesn't have a separate field for isNumber/numericValue, so it is forced to write a wild number. Maybe these four chars should return nan?Nope. Does letter 'A' return a wild number?**** Anyways, those 4 CUNEIFORM asside, what do you make of the entries in Lo: http://unicode.org/cldr/utility/character.jsp?a=F96B These appear to be numeric, but aren't inside Nd/No/Nl. They should return true to isNumber, no?Hmmm. Take a look here: http://unicode.org/cldr/utility/properties.jsp There is a section called Numeric that has 3 properties, and then there is a General section. The General has Category which in turn has 'Number' category. Bottom line is that I believe that std.uni isXXX queries the category of a symbol and not some other property. Let any mishaps in between properties and general category be consortium's headache.Maybe isNumber's "documented behavior" is wrong?Problem is I can't come up with a good description of some other behavior. Maybe this one [^[:Numeric_Type=None:]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g= -- Dmitry Olshansky
Jan 04 2013
On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:05-Jan-2013 00:51, monarch_dodra пишет:Sounds like the root of the problem is that isNumber != Numeric_Type[Decimal, Digit, Numeric] Ergo, there is no correlation between isNumber and numericValue. Feels like there is a lot missing from std.uni, but at the same time, unicode is really huge. At the very least, I think we should have Category enum, along with a (get) "category" function. I was just saying to jmdavis in the pull that std.ascii had "isDigit", but that uni didn't. In truth, both also lack isDecimal and isNumeric. There would just be a bit of ambiguity now between the broad "isNumeric", and "all the chars that have a numeric value"... :/ Damn. Unicode is complicated. Anyways, taking my weekend break.Anyways, those 4 CUNEIFORM asside, what do you make of the entries in Lo: http://unicode.org/cldr/utility/character.jsp?a=F96B These appear to be numeric, but aren't inside Nd/No/Nl. They should return true to isNumber, no?Hmmm. Take a look here: http://unicode.org/cldr/utility/properties.jsp There is a section called Numeric that has 3 properties, and then there is a General section. The General has Category which in turn has 'Number' category. Bottom line is that I believe that std.uni isXXX queries the category of a symbol and not some other property. Let any mishaps in between properties and general category be consortium's headache.Maybe isNumber's "documented behavior" is wrong?Problem is I can't come up with a good description of some other behavior. Maybe this one [^[:Numeric_Type=None:]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=
Jan 04 2013
On Fri, Jan 04, 2013 at 11:48:39PM +0100, monarch_dodra wrote: [...]Sounds like the root of the problem is that isNumber != Numeric_Type[Decimal, Digit, Numeric] Ergo, there is no correlation between isNumber and numericValue.Yikes. That's pretty ... nasty. :-(Feels like there is a lot missing from std.uni, but at the same time, unicode is really huge.Yeah, Unicode is a lot more complex than most people realize. Recently I read through TR14 (proper line-breaking in Unicode), and I was gaping in awe at the insane complexity of such a seemingly-simple task.At the very least, I think we should have Category enum, along with a (get) "category" function.Yes! We need that!!I was just saying to jmdavis in the pull that std.ascii had "isDigit", but that uni didn't. In truth, both also lack isDecimal and isNumeric. There would just be a bit of ambiguity now between the broad "isNumeric", and "all the chars that have a numeric value"... :/ Damn. Unicode is complicated.[...] I, for one, would love to know why isNumeric != hasNumericValue. T -- Valentine's Day: an occasion for florists to reach into the wallets of nominal lovers in dire need of being reminded to profess their hypothetical love for their long-forgotten.
Jan 04 2013
On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:[...] I, for one, would love to know why isNumeric != hasNumericValue. TI guess it's just bad wording from the standard. The standard defined 3 groups that make up Number: [Nd] Number, Decimal Digit [Nl] Number, Letter [No] Number, Other However, there are a couple of characters that *are* numbers, but aren't in those goups. The "Good" news is that the standard, *does* define number_types to classify the kind of number a char is: * Null: Not a number * Digit: Obvious * Decimal: Any decimal number that is NOT a digit * Numeric: Everything else. So they used "Numeric" as wild, and "Number" as their general category. This leaves us with ambiguity when choosing our word: Technically '5' does not clasify as "numeric", although you could consider it "has a numeric value". I hope that makes sense.
Jan 07 2013
On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:[...][...] I, for one, would love to know why isNumeric != hasNumericValue.I guess it's just bad wording from the standard. The standard defined 3 groups that make up Number: [Nd] Number, Decimal Digit [Nl] Number, Letter [No] Number, Other However, there are a couple of characters that *are* numbers, but aren't in those goups. The "Good" news is that the standard, *does* define number_types to classify the kind of number a char is: * Null: Not a number * Digit: Obvious * Decimal: Any decimal number that is NOT a digit * Numeric: Everything else. So they used "Numeric" as wild, and "Number" as their general category. This leaves us with ambiguity when choosing our word: Technically '5' does not clasify as "numeric", although you could consider it "has a numeric value". I hope that makes sense.Hmph. I guess we need to differentiate between the unicode category called "numeric", and the property of having a numerical value. So we'd need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's what the standard is, then that's what it is. Anyway, I'd love to see std.uni cover all unicode categories. Offhanded note: should we unify the various isX() functions into: bool inCategory(string category)(dchar ch) where category is the Unicode designation, say "Nl", "Nd", etc.? That way, it's more future-proof in case the Unicode guys add more categories. Also makes it easier to remember which function to call; else you'd always have to remember "N" -> isNumeric, "L" -> isAlpha, etc.. The current names of course can be left as aliases. T -- The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike Ellis
Jan 09 2013
10-Jan-2013 03:21, H. S. Teoh пишет:On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:isNumber - _Number_ General category (as defined by Unicode 1:1) isNumeric - as having NumericType != None (again going be definition of Unicode properties) And that's all, correct and to the latter.On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:[...][...] I, for one, would love to know why isNumeric != hasNumericValue.I guess it's just bad wording from the standard. The standard defined 3 groups that make up Number: [Nd] Number, Decimal Digit [Nl] Number, Letter [No] Number, Other However, there are a couple of characters that *are* numbers, but aren't in those goups. The "Good" news is that the standard, *does* define number_types to classify the kind of number a char is: * Null: Not a number * Digit: Obvious * Decimal: Any decimal number that is NOT a digit * Numeric: Everything else. So they used "Numeric" as wild, and "Number" as their general category. This leaves us with ambiguity when choosing our word: Technically '5' does not clasify as "numeric", although you could consider it "has a numeric value". I hope that makes sense.Hmph. I guess we need to differentiate between the unicode category called "numeric", and the property of having a numerical value. So we'd need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's what the standard is, then that's what it is.Anyway, I'd love to see std.uni cover all unicode categories. Offhanded note: should we unify the various isX() functions into: bool inCategory(string category)(dchar ch)No, no, no! It's a horrible idea. The main problem with it is: huge catalog of data has to be stored in Phobos (object code) of no (even niche) use. Also to be practical for use cases other then casual observation it has to be fast.. and it can't for any of the useful cases. Just count the number of bits to store per codepoint and fairly irregular structure of the whole set of properties (unlike individual combinations that do have nice distribution e.g. Scripts as in Cyrillic). I've been shoulder-deep in Unicode for about half a year now, and reading through TR-xx algorithms and *none* of them requires queries of the sort that tests all (more then 1-2?) of properties. In all cases the algorithm itself defines a set(s) of codepoints with different meanings/values for this use case. These (useful) sets could be compressed to a fast multi-stage table, the whole catalog of properties - no, as it packs enormous heaps of unused junk (Unicode_Age anyone??). This junk is not fit for std library but the goal is to provide tool for the user to work with sets/data beyond the commonly useful in std.where category is the Unicode designation, say "Nl", "Nd", etc.? That way, it's more future-proof in case the Unicode guys add more categories.I'm posting my work on std.uni as ready for review today or tomorrow. It includes a type for a set of codepoints and ton of predefined sets for Nl, Nd and almost everything sensible (blocks, scripts, properties). The user can then conjure whatever combination required. And it still way smaller then having full 'query the database' thing. To check the full madness of all of the properties just use the web interface of unicode.org. P.S. Hopefully, nobody rises the point of codepoint _names_ they are after all too part of Unicode standard (and character database). -- Dmitry Olshansky
Jan 10 2013
On Thursday, 10 January 2013 at 18:09:31 UTC, Dmitry Olshansky wrote:10-Jan-2013 03:21, H. S. Teoh пишет:Are you sure about that? The four values of Numeric_Type are: * Decimal * Digit * None * Numeric <= !!! http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Type#Numeric_Type Hopefully, we'll have "isDecimal", "isDigit", and eventually "isNumeric", which according to definition, would simply be "Numeric_Type == Numeric_Type.Numeric" The problem is that by the definitions of Unicode properties, there is no name for "not in Numeric_Type.None" "hasNumericValue" is the best name I could come up with to differentiate between "Not Numeric_Type.None" and "Numeric_Type.Numeric"On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:isNumber - _Number_ General category (as defined by Unicode 1:1) isNumeric - as having NumericType != None (again going be definition of Unicode properties) And that's all, correct and to the latter.On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:[...][...] I, for one, would love to know why isNumeric != hasNumericValue.I guess it's just bad wording from the standard. The standard defined 3 groups that make up Number: [Nd] Number, Decimal Digit [Nl] Number, Letter [No] Number, Other However, there are a couple of characters that *are* numbers, but aren't in those goups. The "Good" news is that the standard, *does* define number_types to classify the kind of number a char is: * Null: Not a number * Digit: Obvious * Decimal: Any decimal number that is NOT a digit * Numeric: Everything else. So they used "Numeric" as wild, and "Number" as their general category. This leaves us with ambiguity when choosing our word: Technically '5' does not clasify as "numeric", although you could consider it "has a numeric value". I hope that makes sense.Hmph. I guess we need to differentiate between the unicode category called "numeric", and the property of having a numerical value. So we'd need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's what the standard is, then that's what it is.
Jan 10 2013
On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:[SNIP]Thank you for all your feed back. *everything* makes sense now. However, the conclusion I'm comming to is that there needs some ground work before doing numeric value, which I am currently doing.
Jan 07 2013
Dmitry Olshansky:Yup, and it's 2 lines then. And if one really wants to chain it: map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...); Hardly makes it Phobos candidate then ;)I think you meant to write: map(a => enforce(std.ascii.isNumeric(a)), a - '0')(...); To avoid some bugs I try to not use the comma expression like that. Compare that code with: map!numericValue(...); Bye, bearophile
Jan 02 2013