digitalmars.D.learn - Should I Use std.ascii.isWhite or std.uni.isWhite?

Meta (8/8) Jul 25 2013 I'm confused about which isWhite function I should use. Aren't

Jonathan M Davis (17/25) Jul 25 2013 Unicode contains ASCII, but very few Unicode characters are ASCII, becau...

Meta (9/36) Jul 25 2013 That makes sense. I know that the first 127 unicode characters

anonymous (6/17) Jul 25 2013 Via
Jonathan M Davis (4/6) Jul 26 2013 Yes. Expect all of the isX functions in std.uni to return true for chara...
Dmitry Olshansky (8/17) Jul 26 2013 You are spot on. In case you want to further dig into Unicode characters...

Meta (2/6) Jul 26 2013 That is indeed a helpful link. Thanks.

bearophile (5/6) Jul 26 2013 Right. I have just added this:

anonymous (15/24) Jul 25 2013 No, char vs byte isn't necessarily a thing here.

Meta (5/25) Jul 25 2013 I realized after I posted this that I was being stupid in even

"Meta" <jared771 gmail.com> writes:

I'm confused about which isWhite function I should use. Aren't 
all chars in D (char, wchar, dchar) unicode characters? Should I 
always use std.uni.isWhite, unless I'm working with bytes and 
byte arrays? The documentation doesn't give me much to go on, 
beside "All of the functions in std.ascii accept unicode 
characters but effectively ignore them. All isX functions return 
false for unicode characters, and all toX functions do nothing to 
unicode characters."

Jul 25 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, July 26, 2013 06:09:39 Meta wrote:
 I'm confused about which isWhite function I should use. Aren't
 all chars in D (char, wchar, dchar) unicode characters? Should I
 always use std.uni.isWhite, unless I'm working with bytes and
 byte arrays? The documentation doesn't give me much to go on,
 beside "All of the functions in std.ascii accept unicode
 characters but effectively ignore them. All isX functions return
 false for unicode characters, and all toX functions do nothing to
 unicode characters."

Unicode contains ASCII, but very few Unicode characters are ASCII, because 
there just aren't very many ASCII characters and there and a _ton_ of Unicode 
characters. The std.ascii functions return true for certain sets of ASCII 
characters and false for everything else. The std.uni functions return true 
for many Unicode characters as well. You wouldn't normally use std.ascii if 
you're operating on non-ASCII Unicode characters, but it ignores them if it 
does run into them.

std.ascii.isWhite only cares about ASCII whitespace, which the documentation 
explicitly lists as the space, tab, vertical tab, form feed, carriage return, 
and linefeed characters. Those characters will return true. All other 
characters will return false.

std.uni.isWhite returns true for all of the characters that std.ascii.isWhite 
does plus a whole bunch of other non-ASCII characters that the Unicode 
standard considers to be whitespace.

Which function you use depends on what you're trying to do.

- Jonathan M Davis

Jul 25 2013

"Meta" <jared771 gmail.com> writes:

On Friday, 26 July 2013 at 05:06:45 UTC, Jonathan M Davis wrote:
 Unicode contains ASCII, but very few Unicode characters are 
 ASCII, because
 there just aren't very many ASCII characters and there and a 
 _ton_ of Unicode
 characters. The std.ascii functions return true for certain 
 sets of ASCII
 characters and false for everything else. The std.uni functions 
 return true
 for many Unicode characters as well. You wouldn't normally use 
 std.ascii if
 you're operating on non-ASCII Unicode characters, but it 
 ignores them if it
 does run into them.

 std.ascii.isWhite only cares about ASCII whitespace, which the 
 documentation
 explicitly lists as the space, tab, vertical tab, form feed, 
 carriage return,
 and linefeed characters. Those characters will return true. All 
 other
 characters will return false.

 std.uni.isWhite returns true for all of the characters that 
 std.ascii.isWhite
 does plus a whole bunch of other non-ASCII characters that the 
 Unicode
 standard considers to be whitespace.

 Which function you use depends on what you're trying to do.

 - Jonathan M Davis

That makes sense. I know that the first 127 unicode characters 
are equivalent to the 7-bit ASCII charset, but it confused me 
that the module is named std.ascii when it actually operates on 
unicode characters, I guess.

Another question, I'm not all that familiar with unicode, so what 
is the difference between std.uni.isNumber and 
std.ascii.isNumber? Am I right in thinking that std.uni.isNumber 
will match things outside of the basic 0..9?

Jul 25 2013

"anonymous" <anonymous example.com> writes:

On Friday, 26 July 2013 at 05:54:50 UTC, Meta wrote:
 Another question, I'm not all that familiar with unicode, so 
 what is the difference between std.uni.isNumber and 
 std.ascii.isNumber? Am I right in thinking that 
 std.uni.isNumber will match things outside of the basic 0..9?


 general Unicode category: Nd, Nl, No

Via 
<http://www.google.com/search?q=general+Unicode+category:+Nd,+Nl,+No> 
to 
<http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#General_Category>:
 Number (N)
     Decimal digit (Nd)
     Letter (Nl) — Numerals composed of letters or letterlike 
 symbols (e.g., Roman numerals)
     Other (No) — Includes vulgar fractions and superscript and 
 subscript digits.

Jul 25 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, July 26, 2013 07:54:42 Meta wrote:
 Am I right in thinking that std.uni.isNumber
 will match things outside of the basic 0..9?

Yes. Expect all of the isX functions in std.uni to return true for characters 
outside of ASCII.

- Jonathan M Davis

Jul 26 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

26-Jul-2013 09:54, Meta пишет:
 On Friday, 26 July 2013 at 05:06:45 UTC, Jonathan M Davis wrote:

[snip]
 That makes sense. I know that the first 127 unicode characters are
 equivalent to the 7-bit ASCII charset, but it confused me that the
 module is named std.ascii when it actually operates on unicode
 characters, I guess.

 Another question, I'm not all that familiar with unicode, so what is the
 difference between std.uni.isNumber and std.ascii.isNumber? Am I right
 in thinking that std.uni.isNumber will match things outside of the basic
 0..9?

You are spot on. In case you want to further dig into Unicode characters 
and properties, there is this nice tool:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AN%3A%5D&g=
(e.g. this link shows all of 'N' = Number characters)

-- 
Dmitry Olshansky

Jul 26 2013

"Meta" <jared771 gmail.com> writes:

On Friday, 26 July 2013 at 17:58:21 UTC, Dmitry Olshansky wrote:
 You are spot on. In case you want to further dig into Unicode 
 characters and properties, there is this nice tool:
 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AN%3A%5D&g=
 (e.g. this link shows all of 'N' = Number characters)

That is indeed a helpful link. Thanks.

Jul 26 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 Which function you use depends on what you're trying to do.

Right. I have just added this:
http://d.puremagic.com/issues/show_bug.cgi?id=10717

Bye,
bearophile

Jul 26 2013

"anonymous" <anonymous example.com> writes:

On Friday, 26 July 2013 at 04:09:46 UTC, Meta wrote:
 I'm confused about which isWhite function I should use. Aren't 
 all chars in D (char, wchar, dchar) unicode characters?

They are.

 Should I always use std.uni.isWhite, unless I'm working with 
 bytes and byte arrays?

No, char vs byte isn't necessarily a thing here.

 The documentation doesn't give me much to go on, beside "All of 
 the functions in std.ascii accept unicode characters but 
 effectively ignore them. All isX functions return false for 
 unicode characters, and all toX functions do nothing to unicode 
 characters."

You should use std.uni.isWhite unless you want to match only 
ASCII white space.

That could be the case when ...
* You have data that is not in Unicode, but some other superset 
of ASCII. Then you shouldn't use std.uni.isWhite, of course. 
std.ascii.isWhite might be fine. In this case, you'd actually use 
u{byte,short,int} instead of {,w,d}char.
* You're dealing with a grammar where ASCII white space is a 
thing, while Unicode white space is not.
* There's really only ASCII white space in your data, and you 
want every bit of speed, and you've verified that 
std.ascii.isWhite is indeed faster than std.uni.isWhite.

Jul 25 2013

"Meta" <jared771 gmail.com> writes:

On Friday, 26 July 2013 at 05:26:33 UTC, anonymous wrote:
 Should I always use std.uni.isWhite, unless I'm working with 
 bytes and byte arrays?

 No, char vs byte isn't necessarily a thing here.

I realized after I posted this that I was being stupid in even 
suggesting that, seeing as all the functions in std.ascii take 
dchars.

 The documentation doesn't give me much to go on, beside "All 
 of the functions in std.ascii accept unicode characters but 
 effectively ignore them. All isX functions return false for 
 unicode characters, and all toX functions do nothing to 
 unicode characters."

 You should use std.uni.isWhite unless you want to match only 
 ASCII white space.

 That could be the case when ...
 * You have data that is not in Unicode, but some other superset 
 of ASCII. Then you shouldn't use std.uni.isWhite, of course. 
 std.ascii.isWhite might be fine. In this case, you'd actually 
 use u{byte,short,int} instead of {,w,d}char.
 * You're dealing with a grammar where ASCII white space is a 
 thing, while Unicode white space is not.
 * There's really only ASCII white space in your data, and you 
 want every bit of speed, and you've verified that 
 std.ascii.isWhite is indeed faster than std.uni.isWhite.

Thank you for the informative answer.

Jul 25 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Should I Use std.ascii.isWhite or std.uni.isWhite?