digitalmars.D.learn - Finding chars in strings

Per =?UTF-8?B?Tm9yZGzDtnc=?= (3/3) Sep 05 2017 If a character literal has type char, always below 128, can we

Per =?UTF-8?B?Tm9yZGzDtnc=?= (3/6) Sep 05 2017 Follow up question: If a character literal has type char, can we

ag0aep6g (6/8) Sep 05 2017 Strictly speaking, this is a character literal of type char: '\xC3'.

Jonathan M Davis via Digitalmars-d-learn (14/22) Sep 05 2017 Aside from escape sequences, a literal should not result in a non-ASCII

ag0aep6g (3/6) Sep 05 2017 Yes. You can search for ASCII characters (< 128) without decoding. The

Jonathan M Davis via Digitalmars-d-learn (9/15) Sep 05 2017 Unfortunately, you'll have to use something like std.utf.byCodeUnit or

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

If a character literal has type char, always below 128, can we 
always search for it's first byte offset in a string without 
decoding the string to a range of dchars?

Sep 05 2017

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Tuesday, 5 September 2017 at 15:43:02 UTC, Per Nordlöw wrote:
 If a character literal has type char, always below 128, can we 
 always search for it's first byte offset in a string without 
 decoding the string to a range of dchars?

Follow up question: If a character literal has type char, can we 
always assume it's an ASCII character?

Sep 05 2017

ag0aep6g <anonymous example.com> writes:

On 09/05/2017 05:54 PM, Per Nordlöw wrote:
 Follow up question: If a character literal has type char, can we always 
 assume it's an ASCII character?

Strictly speaking, this is a character literal of type char: '\xC3'. 
It's clearly above 0x7F, and not an ASCII character. So, no.

But if it's an actual character, not an escape sequence, then yes (I 
think). A wrong encoding setting in your text editor could mess with 
that, though.

Sep 05 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Tuesday, September 05, 2017 18:04:16 ag0aep6g via Digitalmars-d-learn 
wrote:
 On 09/05/2017 05:54 PM, Per Nordl�w wrote:
 Follow up question: If a character literal has type char, can we always
 assume it's an ASCII character?

 Strictly speaking, this is a character literal of type char: '\xC3'.
 It's clearly above 0x7F, and not an ASCII character. So, no.

 But if it's an actual character, not an escape sequence, then yes (I
 think). A wrong encoding setting in your text editor could mess with
 that, though.

Aside from escape sequences, a literal should not result in a non-ASCII
value for a char, but in general, it's a bad idea to assume that a char is
an ASCII character unless you've verified that already or somehow know based
on where the input came from that the char or chars that you're dealing with
are all ASCII. And you have to remember that VRP is in play as well, so if
it gets involved, you could end up with a char that's not an ASCII
character. And IIRC, character literals are almost always treated as dchar
unless a cast or VRP gets involved. So, I wouldn't be in a hurry to assume
that using character literals would guarantee that you're dealing with only
ASCII. Ultimately, std.ascii.isASCII is your friend if there's any risk of
something not being ASCII when you need it to be ASCII.

- Jonathan M Davis

Sep 05 2017

ag0aep6g <anonymous example.com> writes:

On 09/05/2017 05:43 PM, Per Nordlöw wrote:
 If a character literal has type char, always below 128, can we always 
 search for it's first byte offset in a string without decoding the 
 string to a range of dchars?

Yes. You can search for ASCII characters (< 128) without decoding. The 
values in multibyte sequences are always above 127.

Sep 05 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Tuesday, September 05, 2017 17:55:20 ag0aep6g via Digitalmars-d-learn 
wrote:
 On 09/05/2017 05:43 PM, Per Nordl�w wrote:
 If a character literal has type char, always below 128, can we always
 search for it's first byte offset in a string without decoding the
 string to a range of dchars?

 Yes. You can search for ASCII characters (< 128) without decoding. The
 values in multibyte sequences are always above 127.

Unfortunately, you'll have to use something like std.utf.byCodeUnit or
std.string.representation to do it; otherwise, you get hit with the
autodecoding. But yeah, UTF-8 is designed to be compatible with ASCII, so
all ASCII characters are valid UTF-8 code units and don't require decoding.
The decoding is just required if you're dealing with non-ASCII characters,
which is another reason why the autodecoding is annoying.

- Jonathan M Davis

Sep 05 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Finding chars in strings