www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Finding chars in strings

reply Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
If a character literal has type char, always below 128, can we 
always search for it's first byte offset in a string without 
decoding the string to a range of dchars?
Sep 05 2017
next sibling parent reply Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Tuesday, 5 September 2017 at 15:43:02 UTC, Per Nordlöw wrote:
 If a character literal has type char, always below 128, can we 
 always search for it's first byte offset in a string without 
 decoding the string to a range of dchars?
Follow up question: If a character literal has type char, can we always assume it's an ASCII character?
Sep 05 2017
parent reply ag0aep6g <anonymous example.com> writes:
On 09/05/2017 05:54 PM, Per Nordlöw wrote:
 Follow up question: If a character literal has type char, can we always 
 assume it's an ASCII character?
Strictly speaking, this is a character literal of type char: '\xC3'. It's clearly above 0x7F, and not an ASCII character. So, no. But if it's an actual character, not an escape sequence, then yes (I think). A wrong encoding setting in your text editor could mess with that, though.
Sep 05 2017
parent Jonathan M Davis via Digitalmars-d-learn writes:
On Tuesday, September 05, 2017 18:04:16 ag0aep6g via Digitalmars-d-learn 
wrote:
 On 09/05/2017 05:54 PM, Per Nordlöw wrote:
 Follow up question: If a character literal has type char, can we always
 assume it's an ASCII character?
Strictly speaking, this is a character literal of type char: '\xC3'. It's clearly above 0x7F, and not an ASCII character. So, no. But if it's an actual character, not an escape sequence, then yes (I think). A wrong encoding setting in your text editor could mess with that, though.
Aside from escape sequences, a literal should not result in a non-ASCII value for a char, but in general, it's a bad idea to assume that a char is an ASCII character unless you've verified that already or somehow know based on where the input came from that the char or chars that you're dealing with are all ASCII. And you have to remember that VRP is in play as well, so if it gets involved, you could end up with a char that's not an ASCII character. And IIRC, character literals are almost always treated as dchar unless a cast or VRP gets involved. So, I wouldn't be in a hurry to assume that using character literals would guarantee that you're dealing with only ASCII. Ultimately, std.ascii.isASCII is your friend if there's any risk of something not being ASCII when you need it to be ASCII. - Jonathan M Davis
Sep 05 2017
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 09/05/2017 05:43 PM, Per Nordlöw wrote:
 If a character literal has type char, always below 128, can we always 
 search for it's first byte offset in a string without decoding the 
 string to a range of dchars?
Yes. You can search for ASCII characters (< 128) without decoding. The values in multibyte sequences are always above 127.
Sep 05 2017
parent Jonathan M Davis via Digitalmars-d-learn writes:
On Tuesday, September 05, 2017 17:55:20 ag0aep6g via Digitalmars-d-learn 
wrote:
 On 09/05/2017 05:43 PM, Per Nordlöw wrote:
 If a character literal has type char, always below 128, can we always
 search for it's first byte offset in a string without decoding the
 string to a range of dchars?
Yes. You can search for ASCII characters (< 128) without decoding. The values in multibyte sequences are always above 127.
Unfortunately, you'll have to use something like std.utf.byCodeUnit or std.string.representation to do it; otherwise, you get hit with the autodecoding. But yeah, UTF-8 is designed to be compatible with ASCII, so all ASCII characters are valid UTF-8 code units and don't require decoding. The decoding is just required if you're dealing with non-ASCII characters, which is another reason why the autodecoding is annoying. - Jonathan M Davis
Sep 05 2017