digitalmars.D.learn - How to check i
- Uranuz (50/50) Oct 16 2014 I have some string *str* of unicode characters. The question is
- spir via Digitalmars-d-learn (10/13) Oct 16 2014 You cannot do that without decoding. Cheking whether utf-x is valid and ...
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (46/49) Oct 16 2014 spir is back! :)
- Uranuz (1/1) Oct 17 2014 This is
- Uranuz (6/6) Oct 17 2014 I haven't touched any key on a keyboard and haven't pressed
- eles (2/4) Oct 17 2014 Scan for rootkits...
- ketmar via Digitalmars-d-learn (3/8) Oct 17 2014 or touchpad. ;-)
I have some string *str* of unicode characters. The question is how to check if I have valid unicode code point starting at code unit *index*? I need it because I try to write parser that operates on string by *code unit*. If more precisely I trying to write function *matchWord* that should exctract whole words (that could consist not only English letters) from text. This word then compared with word from parameter. I want to not decode if it is not necessary. But looks like I can't do it without decoding, because I need to know if current character is letter of alphabet and not punctuation or whitespace for example. There is how I think this look like. In real code I have template algorithm that operates on differrent types of strings: string, wstring, dstring. struct Lexer { string str; size_t index; bool matchWord(string word) { size_t i = index; while( !str[i..$].empty ) { if( !str.isValidChar(i) ) { i++; continue; } uint len = str.graphemeStride(i); if( !isAlpha(str[i..i+len]) ) { break; } i++; } return word == str[index..i]; } } It is just a draft of idea. Maybe it is complicated. What I want to get as a result is logical flag (matched or not) and position should be set after word if it is matched. And it should match whole words of course. How do I implement it correctly without overhead and additional UTF decodings if possible? And also how could I validate single char of string starting at code unit index? Also I don't like that graphemeStride can throw Exception if I point to wrong possition. Is there some nothrow version? I don't want to have extra allocations for exceptions.
Oct 16 2014
On 16/10/14 20:46, Uranuz via Digitalmars-d-learn wrote:I have some string *str* of unicode characters. The question is how to check if I have valid unicode code point starting at code unit *index*? [...]You cannot do that without decoding. Cheking whether utf-x is valid and decoding are the very same process. IIRC, D has a validation func which is more or less just an alias for the decoding func ;-). Moreover, you also need to distinguish "word-character" code points from others (punctuation, spacing, etc) which requires unicode code points (Unicode the consortium provide tables for such tasks). Thus, I would recommand you to just abandon the illusion of working at the level of code units for such tasks, and simply operate on strings of code points. (Why do you think D has them builtin?) denis
Oct 16 2014
On 10/16/2014 12:43 PM, spir via Digitalmars-d-learn wrote:denisspir is back! :) On 10/16/2014 11:46 AM, Uranuz wrote:I have some string *str* of unicode characters. The question is how to check if I have valid unicode code point starting at code unit *index*?It is easy if I understand the question as skipping over invalid UTF-8 sequences: import std.stdio; ubyte upperTwoBits(ubyte b) { return b & 0b1100_0000; } bool isUtf8ContinuationByte(char c) { enum utf8ContinuationPrefix = 0b1000_0000; return upperTwoBits(c) == utf8ContinuationPrefix; } void moveToValid(ref inout(char)[] s) { /* Skip over UTF-8 continuation bytes. */ while (s.length && isUtf8ContinuationByte(s[0])) { s = s[1..$]; } /* * The wchar[] overload is too complicated for Ali at this time. :) * * Please see the following function template in phobos/std/utf.d: * * private dchar decodeImpl(bool canIndex, S)(...) * if (is(S : const wchar[]) ... */ } unittest { auto s = "çde"; moveToValid(s); assert(s == "çde"); s = s[1 .. $]; moveToValid(s); assert(s == "de", s); } void moveToValid(ref const(dchar)[] s) { /* Every code unit is valid; nothing to do. */ } void main() {} Ali
Oct 16 2014
I haven't touched any key on a keyboard and haven't pressed *Send* but message was posted somehow. Thanks. Checking for UTF-8 continuation bytes is good idea. Also I agree that UTF-16 is more difficult. I will keep it for future release when implementation will start to work properly on UTF-8 and UTF-32
Oct 17 2014
On Friday, 17 October 2014 at 16:39:38 UTC, Uranuz wrote:I haven't touched any key on a keyboard and haven't pressed *Send* but message was posted somehow.Scan for rootkits...
Oct 17 2014
On Fri, 17 Oct 2014 19:13:51 +0000 eles via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> wrote:On Friday, 17 October 2014 at 16:39:38 UTC, Uranuz wrote:or touchpad. ;-)I haven't touched any key on a keyboard and haven't pressed=20 *Send* but message was posted somehow.=20 Scan for rootkits...
Oct 17 2014