www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to check i

reply "Uranuz" <neuranuz gmail.com> writes:
I have some string *str* of unicode characters. The question is 
how to check if I have valid unicode code point starting at code 
unit *index*?

I need it because I try to write parser that operates on string 
by *code unit*. If more precisely I trying to write function 
*matchWord* that should exctract whole words (that could consist 
not only English letters) from text. This word then compared with 
word from parameter. I want to not decode if it is not necessary. 
But looks like I can't do it without decoding, because I need to 
know if current character is letter of alphabet and not 
punctuation or whitespace for example.

There is how I think this look like. In real code I have template 
algorithm that operates on differrent types of strings: string, 
wstring, dstring.

struct Lexer
{
	string str;
	size_t index;

	bool matchWord(string word)
	{
		size_t i = index;
		while( !str[i..$].empty )
		{
			if( !str.isValidChar(i) )
			{
				i++;
				continue;
			}
			
			uint len = str.graphemeStride(i);

			if( !isAlpha(str[i..i+len]) )
			{
				break;
			}
			i++;
		}
		
		return word == str[index..i];
	}
}

It is just a draft of idea. Maybe it is complicated. What I want 
to get as a result is logical flag (matched or not) and position 
should be set after word if it is matched. And it should match 
whole words of course.

How do I implement it correctly without overhead and additional 
UTF decodings if possible?

And also how could I validate single char of string starting at 
code unit index? Also I don't like that graphemeStride can throw 
Exception if I point to wrong possition. Is there some nothrow 
version? I don't want to have extra allocations for exceptions.
Oct 16 2014
parent reply spir via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On 16/10/14 20:46, Uranuz via Digitalmars-d-learn wrote:
 I have some string *str* of unicode characters. The question is how to check if
 I have valid unicode code point starting at code unit *index*?
 [...]
You cannot do that without decoding. Cheking whether utf-x is valid and decoding are the very same process. IIRC, D has a validation func which is more or less just an alias for the decoding func ;-). Moreover, you also need to distinguish "word-character" code points from others (punctuation, spacing, etc) which requires unicode code points (Unicode the consortium provide tables for such tasks). Thus, I would recommand you to just abandon the illusion of working at the level of code units for such tasks, and simply operate on strings of code points. (Why do you think D has them builtin?) denis
Oct 16 2014
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 10/16/2014 12:43 PM, spir via Digitalmars-d-learn wrote:

 denis
spir is back! :) On 10/16/2014 11:46 AM, Uranuz wrote:
 I have some string *str* of unicode characters. The question is how to
 check if I have valid unicode code point starting at code unit *index*?
It is easy if I understand the question as skipping over invalid UTF-8 sequences: import std.stdio; ubyte upperTwoBits(ubyte b) { return b & 0b1100_0000; } bool isUtf8ContinuationByte(char c) { enum utf8ContinuationPrefix = 0b1000_0000; return upperTwoBits(c) == utf8ContinuationPrefix; } void moveToValid(ref inout(char)[] s) { /* Skip over UTF-8 continuation bytes. */ while (s.length && isUtf8ContinuationByte(s[0])) { s = s[1..$]; } /* * The wchar[] overload is too complicated for Ali at this time. :) * * Please see the following function template in phobos/std/utf.d: * * private dchar decodeImpl(bool canIndex, S)(...) * if (is(S : const wchar[]) ... */ } unittest { auto s = "çde"; moveToValid(s); assert(s == "çde"); s = s[1 .. $]; moveToValid(s); assert(s == "de", s); } void moveToValid(ref const(dchar)[] s) { /* Every code unit is valid; nothing to do. */ } void main() {} Ali
Oct 16 2014
parent reply "Uranuz" <neuranuz gmail.com> writes:
This is
Oct 17 2014
parent reply "Uranuz" <neuranuz gmail.com> writes:
I haven't touched any key on a keyboard and haven't pressed 
*Send* but message was posted somehow.

Thanks. Checking for UTF-8 continuation bytes is good idea. Also 
I agree that UTF-16 is more difficult. I will keep it for future 
release when implementation will start to work properly on UTF-8 
and UTF-32
Oct 17 2014
parent reply "eles" <eles215 gzk.dot> writes:
On Friday, 17 October 2014 at 16:39:38 UTC, Uranuz wrote:
 I haven't touched any key on a keyboard and haven't pressed 
 *Send* but message was posted somehow.
Scan for rootkits...
Oct 17 2014
parent ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Fri, 17 Oct 2014 19:13:51 +0000
eles via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> wrote:

 On Friday, 17 October 2014 at 16:39:38 UTC, Uranuz wrote:
 I haven't touched any key on a keyboard and haven't pressed=20
 *Send* but message was posted somehow.
=20 Scan for rootkits...
or touchpad. ;-)
Oct 17 2014