digitalmars.D.learn - How to check i

Uranuz (50/50) Oct 16 2014 I have some string *str* of unicode characters. The question is

spir via Digitalmars-d-learn (10/13) Oct 16 2014 You cannot do that without decoding. Cheking whether utf-x is valid and ...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (46/49) Oct 16 2014 spir is back! :)

Uranuz (1/1) Oct 17 2014 This is

Uranuz (6/6) Oct 17 2014 I haven't touched any key on a keyboard and haven't pressed

eles (2/4) Oct 17 2014 Scan for rootkits...

ketmar via Digitalmars-d-learn (3/8) Oct 17 2014 or touchpad. ;-)

"Uranuz" <neuranuz gmail.com> writes:

I have some string *str* of unicode characters. The question is 
how to check if I have valid unicode code point starting at code 
unit *index*?

I need it because I try to write parser that operates on string 
by *code unit*. If more precisely I trying to write function 
*matchWord* that should exctract whole words (that could consist 
not only English letters) from text. This word then compared with 
word from parameter. I want to not decode if it is not necessary. 
But looks like I can't do it without decoding, because I need to 
know if current character is letter of alphabet and not 
punctuation or whitespace for example.

There is how I think this look like. In real code I have template 
algorithm that operates on differrent types of strings: string, 
wstring, dstring.

struct Lexer
{
	string str;
	size_t index;

	bool matchWord(string word)
	{
		size_t i = index;
		while( !str[i..$].empty )
		{
			if( !str.isValidChar(i) )
			{
				i++;
				continue;
			}
			
			uint len = str.graphemeStride(i);

			if( !isAlpha(str[i..i+len]) )
			{
				break;
			}
			i++;
		}
		
		return word == str[index..i];
	}
}

It is just a draft of idea. Maybe it is complicated. What I want 
to get as a result is logical flag (matched or not) and position 
should be set after word if it is matched. And it should match 
whole words of course.

How do I implement it correctly without overhead and additional 
UTF decodings if possible?

And also how could I validate single char of string starting at 
code unit index? Also I don't like that graphemeStride can throw 
Exception if I point to wrong possition. Is there some nothrow 
version? I don't want to have extra allocations for exceptions.

Oct 16 2014

spir via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On 16/10/14 20:46, Uranuz via Digitalmars-d-learn wrote:
 I have some string *str* of unicode characters. The question is how to check if
 I have valid unicode code point starting at code unit *index*?
 [...]

You cannot do that without decoding. Cheking whether utf-x is valid and
decoding 
are the very same process. IIRC, D has a validation func which is more or less 
just an alias for the decoding func ;-). Moreover, you also need to distinguish 
"word-character" code points from others (punctuation, spacing, etc) which 
requires unicode code points (Unicode the consortium provide tables for such
tasks).

Thus, I would recommand you to just abandon the illusion of working at the
level 
of code units for such tasks, and simply operate on strings of code points.
(Why 
do you think D has them builtin?)

denis

Oct 16 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 10/16/2014 12:43 PM, spir via Digitalmars-d-learn wrote:

 denis

spir is back! :)

On 10/16/2014 11:46 AM, Uranuz wrote:

 I have some string *str* of unicode characters. The question is how to
 check if I have valid unicode code point starting at code unit *index*?

It is easy if I understand the question as skipping over invalid UTF-8 
sequences:

import std.stdio;

ubyte upperTwoBits(ubyte b)
{
     return b & 0b1100_0000;
}

bool isUtf8ContinuationByte(char c)
{
     enum utf8ContinuationPrefix = 0b1000_0000;
     return upperTwoBits(c) == utf8ContinuationPrefix;
}

void moveToValid(ref inout(char)[] s)
{
     /* Skip over UTF-8 continuation bytes. */
     while (s.length && isUtf8ContinuationByte(s[0])) {
         s = s[1..$];
     }

     /*
      * The wchar[] overload is too complicated for Ali at this time. :)
      *
      * Please see the following function template in phobos/std/utf.d:
      *
      * private dchar decodeImpl(bool canIndex, S)(...)
      *     if (is(S : const wchar[]) ...
      */
}

unittest
{
     auto s = "çde";
     moveToValid(s);
     assert(s == "çde");

     s = s[1 .. $];
     moveToValid(s);
     assert(s == "de", s);
}

void moveToValid(ref const(dchar)[] s)
{
     /* Every code unit is valid; nothing to do. */
}

void main()
{}

Ali

Oct 16 2014

"Uranuz" <neuranuz gmail.com> writes:

This is

Oct 17 2014

"Uranuz" <neuranuz gmail.com> writes:

I haven't touched any key on a keyboard and haven't pressed 
*Send* but message was posted somehow.

Thanks. Checking for UTF-8 continuation bytes is good idea. Also 
I agree that UTF-16 is more difficult. I will keep it for future 
release when implementation will start to work properly on UTF-8 
and UTF-32

Oct 17 2014

"eles" <eles215 gzk.dot> writes:

On Friday, 17 October 2014 at 16:39:38 UTC, Uranuz wrote:
 I haven't touched any key on a keyboard and haven't pressed 
 *Send* but message was posted somehow.

Scan for rootkits...

Oct 17 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Fri, 17 Oct 2014 19:13:51 +0000
eles via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> wrote:

 On Friday, 17 October 2014 at 16:39:38 UTC, Uranuz wrote:
 I haven't touched any key on a keyboard and haven't pressed=20
 *Send* but message was posted somehow.

=20
 Scan for rootkits...

or touchpad. ;-)

Oct 17 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - How to check i