digitalmars.D.learn - Best way to count character spaces.

Taylor Hillegeist (34/34) Jun 30 2015 So I am aware that Unicode is not simple... I have been working

Rikki Cattermole (10/43) Jun 30 2015 Well I would personally use isWhite[0].
H. S. Teoh via Digitalmars-d-learn (13/27) Jun 30 2015 [...]

Steven Schveighoffer (5/29) Jul 01 2015 BTW, this exercise would make an EXCELLENT blog post highlighting both

"Taylor Hillegeist" <taylorh140 gmail.com> writes:

So I am aware that Unicode is not simple... I have been working 
on a boxes like project http://boxes.thomasjensen.com/

it basically puts a pretty border around stdin characters. like 
so:
  ________________________
/\                       \
\_|Different all twisty a|
   |of in maze are you,   |
   |passages little.      |
   |   ___________________|_
    \_/_____________________/

but I find that I need to know a bit more than the length of the 
string because of encoding differences

I had a thought at one point to do this:

MyString.splitlines.map!(a => a.toUTF32.length).reduce!max();

Should get me the longest line.

but this has a problem too because control characters might not 
take up space (backspace?).

https://en.wikipedia.org/wiki/Unicode_control_characters

leaving an unwanted nasty space :( or take weird amount of space 
\t. And perhaps the first isn't really something to worry about.

Or should i do something like:

MyString.splitLines
		.map!(a => a
			  .map!(a => a
					.isGraphical)
			  .map!(a => cast(int) a?1:0)
			  .array
			  .reduce!((a,b) => a+b))
		.reduce!max

Mostly I am just curious of best practice in this situation.

Both of the above fail with the input:
"hello \n People \nP\u0008ofEARTH"
on my command prompt at least.

Jun 30 2015

Rikki Cattermole <alphaglosined gmail.com> writes:

On 1/07/2015 6:33 a.m., Taylor Hillegeist wrote:
 So I am aware that Unicode is not simple... I have been working on a
 boxes like project http://boxes.thomasjensen.com/

 it basically puts a pretty border around stdin characters. like so:
   ________________________
 /\                       \
 \_|Different all twisty a|
    |of in maze are you,   |
    |passages little.      |
    |   ___________________|_
     \_/_____________________/

 but I find that I need to know a bit more than the length of the string
 because of encoding differences

 I had a thought at one point to do this:

 MyString.splitlines.map!(a => a.toUTF32.length).reduce!max();

 Should get me the longest line.

 but this has a problem too because control characters might not take up
 space (backspace?).

 https://en.wikipedia.org/wiki/Unicode_control_characters

 leaving an unwanted nasty space :( or take weird amount of space \t. And
 perhaps the first isn't really something to worry about.

 Or should i do something like:

 MyString.splitLines
          .map!(a => a
                .map!(a => a
                      .isGraphical)
                .map!(a => cast(int) a?1:0)
                .array
                .reduce!((a,b) => a+b))
          .reduce!max

 Mostly I am just curious of best practice in this situation.

 Both of the above fail with the input:
 "hello \n People \nP\u0008ofEARTH"
 on my command prompt at least.


Well I would personally use isWhite[0].
I would also use filter and count along with it.

So something like this:
size_t[] lengths = MyString.splitLines
.filter!isWhite
.count
.array;

Untested of course, but may give you ideas :)

Jun 30 2015

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, Jun 30, 2015 at 06:33:32PM +0000, Taylor Hillegeist via
Digitalmars-d-learn wrote:
 So I am aware that Unicode is not simple... I have been working on a boxes
 like project http://boxes.thomasjensen.com/
 
 it basically puts a pretty border around stdin characters. like so:
  ________________________
 /\                       \
 \_|Different all twisty a|
   |of in maze are you,   |
   |passages little.      |
   |   ___________________|_
    \_/_____________________/
 
 but I find that I need to know a bit more than the length of the string
 because of encoding differences

[...]

Use std.uni.byGrapheme. That's the only reliable way to count anything
remotely resembling the display length of the string, which is not to be
confused with the number of code points, which is also different from
the length of the string in bytes or the number of code units.

Note that even with byGrapheme, you may still need some post-processing,
because certain terminals may output Asian block characters in double
width, meaning that 1 grapheme takes up two columns on the screen. But
byGrapheme should get you started on the right footing.


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. --
Christopher

Jun 30 2015

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/1/15 1:25 AM, H. S. Teoh via Digitalmars-d-learn wrote:
 On Tue, Jun 30, 2015 at 06:33:32PM +0000, Taylor Hillegeist via
Digitalmars-d-learn wrote:
 So I am aware that Unicode is not simple... I have been working on a boxes
 like project http://boxes.thomasjensen.com/

 it basically puts a pretty border around stdin characters. like so:
   ________________________
 /\                       \
 \_|Different all twisty a|
    |of in maze are you,   |
    |passages little.      |
    |   ___________________|_
     \_/_____________________/

 but I find that I need to know a bit more than the length of the string
 because of encoding differences

 [...]

 Use std.uni.byGrapheme. That's the only reliable way to count anything
 remotely resembling the display length of the string, which is not to be
 confused with the number of code points, which is also different from
 the length of the string in bytes or the number of code units.

 Note that even with byGrapheme, you may still need some post-processing,
 because certain terminals may output Asian block characters in double
 width, meaning that 1 grapheme takes up two columns on the screen. But
 byGrapheme should get you started on the right footing.

BTW, this exercise would make an EXCELLENT blog post highlighting both 
the power of D's unicode support and the hairy issues of unicode.

I like the ascii er... unicode art concept :)

-Steve

Jul 01 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Best way to count character spaces.