digitalmars.D.learn - Reading unicode string with readf ("%s")

Ivan Kazmenko (27/27) Nov 03 2014 Hi!

Ivan Kazmenko (7/8) Nov 03 2014 Worth noting: this reads to end-of-file (not end-of-line or

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (22/29) Nov 03 2014 I don't know the answer to the Unicode issue with readf but you can read...

Ivan Kazmenko (7/43) Nov 04 2014 Thank you for suggesting an alternative!

Gary Willoughby (9/12) Nov 03 2014 Maybe something like this:

Ivan Kazmenko (7/19) Nov 04 2014 And thanks for a short alternative!

Ivan Kazmenko (6/30) Nov 04 2014 And sorry, I didn't read that correctly.

Kagamin (1/1) Nov 04 2014 https://issues.dlang.org/show_bug.cgi?id=12990 this?

Ivan Kazmenko (7/8) Nov 04 2014 Similar, but not quite that. Bugs 12990 and 1448 (linked from

anonymous (15/42) Nov 04 2014 Yes. std.stdio.LockingTextReader is to blame:

Ivan Kazmenko (5/54) Nov 04 2014 You nailed it!

"Ivan Kazmenko" <gassa mail.ru> writes:

Hi!

The following code does not correctly handle Unicode strings.
-----
import std.stdio;
void main () {
	string s;
	readf ("%s", &s);
	write (s);
}
-----

Example input ("Test." in cyrillic):
-----
Тест.
-----
(hex: D0 A2 D0 B5 D1 81 D1 82 2E 0D 0A)

Example output:
-----
Ð¢ÐµÑÑ.
-----
(hex: C3 90 C2 A2 C3 90 C2 B5 C3 91 C2 81 C3 91 C2 82 2E 0D 0A)

Here, the input bytes are handled separately: D0 -> C3 90, A2 -> 
C2 A2, etc.

On the bright side, reading the file with readln works properly.

Is this an expected shortcoming of "%s"-reading a string?
Could it be made to work somehow?
Is it worth a bug report?

Ivan Kazmenko.

Nov 03 2014

"Ivan Kazmenko" <gassa mail.ru> writes:

On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote:
 	readf ("%s", &s);

Worth noting: this reads to end-of-file (not end-of-line or 
whitespace), and reading the whole file into a string was what I 
indeed expected it to do.

So, if there is an idiomatic way to read the whole file into a 
string which is Unicode-compatible, it would be great to learn 
that, too.

Nov 03 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 11/03/2014 11:47 AM, Ivan Kazmenko wrote:
 On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote:
     readf ("%s", &s);

 Worth noting: this reads to end-of-file (not end-of-line or whitespace),
 and reading the whole file into a string was what I indeed expected it
 to do.

 So, if there is an idiomatic way to read the whole file into a string
 which is Unicode-compatible, it would be great to learn that, too.

I don't know the answer to the Unicode issue with readf but you can read 
the file by chunks:

import std.stdio;
import std.array;
import std.exception;

string readAll(File file)
{
     char[666] buffer;
     char[] contents;
     char[] piece;

     do {
         piece = file.rawRead(buffer);
         contents ~= piece;

     } while (!piece.empty);

     return assumeUnique(contents);
}

void main () {
     string s = stdin.readAll();

     write (s);
}

Ali

Nov 03 2014

"Ivan Kazmenko" <gassa mail.ru> writes:

On Monday, 3 November 2014 at 20:03:03 UTC, Ali Çehreli wrote:
 On 11/03/2014 11:47 AM, Ivan Kazmenko wrote:
 On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko 
 wrote:
    readf ("%s", &s);

 Worth noting: this reads to end-of-file (not end-of-line or 
 whitespace),
 and reading the whole file into a string was what I indeed 
 expected it
 to do.

 So, if there is an idiomatic way to read the whole file into a 
 string
 which is Unicode-compatible, it would be great to learn that, 
 too.

 I don't know the answer to the Unicode issue with readf but you 
 can read the file by chunks:

 import std.stdio;
 import std.array;
 import std.exception;

 string readAll(File file)
 {
     char[666] buffer;
     char[] contents;
     char[] piece;

     do {
         piece = file.rawRead(buffer);
         contents ~= piece;

     } while (!piece.empty);

     return assumeUnique(contents);
 }

 void main () {
     string s = stdin.readAll();

     write (s);
 }

 Ali

Thank you for suggesting an alternative!
Looks like it would be an efficient one, too.
I believe it can be made a bit more efficient if using an 
appender, right?

Still, that's a lot of code for a minute scripting task, albeit 
one has to write the readAll function only once.

Nov 04 2014

"Gary Willoughby" <dev nomad.so> writes:

On Monday, 3 November 2014 at 19:47:17 UTC, Ivan Kazmenko wrote:
 So, if there is an idiomatic way to read the whole file into a 
 string which is Unicode-compatible, it would be great to learn 
 that, too.

Maybe something like this:

import std.stdio;
import std.array;
import std.conv;

string text = stdin
	.byLine(KeepTerminator.yes)
	.join()
	.to!(string);

Nov 03 2014

"Ivan Kazmenko" <gassa mail.ru> writes:

On Monday, 3 November 2014 at 20:10:02 UTC, Gary Willoughby wrote:
 On Monday, 3 November 2014 at 19:47:17 UTC, Ivan Kazmenko wrote:
 So, if there is an idiomatic way to read the whole file into a 
 string which is Unicode-compatible, it would be great to learn 
 that, too.

 Maybe something like this:

 import std.stdio;
 import std.array;
 import std.conv;

 string text = stdin
 	.byLine(KeepTerminator.yes)
 	.join()
 	.to!(string);

And thanks for a short alternative!

At first glance, looks like it sacrifices a bit of efficiency on 
the way: the "remove-line-breaks, then add-line-breaks" path 
looks redundant.
Still, it does not store intermediate splitted representation, so 
the inefficiency is in fact not catastrophic, right?

Nov 04 2014

"Ivan Kazmenko" <gassa mail.ru> writes:

On Tuesday, 4 November 2014 at 18:09:48 UTC, Ivan Kazmenko wrote:
 On Monday, 3 November 2014 at 20:10:02 UTC, Gary Willoughby 
 wrote:
 On Monday, 3 November 2014 at 19:47:17 UTC, Ivan Kazmenko 
 wrote:
 So, if there is an idiomatic way to read the whole file into 
 a string which is Unicode-compatible, it would be great to 
 learn that, too.

 Maybe something like this:

 import std.stdio;
 import std.array;
 import std.conv;

 string text = stdin
 	.byLine(KeepTerminator.yes)
 	.join()
 	.to!(string);

 And thanks for a short alternative!

 At first glance, looks like it sacrifices a bit of efficiency 
 on the way: the "remove-line-breaks, then add-line-breaks" path 
 looks redundant.
 Still, it does not store intermediate splitted representation, 
 so the inefficiency is in fact not catastrophic, right?

And sorry, I didn't read that correctly.
Using byLine with KeepTerminator.yes and join with nothing, it 
does not alter line breaks at all.
So, the efficiency of this is entirely up to whether optimizer is 
able to detect that the break/join sequence is a operation.

Nov 04 2014

"Kagamin" <spam here.lot> writes:

https://issues.dlang.org/show_bug.cgi?id=12990 this?

Nov 04 2014

"Ivan Kazmenko" <gassa mail.ru> writes:

On Tuesday, 4 November 2014 at 11:46:24 UTC, Kagamin wrote:
 https://issues.dlang.org/show_bug.cgi?id=12990 this?

Similar, but not quite that.  Bugs 12990 and 1448 (linked from 
there) seem to have Windows console as an important part of the 
process.  For me, the example does not work even with files, 
either redirected via "test.exe <one.txt >two.txt" or using File 
structs inside D program.

Still, thank you for the link!

Nov 04 2014

"anonymous" <anonymous example.com> writes:

On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote:
 Hi!

 The following code does not correctly handle Unicode strings.
 -----
 import std.stdio;
 void main () {
 	string s;
 	readf ("%s", &s);
 	write (s);
 }
 -----

 Example input ("Test." in cyrillic):
 -----
 Тест.
 -----
 (hex: D0 A2 D0 B5 D1 81 D1 82 2E 0D 0A)

 Example output:
 -----
 Ð¢ÐµÑÑ.
 -----
 (hex: C3 90 C2 A2 C3 90 C2 B5 C3 91 C2 81 C3 91 C2 82 2E 0D 0A)

 Here, the input bytes are handled separately: D0 -> C3 90, A2 
 -> C2 A2, etc.

 On the bright side, reading the file with readln works properly.

 Is this an expected shortcoming of "%s"-reading a string?

No.

 Could it be made to work somehow?

Yes. std.stdio.LockingTextReader is to blame:

void main()
{
      import std.stdio;
      auto ltr = LockingTextReader(std.stdio.stdin);
      write(ltr);
}
----
$ echo Тест | rdmd test.d
Ð¢ÐµÑÑ

LockingTextReader has a dchar front. But it doesn't do any 
decoding. The dchar front is really a char front.

 Is it worth a bug report?

Yes.

 Ivan Kazmenko.

Nov 04 2014

"Ivan Kazmenko" <gassa mail.ru> writes:

On Tuesday, 4 November 2014 at 13:01:48 UTC, anonymous wrote:
 On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote:
 Hi!

 The following code does not correctly handle Unicode strings.
 -----
 import std.stdio;
 void main () {
 	string s;
 	readf ("%s", &s);
 	write (s);
 }
 -----

 Example input ("Test." in cyrillic):
 -----
 Тест.
 -----
 (hex: D0 A2 D0 B5 D1 81 D1 82 2E 0D 0A)

 Example output:
 -----
 Ð¢ÐµÑÑ.
 -----
 (hex: C3 90 C2 A2 C3 90 C2 B5 C3 91 C2 81 C3 91 C2 82 2E 0D 0A)

 Here, the input bytes are handled separately: D0 -> C3 90, A2 
 -> C2 A2, etc.

 On the bright side, reading the file with readln works 
 properly.

 Is this an expected shortcoming of "%s"-reading a string?

 No.

 Could it be made to work somehow?

 Yes. std.stdio.LockingTextReader is to blame:

 void main()
 {
      import std.stdio;
      auto ltr = LockingTextReader(std.stdio.stdin);
      write(ltr);
 }
 ----
 $ echo Тест | rdmd test.d
 Ð¢ÐµÑÑ

 LockingTextReader has a dchar front. But it doesn't do any 
 decoding. The dchar front is really a char front.

 Is it worth a bug report?

 Yes.

 Ivan Kazmenko.


You nailed it!
Reported the bug in original form: 
https://issues.dlang.org/show_bug.cgi?id=13686
Perhaps your reduction would be useful.

Nov 04 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Reading unicode string with readf ("%s")