digitalmars.D.learn - Reading binary streams with decoding to Unicode

Vinay Sajip (7/7) Oct 15 2018 Is there a standardised way of reading over buffered binary

Dukc (8/15) Oct 15 2018 This is done automatically for character arrays, which includes

Vinay Sajip (9/13) Oct 15 2018 Thanks for the response. I was looking for something where I

Nicholas Wilson (14/29) Oct 15 2018 import std.file : readText;

Vinay Sajip (7/16) Oct 15 2018 Your example shows reading an entire file into memory (string a =

Nicholas Wilson (3/22) Oct 15 2018 Oh, sorry I missed that. Take a look at

Vinay Sajip (2/4) Oct 16 2018 Great, thanks.

Steven Schveighoffer (4/9) Oct 16 2018 Let me know if anything doesn't work there. The text processing is

Vinay Sajip <vinay_sajip yahoo.co.uk> writes:

Is there a standardised way of reading over buffered binary 
streams (at least strings, files, and sockets) where you can 
layer a decoder on top, so you get a character stream you can 
read one Unicode char at a time? Initially UTF-8, but later also 
other encodings. I see that std.stream was deprecated, but can't 
see what other options there are. Can anyone point me in the 
right direction?

Oct 15 2018

Dukc <ajieskola gmail.com> writes:

On Monday, 15 October 2018 at 10:49:49 UTC, Vinay Sajip wrote:
 Is there a standardised way of reading over buffered binary 
 streams (at least strings, files, and sockets) where you can 
 layer a decoder on top, so you get a character stream you can 
 read one Unicode char at a time? Initially UTF-8, but later 
 also other encodings. I see that std.stream was deprecated, but 
 can't see what other options there are. Can anyone point me in 
 the right direction?

This is done automatically for character arrays, which includes 
strings. wchar arrays wil iterate by UTF-16, and dchar arrays by 
UTF-32. If you have a byte/ubyte array you know to be 
unicode-encoded, convert it to char[] to iterate by code points.

Vice-versa, if you want to iterate a character array by code 
unit, convert it to ubyte[]/ushort[] (depending on code unit 
length) or use std.utf.byCodeUnit

Oct 15 2018

Vinay Sajip <vinay_sajip yahoo.co.uk> writes:

On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:
 This is done automatically for character arrays, which includes 
 strings. wchar arrays wil iterate by UTF-16, and dchar arrays 
 by UTF-32. If you have a byte/ubyte array you know to be 
 unicode-encoded, convert it to char[] to iterate by code points.

Thanks for the response. I was looking for something where I 
don't have to manage buffers myself (e.g. when handling buffered 
file or socket I/O). It's really easy to find this functionality 

doesn't seem to be a ready-to-go equivalent in D. For example, I 
can find D examples of opening files and reading a line at a 
time, but no examples of opening a file and reading Unicode chars 
one at a time. Perhaps I've just missed them?

Oct 15 2018

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Monday, 15 October 2018 at 18:57:19 UTC, Vinay Sajip wrote:
 On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:
 This is done automatically for character arrays, which 
 includes strings. wchar arrays wil iterate by UTF-16, and 
 dchar arrays by UTF-32. If you have a byte/ubyte array you 
 know to be unicode-encoded, convert it to char[] to iterate by 
 code points.

 Thanks for the response. I was looking for something where I 
 don't have to manage buffers myself (e.g. when handling 
 buffered file or socket I/O). It's really easy to find this 

 surprised there doesn't seem to be a ready-to-go equivalent in 
 D. For example, I can find D examples of opening files and 
 reading a line at a time, but no examples of opening a file and 
 reading Unicode chars one at a time. Perhaps I've just missed 
 them?

import std.file : readText;
import std.uni : byCodePoint, byGrapheme;
// or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar 
/*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
string a = readText("foo");

foreach(cp; a.byCodePoint)
{
     // do stuff with code point 'cp'
}

foreach(g; a.byGrapheme)
{
     // do stuff with grapheme 'g'
}

Oct 15 2018

Vinay Sajip <vinay_sajip yahoo.co.uk> writes:

On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson wrote:
 import std.file : readText;
 import std.uni : byCodePoint, byGrapheme;
 // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar 
 /*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
 string a = readText("foo");

 foreach(cp; a.byCodePoint)
 {
     // do stuff with code point 'cp'
 }

Your example shows reading an entire file into memory (string a = 
readText("foo")), then iterating over that. I know you can 
iterate over a string; I'm interested in iterating over a stream, 
which is perhaps read over a network or from another I/O source, 
where you can't assume you have access to all of it at once - 
just one Unicode character at a time.

Oct 15 2018

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Monday, 15 October 2018 at 21:48:05 UTC, Vinay Sajip wrote:
 On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson 
 wrote:
 import std.file : readText;
 import std.uni : byCodePoint, byGrapheme;
 // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar 
 /*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
 string a = readText("foo");

 foreach(cp; a.byCodePoint)
 {
     // do stuff with code point 'cp'
 }

 Your example shows reading an entire file into memory (string a 
 = readText("foo")), then iterating over that. I know you can 
 iterate over a string; I'm interested in iterating over a 
 stream, which is perhaps read over a network or from another 
 I/O source, where you can't assume you have access to all of it 
 at once - just one Unicode character at a time.

Oh, sorry I missed that. Take a look at 
https://github.com/schveiguy/iopipe

Oct 15 2018

Vinay Sajip <vinay_sajip yahoo.co.uk> writes:

On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote:
 Oh, sorry I missed that. Take a look at 
 https://github.com/schveiguy/iopipe

Great, thanks.

Oct 16 2018

Steven Schveighoffer <schveiguy gmail.com> writes:

On 10/16/18 11:42 AM, Vinay Sajip wrote:
 On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote:
 Oh, sorry I missed that. Take a look at 
 https://github.com/schveiguy/iopipe

 
 Great, thanks.

Let me know if anything doesn't work there. The text processing is 
pretty robust, but haven't done a lot of work with sockets.

-Steve

Oct 16 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Reading binary streams with decoding to Unicode