www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Reading binary streams with decoding to Unicode

reply Vinay Sajip <vinay_sajip yahoo.co.uk> writes:
Is there a standardised way of reading over buffered binary 
streams (at least strings, files, and sockets) where you can 
layer a decoder on top, so you get a character stream you can 
read one Unicode char at a time? Initially UTF-8, but later also 
other encodings. I see that std.stream was deprecated, but can't 
see what other options there are. Can anyone point me in the 
right direction?
Oct 15 2018
parent reply Dukc <ajieskola gmail.com> writes:
On Monday, 15 October 2018 at 10:49:49 UTC, Vinay Sajip wrote:
 Is there a standardised way of reading over buffered binary 
 streams (at least strings, files, and sockets) where you can 
 layer a decoder on top, so you get a character stream you can 
 read one Unicode char at a time? Initially UTF-8, but later 
 also other encodings. I see that std.stream was deprecated, but 
 can't see what other options there are. Can anyone point me in 
 the right direction?
This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points. Vice-versa, if you want to iterate a character array by code unit, convert it to ubyte[]/ushort[] (depending on code unit length) or use std.utf.byCodeUnit
Oct 15 2018
parent reply Vinay Sajip <vinay_sajip yahoo.co.uk> writes:
On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:
 This is done automatically for character arrays, which includes 
 strings. wchar arrays wil iterate by UTF-16, and dchar arrays 
 by UTF-32. If you have a byte/ubyte array you know to be 
 unicode-encoded, convert it to char[] to iterate by code points.
Thanks for the response. I was looking for something where I don't have to manage buffers myself (e.g. when handling buffered file or socket I/O). It's really easy to find this functionality doesn't seem to be a ready-to-go equivalent in D. For example, I can find D examples of opening files and reading a line at a time, but no examples of opening a file and reading Unicode chars one at a time. Perhaps I've just missed them?
Oct 15 2018
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Monday, 15 October 2018 at 18:57:19 UTC, Vinay Sajip wrote:
 On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:
 This is done automatically for character arrays, which 
 includes strings. wchar arrays wil iterate by UTF-16, and 
 dchar arrays by UTF-32. If you have a byte/ubyte array you 
 know to be unicode-encoded, convert it to char[] to iterate by 
 code points.
Thanks for the response. I was looking for something where I don't have to manage buffers myself (e.g. when handling buffered file or socket I/O). It's really easy to find this surprised there doesn't seem to be a ready-to-go equivalent in D. For example, I can find D examples of opening files and reading a line at a time, but no examples of opening a file and reading Unicode chars one at a time. Perhaps I've just missed them?
import std.file : readText; import std.uni : byCodePoint, byGrapheme; // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF /*utf8(?)*/; string a = readText("foo"); foreach(cp; a.byCodePoint) { // do stuff with code point 'cp' } foreach(g; a.byGrapheme) { // do stuff with grapheme 'g' }
Oct 15 2018
parent reply Vinay Sajip <vinay_sajip yahoo.co.uk> writes:
On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson wrote:
 import std.file : readText;
 import std.uni : byCodePoint, byGrapheme;
 // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar 
 /*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
 string a = readText("foo");

 foreach(cp; a.byCodePoint)
 {
     // do stuff with code point 'cp'
 }
Your example shows reading an entire file into memory (string a = readText("foo")), then iterating over that. I know you can iterate over a string; I'm interested in iterating over a stream, which is perhaps read over a network or from another I/O source, where you can't assume you have access to all of it at once - just one Unicode character at a time.
Oct 15 2018
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Monday, 15 October 2018 at 21:48:05 UTC, Vinay Sajip wrote:
 On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson 
 wrote:
 import std.file : readText;
 import std.uni : byCodePoint, byGrapheme;
 // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar 
 /*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
 string a = readText("foo");

 foreach(cp; a.byCodePoint)
 {
     // do stuff with code point 'cp'
 }
Your example shows reading an entire file into memory (string a = readText("foo")), then iterating over that. I know you can iterate over a string; I'm interested in iterating over a stream, which is perhaps read over a network or from another I/O source, where you can't assume you have access to all of it at once - just one Unicode character at a time.
Oh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipe
Oct 15 2018
parent reply Vinay Sajip <vinay_sajip yahoo.co.uk> writes:
On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote:
 Oh, sorry I missed that. Take a look at 
 https://github.com/schveiguy/iopipe
Great, thanks.
Oct 16 2018
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 10/16/18 11:42 AM, Vinay Sajip wrote:
 On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote:
 Oh, sorry I missed that. Take a look at 
 https://github.com/schveiguy/iopipe
Great, thanks.
Let me know if anything doesn't work there. The text processing is pretty robust, but haven't done a lot of work with sockets. -Steve
Oct 16 2018