digitalmars.D.learn - Reading binary streams with decoding to Unicode
- Vinay Sajip (7/7) Oct 15 2018 Is there a standardised way of reading over buffered binary
- Dukc (8/15) Oct 15 2018 This is done automatically for character arrays, which includes
- Vinay Sajip (9/13) Oct 15 2018 Thanks for the response. I was looking for something where I
- Nicholas Wilson (14/29) Oct 15 2018 import std.file : readText;
- Vinay Sajip (7/16) Oct 15 2018 Your example shows reading an entire file into memory (string a =
- Nicholas Wilson (3/22) Oct 15 2018 Oh, sorry I missed that. Take a look at
- Vinay Sajip (2/4) Oct 16 2018 Great, thanks.
- Steven Schveighoffer (4/9) Oct 16 2018 Let me know if anything doesn't work there. The text processing is
Is there a standardised way of reading over buffered binary streams (at least strings, files, and sockets) where you can layer a decoder on top, so you get a character stream you can read one Unicode char at a time? Initially UTF-8, but later also other encodings. I see that std.stream was deprecated, but can't see what other options there are. Can anyone point me in the right direction?
Oct 15 2018
On Monday, 15 October 2018 at 10:49:49 UTC, Vinay Sajip wrote:Is there a standardised way of reading over buffered binary streams (at least strings, files, and sockets) where you can layer a decoder on top, so you get a character stream you can read one Unicode char at a time? Initially UTF-8, but later also other encodings. I see that std.stream was deprecated, but can't see what other options there are. Can anyone point me in the right direction?This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points. Vice-versa, if you want to iterate a character array by code unit, convert it to ubyte[]/ushort[] (depending on code unit length) or use std.utf.byCodeUnit
Oct 15 2018
On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points.Thanks for the response. I was looking for something where I don't have to manage buffers myself (e.g. when handling buffered file or socket I/O). It's really easy to find this functionality doesn't seem to be a ready-to-go equivalent in D. For example, I can find D examples of opening files and reading a line at a time, but no examples of opening a file and reading Unicode chars one at a time. Perhaps I've just missed them?
Oct 15 2018
On Monday, 15 October 2018 at 18:57:19 UTC, Vinay Sajip wrote:On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:import std.file : readText; import std.uni : byCodePoint, byGrapheme; // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF /*utf8(?)*/; string a = readText("foo"); foreach(cp; a.byCodePoint) { // do stuff with code point 'cp' } foreach(g; a.byGrapheme) { // do stuff with grapheme 'g' }This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points.Thanks for the response. I was looking for something where I don't have to manage buffers myself (e.g. when handling buffered file or socket I/O). It's really easy to find this surprised there doesn't seem to be a ready-to-go equivalent in D. For example, I can find D examples of opening files and reading a line at a time, but no examples of opening a file and reading Unicode chars one at a time. Perhaps I've just missed them?
Oct 15 2018
On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson wrote:import std.file : readText; import std.uni : byCodePoint, byGrapheme; // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF /*utf8(?)*/; string a = readText("foo"); foreach(cp; a.byCodePoint) { // do stuff with code point 'cp' }Your example shows reading an entire file into memory (string a = readText("foo")), then iterating over that. I know you can iterate over a string; I'm interested in iterating over a stream, which is perhaps read over a network or from another I/O source, where you can't assume you have access to all of it at once - just one Unicode character at a time.
Oct 15 2018
On Monday, 15 October 2018 at 21:48:05 UTC, Vinay Sajip wrote:On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson wrote:Oh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipeimport std.file : readText; import std.uni : byCodePoint, byGrapheme; // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF /*utf8(?)*/; string a = readText("foo"); foreach(cp; a.byCodePoint) { // do stuff with code point 'cp' }Your example shows reading an entire file into memory (string a = readText("foo")), then iterating over that. I know you can iterate over a string; I'm interested in iterating over a stream, which is perhaps read over a network or from another I/O source, where you can't assume you have access to all of it at once - just one Unicode character at a time.
Oct 15 2018
On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote:Oh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipeGreat, thanks.
Oct 16 2018
On 10/16/18 11:42 AM, Vinay Sajip wrote:On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote:Let me know if anything doesn't work there. The text processing is pretty robust, but haven't done a lot of work with sockets. -SteveOh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipeGreat, thanks.
Oct 16 2018