digitalmars.D - Transcoding - who's doing what?
- Arcane Jill (90/90) Aug 15 2004 There have been loads and loads of discussions in recent weeks about Uni...
- Ben Hinkle (3/5) Aug 15 2004 std.stream.InputStream and OutputStream interfaces already exist (since
- Arcane Jill (10/17) Aug 15 2004 Ah. That would be why I didn't know it. I've only read the HTML, not the...
- antiAlias (85/175) Aug 15 2004 I'm not doing anything specific for transcoding (yet) Jill; but will as ...
- Arcane Jill (15/23) Aug 15 2004 I'm afraid it doesn't have anything relevant to encoding or decoding, so...
- antiAlias (8/13) Aug 15 2004 someone
- teqDruid (6/41) Aug 15 2004 I, for one, would prefer that the core functionality NOT be phobos-strea...
- Arcane Jill (9/14) Aug 15 2004 Right, but this "set of functions" (or classes, which I'd prefer) would ...
- antiAlias (13/27) Aug 15 2004 Might I suggest something along the following lines:
- Nick (11/18) Aug 15 2004 Ok, here's my shot at it:
- Arcane Jill (8/9) Aug 16 2004 I think we should establish what we need, who needs what and why, etc., ...
- Nick (8/15) Aug 16 2004 That is ok. You raise some interesting points in your other post, and I ...
- teqDruid (5/9) Aug 15 2004 That's what I was getting at... I don't know much about Unicode
- Arcane Jill (24/33) Aug 16 2004 Suppose you want to decode a dchar from a stream, and then immediately r...
- Martin M. Pedersen (9/14) Aug 16 2004 One?
- Arcane Jill (105/115) Aug 16 2004 That would be bad. I think it's possible you haven't understood the issu...
- teqDruid (17/119) Aug 16 2004 Understood. This code looks reasonably agnostic, and even simple enough
- Arcane Jill (25/40) Aug 17 2004 Yes. That's because a string can always be viewed as a stream, but a str...
- antiAlias (120/185) Aug 16 2004 Confusion abounds! I follow you Jill, but please don't underestimate the
- antiAlias (2/14) Aug 16 2004 Whoops! Those twin while loops should, of course, be a single while() wi...
- Arcane Jill (14/63) Aug 17 2004 I would say that here the time spent getting the web page from the serve...
- Ben Hinkle (5/14) Aug 17 2004 std.stream supports ungetc, which pushes a character back by maintaining...
- Sean Kelly (9/23) Aug 17 2004 For the record, this is exactly what my mods to std.utf are for. In fac...
- Walter (7/11) Aug 15 2004 Unicode,
- Sean Kelly (27/41) Aug 16 2004 Hard to answer as I don't really know what will happen with Phobos in th...
- Arcane Jill (28/54) Aug 17 2004 Apparently, so does Phobos, although I didn't know that at the time I po...
- antiAlias (17/23) Aug 17 2004 that
- stonecobra (4/18) Aug 17 2004 Or, since it is open source, you can just compile it in ala std.* and
- Sean Kelly (9/29) Aug 17 2004 It's only a prototype in the sense that I haven't really finished it yet...
There have been loads and loads of discussions in recent weeks about Unicode, streams, and transcodings. There seems to be a general belief that "things are happening", but I'm not quite clear on the specifics - hence this post, which is basically a question. To clarify my own plans on the Unicode front, the purpose of the etc.unicode library is to implement all of the algorithms defined by the Unicode standard on the Unicode website. ("All" is quite ambitious, actually, and it will take a long time to achieve that, but obviously the core ones will come first, and most of the property-getting functions are already there). But I'm /not/ planning on writing any transcoding functions, simply because they're not part of the Unicode standard. Transcoding, in fact, is all about converting /to/ Unicode from something else (and vice versa). Transcoding functions are easy to write - for most encodings a simple 256-entry lookup table will suffice, at least in one direction. But transcoding in strings is not necessarily the best architecture, and it would probably be better to do it at a lower level, using streams (aka filters/readers/writers) - basically just classes which implement a read() function and/or a write() function. I don't know who, if anyone, is currently working on this. In post http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5925, Hauke said: "I'm currently working on ... a string interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...).", but it's possible I may have read too much into that. I also know that Sean is doing some stream stuff, and that in post http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8236, he said 'Rather than having the encoding handled directly by the Stream layer perhaps it should be dropped into another class. I can't imagine coding a base lib to support "Joe's custom encoding scheme." For the moment though, I think I'll leave stream.d as-is. This seems like a design issue that will take a bit of talk to get right.' and 'I'll probably have the first cut of stream.d done in a few more days and after that we can talk about what's wrong with it, etc.' I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what? Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't mind either way, but if Phobos is going to go off in some completely tangential direction, I want to know that too. So, the simple, easy peasy task of converting between Latin-1 and Unicode hasn't been done yet, basically because we haven't agreed on an architecture, and I for one am not really sure who's doing it anyway. Therefore, (1), I would like to ask, is anyone /actually/ writing transcoders yet, or is it still up in the air? And, (2), if the answer to (1) is no, I'd like to suggest that a couple of simple classes be written which, I believe, will slot nicely into whatever architecture we eventually come up with. This is what I suspect will do the job. Two classes: Now these will probably need some adapting to fit into our final architecture. (Should they derive from Stream? Or from some yet-to-be-defined transcoding Reader/Writer base classes? Should they implement some interface? Should they be merged into a single class? etc.) BUT - they won't need /much/ adaptation, and once we've got Latin-1 working, we'll have an example on which to model all the others. So feel free to take the above code and adapt it as necessary. But I do think we should nail down the architecture soon, as we're getting a lot of questions and discussion on this. But one thing at a time. Someone tell me where streams are going (with regard to above questions) and then I'll have more suggestions. Arcane Jill
Aug 15 2004
I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?std.stream.InputStream and OutputStream interfaces already exist (since 0.89). All the "new" stuff in std.stream isn't in the phobos.html doc. Are you thinking of a different InputStream and OutputStream?
Aug 15 2004
In article <cfogis$12on$1 digitaldaemon.com>, Ben Hinkle says...I didn't know that. Thanks.I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?std.stream.InputStream and OutputStream interfaces already exist (since 0.89).All the "new" stuff in std.stream isn't in the phobos.html doc.Ah. That would be why I didn't know it. I've only read the HTML, not the D source. I know a lot have folk have suggested that I should read the source, but I guess it's an ideological thing - using the specifics of the source smacks of relying on undocumented features to me, something not guaranteed to work in future incarnations. How hard would it be to update the documentation?Are you thinking of a different InputStream and OutputStream?I wasn't thinking of anything. I just didn't know there was such a beast. Thanks for educating me. Jill
Aug 15 2004
I'm not doing anything specific for transcoding (yet) Jill; but will as soon as the appropriate knowledge is made available in the shape of some low-level libraries. If etc.unicode already has those, well, I'll get on the job pronto. As for architecture, this is how mango.io approaches it: One might consider mango.io to have three separate, but related and bindable, entities. These are Conduit, Buffer, and Reader/Writer. Conduits represent things like files, sockets, and other 'physical', block oriented devices. You can talk to a Conduit directly (via read/write methods) with an instance of a Buffer. The next stage up in pecking order is the Buffer, which acts as a bi-directional queue for Conduit data (or used independently like Outbuffer, for that matter). You can read and write to a buffer using void[], or map it directly to a local array if desired. Buffers are intended as an abstraction over the more physical Conduit. You can use a common Buffer for both read and write purposes, or you can have a separate instance for both read and write purposes. On top of the Buffer, one can map either a set of Tokenizers (for scanf like processing), or a set of Readers/Writers. The latter convert between representations: usually programmer-idioms to Conduit-idioms and back again. For example, a Reader might convert Buffer content into ints, longs, char[] arrays and so on. Writer does the opposite. You can make a Reader/Writer pair do whatever you wish in terms of conversion: a classic example is endian conversion, but others might include various transcoding other tasks, including unicode. In addition, you can map multiple Readers/Writers onto a common Buffer, and they will all behave sequentially as one might imagine. The latter is handy for when you need to see what the content is before reading it in some other manner (think HTTP headers, followed by content that's been zip-compressed). You might think of the Reader/Writer layer as "piecemeal" IO: they usually work with small amounts of data at a time. Finally, the Conduit actually has an optional filter "intercept" layer: you can build a filter to modify either the input or output in void[] style. That is, an output filter is given a void[], and does what ever it wants with it (usually calls the next filter in the chain, which will ultimately cause the modulated content to be written somewhere). This sounds somewhat complex, but the APIs make it really easy (certainly as easy as phobos.io) to get things hooked up. For example, when reading a file you typically do the following: FileConduit fc = new FileConduit ("file.name"); Reader r = new Reader (fc); r.get(x).get(y).get(z); (or r >> x >> y >> z;) etc. So, whenever the appropriate unicode converters are available, I (or someone else) can hook them up either at the Buffer layer, or at the Conduit-filter layer. If you'd be interested in doing that, I'd be very, very, grateful! "Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cfog97$12n2$1 digitaldaemon.com...There have been loads and loads of discussions in recent weeks aboutUnicode,streams, and transcodings. There seems to be a general belief that "thingsarehappening", but I'm not quite clear on the specifics - hence this post,which isbasically a question. To clarify my own plans on the Unicode front, the purpose of theetc.unicodelibrary is to implement all of the algorithms defined by the Unicodestandard onthe Unicode website. ("All" is quite ambitious, actually, and it will takealong time to achieve that, but obviously the core ones will come first,and mostof the property-getting functions are already there). But I'm /not/planning onwriting any transcoding functions, simply because they're not part of the Unicode standard. Transcoding, in fact, is all about converting /to/Unicodefrom something else (and vice versa). Transcoding functions are easy to write - for most encodings a simple256-entrylookup table will suffice, at least in one direction. But transcoding instringsis not necessarily the best architecture, and it would probably be betterto doit at a lower level, using streams (aka filters/readers/writers) -basicallyjust classes which implement a read() function and/or a write() function. I don't know who, if anyone, is currently working on this. In post http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5925, Hauke said:"I'mcurrently working on ... a string interface that abstracts from thespecificencoding + a bunch of implementations for the most common ones (UTF-8, 16,32,system codepage, etc...).", but it's possible I may have read too muchintothat. I also know that Sean is doing some stream stuff, and that in post http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8236, he said'Ratherthan having the encoding handled directly by the Stream layer perhaps it should be dropped into anotherclass. Ican't imagine coding a base lib to support "Joe's custom encoding scheme."Forthe moment though, I think I'll leave stream.d as-is. This seems like adesignissue that will take a bit of talk to get right.' and 'I'll probably have the first cut of stream.d done in a few more days and afterthatwe can talk about what's wrong with it, etc.' I also need a bit of educating on the future of D's streams. Are we goingto getseparate InputStream and OutputStream interfaces, or what? Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? Idon'tmind either way, but if Phobos is going to go off in some completelytangentialdirection, I want to know that too. So, the simple, easy peasy task of converting between Latin-1 and Unicodehasn'tbeen done yet, basically because we haven't agreed on an architecture, andI forone am not really sure who's doing it anyway. Therefore, (1), I would like to ask, is anyone /actually/ writingtranscodersyet, or is it still up in the air? And, (2), if the answer to (1) is no, I'd like to suggest that a couple of simple classes be written which, I believe, will slot nicely into whatever architecture we eventually come up with. This is what I suspect will dothe job.Two classes: Now these will probably need some adapting to fit into our finalarchitecture.(Should they derive from Stream? Or from some yet-to-be-definedtranscodingReader/Writer base classes? Should they implement some interface? Shouldthey bemerged into a single class? etc.) BUT - they won't need /much/ adaptation,andonce we've got Latin-1 working, we'll have an example on which to modelall theothers. So feel free to take the above code and adapt it as necessary. But I do think we should nail down the architecture soon, as we're gettinga lotof questions and discussion on this. But one thing at a time. Someone tellmewhere streams are going (with regard to above questions) and then I'llhave moresuggestions. Arcane Jill
Aug 15 2004
In article <cfoiln$13re$1 digitaldaemon.com>, antiAlias says...I'm not doing anything specific for transcoding (yet) Jill; but will as soon as the appropriate knowledge is made available in the shape of some low-level libraries. If etc.unicode already has those, well, I'll get on the job pronto.I'm afraid it doesn't have anything relevant to encoding or decoding, sorry - just character properties, like isWhitespace(dchar) and so on. Transcoding is a different issue, basically just a mapping to/from a sequence of bytes from/to a Unicode character, and the actual mapping will be different for each encoding. Latin-1 is easy, because the codepoints are identical to those of Unicode.As for architecture, this is how mango.io approaches it:<snip> Cool.So, whenever the appropriate unicode converters are available, I (or someone else) can hook them up either at the Buffer layer, or at the Conduit-filter layer. If you'd be interested in doing that, I'd be very, very, grateful!I think I follow that. But presumably, if people don't want it to be std-specific, then it shouldn't be mango-specific either. I can write a converter for Latin-1, once we're all happy with the architecture. (Actually, I think any of us could). But I certainly wouldn't be able to do (for example) SHIFT-JIS. I imagine once we have the architecture nailed down, lots of transcoder classes will get written (one for each encoding). Jill
Aug 15 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cforp2$18vc$1 digitaldaemon.com...someoneSo, whenever the appropriate unicode converters are available, I (orConduit-filterelse) can hook them up either at the Buffer layer, or at theOops. Should have written "either at the Reader/Writer layer, or at the Conduit-filter layer" instead.layer. If you'd be interested in doing that, I'd be very, very, grateful!I think I follow that. But presumably, if people don't want it to be std-specific, then it shouldn't be mango-specific either.Yep; I think it's feasible to avoid all dependencies by limiting the API to arrays.
Aug 15 2004
On Sun, 15 Aug 2004 20:15:35 +0000, Arcane Jill wrote:I, for one, would prefer that the core functionality NOT be phobos-streams specific. IE, make a set of functions to do the transcoding, then use those to create the readers and writers. This way, it'll be easier to put the transcoding stuff into mango, which I prefer over std.streams. John
Aug 15 2004
In article <pan.2004.08.15.21.19.34.123236 teqdruid.com>, teqDruid says...I, for one, would prefer that the core functionality NOT be phobos-streams specific.Fair enough.IE, make a set of functions to do the transcoding, then use those to create the readers and writers. This way, it'll be easier to put the transcoding stuff into mango, which I prefer over std.streams.Right, but this "set of functions" (or classes, which I'd prefer) would still have to have a common format, or you wouldn't be able to call them polymorphically at runtime. Would you have a problem if they just implemented (or relied upon) the InputStream and OutputStream interfaces which I only just learned about a few posts ago? Jill
Aug 15 2004
Might I suggest something along the following lines: int utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output); where both return the number of bytes converted (or something like that). I think it's perhaps best to make these kind of things completely independent of any other layer, if at all possible. These also happen to be the kind of functions that might be worth optimizing with a smattering of assembly ... "Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cforuu$191p$1 digitaldaemon.com...In article <pan.2004.08.15.21.19.34.123236 teqdruid.com>, teqDruid says...phobos-streamsI, for one, would prefer that the core functionality NOT beputspecific.Fair enough.IE, make a set of functions to do the transcoding, then use those to create the readers and writers. This way, it'll be easier tostillthe transcoding stuff into mango, which I prefer over std.streams.Right, but this "set of functions" (or classes, which I'd prefer) wouldhave to have a common format, or you wouldn't be able to call them polymorphically at runtime. Would you have a problem if they just implemented (or relied upon) the InputStream and OutputStream interfaces which I only just learned about afewposts ago? Jill
Aug 15 2004
In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...Might I suggest something along the following lines: int utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output); where both return the number of bytes converted (or something like that). I think it's perhaps best to make these kind of things completely independent of any other layer, if at all possible. These also happen to be the kind of functions that might be worth optimizing with a smattering of assembly ...Ok, here's my shot at it: http://folk.uio.no/mortennk/encoding/ (released under LGPL) I'm not a professional programmer, so please excuse bad programming style, naming conventions or other crimes against humanity. Like mentioned earlier, I use iconv() from libiconv, which can convert between a large set of encodings with little hassle. Only tested on Linux. I'll leave the Windows porting/testing to someone else. A Win32 port of libiconv can be found here: http://gnuwin32.sourceforge.net/packages/libiconv.htm Nick
Aug 15 2004
In article <cfp7v5$1h84$1 digitaldaemon.com>, Nick says...Ok, here's my shot at it:I think we should establish what we need, who needs what and why, etc., before committing any code to a public library. Although the transcoding issue is "urgent" in the sense that lots of people want it, I'd say it was more important to get it right, than to write it fast. There's nothing wrong with your code. I just think that it addresses a different problem than the ones faced by stream developers. Jill
Aug 16 2004
That is ok. You raise some interesting points in your other post, and I might rewrite my code later based on what you said, if I have the time. My code is more a proof of concept, and the point was that encoding can be done easily through libiconv and you don't have to reinvent the wheel. The library already supports all the features you want, and rewriting my code for use with streams shouldn't be very hard. Nick In article <cfpvc5$2297$1 digitaldaemon.com>, Arcane Jill says...I think we should establish what we need, who needs what and why, etc., before committing any code to a public library. Although the transcoding issue is "urgent" in the sense that lots of people want it, I'd say it was more important to get it right, than to write it fast. There's nothing wrong with your code. I just think that it addresses a different problem than the ones faced by stream developers. Jill
Aug 16 2004
On Sun, 15 Aug 2004 17:12:25 -0700, antiAlias wrote:Might I suggest something along the following lines: int utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output);That's what I was getting at... I don't know much about Unicode transcoding, but I don't see a reason for the core functionality to be any more complicated than that. John
Aug 15 2004
In article <pan.2004.08.16.06.29.47.206851 teqdruid.com>, teqDruid says...On Sun, 15 Aug 2004 17:12:25 -0700, antiAlias wrote:Suppose you want to decode a dchar from a stream, and then immediately read a ubyte from the same stream. The above functions won't let you do that. To decode a dchar from a stream you must first read /some/ bytes from that stream, in order to pass those bytes to the above function. But how many? One? Two? Four? In UTF-7, some Unicode characters require no less than /eight/ bytes. (One can invent or imagine encodings that require even more). If you've read too few bytes from the stream, your conversion function will throw an exception. If you've read too many, the stream's seek position will be incorrect for the next read. You could argue that streams themselves could be rewritten to call functions like the above internally, but now you're adding complexity to something that doesn't need it. You said: "I don't see a reason for the core functionality to be any more complicated than that". But those functions are not "core" - they are constructable from yet lower level functionality. The lowest level of abstraction about which it makes sense to talk is "get one Unicode character from somewhere" and "write one Unicode character somewhere". The minute you start talking about /strings/ instead of merely /characters/, you've made an implementation assumption. Anyway, it's not the function/class/interface/whatever that needs to be simple, it's the code which calls it. We make classes do complicated things so that callers don't have to. Arcane JillMight I suggest something along the following lines: int utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output);That's what I was getting at... I don't know much about Unicode transcoding, but I don't see a reason for the core functionality to be any more complicated than that. John
Aug 16 2004
"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse news:cfqcrh$2cs2$1 digitaldaemon.com...In article <pan.2004.08.16.06.29.47.206851 teqdruid.com>, teqDruid says... To decode a dchar from a stream you must first read /some/ bytes from that stream, in order to pass those bytes to the above function. But how many?One?Two? Four? In UTF-7, some Unicode characters require no less than /eight/bytes.(One can invent or imagine encodings that require even more).Another verbose, yet useful representation is the character entities used in HTML: http://www.w3.org/TR/REC-html40/sgml/entities.html Regards, Martin M. Pedersen
Aug 16 2004
In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...Might I suggest something along the following lines: int utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output); where both return the number of bytes converted (or something like that).That would be bad. I think it's possible you haven't understood the issues, so I'll try to explain in this post what some of them are, and why you would want to do certain things in certain ways.I think it's perhaps best to make these kind of things completely independent of any other layer, if at all possible.I don't have any problem with that.These also happen to be the kind of functions that might be worth optimizing with a smattering of assembly ...I disagree. Transcoding almost never happens in performance-critical code. It happens during input and output. A typical scenario is to get input from a console and then decode it, to to encode a string and then write it to a file. The CPU time utilized in the I/O will outweigh the time spent transcoding by a very large factor. Of course it still makes sense to do this efficiently, but assembler - given that it's not portable, decreases maintainability, etc. - is probably going a bit too far. Okay, back to these function signatures:int utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output);(1) The encoding is not necessarily known at compile time. This problem would also exist had you used classes/interfaces, of course, but at least with classes or interfaces instead of plain functions, you can rely on polymorphism and factory methods to do the dispatching, giving you a single point of decision. Functions like the above would lead to switch statements all over the place, and also to inconsistent encoding names (e.g. "ISO-8859-1" vs "iso-8859-1" vs "LATIN-1" vs "Latin1"). Only by a single point of decision can enforce the IANA encoding names, case conventions, etc.. I see that in "charset.d" you made the encoding name a runtime parameter - but that too is bad, partly because you don't have a single point of decision, but partly also because you're now having to make that runtime check with /every/ fragment of text - not merely at construction time. (2) (Trival) you forgot "out" on the output variables. You cannot expect the caller to be aware in advance of the resulting required buffer size. (3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes. In fact, the minimal functionality that a decoder requires, is this: (next() could be called get(), or read(), or whatever). The minimal functionality upon which a decoder would rely, is this: For comparison, look at the way Walter's format() function uses an underlying put() function to write a single character. He /could/ have used strings throughout, but he recognised (correctly) that the one-byte-at-a-time approach was conceptually at a lower level. Strings can then be handled /in terms of/ those lower-level functions. With these two interfaces, you can put together the concept of a decoder. Thus: And a /specific/ decoder could derive from this, thus: This could be implemented more efficiently, but I wrote it that way to illustrate the point that the decoder - not the caller - is the only entity capable of knowing the length of the byte sequence corresponding to the next (dchar) character. So, NOW, if you want to plug this into a std.Stream, you could make one of these: And then simply make the magic decoder like so: And similarly for mango streams, InputStreams, strings, and so on. Strings are just not sufficiently low-level. We can rely on the compiler to inline these very simple functions. Encoding - the reverse process - would follow a similar pattern. You wouldn't need hasMore(), but something like done() or close() might be appropriate to indicate that you've finished. Arcane Jill
Aug 16 2004
On Mon, 16 Aug 2004 09:34:36 +0000, Arcane Jill wrote:In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says......Might I suggest something along the following lines: int utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output); where both return the number of bytes converted (or something like that).(3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes. In fact, the minimal functionality that a decoder requires, is this: (next() could be called get(), or read(), or whatever). The minimal functionality upon which a decoder would rely, is this: For comparison, look at the way Walter's format() function uses an underlying put() function to write a single character. He /could/ have used strings throughout, but he recognised (correctly) that the one-byte-at-a-time approach was conceptually at a lower level. Strings can then be handled /in terms of/ those lower-level functions. With these two interfaces, you can put together the concept of a decoder. Thus: And a /specific/ decoder could derive from this, thus: This could be implemented more efficiently, but I wrote it that way to illustrate the point that the decoder - not the caller - is the only entity capable of knowing the length of the byte sequence corresponding to the next (dchar) character. So, NOW, if you want to plug this into a std.Stream, you could make one of these: And then simply make the magic decoder like so: And similarly for mango streams, InputStreams, strings, and so on. Strings are just not sufficiently low-level. We can rely on the compiler to inline these very simple functions. Encoding - the reverse process - would follow a similar pattern. You wouldn't need hasMore(), but something like done() or close() might be appropriate to indicate that you've finished. Arcane JillUnderstood. This code looks reasonably agnostic, and even simple enough the use. The only difference is in thinking- streams vs strings. I might note, however that you use: dchar[] toUTF32(char[] s); Which could also be written as: int toUTF32(char[] s, out dchar[]); Which looks very similar to: int utf8ToDChar (char[] input, dchar[] output); This is the function that I would define as implementing the "core" functionality. You then (to quote myself) "use those to create the readers and writers." The stream implementation is a bit more complex than I imagined, but I can blame that up to a total lack of experience with variable-width character encodings. (And hey, I'm a first-year undergrad... what'dya expect?) John
Aug 16 2004
In article <pan.2004.08.16.18.58.22.898270 teqdruid.com>, teqDruid says...Understood. This code looks reasonably agnostic, and even simple enough the use. The only difference is in thinking- streams vs strings.Yes. That's because a string can always be viewed as a stream, but a stream cannot always be viewed as a string.I might note, however that you use: dchar[] toUTF32(char[] s); Which could also be written as: int toUTF32(char[] s, out dchar[]);Actually I was just calling the function in std.utf. For any other encoding, I probably would have inlined the code right there, rather than written a function, but I figured, why re-invent stuff? std.utf.toUTF32() throws exceptions if the input is wrong, so it's just what you'd need in this circumstance. (The tests I made to determine the length didn't weed out illegal sequences - I was relying on std.utf to do that for me).Which looks very similar to: int utf8ToDChar (char[] input, dchar[] output); This is the function that I would define as implementing the "core" functionality.Fair enough. Guess it just depends what you call "core". The main thing is the dispatch mechanism.The stream implementation is a bit more complex than I imagined, but I can blame that up to a total lack of experience with variable-width character encodings.There's more. Some encodings are not merely variable-width, but are also /stateful/. Consider UTF-7. A UTF-7 stream is always in one of two states: "ASCII" or "Radix 64". A '+' character in the stream changes the state to "Radix 64", and a '-' character changes the state back to "ASCII". A UTF-7 decoder needs to be aware at all times of the state of the stream. Incoming bytes are interpretted differently (as though they were two entirely different encodings) depending on the stream state. A function such as:int utf7ToDchar (char[] input, dchar[] output);just wouldn't do the job, because it doesn't preserve/know the state of the stream. You'd need a class, with a member variable to contain the current state of the stream (unless you wanted to use a global variable to store the state - yuk!) So, in general, basing your architecture on a set of functions with similar signature just wouldn't be adequate to do the job. Arcane Jill
Aug 17 2004
Confusion abounds! I follow you Jill, but please don't underestimate the usefulness of D arrays. I'll try to explain as we go along ... "Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cfpv3c$2253$1 digitaldaemon.com...In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...issues, soMight I suggest something along the following lines: int utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output); where both return the number of bytes converted (or something like that).That would be bad. I think it's possible you haven't understood theI'll try to explain in this post what some of them are, and why you wouldwantto do certain things in certain ways.independentI think it's perhaps best to make these kind of things completely...of any other layer, if at all possible.I don't have any problem with that.These also happen to be the kind of functions that might be worth optimizing with a smattering of assemblyI disagree. Transcoding almost never happens in performance-critical code.Ithappens during input and output. A typical scenario is to get input from a console and then decode it, to to encode a string and then write it to afile.The CPU time utilized in the I/O will outweigh the time spent transcodingby avery large factor. Of course it still makes sense to do this efficiently,butassembler - given that it's not portable, decreases maintainability,etc. - isprobably going a bit too far.What about HTTP servers? What about SOAP servers? Pretty much anything XML oriented has to at least think about doing this kind of thing often and efficiently. The latter still matters, and perhaps always will. Still, it was just a suggestion.Okay, back to these function signatures:wouldint utf8ToDChar (char[] input, dchar[] output); int dCharToUtf8 (dchar[] input, char[] output);(1) The encoding is not necessarily known at compile time. This problemalso exist had you used classes/interfaces, of course, but at least withclassesor interfaces instead of plain functions, you can rely on polymorphism and factory methods to do the dispatching, giving you a single point ofdecision.Functions like the above would lead to switch statements all over theplace, andalso to inconsistent encoding names (e.g. "ISO-8859-1" vs "iso-8859-1" vs "LATIN-1" vs "Latin1"). Only by a single point of decision can enforce theIANAencoding names, case conventions, etc..Agreed. I wouldn't presume to fashion a "complete" solution on /this/ NG <g>. Thus, encoding was deliberately ommited to clarify the means of getting data into and out of these converters. As far as encoding-names go, I would have expected such converters to be implemented as methods in a class; the constructor would be given the encoding identifier.I see that in "charset.d" you made the encoding name a runtime parameter -butthat too is bad, partly because you don't have a single point of decision,butpartly also because you're now having to make that runtime check with/every/fragment of text - not merely at construction time.Not sure what you mean. I've never written anything called "charset.d" ... besides, you can safely assume that efficiency is important to me.(2) (Trival) you forgot "out" on the output variables. You cannot expectthecaller to be aware in advance of the resulting required buffer size.Au contraire! Both input and output are /provided/ by the caller. This is why the return value specifies the number of items converted. D arrays have some wonderful properties worth taking advantage of -- the length is always provided, you can slice and dice to your hearts' content, and void[] arrays can easily be mapped onto pretty much anything (including a single char or dchar instance). The caller has already said "here's a set of input data, and here's a place to put the output. Convert what you can within the constraints of input & output limits, and tell me the resultant outcome". If (for example) there's only space in the output for one dchar, the algorithm will halt after converting just one. If there's not enough input provided to construct a dchar, the algorithm indicates nothing was converted. Of course, this points out a flaw in the original prototypes: two return values are needed instead of one (the number of items used from the input, as well as the number of items placed into the output). Alternatively, the implementing class could provide it's own output buffer during initial construction.(3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will wanttoget a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms offunctionality).But you can't build such a function out of your string routines, becauseyouhave no way of knowing in advance how many bytes will need to be consumedfromthe stream in order to build one character. So what do you do? Read toomany andthen put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes.Wholly agreed: pushback is a big "no no". But it's not an issue when using a pair of arrays in the suggested manner.In fact, the minimal functionality that a decoder requires, is this: (next() could be called get(), or read(), or whatever). The minimal functionality upon which a decoder would rely, is this: For comparison, look at the way Walter's format() function uses anunderlyingput() function to write a single character. He /could/ have used strings throughout, but he recognised (correctly) that the one-byte-at-a-timeapproachwas conceptually at a lower level. Strings can then be handled /in termsof/those lower-level functions.There are several valid ways to skin that particular cat <g> <snip> Here's a fuller implementation of the array approach (in pseudo-code) class Transcoder { this (char[] encoding) {...} dchar[] toUnicode (char[] input, dchar[] output, out int consumed) { while (room_for_more_output) while (enough_input_for_another_dchar) do_actual_conversion_into_output_buffer; emit_quantity_of_input_consumed; return_slice_of_output_representing_converted_dchars; } char[] toUtf8 (dchar[] input, char[] output, out int consumed) { while (room_for_more_output) while (enough_input_for_another_char) do_actual_conversion_into_output_buffer; emit_quantity_of_input_consumed; return_slice_of_output_representing_converted_chars; } } This would be wrapped at some higher level such as within a Phobos Stream, or a Mango Reader/Writer, to handle the mapping of arrays to variables. The benefit of this approach is it's throughput, and the ability for the 'controller' to direct the input and output arrays to anywhere it likes (including scalar variables), leading to further efficiencies. Functions such as these do not need to be exposed to the typical programmer. In fact, I vaguely recall Java has something along these lines that's hidden in some sun.x.x library, which the Java Streams utilize at some level. A variation on the theme might initially provide a buffer to house the conversion output instead. There's pros and cons to both approaches. In this case, you'd probably want to split the transcoding into separate encoding and decoding: class Decoder { private dchar[] unicode; this (char[] encoding, dchar[] output) { do_something_with_encoding; unicode = output; } this (char[] encoding, int outputSize) { this (encoding, new dchar[outputSize]); } dchar[] convert (char[] input, out int consumed) { while (room_for_more_output_in_output_buffer) while (enough_input_for_another_dchar) do_actual_conversion_into_output_buffer; emit_quantity_of_input_consumed; return_slice_of_output_representing_converted_dchars; } } class Encoder { // similar approach to Decoder } These are just suggestions, to take or leave at one's discretion.
Aug 16 2004
class Transcoder { this (char[] encoding) {...} dchar[] toUnicode (char[] input, dchar[] output, out int consumed) { while (room_for_more_output) while (enough_input_for_another_dchar) do_actual_conversion_into_output_buffer; emit_quantity_of_input_consumed; return_slice_of_output_representing_converted_dchars; } }Whoops! Those twin while loops should, of course, be a single while() with an && between the two conditions.
Aug 16 2004
In article <cfr0tf$2pgm$1 digitaldaemon.com>, antiAlias says...I would say that here the time spent getting the web page from the server to client across the internet will outweigh the time spent encoding by many orders of magnitude. But I'm not /against/ efficiency. If people want to recode this stuff in assembler then obviously I'm not going to object.by aThese also happen to be the kind of functions that might be worth optimizing with a smattering of assemblyThe CPU time utilized in the I/O will outweigh the time spent transcodingvery large factor.What about HTTP servers? What about SOAP servers? Pretty much anything XML oriented has to at least think about doing this kind of thing often and efficiently.Not sure what you mean. I've never written anything called "charset.d" ... besides, you can safely assume that efficiency is important to me.I think I was confusing you with Nick. My bad.Gotcha. Sorry - I misinterpretted the intent of the function signatures.(2) (Trival) you forgot "out" on the output variables. You cannot expectthecaller to be aware in advance of the resulting required buffer size.Au contraire! Both input and output are /provided/ by the caller. This is why the return value specifies the number of items converted. D arrays have some wonderful properties worth taking advantage of -- the length is always provided, you can slice and dice to your hearts' content, and void[] arrays can easily be mapped onto pretty much anything (including a single char or dchar instance). The caller has already said "here's a set of input data, and here's a place to put the output. Convert what you can within the constraints of input & output limits, and tell me the resultant outcome". If (for example) there's only space in the output for one dchar, the algorithm will halt after converting just one. If there's not enough input provided to construct a dchar, the algorithm indicates nothing was converted. Of course, this points out a flaw in the original prototypes: two return values are needed instead of one (the number of items used from the input, as well as the number of items placed into the output). Alternatively, the implementing class could provide it's own output buffer during initial construction.Wholly agreed: pushback is a big "no no". But it's not an issue when using a pair of arrays in the suggested manner.Wholly agreed.There are several valid ways to skin that particular cat <g> Here's a fuller implementation of the array approach (in pseudo-code) <snip> This would be wrapped at some higher level such as within a Phobos Stream, or a Mango Reader/Writer, to handle the mapping of arrays to variables. The benefit of this approach is it's throughput, and the ability for the 'controller' to direct the input and output arrays to anywhere it likes (including scalar variables), leading to further efficiencies. Functions such as these do not need to be exposed to the typical programmer. In fact, I vaguely recall Java has something along these lines that's hidden in some sun.x.x library, which the Java Streams utilize at some level. A variation on the theme might initially provide a buffer to house the conversion output instead. There's pros and cons to both approaches. In this case, you'd probably want to split the transcoding into separate encoding and decoding: <snip> These are just suggestions, to take or leave at one's discretion.They are good suggestions. They have the benefit of efficiency without losing generality. They have the disadvantage of having a slightly confusing signature, but good documentation should solve that. Nice one. Arcane Jill
Aug 17 2004
(3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes.std.stream supports ungetc, which pushes a character back by maintaining an array of pushed-back characters. Right now only the text functions check this array for content, though. I think the idea was that if one is storing text and binary data mixed together that the text are stored with writeString which puts a length byte followed by the text.
Aug 17 2004
In article <cfsu27$122d$1 digitaldaemon.com>, Ben Hinkle says...For the record, this is exactly what my mods to std.utf are for. In fact, unFormat and my stream mods already use them.(3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes.std.stream supports ungetc, which pushes a character back by maintaining an array of pushed-back characters. Right now only the text functions check this array for content, though. I think the idea was that if one is storing text and binary data mixed together that the text are stored with writeString which puts a length byte followed by the text.Most stream routines allow for at least one byte to put back. Obviously this isn't possible in all cases, but it *is* always possible to carry an unget buffer around with the stream, as std.stream already does. Only the formatted routines check this area for content and I consider that correct behavior, as some translation may have been done between the stream and the buffer. Sean
Aug 17 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cfog97$12n2$1 digitaldaemon.com...There have been loads and loads of discussions in recent weeks aboutUnicode,streams, and transcodings. There seems to be a general belief that "thingsarehappening", but I'm not quite clear on the specifics - hence this post,which isbasically a question.What I am excited about is D is becoming the premier language to do unicode in, by a wide margin. And that's thanks to you guys!
Aug 15 2004
In article <cfog97$12n2$1 digitaldaemon.com>, Arcane Jill says...I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?I'd like to have them. In fact my partial rewrite of stream.d already has this.Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't mind either way, but if Phobos is going to go off in some completely tangential direction, I want to know that too.Hard to answer as I don't really know what will happen with Phobos in the long term, however... I would like to see unFormat/readf get into Phobos, though that may have to wait for TypeInfo to be working for pointers since the current calling convention is still a bit inconsistent with writef (ie. it requires a format string as the first argument, just like scanf). I could work around this with a big if/else block to get the underlying type of pointer arguments but I'd prefer to just work off classinfo.name like Walter does for doFormat. Perhaps I'll just drop that into a separate function and replace it later when TypeInfo gets fixed. As for my std.stream rewrite... I like it better than what's in std.stream now but I have no idea what will sort out in the long term. Is adopting Mango.io a better idea? Perhaps streams should be dropped from Phobos completely? I consider my version of stream.d to be more of a prototype than a full-featured replacement.So, the simple, easy peasy task of converting between Latin-1 and Unicode hasn't been done yet, basically because we haven't agreed on an architecture, and I for one am not really sure who's doing it anyway.Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or UTF-16. My unFormat will read UTF-8 and UTF-16 following the same convention as std.doFormat and will convert everything to char/wchar/dchar strings as appropriate. Both of these functions use the functions in std.utf for conversion. Is this enough to start with?Therefore, (1), I would like to ask, is anyone /actually/ writing transcoders yet, or is it still up in the air?I guess that depends on what still needs to be done.But I do think we should nail down the architecture soon, as we're getting a lot of questions and discussion on this. But one thing at a time. Someone tell me where streams are going (with regard to above questions) and then I'll have more suggestions.Frankly, I can live without streams so long as there is *some* way to do formatted i/o that can handle Unicode. I think doFormat/unFormat might be the answer to this, but I don't know the remaining issues well enough to say for sure. Sean
Aug 16 2004
In article <cfrc06$31g8$1 digitaldaemon.com>, Sean Kelly says...Apparently, so does Phobos, although I didn't know that at the time I posted the question. Now isn't that cute - an interface with an undocumented interface!I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?I'd like to have them. In fact my partial rewrite of stream.d already has this.Okay - I just wasn't sure if you were working for Walter in some capacity. Forgive the dumb question.Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't mind either way, but if Phobos is going to go off in some completely tangential direction, I want to know that too.Hard to answer as I don't really know what will happen with Phobos in the long term, however...As for my std.stream rewrite... I like it better than what's in std.stream nowIt's hard to know what's in std.stream now without reading the source. I /really/ wish someone would document it.but I have no idea what will sort out in the long term. Is adopting Mango.io a better idea?Many people think so. Others argue that we should wait for the new-improved std.stream. But I don't know what that future is. The mango folk have said that they don't want mango.io moved into std, so it will always be an external library. That isn't a problem for applications, of course, since mango is free and open-source, but it might be considered a problem for libraries (- when one third-party library dependends on a different third-party library, things start to get messy).Perhaps streams should be dropped from Phobos completely?Perhaps, but I find it unlikely that that will happen. Only Walter is empowered to do that.I consider my version of stream.d to be more of a prototype than a full-featured replacement.Well, that's good and bad. A prototype is good - it implies that better, future versions will exist. But "not a replacement"? If it's not a replacement, are you envisaging that people will use both? Do they interact somehow?Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or UTF-16. My unFormat will read UTF-8 and UTF-16 following the same convention as std.doFormat and will convert everything to char/wchar/dchar strings as appropriate. Both of these functions use the functions in std.utf for conversion. Is this enough to start with?It's certainly enough for now, but it's not transcoding in the more general sense. UTF8/16/32 are fundamental to D - they simply have to be there.Frankly, I can live without streams so long as there is *some* way to do formatted i/o that can handle Unicode.Yes, "formatted" - that is an interesting and important one. printf()/writef() are currently not very Unicode-aware. A format string like "%5s" will output at least five /bytes/, not at least five /characters/. What is needed in this department is a printf() replacement written exclusively for dchars.I think doFormat/unFormat might be the answer to this, but I don't know the remaining issues well enough to say for sure. SeanWell, thanks. I think I've got a picture now of what's going on. I'll post a summary shortly, then we can start calling for volunteers for the missing bits. Jill
Aug 17 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cfscq6$s8u$1 digitaldaemon.com...std.stream. But I don't know what that future is. The mango folk have saidthatthey don't want mango.io moved into std, so it will always be an external library. That isn't a problem for applications, of course, since mango isfree Please allow me to clarify: from recollection, the position has always been that both mango.io and std.streams should be excluded from Phobos. I think that was Matthew's position also. If mango.io turns out to be a better solution, then it can certainly move from its current home if that's what people want; but I'm not holding my breath waiting for a consensus on that one <g> BTW; mango.io is completely independent from the rest of the Mango Tree (has no dependencies), so it can be easily cut away. In fact, it's almost totally independent of Phobos too ...and open-source, but it might be considered a problem for libraries (-when onethird-party library dependends on a different third-party library, thingsstartto get messy).Right -- that thread regarding placing the .lib dependencies inside the D source-code might help out with this (to the extent that it can).
Aug 17 2004
Arcane Jill wrote:Or, since it is open source, you can just compile it in ala std.* and not have a library dependency Scottbut I have no idea what will sort out in the long term. Is adopting Mango.io a better idea?Many people think so. Others argue that we should wait for the new-improved std.stream. But I don't know what that future is. The mango folk have said that they don't want mango.io moved into std, so it will always be an external library. That isn't a problem for applications, of course, since mango is free and open-source, but it might be considered a problem for libraries (- when one third-party library dependends on a different third-party library, things start to get messy).
Aug 17 2004
In article <cfscq6$s8u$1 digitaldaemon.com>, Arcane Jill says...In article <cfrc06$31g8$1 digitaldaemon.com>, Sean Kelly says...It's only a prototype in the sense that I haven't really finished it yet. There are some notable functions missing (like ignore), etc. If people are interested then I'll flesh it out a bit. I don't have a ton of free time so I figured I'd see what the response was before I worked any more on it.I consider my version of stream.d to be more of a prototype than a full-featured replacement.Well, that's good and bad. A prototype is good - it implies that better, future versions will exist. But "not a replacement"? If it's not a replacement, are you envisaging that people will use both? Do they interact somehow?unFormat operates entirely in terms of dchars. So the width modifiers are in terms of UTF-32 characters, etc. But I agree. If doFormat doesn't work this way then it probably should. The results are unpredictable otherwise. SeanWell, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or UTF-16. My unFormat will read UTF-8 and UTF-16 following the same convention as std.doFormat and will convert everything to char/wchar/dchar strings as appropriate. Both of these functions use the functions in std.utf for conversion. Is this enough to start with?It's certainly enough for now, but it's not transcoding in the more general sense. UTF8/16/32 are fundamental to D - they simply have to be there.Frankly, I can live without streams so long as there is *some* way to do formatted i/o that can handle Unicode.Yes, "formatted" - that is an interesting and important one. printf()/writef() are currently not very Unicode-aware. A format string like "%5s" will output at least five /bytes/, not at least five /characters/. What is needed in this department is a printf() replacement written exclusively for dchars.
Aug 17 2004