digitalmars.D - Transcoding - who's doing what?

Arcane Jill (90/90) Aug 15 2004 There have been loads and loads of discussions in recent weeks about Uni...

Ben Hinkle (3/5) Aug 15 2004 std.stream.InputStream and OutputStream interfaces already exist (since

Arcane Jill (10/17) Aug 15 2004 Ah. That would be why I didn't know it. I've only read the HTML, not the...

antiAlias (85/175) Aug 15 2004 I'm not doing anything specific for transcoding (yet) Jill; but will as ...

Arcane Jill (15/23) Aug 15 2004 I'm afraid it doesn't have anything relevant to encoding or decoding, so...

antiAlias (8/13) Aug 15 2004 someone

teqDruid (6/41) Aug 15 2004 I, for one, would prefer that the core functionality NOT be phobos-strea...

Arcane Jill (9/14) Aug 15 2004 Right, but this "set of functions" (or classes, which I'd prefer) would ...

antiAlias (13/27) Aug 15 2004 Might I suggest something along the following lines:

Nick (11/18) Aug 15 2004 Ok, here's my shot at it:

Arcane Jill (8/9) Aug 16 2004 I think we should establish what we need, who needs what and why, etc., ...

Nick (8/15) Aug 16 2004 That is ok. You raise some interesting points in your other post, and I ...

teqDruid (5/9) Aug 15 2004 That's what I was getting at... I don't know much about Unicode

Arcane Jill (24/33) Aug 16 2004 Suppose you want to decode a dchar from a stream, and then immediately r...

Martin M. Pedersen (9/14) Aug 16 2004 One?

Arcane Jill (105/115) Aug 16 2004 That would be bad. I think it's possible you haven't understood the issu...

teqDruid (17/119) Aug 16 2004 Understood. This code looks reasonably agnostic, and even simple enough

Arcane Jill (25/40) Aug 17 2004 Yes. That's because a string can always be viewed as a stream, but a str...

antiAlias (120/185) Aug 16 2004 Confusion abounds! I follow you Jill, but please don't underestimate the

antiAlias (2/14) Aug 16 2004 Whoops! Those twin while loops should, of course, be a single while() wi...
Arcane Jill (14/63) Aug 17 2004 I would say that here the time spent getting the web page from the serve...

Ben Hinkle (5/14) Aug 17 2004 std.stream supports ungetc, which pushes a character back by maintaining...

Sean Kelly (9/23) Aug 17 2004 For the record, this is exactly what my mods to std.utf are for. In fac...

Walter (7/11) Aug 15 2004 Unicode,
Sean Kelly (27/41) Aug 16 2004 Hard to answer as I don't really know what will happen with Phobos in th...

Arcane Jill (28/54) Aug 17 2004 Apparently, so does Phobos, although I didn't know that at the time I po...

antiAlias (17/23) Aug 17 2004 that
stonecobra (4/18) Aug 17 2004 Or, since it is open source, you can just compile it in ala std.* and
Sean Kelly (9/29) Aug 17 2004 It's only a prototype in the sense that I haven't really finished it yet...

Arcane Jill <Arcane_member pathlink.com> writes:

There have been loads and loads of discussions in recent weeks about Unicode,
streams, and transcodings. There seems to be a general belief that "things are
happening", but I'm not quite clear on the specifics - hence this post, which is
basically a question.

To clarify my own plans on the Unicode front, the purpose of the etc.unicode
library is to implement all of the algorithms defined by the Unicode standard on
the Unicode website. ("All" is quite ambitious, actually, and it will take a
long time to achieve that, but obviously the core ones will come first, and most
of the property-getting functions are already there). But I'm /not/ planning on
writing any transcoding functions, simply because they're not part of the
Unicode standard. Transcoding, in fact, is all about converting /to/  Unicode
from something else (and vice versa).

Transcoding functions are easy to write - for most encodings a simple 256-entry
lookup table will suffice, at least in one direction. But transcoding in strings
is not necessarily the best architecture, and it would probably be better to do
it at a lower level, using streams (aka filters/readers/writers) - basically
just classes which implement a read() function and/or a write() function.

I don't know who, if anyone, is currently working on this. In post
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5925, Hauke said: "I'm
currently working on ... a string interface that abstracts from the specific
encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32,
system codepage, etc...).", but it's possible I may have read too much into
that.

I also know that Sean is doing some stream stuff, and that in post
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8236, he said 'Rather
than having the encoding handled
directly by the Stream layer perhaps it should be dropped into another class.  I
can't imagine coding a base lib to support "Joe's custom encoding scheme."  For
the moment though, I think I'll leave stream.d as-is.  This seems like a design
issue that will take a bit of talk to get right.' and 'I'll
probably have the first cut of stream.d done in a few more days and after that
we can talk about what's wrong with it, etc.'

I also need a bit of educating on the future of D's streams. Are we going to get
separate InputStream and OutputStream interfaces, or what?

Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't
mind either way, but if Phobos is going to go off in some completely tangential
direction, I want to know that too.

So, the simple, easy peasy task of converting between Latin-1 and Unicode hasn't
been done yet, basically because we haven't agreed on an architecture, and I for
one am not really sure who's doing it anyway.

Therefore, (1), I would like to ask, is anyone /actually/ writing transcoders
yet, or is it still up in the air?

And, (2), if the answer to (1) is no, I'd like to suggest that a couple of
simple classes be written which, I believe, will slot nicely into whatever
architecture we eventually come up with. This is what I suspect will do the job.
Two classes:



































Now these will probably need some adapting to fit into our final architecture.
(Should they derive from Stream? Or from some yet-to-be-defined transcoding
Reader/Writer base classes? Should they implement some interface? Should they be
merged into a single class? etc.) BUT - they won't need /much/ adaptation, and
once we've got Latin-1 working, we'll have an example on which to model all the
others. So feel free to take the above code and adapt it as necessary.

But I do think we should nail down the architecture soon, as we're getting a lot
of questions and discussion on this. But one thing at a time. Someone tell me
where streams are going (with regard to above questions) and then I'll have more
suggestions.

Arcane Jill

Aug 15 2004

Ben Hinkle <bhinkle4 juno.com> writes:

 I also need a bit of educating on the future of D's streams. Are we going
 to get separate InputStream and OutputStream interfaces, or what?

std.stream.InputStream and OutputStream interfaces already exist (since
0.89). All the "new" stuff in std.stream isn't in the phobos.html doc. Are
you thinking of a different InputStream and OutputStream?

Aug 15 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cfogis$12on$1 digitaldaemon.com>, Ben Hinkle says...
 I also need a bit of educating on the future of D's streams. Are we going
 to get separate InputStream and OutputStream interfaces, or what?

std.stream.InputStream and OutputStream interfaces already exist (since
0.89).

I didn't know that. Thanks.

All the "new" stuff in std.stream isn't in the phobos.html doc.

Ah. That would be why I didn't know it. I've only read the HTML, not the D
source. I know a lot have folk have suggested that I should read the source, but
I guess it's an ideological thing - using the specifics of the source smacks of
relying on undocumented features to me, something not guaranteed to work in
future incarnations. How hard would it be to update the documentation?


Are
you thinking of a different InputStream and OutputStream?

I wasn't thinking of anything. I just didn't know there was such a beast. Thanks
for educating me.

Jill

Aug 15 2004

"antiAlias" <gblazzer corneleus.com> writes:

I'm not doing anything specific for transcoding (yet) Jill; but will as soon
as the appropriate knowledge is made available in the shape of some
low-level libraries. If etc.unicode already has those, well, I'll get on the
job pronto.

As for architecture, this is how mango.io approaches it:

One might consider mango.io to have three separate, but related and
bindable, entities. These are Conduit, Buffer, and Reader/Writer. Conduits
represent things like files, sockets, and other 'physical', block oriented
devices. You can talk to a Conduit directly (via read/write methods) with an
instance of a Buffer. The next stage up in pecking order is the Buffer,
which acts as a bi-directional queue for Conduit data (or used independently
like Outbuffer, for that matter). You can read and write to a buffer using
void[], or map it directly to a local array if desired.

Buffers are intended as an abstraction over the more physical Conduit. You
can use a common Buffer for both read and write purposes, or you can have a
separate instance for both read and write purposes. On top of the Buffer,
one can map either a set of Tokenizers (for scanf like processing), or a set
of Readers/Writers. The latter convert between representations: usually
programmer-idioms to Conduit-idioms and back again. For example, a Reader
might convert Buffer content into ints, longs, char[] arrays and so on.
Writer does the opposite.

You can make a Reader/Writer pair do whatever you wish in terms of
conversion: a classic example is endian conversion, but others might include
various transcoding other tasks, including unicode. In addition, you can map
multiple Readers/Writers onto a common Buffer, and they will all behave
sequentially as one might imagine. The latter is handy for when you need to
see what the content is before reading it in some other manner (think HTTP
headers, followed by content that's been zip-compressed). You might think of
the Reader/Writer layer as "piecemeal" IO: they usually work with small
amounts of data at a time.

Finally, the Conduit actually has an optional filter "intercept" layer: you
can build a filter to modify either the input or output in void[] style.
That is, an output filter is given a void[], and does what ever it wants
with it (usually calls the next filter in the chain, which will ultimately
cause the modulated content to be written somewhere).

This sounds somewhat complex, but the APIs make it really easy (certainly as
easy as phobos.io) to get things hooked up. For example, when reading a file
you typically do the following:

FileConduit fc = new FileConduit ("file.name");
Reader r = new Reader (fc);

r.get(x).get(y).get(z);

(or r >> x >> y >> z;)

etc.

So, whenever the appropriate unicode converters are available, I (or someone
else) can hook them up either at the Buffer layer, or at the Conduit-filter
layer. If you'd be interested in doing that, I'd be very, very, grateful!






"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfog97$12n2$1 digitaldaemon.com...
 There have been loads and loads of discussions in recent weeks about

Unicode,
 streams, and transcodings. There seems to be a general belief that "things

are
 happening", but I'm not quite clear on the specifics - hence this post,

which is
 basically a question.

 To clarify my own plans on the Unicode front, the purpose of the

etc.unicode
 library is to implement all of the algorithms defined by the Unicode

standard on
 the Unicode website. ("All" is quite ambitious, actually, and it will take

a
 long time to achieve that, but obviously the core ones will come first,

and most
 of the property-getting functions are already there). But I'm /not/

planning on
 writing any transcoding functions, simply because they're not part of the
 Unicode standard. Transcoding, in fact, is all about converting /to/

Unicode
 from something else (and vice versa).

 Transcoding functions are easy to write - for most encodings a simple

256-entry
 lookup table will suffice, at least in one direction. But transcoding in

strings
 is not necessarily the best architecture, and it would probably be better

to do
 it at a lower level, using streams (aka filters/readers/writers) -

basically
 just classes which implement a read() function and/or a write() function.

 I don't know who, if anyone, is currently working on this. In post
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5925, Hauke said:

"I'm
 currently working on ... a string interface that abstracts from the

specific
 encoding + a bunch of implementations for the most common ones (UTF-8, 16,

32,
 system codepage, etc...).", but it's possible I may have read too much

into
 that.

 I also know that Sean is doing some stream stuff, and that in post
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8236, he said

'Rather
 than having the encoding handled
 directly by the Stream layer perhaps it should be dropped into another

class.  I
 can't imagine coding a base lib to support "Joe's custom encoding scheme."

For
 the moment though, I think I'll leave stream.d as-is.  This seems like a

design
 issue that will take a bit of talk to get right.' and 'I'll
 probably have the first cut of stream.d done in a few more days and after

that
 we can talk about what's wrong with it, etc.'

 I also need a bit of educating on the future of D's streams. Are we going

to get
 separate InputStream and OutputStream interfaces, or what?

 Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I

don't
 mind either way, but if Phobos is going to go off in some completely

tangential
 direction, I want to know that too.

 So, the simple, easy peasy task of converting between Latin-1 and Unicode

hasn't
 been done yet, basically because we haven't agreed on an architecture, and

I for
 one am not really sure who's doing it anyway.

 Therefore, (1), I would like to ask, is anyone /actually/ writing

transcoders
 yet, or is it still up in the air?

 And, (2), if the answer to (1) is no, I'd like to suggest that a couple of
 simple classes be written which, I believe, will slot nicely into whatever
 architecture we eventually come up with. This is what I suspect will do

the job.
 Two classes:



































 Now these will probably need some adapting to fit into our final

architecture.
 (Should they derive from Stream? Or from some yet-to-be-defined

transcoding
 Reader/Writer base classes? Should they implement some interface? Should

they be
 merged into a single class? etc.) BUT - they won't need /much/ adaptation,

and
 once we've got Latin-1 working, we'll have an example on which to model

all the
 others. So feel free to take the above code and adapt it as necessary.

 But I do think we should nail down the architecture soon, as we're getting

a lot
 of questions and discussion on this. But one thing at a time. Someone tell

me
 where streams are going (with regard to above questions) and then I'll

have more
 suggestions.

 Arcane Jill

Aug 15 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cfoiln$13re$1 digitaldaemon.com>, antiAlias says...
I'm not doing anything specific for transcoding (yet) Jill; but will as soon
as the appropriate knowledge is made available in the shape of some
low-level libraries. If etc.unicode already has those, well, I'll get on the
job pronto.

I'm afraid it doesn't have anything relevant to encoding or decoding, sorry -
just character properties, like isWhitespace(dchar) and so on. Transcoding is a
different issue, basically just a mapping to/from a sequence of bytes from/to a
Unicode character, and the actual mapping will be different for each encoding.
Latin-1 is easy, because the codepoints are identical to those of Unicode.


As for architecture, this is how mango.io approaches it:

<snip>

Cool.

So, whenever the appropriate unicode converters are available, I (or someone
else) can hook them up either at the Buffer layer, or at the Conduit-filter
layer. If you'd be interested in doing that, I'd be very, very, grateful!

I think I follow that. But presumably, if people don't want it to be
std-specific, then it shouldn't be mango-specific either.

I can write a converter for Latin-1, once we're all happy with the architecture.
(Actually, I think any of us could). But I certainly wouldn't be able to do (for
example) SHIFT-JIS. I imagine once we have the architecture nailed down, lots of
transcoder classes will get written (one for each encoding).

Jill

Aug 15 2004

"antiAlias" <gblazzer corneleus.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cforp2$18vc$1 digitaldaemon.com...
So, whenever the appropriate unicode converters are available, I (or


someone
else) can hook them up either at the Buffer layer, or at the


Conduit-filter
layer. If you'd be interested in doing that, I'd be very, very, grateful!


Oops. Should have written "either at the Reader/Writer layer, or at the
Conduit-filter layer" instead.

 I think I follow that. But presumably, if people don't want it to be
 std-specific, then it shouldn't be mango-specific either.

Yep; I think it's feasible to avoid all dependencies by limiting the API to
arrays.

Aug 15 2004

teqDruid <me teqdruid.com> writes:

On Sun, 15 Aug 2004 20:15:35 +0000, Arcane Jill wrote:
 

































 

I, for one, would prefer that the core functionality NOT be phobos-streams
specific.  IE, make a set of functions to do the transcoding, then use
those to create the readers and writers.  This way, it'll be easier to put
the transcoding stuff into mango, which I prefer over std.streams.

John

Aug 15 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <pan.2004.08.15.21.19.34.123236 teqdruid.com>, teqDruid says...

I, for one, would prefer that the core functionality NOT be phobos-streams
specific.

Fair enough.

IE, make a set of functions to do the transcoding, then use
those to create the readers and writers.  This way, it'll be easier to put
the transcoding stuff into mango, which I prefer over std.streams.

Right, but this "set of functions" (or classes, which I'd prefer) would still
have to have a common format, or you wouldn't be able to call them
polymorphically at runtime.

Would you have a problem if they just implemented (or relied upon) the
InputStream and OutputStream interfaces which I only just learned about a few
posts ago?

Jill

Aug 15 2004

"antiAlias" <gblazzer corneleus.com> writes:

Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);

where both return the number of bytes converted (or something like that). I
think it's perhaps best to make these kind of things completely independent
of any other layer, if at all possible. These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly ...


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cforuu$191p$1 digitaldaemon.com...
 In article <pan.2004.08.15.21.19.34.123236 teqdruid.com>, teqDruid says...

I, for one, would prefer that the core functionality NOT be


phobos-streams
specific.

 Fair enough.

IE, make a set of functions to do the transcoding, then use
those to create the readers and writers.  This way, it'll be easier to


put
the transcoding stuff into mango, which I prefer over std.streams.

 Right, but this "set of functions" (or classes, which I'd prefer) would

still
 have to have a common format, or you wouldn't be able to call them
 polymorphically at runtime.

 Would you have a problem if they just implemented (or relied upon) the
 InputStream and OutputStream interfaces which I only just learned about a

few
 posts ago?

 Jill

Aug 15 2004

Nick <Nick_member pathlink.com> writes:

In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...
Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);

where both return the number of bytes converted (or something like that). I
think it's perhaps best to make these kind of things completely independent
of any other layer, if at all possible. These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly ...

Ok, here's my shot at it:
http://folk.uio.no/mortennk/encoding/ (released under LGPL)

I'm not a professional programmer, so please excuse bad programming style,
naming conventions or other crimes against humanity.

Like mentioned earlier, I use iconv() from libiconv, which can convert between a
large set of encodings with little hassle. Only tested on Linux. I'll leave the
Windows porting/testing to someone else. A Win32 port of libiconv can be found
here:

http://gnuwin32.sourceforge.net/packages/libiconv.htm

Nick

Aug 15 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cfp7v5$1h84$1 digitaldaemon.com>, Nick says...

Ok, here's my shot at it:

I think we should establish what we need, who needs what and why, etc., before
committing any code to a public library. Although the transcoding issue is
"urgent" in the sense that lots of people want it, I'd say it was more important
to get it right, than to write it fast.

There's nothing wrong with your code. I just think that it addresses a different
problem than the ones faced by stream developers.

Jill

Aug 16 2004

Nick <Nick_member pathlink.com> writes:

That is ok. You raise some interesting points in your other post, and I might
rewrite my code later based on what you said, if I have the time. My code is
more a proof of concept, and the point was that encoding can be done easily
through libiconv and you don't have to reinvent the wheel. The library already
supports all the features you want, and rewriting my code for use with streams
shouldn't be very hard.

Nick

In article <cfpvc5$2297$1 digitaldaemon.com>, Arcane Jill says...
I think we should establish what we need, who needs what and why, etc., before
committing any code to a public library. Although the transcoding issue is
"urgent" in the sense that lots of people want it, I'd say it was more important
to get it right, than to write it fast.

There's nothing wrong with your code. I just think that it addresses a different
problem than the ones faced by stream developers.

Jill

Aug 16 2004

teqDruid <me teqdruid.com> writes:

On Sun, 15 Aug 2004 17:12:25 -0700, antiAlias wrote:

 Might I suggest something along the following lines:
 
 int utf8ToDChar (char[] input, dchar[] output);
 int dCharToUtf8 (dchar[] input, char[] output);

That's what I was getting at... I don't know much about Unicode
transcoding, but I don't see a reason for the core functionality to be any
more complicated than that.

John

Aug 15 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <pan.2004.08.16.06.29.47.206851 teqdruid.com>, teqDruid says...
On Sun, 15 Aug 2004 17:12:25 -0700, antiAlias wrote:

 Might I suggest something along the following lines:
 
 int utf8ToDChar (char[] input, dchar[] output);
 int dCharToUtf8 (dchar[] input, char[] output);

That's what I was getting at... I don't know much about Unicode
transcoding, but I don't see a reason for the core functionality to be any
more complicated than that.

John

Suppose you want to decode a dchar from a stream, and then immediately read a
ubyte from the same stream. The above functions won't let you do that.

To decode a dchar from a stream you must first read /some/ bytes from that
stream, in order to pass those bytes to the above function. But how many? One?
Two? Four? In UTF-7, some Unicode characters require no less than /eight/ bytes.
(One can invent or imagine encodings that require even more). If you've read too
few bytes from the stream, your conversion function will throw an exception. If
you've read too many, the stream's seek position will be incorrect for the next
read.

You could argue that streams themselves could be rewritten to call functions
like the above internally, but now you're adding complexity to something that
doesn't need it.

You said: "I don't see a reason for the core functionality to be any more
complicated than that". But those functions are not "core" - they are
constructable from yet lower level functionality. The lowest level of
abstraction about which it makes sense to talk is "get one Unicode character
from somewhere" and "write one Unicode character somewhere". The minute you
start talking about /strings/ instead of merely /characters/, you've made an
implementation assumption.

Anyway, it's not the function/class/interface/whatever that needs to be simple,
it's the code which calls it. We make classes do complicated things so that
callers don't have to.

Arcane Jill

Aug 16 2004

"Martin M. Pedersen" <martin moeller-pedersen.dk> writes:

"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse
news:cfqcrh$2cs2$1 digitaldaemon.com...
 In article <pan.2004.08.16.06.29.47.206851 teqdruid.com>, teqDruid says...
 To decode a dchar from a stream you must first read /some/ bytes from that
 stream, in order to pass those bytes to the above function. But how many?

One?
 Two? Four? In UTF-7, some Unicode characters require no less than /eight/

bytes.
 (One can invent or imagine encodings that require even more).

Another verbose, yet useful representation is the character entities used in
HTML:
http://www.w3.org/TR/REC-html40/sgml/entities.html

Regards,
Martin M. Pedersen

Aug 16 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...

Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);
where both return the number of bytes converted (or something like that).

That would be bad. I think it's possible you haven't understood the issues, so
I'll try to explain in this post what some of them are, and why you would want
to do certain things in certain ways.


I think it's perhaps best to make these kind of things completely independent
of any other layer, if at all possible.

I don't have any problem with that.


These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly ...

I disagree. Transcoding almost never happens in performance-critical code. It
happens during input and output. A typical scenario is to get input from a
console and then decode it, to to encode a string and then write it to a file.
The CPU time utilized in the I/O will outweigh the time spent transcoding by a
very large factor. Of course it still makes sense to do this efficiently, but
assembler - given that it's not portable, decreases maintainability, etc. - is
probably going a bit too far.

Okay, back to these function signatures:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);

(1) The encoding is not necessarily known at compile time. This problem would
also exist had you used classes/interfaces, of course, but at least with classes
or interfaces instead of plain functions, you can rely on polymorphism and
factory methods to do the dispatching, giving you a single point of decision.
Functions like the above would lead to switch statements all over the place, and
also to inconsistent encoding names (e.g. "ISO-8859-1" vs "iso-8859-1" vs
"LATIN-1" vs "Latin1"). Only by a single point of decision can enforce the IANA
encoding names, case conventions, etc..

I see that in "charset.d" you made the encoding name a runtime parameter - but
that too is bad, partly because you don't have a single point of decision, but
partly also because you're now having to make that runtime check with /every/
fragment of text - not merely at construction time.

(2) (Trival) you forgot "out" on the output variables. You cannot expect the
caller to be aware in advance of the resulting required buffer size.

(3) /This is most important/. In the typical scenario, the caller will be
reading bytes from some source - which /could/ be a stream - and will want to
get a single dchar. We're talking about a "get the next Unicode character"
function, which is about as low level as it gets (in terms of functionality).
But you can't build such a function out of your string routines, because you
have no way of knowing in advance how many bytes will need to be consumed from
the stream in order to build one character. So what do you do? Read too many and
then put some back? Not all byte sources will allow you to "put back" or
"unconsume" bytes.

In fact, the minimal functionality that a decoder requires, is this:







(next() could be called get(), or read(), or whatever). The minimal
functionality upon which a decoder would rely, is this:







For comparison, look at the way Walter's format() function uses an underlying
put() function to write a single character. He /could/ have used strings
throughout, but he recognised (correctly) that the one-byte-at-a-time approach
was conceptually at a lower level. Strings can then be handled /in terms of/
those lower-level functions.

With these two interfaces, you can put together the concept of a decoder. Thus:










And a /specific/ decoder could derive from this, thus:




















This could be implemented more efficiently, but I wrote it that way to
illustrate the point that the decoder - not the caller - is the only entity
capable of knowing the length of the byte sequence corresponding to the next
(dchar) character.

So, NOW, if you want to plug this into a std.Stream, you could make one of
these:










And then simply make the magic decoder like so:



And similarly for mango streams, InputStreams, strings, and so on. Strings are
just not sufficiently low-level. We can rely on the compiler to inline these
very simple functions.

Encoding - the reverse process - would follow a similar pattern. You wouldn't
need hasMore(), but something like done() or close() might be appropriate to
indicate that you've finished.

Arcane Jill

Aug 16 2004

teqDruid <me teqdruid.com> writes:

On Mon, 16 Aug 2004 09:34:36 +0000, Arcane Jill wrote:

 In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...
 
Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);
where both return the number of bytes converted (or something like that).


...
 (3) /This is most important/. In the typical scenario, the caller will be
 reading bytes from some source - which /could/ be a stream - and will want to
 get a single dchar. We're talking about a "get the next Unicode character"
 function, which is about as low level as it gets (in terms of functionality).
 But you can't build such a function out of your string routines, because you
 have no way of knowing in advance how many bytes will need to be consumed from
 the stream in order to build one character. So what do you do? Read too many
and
 then put some back? Not all byte sources will allow you to "put back" or
 "unconsume" bytes.
 
 In fact, the minimal functionality that a decoder requires, is this:
 





 
 (next() could be called get(), or read(), or whatever). The minimal
 functionality upon which a decoder would rely, is this:
 





 
 For comparison, look at the way Walter's format() function uses an underlying
 put() function to write a single character. He /could/ have used strings
 throughout, but he recognised (correctly) that the one-byte-at-a-time approach
 was conceptually at a lower level. Strings can then be handled /in terms of/
 those lower-level functions.
 
 With these two interfaces, you can put together the concept of a decoder. Thus:
 








 
 And a /specific/ decoder could derive from this, thus:
 


















 
 This could be implemented more efficiently, but I wrote it that way to
 illustrate the point that the decoder - not the caller - is the only entity
 capable of knowing the length of the byte sequence corresponding to the next
 (dchar) character.
 
 So, NOW, if you want to plug this into a std.Stream, you could make one of
 these:
 








 
 And then simply make the magic decoder like so:
 

 
 And similarly for mango streams, InputStreams, strings, and so on. Strings are
 just not sufficiently low-level. We can rely on the compiler to inline these
 very simple functions.
 
 Encoding - the reverse process - would follow a similar pattern. You wouldn't
 need hasMore(), but something like done() or close() might be appropriate to
 indicate that you've finished.
 
 Arcane Jill

Understood.  This code looks reasonably agnostic, and even simple enough
the use.  The only difference is in thinking- streams vs strings.  I might
note, however that you use:
dchar[] toUTF32(char[] s);
Which could also be written as:
int toUTF32(char[] s, out dchar[]);
Which looks very similar to:
int utf8ToDChar (char[] input, dchar[] output);

This is the function that I would define as implementing the "core"
functionality.  You then (to quote myself) "use those to create the
readers and writers."

The stream implementation is a bit more complex than I imagined, but I can
blame that up to a total lack of experience with variable-width character
encodings. (And hey, I'm a first-year undergrad... what'dya expect?)

John

Aug 16 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <pan.2004.08.16.18.58.22.898270 teqdruid.com>, teqDruid says...

Understood.  This code looks reasonably agnostic, and even simple enough
the use.  The only difference is in thinking- streams vs strings.

Yes. That's because a string can always be viewed as a stream, but a stream
cannot always be viewed as a string.


I might
note, however that you use:
dchar[] toUTF32(char[] s);
Which could also be written as:
int toUTF32(char[] s, out dchar[]);

Actually I was just calling the function in std.utf. For any other encoding, I
probably would have inlined the code right there, rather than written a
function, but I figured, why re-invent stuff? std.utf.toUTF32() throws
exceptions if the input is wrong, so it's just what you'd need in this
circumstance. (The tests I made to determine the length didn't weed out illegal
sequences - I was relying on std.utf to do that for me).


Which looks very similar to:
int utf8ToDChar (char[] input, dchar[] output);

This is the function that I would define as implementing the "core"
functionality.

Fair enough. Guess it just depends what you call "core". The main thing is the
dispatch mechanism.



The stream implementation is a bit more complex than I imagined, but I can
blame that up to a total lack of experience with variable-width character
encodings.

There's more. Some encodings are not merely variable-width, but are also
/stateful/. Consider UTF-7. A UTF-7 stream is always in one of two states:
"ASCII" or "Radix 64". A '+' character in the stream changes the state to "Radix
64", and a '-' character changes the state back to "ASCII". A UTF-7 decoder
needs to be aware at all times of the state of the stream. Incoming bytes are
interpretted differently (as though they were two entirely different encodings)
depending on the stream state. A function such as:

int utf7ToDchar (char[] input, dchar[] output);

just wouldn't do the job, because it doesn't preserve/know the state of the
stream. You'd need a class, with a member variable to contain the current state
of the stream (unless you wanted to use a global variable to store the state -
yuk!)

So, in general, basing your architecture on a set of functions with similar
signature just wouldn't be adequate to do the job.

Arcane Jill

Aug 17 2004

"antiAlias" <gblazzer corneleus.com> writes:

Confusion abounds! I follow you Jill, but please don't underestimate the
usefulness of D arrays. I'll try to explain as we go along ...


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfpv3c$2253$1 digitaldaemon.com...
 In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...

Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);
where both return the number of bytes converted (or something like that).

 That would be bad. I think it's possible you haven't understood the

issues, so
 I'll try to explain in this post what some of them are, and why you would

want
 to do certain things in certain ways.


I think it's perhaps best to make these kind of things completely


independent
of any other layer, if at all possible.

 I don't have any problem with that.


These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly


...
 I disagree. Transcoding almost never happens in performance-critical code.

It
 happens during input and output. A typical scenario is to get input from a
 console and then decode it, to to encode a string and then write it to a

file.
 The CPU time utilized in the I/O will outweigh the time spent transcoding

by a
 very large factor. Of course it still makes sense to do this efficiently,

but
 assembler - given that it's not portable, decreases maintainability,

etc. - is
 probably going a bit too far.

What about HTTP servers? What about SOAP servers? Pretty much anything XML
oriented has to at least think about doing this kind of thing often and
efficiently. The latter still matters, and perhaps always will. Still, it
was just a suggestion.

 Okay, back to these function signatures:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);

 (1) The encoding is not necessarily known at compile time. This problem

would
 also exist had you used classes/interfaces, of course, but at least with

classes
 or interfaces instead of plain functions, you can rely on polymorphism and
 factory methods to do the dispatching, giving you a single point of

decision.
 Functions like the above would lead to switch statements all over the

place, and
 also to inconsistent encoding names (e.g. "ISO-8859-1" vs "iso-8859-1" vs
 "LATIN-1" vs "Latin1"). Only by a single point of decision can enforce the

IANA
 encoding names, case conventions, etc..

Agreed. I wouldn't presume to fashion a "complete" solution on /this/ NG
<g>. Thus, encoding was deliberately ommited to clarify the means of getting
data into and out of these converters. As far as encoding-names go, I would
have expected such converters to be implemented as methods in a class; the
constructor would be given the encoding identifier.


 I see that in "charset.d" you made the encoding name a runtime parameter -

but
 that too is bad, partly because you don't have a single point of decision,

but
 partly also because you're now having to make that runtime check with

/every/
 fragment of text - not merely at construction time.

Not sure what you mean. I've never written anything called "charset.d" ...
besides, you can safely assume that efficiency is important to me.


 (2) (Trival) you forgot "out" on the output variables. You cannot expect

the
 caller to be aware in advance of the resulting required buffer size.

Au contraire! Both input and output are /provided/ by the caller. This is
why the return value specifies the number of items converted. D arrays have
some wonderful properties worth taking advantage of -- the length is always
provided, you can slice and dice to your hearts' content, and void[] arrays
can easily be mapped onto pretty much anything (including a single char or
dchar instance). The caller has already said "here's a set of input data,
and here's a place to put the output. Convert what you can within the
constraints of input & output limits, and tell me the resultant outcome".

If (for example) there's only space in the output for one dchar, the
algorithm will halt after converting just one. If there's not enough input
provided to construct a dchar, the algorithm indicates nothing was
converted. Of course, this points out a flaw in the original prototypes: two
return values are needed instead of one (the number of items used from the
input, as well as the number of items placed into the output).
Alternatively, the implementing class could provide it's own output buffer
during initial construction.


 (3) /This is most important/. In the typical scenario, the caller will be
 reading bytes from some source - which /could/ be a stream - and will want

to
 get a single dchar. We're talking about a "get the next Unicode character"
 function, which is about as low level as it gets (in terms of

functionality).
 But you can't build such a function out of your string routines, because

you
 have no way of knowing in advance how many bytes will need to be consumed

from
 the stream in order to build one character. So what do you do? Read too

many and
 then put some back? Not all byte sources will allow you to "put back" or
 "unconsume" bytes.

Wholly agreed: pushback is a big "no no". But it's not an issue when using a
pair of arrays in the suggested manner.


 In fact, the minimal functionality that a decoder requires, is this:







 (next() could be called get(), or read(), or whatever). The minimal
 functionality upon which a decoder would rely, is this:







 For comparison, look at the way Walter's format() function uses an

underlying
 put() function to write a single character. He /could/ have used strings
 throughout, but he recognised (correctly) that the one-byte-at-a-time

approach
 was conceptually at a lower level. Strings can then be handled /in terms

of/
 those lower-level functions.

There are several valid ways to skin that particular cat <g>

<snip>


Here's a fuller implementation of the array approach (in pseudo-code)

class Transcoder
{
      this (char[] encoding) {...}

      dchar[] toUnicode (char[] input, dchar[] output, out int consumed)
      {
          while (room_for_more_output)
                    while (enough_input_for_another_dchar)
                              do_actual_conversion_into_output_buffer;

          emit_quantity_of_input_consumed;
          return_slice_of_output_representing_converted_dchars;
      }

      char[] toUtf8 (dchar[] input, char[] output, out int consumed)
      {
          while (room_for_more_output)
                    while (enough_input_for_another_char)
                              do_actual_conversion_into_output_buffer;

          emit_quantity_of_input_consumed;
          return_slice_of_output_representing_converted_chars;
      }
}


This would be wrapped at some higher level such as within a Phobos Stream,
or a Mango Reader/Writer, to handle the mapping of arrays to variables. The
benefit of this approach is it's throughput, and the ability for the
'controller' to direct the input and output arrays to anywhere it likes
(including scalar variables), leading to further efficiencies. Functions
such as these do not need to be exposed to the typical programmer. In fact,
I vaguely recall Java has something along these lines that's hidden in some
sun.x.x library, which the Java Streams utilize at some level.

A variation on the theme might initially provide a buffer to house the
conversion output instead. There's pros and cons to both approaches. In this
case, you'd probably want to split the transcoding into separate encoding
and decoding:

class Decoder
{
      private dchar[] unicode;

      this (char[] encoding, dchar[] output)
     {
        do_something_with_encoding;
        unicode = output;
     }

      this (char[] encoding, int outputSize)
     {
        this (encoding,  new dchar[outputSize]);
     }

      dchar[] convert (char[] input, out int consumed)
      {
          while (room_for_more_output_in_output_buffer)
                    while (enough_input_for_another_dchar)
                              do_actual_conversion_into_output_buffer;

          emit_quantity_of_input_consumed;
          return_slice_of_output_representing_converted_dchars;
      }
}

class Encoder
{
   // similar approach to Decoder
}


These are just suggestions, to take or leave at one's discretion.

Aug 16 2004

"antiAlias" <gblazzer corneleus.com> writes:

 class Transcoder
 {
       this (char[] encoding) {...}

       dchar[] toUnicode (char[] input, dchar[] output, out int consumed)
       {
           while (room_for_more_output)
                     while (enough_input_for_another_dchar)
                               do_actual_conversion_into_output_buffer;

           emit_quantity_of_input_consumed;
           return_slice_of_output_representing_converted_dchars;
       }
 }


Whoops! Those twin while loops should, of course, be a single while() with
an && between the two conditions.

Aug 16 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cfr0tf$2pgm$1 digitaldaemon.com>, antiAlias says...

These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly

 The CPU time utilized in the I/O will outweigh the time spent transcoding

by a
 very large factor.

What about HTTP servers? What about SOAP servers? Pretty much anything XML
oriented has to at least think about doing this kind of thing often and
efficiently.

I would say that here the time spent getting the web page from the server to
client across the internet will outweigh the time spent encoding by many orders
of magnitude.

But I'm not /against/ efficiency. If people want to recode this stuff in
assembler then obviously I'm not going to object.



Not sure what you mean. I've never written anything called "charset.d" ...
besides, you can safely assume that efficiency is important to me.

I think I was confusing you with Nick. My bad.


 (2) (Trival) you forgot "out" on the output variables. You cannot expect

the
 caller to be aware in advance of the resulting required buffer size.

Au contraire! Both input and output are /provided/ by the caller. This is
why the return value specifies the number of items converted. D arrays have
some wonderful properties worth taking advantage of -- the length is always
provided, you can slice and dice to your hearts' content, and void[] arrays
can easily be mapped onto pretty much anything (including a single char or
dchar instance). The caller has already said "here's a set of input data,
and here's a place to put the output. Convert what you can within the
constraints of input & output limits, and tell me the resultant outcome".

If (for example) there's only space in the output for one dchar, the
algorithm will halt after converting just one. If there's not enough input
provided to construct a dchar, the algorithm indicates nothing was
converted. Of course, this points out a flaw in the original prototypes: two
return values are needed instead of one (the number of items used from the
input, as well as the number of items placed into the output).
Alternatively, the implementing class could provide it's own output buffer
during initial construction.

Gotcha. Sorry - I misinterpretted the intent of the function signatures.


Wholly agreed: pushback is a big "no no". But it's not an issue when using a
pair of arrays in the suggested manner.

Wholly agreed.


There are several valid ways to skin that particular cat <g>
Here's a fuller implementation of the array approach (in pseudo-code)

 <snip>

This would be wrapped at some higher level such as within a Phobos Stream,
or a Mango Reader/Writer, to handle the mapping of arrays to variables. The
benefit of this approach is it's throughput, and the ability for the
'controller' to direct the input and output arrays to anywhere it likes
(including scalar variables), leading to further efficiencies. Functions
such as these do not need to be exposed to the typical programmer. In fact,
I vaguely recall Java has something along these lines that's hidden in some
sun.x.x library, which the Java Streams utilize at some level.

A variation on the theme might initially provide a buffer to house the
conversion output instead. There's pros and cons to both approaches. In this
case, you'd probably want to split the transcoding into separate encoding
and decoding:

 <snip>

These are just suggestions, to take or leave at one's discretion.

They are good suggestions. They have the benefit of efficiency without losing
generality. They have the disadvantage of having a slightly confusing signature,
but good documentation should solve that.

Nice one.

Arcane Jill

Aug 17 2004

Ben Hinkle <bhinkle4 juno.com> writes:

 (3) /This is most important/. In the typical scenario, the caller will be
 reading bytes from some source - which /could/ be a stream - and will want
 to get a single dchar. We're talking about a "get the next Unicode
 character" function, which is about as low level as it gets (in terms of
 functionality). But you can't build such a function out of your string
 routines, because you have no way of knowing in advance how many bytes
 will need to be consumed from the stream in order to build one character.
 So what do you do? Read too many and then put some back? Not all byte
 sources will allow you to "put back" or "unconsume" bytes.

std.stream supports ungetc, which pushes a character back by maintaining an
array of pushed-back characters. Right now only the text functions check
this array for content, though. I think the idea was that if one is storing
text and binary data mixed together that the text are stored with
writeString which puts a length byte followed by the text.

Aug 17 2004

Sean Kelly <sean f4.ca> writes:

In article <cfsu27$122d$1 digitaldaemon.com>, Ben Hinkle says...
 (3) /This is most important/. In the typical scenario, the caller will be
 reading bytes from some source - which /could/ be a stream - and will want
 to get a single dchar. We're talking about a "get the next Unicode
 character" function, which is about as low level as it gets (in terms of
 functionality). But you can't build such a function out of your string
 routines, because you have no way of knowing in advance how many bytes
 will need to be consumed from the stream in order to build one character.
 So what do you do? Read too many and then put some back? Not all byte
 sources will allow you to "put back" or "unconsume" bytes.


For the record, this is exactly what my mods to std.utf are for.  In fact,
unFormat and my stream mods already use them.

std.stream supports ungetc, which pushes a character back by maintaining an
array of pushed-back characters. Right now only the text functions check
this array for content, though. I think the idea was that if one is storing
text and binary data mixed together that the text are stored with
writeString which puts a length byte followed by the text. 

Most stream routines allow for at least one byte to put back.  Obviously this
isn't possible in all cases, but it *is* always possible to carry an unget
buffer around with the stream, as std.stream already does.  Only the formatted
routines check this area for content and I consider that correct behavior, as
some translation may have been done between the stream and the buffer.


Sean

Aug 17 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfog97$12n2$1 digitaldaemon.com...
 There have been loads and loads of discussions in recent weeks about

Unicode,
 streams, and transcodings. There seems to be a general belief that "things

are
 happening", but I'm not quite clear on the specifics - hence this post,

which is
 basically a question.

What I am excited about is D is becoming the premier language to do unicode
in, by a wide margin. And that's thanks to you guys!

Aug 15 2004

Sean Kelly <sean f4.ca> writes:

In article <cfog97$12n2$1 digitaldaemon.com>, Arcane Jill says...
I also need a bit of educating on the future of D's streams. Are we going to get
separate InputStream and OutputStream interfaces, or what?

I'd like to have them.  In fact my partial rewrite of stream.d already has this.

Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't
mind either way, but if Phobos is going to go off in some completely tangential
direction, I want to know that too.

Hard to answer as I don't really know what will happen with Phobos in the long
term, however...

I would like to see unFormat/readf get into Phobos, though that may have to wait
for TypeInfo to be working for pointers since the current calling convention is
still a bit inconsistent with writef (ie. it requires a format string as the
first argument, just like scanf).  I could work around this with a big if/else
block to get the underlying type of pointer arguments but I'd prefer to just
work off classinfo.name like Walter does for doFormat.  Perhaps I'll just drop
that into a separate function and replace it later when TypeInfo gets fixed.

As for my std.stream rewrite... I like it better than what's in std.stream now
but I have no idea what will sort out in the long term.  Is adopting Mango.io a
better idea?  Perhaps streams should be dropped from Phobos completely?  I
consider my version of stream.d to be more of a prototype than a full-featured
replacement.

So, the simple, easy peasy task of converting between Latin-1 and Unicode hasn't
been done yet, basically because we haven't agreed on an architecture, and I for
one am not really sure who's doing it anyway.

Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or
UTF-16.  My unFormat will read UTF-8 and UTF-16 following the same convention as
std.doFormat and will convert everything to char/wchar/dchar strings as
appropriate.  Both of these functions use the functions in std.utf for
conversion.  Is this enough to start with?

Therefore, (1), I would like to ask, is anyone /actually/ writing transcoders
yet, or is it still up in the air?

I guess that depends on what still needs to be done.

But I do think we should nail down the architecture soon, as we're getting a lot
of questions and discussion on this. But one thing at a time. Someone tell me
where streams are going (with regard to above questions) and then I'll have more
suggestions.

Frankly, I can live without streams so long as there is *some* way to do
formatted i/o that can handle Unicode.  I think doFormat/unFormat might be the
answer to this, but I don't know the remaining issues well enough to say for
sure.


Sean

Aug 16 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cfrc06$31g8$1 digitaldaemon.com>, Sean Kelly says...

I also need a bit of educating on the future of D's streams. Are we going to get
separate InputStream and OutputStream interfaces, or what?

I'd like to have them.  In fact my partial rewrite of stream.d already has this.

Apparently, so does Phobos, although I didn't know that at the time I posted the
question. Now isn't that cute - an interface with an undocumented interface!


Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't
mind either way, but if Phobos is going to go off in some completely tangential
direction, I want to know that too.

Hard to answer as I don't really know what will happen with Phobos in the long
term, however...

Okay - I just wasn't sure if you were working for Walter in some capacity.
Forgive the dumb question.


As for my std.stream rewrite... I like it better than what's in std.stream now

It's hard to know what's in std.stream now without reading the source. I
/really/ wish someone would document it.


but I have no idea what will sort out in the long term.  Is adopting Mango.io a
better idea?

Many people think so. Others argue that we should wait for the new-improved
std.stream. But I don't know what that future is. The mango folk have said that
they don't want mango.io moved into std, so it will always be an external
library. That isn't a problem for applications, of course, since mango is free
and open-source, but it might be considered a problem for libraries (- when one
third-party library dependends on a different third-party library, things start
to get messy).


Perhaps streams should be dropped from Phobos completely?

Perhaps, but I find it unlikely that that will happen. Only Walter is empowered
to do that.


I
consider my version of stream.d to be more of a prototype than a full-featured
replacement.

Well, that's good and bad. A prototype is good - it implies that better, future
versions will exist. But "not a replacement"? If it's not a replacement, are you
envisaging that people will use both? Do they interact somehow?



Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or
UTF-16.  My unFormat will read UTF-8 and UTF-16 following the same convention as
std.doFormat and will convert everything to char/wchar/dchar strings as
appropriate.  Both of these functions use the functions in std.utf for
conversion.  Is this enough to start with?

It's certainly enough for now, but it's not transcoding in the more general
sense. UTF8/16/32 are fundamental to D - they simply have to be there.



Frankly, I can live without streams so long as there is *some* way to do
formatted i/o that can handle Unicode.

Yes, "formatted" - that is an interesting and important one. printf()/writef()
are currently not very Unicode-aware. A format string like "%5s" will output at
least five /bytes/, not at least five /characters/. What is needed in this
department is a printf() replacement written exclusively for dchars.


I think doFormat/unFormat might be the
answer to this, but I don't know the remaining issues well enough to say for
sure.

Sean

Well, thanks. I think I've got a picture now of what's going on. I'll post a
summary shortly, then we can start calling for volunteers for the missing bits.

Jill

Aug 17 2004

"antiAlias" <fu bar.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfscq6$s8u$1 digitaldaemon.com...
 std.stream. But I don't know what that future is. The mango folk have said

that
 they don't want mango.io moved into std, so it will always be an external
 library. That isn't a problem for applications, of course, since mango is

free

Please allow me to clarify: from recollection, the position has always been
that both mango.io and std.streams should be excluded from Phobos. I think
that was Matthew's position also. If mango.io turns out to be a better
solution, then it can certainly move from its current home if that's what
people want; but I'm not holding my breath waiting for a consensus on that
one <g>

BTW; mango.io is completely independent from the rest of the Mango Tree (has
no dependencies), so it can be easily cut away. In fact, it's almost totally
independent of Phobos too ...


 and open-source, but it might be considered a problem for libraries (-

when one
 third-party library dependends on a different third-party library, things

start
 to get messy).

Right -- that thread regarding placing the .lib dependencies inside the D
source-code might help out with this (to the extent that it can).

Aug 17 2004

stonecobra <scott stonecobra.com> writes:

Arcane Jill wrote:

but I have no idea what will sort out in the long term.  Is adopting Mango.io a
better idea?

 
 
 Many people think so. Others argue that we should wait for the new-improved
 std.stream. But I don't know what that future is. The mango folk have said that
 they don't want mango.io moved into std, so it will always be an external
 library. That isn't a problem for applications, of course, since mango is free
 and open-source, but it might be considered a problem for libraries (- when one
 third-party library dependends on a different third-party library, things start
 to get messy).
 
 
 

Or, since it is open source, you can just compile it in ala std.* and 
not have a library dependency

Scott

Aug 17 2004

Sean Kelly <sean f4.ca> writes:

In article <cfscq6$s8u$1 digitaldaemon.com>, Arcane Jill says...
In article <cfrc06$31g8$1 digitaldaemon.com>, Sean Kelly says...

I
consider my version of stream.d to be more of a prototype than a full-featured
replacement.

Well, that's good and bad. A prototype is good - it implies that better, future
versions will exist. But "not a replacement"? If it's not a replacement, are you
envisaging that people will use both? Do they interact somehow?

It's only a prototype in the sense that I haven't really finished it yet.  There
are some notable functions missing (like ignore), etc.  If people are interested
then I'll flesh it out a bit.  I don't have a ton of free time so I figured I'd
see what the response was before I worked any more on it.

Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or
UTF-16.  My unFormat will read UTF-8 and UTF-16 following the same convention as
std.doFormat and will convert everything to char/wchar/dchar strings as
appropriate.  Both of these functions use the functions in std.utf for
conversion.  Is this enough to start with?

It's certainly enough for now, but it's not transcoding in the more general
sense. UTF8/16/32 are fundamental to D - they simply have to be there.

Frankly, I can live without streams so long as there is *some* way to do
formatted i/o that can handle Unicode.

Yes, "formatted" - that is an interesting and important one. printf()/writef()
are currently not very Unicode-aware. A format string like "%5s" will output at
least five /bytes/, not at least five /characters/. What is needed in this
department is a printf() replacement written exclusively for dchars.

unFormat operates entirely in terms of dchars.  So the width modifiers are in
terms of UTF-32 characters, etc.  But I agree.  If doFormat doesn't work this
way then it probably should.  The results are unpredictable otherwise.


Sean

Aug 17 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Transcoding - who's doing what?