digitalmars.D - Streams and encoding
- Sean Kelly (38/38) Aug 03 2004 I finally got back on my stream mods today and had a question: how shou...
- Arcane Jill (12/14) Aug 03 2004 Simple answer - it shouldn't have to.
- Sean Kelly (7/19) Aug 03 2004 Works for me. So how does a formatted read/write routine know which for...
- Arcane Jill (20/25) Aug 03 2004 Got it in one. Plus, you can have a factory function like createReader(c...
- Sean Kelly (4/17) Aug 03 2004 No worries. If we've got a class per format then it knows implicitly wh...
- parabolis (48/51) Aug 03 2004 I have been wondering who was working on a Stream library. I
-
Regan Heath
(25/47)
Aug 03 2004
On Tue, 03 Aug 2004 16:21:05 -0400, parabolis
- parabolis (18/69) Aug 03 2004 I will concede the order was wrong. However I believe the
- Regan Heath (20/88) Aug 03 2004 So.. ?
- parabolis (36/88) Aug 03 2004 Here I must argue that any knowledge of where C went really
- Regan Heath (26/70) Aug 03 2004 I didn't/don't use slicing. I think you may be confusing two different
- parabolis (14/80) Aug 03 2004 Show me a safe function that takes void* as a parameter. That
-
Regan Heath
(61/84)
Aug 03 2004
On Tue, 03 Aug 2004 23:30:03 -0400, parabolis
- parabolis (15/102) Aug 03 2004 Now that is pretty neat.
- Regan Heath (19/118) Aug 04 2004 Just that bit.. or the whole thing? That bit above was a little hack, it...
- parabolis (53/156) Aug 04 2004 Not really. My DataXXXStream would handle reading all cases
- Regan Heath (16/98) Aug 04 2004 I don't think so.
- Andy Friesen (20/63) Aug 03 2004 Slicing does not create garbage. Arrays really are value types that get...
- Regan Heath (25/54) Aug 03 2004 Really? doesn't slicing create another array structure (the one you have...
- Andy Friesen (11/46) Aug 03 2004 Sure, but the second two can probably be optimized into one and the same...
- parabolis (2/4) Aug 03 2004 Sure there is. Not allocating is infinitely faster. :)
- Andy Friesen (13/20) Aug 03 2004 Passing an array slice as an argument is exactly the same as passing a
- Bent Rasmussen (2/6) Aug 04 2004 That is post-mature optimization. You should never have created
- Sean Kelly (16/18) Aug 04 2004 Not sure I agree in this case.
- parabolis (4/24) Aug 04 2004 I am pretty sure the second read in your example parses it be
- Regan Heath (7/34) Aug 04 2004 It's not guaranteed to be valid. replace x.sizeof with 1000 and it's an
- Andy Friesen (5/25) Aug 04 2004 I changed my mind. You're right. :)
- parabolis (10/48) Aug 03 2004 That is what I meant by a wrapper. It is actually defined in
- parabolis (14/19) Aug 05 2004 This is a good suggestion because void /is/ a much better
- Regan Heath (18/38) Aug 05 2004 I still don't agree with the last bit, void[] gives no _assurance_ at al...
- parabolis (47/98) Aug 06 2004 My argument is that there exists a program in which a bug will
-
Regan Heath
(60/133)
Aug 08 2004
On Fri, 06 Aug 2004 14:29:19 -0400, parabolis
- Arcane Jill (4/9) Aug 06 2004 For all D types, the number of bytes occupied by a T[] of length N is (N...
- parabolis (8/20) Aug 06 2004 Sorry I meant from the docs
- Sean Kelly (6/13) Aug 06 2004 Probably an esoteric question, but I assume that the byte size gurantee ...
- parabolis (6/24) Aug 06 2004 Actually that is not a terribly esoteric question. I do not
- Regan Heath (14/65) Aug 03 2004 For another perspective/idea have a look at my thread entitled "My strea...
- Sean Kelly (19/29) Aug 03 2004 My design really set out extend the original stream approach, and it see...
- antiAlias (28/66) Aug 03 2004 What you good folks seem to be describing is pretty much how mango.io
- Sean Kelly (7/15) Aug 03 2004 Yup. I've played around with Mango and kind of like it. One of the rea...
- antiAlias (8/24) Aug 03 2004 You are absolutely right. But not many people seem to know about Mango, ...
- parabolis (12/25) Aug 03 2004 I cant help but ask how it manages to do both input and output
- Walter (14/20) Aug 03 2004 with a
- Arcane Jill (19/27) Aug 03 2004 With all due respect, Walter, that's not really feasible. It is very har...
- Sean Kelly (11/16) Aug 04 2004 That reminds me. Which format does the code in utf.d use? I'm thinking...
- Arcane Jill (14/24) Aug 04 2004 Whatever works, works. But I'd make the enum private. Encodings should b...
- Sean Kelly (16/42) Aug 04 2004 std.utf has methods like toUTF16. But does this target the big or littl...
- Ben Hinkle (16/38) Aug 04 2004 handled
- Arcane Jill (28/37) Aug 04 2004 Neither, really. toUTF16 returns an array of wchars, not an array of cha...
- Sean Kelly (11/21) Aug 04 2004 Bah. Of course. So the two UTF schemes just depend on the byte order w...
- Arcane Jill (52/65) Aug 04 2004 Well, from one point of view, the problem we've got here is serializatio...
-
Carlos Santander B.
(19/19)
Aug 04 2004
"Arcane Jill"
escribió en el mensaje - Arcane Jill (17/21) Aug 04 2004 Good question. I guess probably not. If the encoding is known, then it's...
- Regan Heath (26/35) Aug 04 2004 On Wed, 4 Aug 2004 16:58:48 +0000 (UTC), Arcane Jill
- Walter (17/34) Aug 04 2004 able
I finally got back on my stream mods today and had a question: how should the wrapper class know the encoding scheme of the low-level data? For example, say all of the formatted IO code is in a mixin or base class (assume base class for the same of discussion) that calls a read(void*, size_t) or write(void*, size_t) method in the derived class. Now say I want to read a char, wchar, or dchar from the stream. How many bytes should I read and how do I know what the encoding format is? C++ streams handle this fairly simply by making the char type a template parameter: This has the obvious limitation that the programmer must instantiate the proper type of stream for the data format he is trying to read (as there is only one get/put method for any char type: CharT). But it makes things pretty explicit: Stream!(char) means "this is a stream formatted in UTF8." The other option I can think off offhand would be to have a class member that the derived class could set which specifies the encoding format: This has tbe benefit of allowing the user to read and write any char type with a single instantiation, but requires greater complexity in the Stream class and in the Derived class. And I wonder if such flexibility is truly necessary. Any other design possibilities? Preferences? I'm really trying to establish a good formatted IO design than work out the perfect stream API. Any other weird issues would be welcome also. Sean
Aug 03 2004
In article <ceopfj$1hcl$1 digitaldaemon.com>, Sean Kelly says...I finally got back on my stream mods today and had a question: how should the wrapper class know the encoding scheme of the low-level data?Simple answer - it shouldn't have to. I suggest using a specialized transcoding filter for such things. That's what Java does (Java calls them Readers and Writers), and Java's streams have been hailed as a shining example of how to do things correctly. Then your streams just connect together naturally, as others have shown in other recent threads. e.g.: Windows1252Reader(stdin)))); (or something similar). You can have factory methods to create transcoders where the encoding is not known until runtime. Jill
Aug 03 2004
In article <ceor8d$1ihu$1 digitaldaemon.com>, Arcane Jill says...In article <ceopfj$1hcl$1 digitaldaemon.com>, Sean Kelly says...Works for me. So how does a formatted read/write routine know which format it's targeting?I finally got back on my stream mods today and had a question: how should the wrapper class know the encoding scheme of the low-level data?Simple answer - it shouldn't have to.I suggest using a specialized transcoding filter for such things. That's what Java does (Java calls them Readers and Writers), and Java's streams have been hailed as a shining example of how to do things correctly. Then your streams just connect together naturally, as others have shown in other recent threads. e.g.: Windows1252Reader(stdin))));Okay, so all the formatted IO routines go in a Reader class and the type of the reader class determines the format? ie. there would be an UTF8Writer, UTF8Reader, UTF16Writer, UTF16Reader, etc? Sean
Aug 03 2004
In article <ceorv6$1iti$1 digitaldaemon.com>, Sean Kelly says...Okay, so all the formatted IO routines go in a Reader class and the type of the reader class determines the format? ie. there would be an UTF8Writer, UTF8Reader, UTF16Writer, UTF16Reader, etc?Got it in one. Plus, you can have a factory function like createReader(char[]), so you can do Reader r = createReader("UTF-16LE"); etc. (for when the type is known at run time, not compile time, which is usually). The implementation of createReader() is just a big swtich statement, with each case return a new instance of the relevant class. (I swapped your questions around. Here's the first one).Works for me. So how does a formatted read/write routine know which format it's targeting?You got me there. I think the question's too vague, and the answer application-specific. Generally speaking, at some level, the encoding is known, somehow. Maybe it's specified in the text file itself (XML and HTTP pull this trick - for it to work the very start of the file must comprise only ASCII characters (although they can be encoded in a UTF)); maybe it's specified in a configuration file; maybe it's deduced using some heuristic test; maybe the OS default is assumed. At the level where the encoding is known, decode it (into UTF-8), and then you can use byte streams from then on. As parabolis said, a stream, in the abstract, deals in ubytes, not chars (because that's what you write to files, sockets, etc.). Classes which implement read() or write() in units other than ubyte shouldn't really be called "streams", which of course is why Java calls them Readers and Writers. (Maybe "filters" for the general case). Arcane Jill
Aug 03 2004
In article <ceou9t$1kbq$1 digitaldaemon.com>, Arcane Jill says...In article <ceorv6$1iti$1 digitaldaemon.com>, Sean Kelly says...No worries. If we've got a class per format then it knows implicitly what format to convert to/from. SeanOkay, so all the formatted IO routines go in a Reader class and the type of the reader class determines the format? ie. there would be an UTF8Writer, UTF8Reader, UTF16Writer, UTF16Reader, etc?Got it in one. Plus, you can have a factory function like createReader(char[]), so you can do Reader r = createReader("UTF-16LE"); etc. (for when the type is known at run time, not compile time, which is usually). The implementation of createReader() is just a big swtich statement, with each case return a new instance of the relevant class. (I swapped your questions around. Here's the first one).Works for me. So how does a formatted read/write routine know which format it's targeting?You got me there.
Aug 03 2004
Sean Kelly wrote:I finally got back on my stream mods today and had a question: how should the wrapper class know the encoding scheme of the low-level data?I have been wondering who was working on a Stream library. I have many thoughts, many of which are covered in OT - scanf in Java. Here are a some notes: In C (and C++ by extension I would imagine) the char type is the smallest addressable cell in memory. In D the char is a UTF-8 8-bit code unit which is quite a differnent thing. I would suggest you seriously consider defining basic IO using either the ubyte (which represents a general 8-bit value) or possibly the data type that is the native cell size used in memory (something like size_t I believe). Also I have noticed the tendency for people to not make the distinction between Input and Output streams. This leads to some problems. Say I want to write a class to handle CRC32 on stream data. It is far simpler and less error prone to compute such a digest on a stream in which data flows in only one direction especially in a multi-threaded environment. Also the Input and Output distinction allows for streams pumps that automatically pull data from one and push data into another. This is especially useful with bifurcating streams that also do logging. As for the templatization of streams I believe a pair of generic data input/output stream classes can be written using templates which will do impedance matching from the 8-bit streams to the n-bit data type you want to read. So you have to write 8, 16, 32 and possibly 64 and 128 bit functions. Here is the foundation of the stream library I imagine: ================================================================ interface DataSink { uint write( ubyte[] data, uint off = 0, uint len = 0); } interface DataSource { uint read( inout ubyte[] data, uint off = 0, uint len = 0); ulong seek( ulong size ); } ================================================================ The data being read/written by native interface classes: ================================================================ FileInputSream : DataSource FileOutputSream : DataSink SocketInputSream : DataSource SocketOutputSream : DataSink MMapInputStream : DataSource MMapOutputStream : DataSink ================================================================ The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...
Aug 03 2004
On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> wrote: <snip>Here is the foundation of the stream library I imagine: ================================================================ interface DataSink { uint write( ubyte[] data, uint off = 0, uint len = 0); } interface DataSource { uint read( inout ubyte[] data, uint off = 0, uint len = 0); ulong seek( ulong size ); } ================================================================I think you need functions in the form: ulong write(void* data, ulong len = 0, ulong off = 0); notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong. If you use ubyte[] you don't need len or off as you can call with: ubyte[] big = "regan was here"; write(big[6..9]); to achieve both. The void* allows easy specialised write functions, eg. bool write(int x) { write(&x,x.sizeof); } I'm not sure whether uint or ulong should be used, anyone got opinions/reasons for one or the other?The data being read/written by native interface classes: ================================================================ FileInputSream : DataSource FileOutputSream : DataSink SocketInputSream : DataSource SocketOutputSream : DataSink MMapInputStream : DataSource MMapOutputStream : DataSink ================================================================ The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations. See my earlier post (with source) on how this works. Note there was a problem with it which I have since fixed, changing 'super.' to 'this.' in the stream template class. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
Regan Heath wrote:On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> wrote: <snip>I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed. The len and off parameters allow a caller to take either approach.Here is the foundation of the stream library I imagine: ================================================================ interface DataSink { uint write( ubyte[] data, uint off = 0, uint len = 0); } interface DataSource { uint read( inout ubyte[] data, uint off = 0, uint len = 0); ulong seek( ulong size ); } ================================================================I think you need functions in the form: ulong write(void* data, ulong len = 0, ulong off = 0); notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong. If you use ubyte[] you don't need len or off as you can call with: ubyte[] big = "regan was here"; write(big[6..9]); to achieve both.The void* allows easy specialised write functions, eg. bool write(int x) { write(&x,x.sizeof); }The void* is a pointer with no associated type. The arrays in D are infinitely better than void* pointers because arrays have extra information. As I said earlier in my post the behavior of providing data in a particular non-byte format should be done elsewhere in a single DataXXStream.I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.The data being read/written by native interface classes: ================================================================ FileInputSream : DataSource FileOutputSream : DataSink SocketInputSream : DataSource SocketOutputSream : DataSink MMapInputStream : DataSource MMapOutputStream : DataSink ================================================================ The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.
Aug 03 2004
On Tue, 03 Aug 2004 18:02:55 -0400, parabolis <parabolis softhome.net> wrote:Regan Heath wrote:So.. ?On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> wrote: <snip>I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed.Here is the foundation of the stream library I imagine: ================================================================ interface DataSink { uint write( ubyte[] data, uint off = 0, uint len = 0); } interface DataSource { uint read( inout ubyte[] data, uint off = 0, uint len = 0); ulong seek( ulong size ); } ================================================================I think you need functions in the form: ulong write(void* data, ulong len = 0, ulong off = 0); notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong. If you use ubyte[] you don't need len or off as you can call with: ubyte[] big = "regan was here"; write(big[6..9]); to achieve both.The len and off parameters allow a caller to take either approach.Yeah.. we have default parameters, we can provide both options at no cost, so why not.Correct.The void* allows easy specialised write functions, eg. bool write(int x) { write(&x,x.sizeof); }The void* is a pointer with no associated type.The arrays in D are infinitely better than void* pointers because arrays have extra information.Incorrect. D arrays are better for some things, those that need/want the extra information. Lets ignore our opinions on the use of void* for now, can you write the write(int x) function above as easily if you do not use void* but use ubyte[] instead?As I said earlier in my post the behavior of providing data in a particular non-byte format should be done elsewhere in a single DataXXStream.Sure, and when/where you provide it, what will it look like if the underlying write operation takes a ubyte[] and not a void*? is it possible? is it worse than simply using a void*?Sure, wanting to do this does not stop you using bolt-ins. I just have to split my Stream bolt-in into InputStream and OutputStream, in fact, I think I will, as I agree with your reasoning. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.The data being read/written by native interface classes: ================================================================ FileInputSream : DataSource FileOutputSream : DataSink SocketInputSream : DataSource SocketOutputSream : DataSink MMapInputStream : DataSource MMapOutputStream : DataSink ================================================================ The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.
Aug 03 2004
Regan Heath wrote:On Tue, 03 Aug 2004 18:02:55 -0400, parabolis <parabolis softhome.net>Here I must argue that any knowledge of where C went really wrong was with char* which allows buffer overruns because you do not know how long the buffer is... I also do not see how you could have used slicing and a void*. How would you know when to stop reading before you had off and len?The arrays in D are infinitely better than void* pointers because arrays have extra information.Incorrect. D arrays are better for some things, those that need/want the extra information.Lets ignore our opinions on the use of void* for now, can you write the write(int x) function above as easily if you do not use void* but use ubyte[] instead?I will do both at the same time... (read on)Sure, and when/where you provide it, what will it look like if the underlying write operation takes a ubyte[] and not a void*? is it possible? is it worse than simply using a void*?I am more concerned with the fact that a ubyte[] should help guard against the char* buffer overruns that creaed a huge security industry. In fact I suspect that you might be somebody from NAV or McAfee and are here only to ensure security holes remain rampant... :P One of the biggest breakthroughs Java made was in the area of security. Part of this breakthrough was a result of their eliminating that nasty char* and using arrays with length info builtin. Having said that... Of course it is possible to read a int/long/real/whatever from a byte buffer. Moreover you can test to see if something went wrong in the buffer because you know how long it is... ================================================================ int readInt( ubyte buf, uint off = 0 ) { if( buf.length <= off+4 ) throw Error( "Buffer overrun" ); uint result = buf[off+0]; result |= (cast(int)(buf[off+1])) << 8; result |= (cast(int)(buf[off+2])) << 16; result |= (cast(int)(buf[off+3])) << 24; return result; } ================================================================I am glad to hear you decided to split them. I think you will find it makes life simpler. I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)Sure, wanting to do this does not stop you using bolt-ins. I just have to split my Stream bolt-in into InputStream and OutputStream, in fact, I think I will, as I agree with your reasoning.I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.The data being read/written by native interface classes: ================================================================ FileInputSream : DataSource FileOutputSream : DataSink SocketInputSream : DataSource SocketOutputSream : DataSink MMapInputStream : DataSource MMapOutputStream : DataSink ================================================================ The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.
Aug 03 2004
On Tue, 03 Aug 2004 21:41:51 -0400, parabolis <parabolis softhome.net> wrote:Regan Heath wrote:Incorrect. D arrays are better for some things, those that need/want the extra information.Here I must argue that any knowledge of where C went really wrong was with char* which allows buffer overruns because you do not know how long the buffer is...I also do not see how you could have used slicing and a void*.I didn't/don't use slicing. I think you may be confusing two different points I made. My first point was that off and len were not required because you can slice into a ubyte[]. So _if_ you use ubyte[] you don't _need_ off and len. My second point was that instead of ubyte[] you should use void* for convenience. If you use void* you definately need len.How would you know when to stop reading before you had off and len?I have always had len, my fn prototype is: ulong write(void* address, ulong length); which simply writes length bytes starting at address.both? .. on I read ..Lets ignore our opinions on the use of void* for now, can you write the write(int x) function above as easily if you do not use void* but use ubyte[] instead?I will do both at the same time... (read on)Typo, you missed the [], I have added them below.Sure, and when/where you provide it, what will it look like if the underlying write operation takes a ubyte[] and not a void*? is it possible? is it worse than simply using a void*?I am more concerned with the fact that a ubyte[] should help guard against the char* buffer overruns that creaed a huge security industry. In fact I suspect that you might be somebody from NAV or McAfee and are here only to ensure security holes remain rampant... :P One of the biggest breakthroughs Java made was in the area of security. Part of this breakthrough was a result of their eliminating that nasty char* and using arrays with length info builtin. Having said that... Of course it is possible to read a int/long/real/whatever from a byte buffer. Moreover you can test to see if something went wrong in the buffer because you know how long it is... ================================================================ int readInt( ubyte buf, uint off = 0 ) {int readInt( ubyte[] buf, uint off = 0 ) { if( buf.length <= off+4 ) throw Error( "Buffer overrun" ); uint result = buf[off+0]; result |= (cast(int)(buf[off+1])) << 8; result |= (cast(int)(buf[off+2])) << 16; result |= (cast(int)(buf[off+3])) << 24; return result; } ================================================================And this is supposed to be nicer/easier/more efficient than.. bool readInt(out int x) { if (read(&x,x.sizeof) != x.sizeof) throw new Exception("Out of data"); return true; } As you can see using void* allows very convenient and totally buffer overrun safe code. <snip>I am glad to hear you decided to split them. I think you will find it makes life simpler. I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)You mean the problem you see with threads and shared buffers? Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
Regan Heath wrote:I didn't/don't use slicing. I think you may be confusing two different points I made. My first point was that off and len were not required because you can slice into a ubyte[]. So _if_ you use ubyte[] you don't _need_ off and len. My second point was that instead of ubyte[] you should use void* for convenience. If you use void* you definately need len.I see now. I was confused. Sorry.Show me a safe function that takes void* as a parameter. That was really more the point I was making. There is no way to guanratee in read(void*,uint len) that len is not actually longer than the array someone passes in. When that happens your read function will overwrite the end of the array and eventually write over executable code. Somebody will find that bug and send a specially formatted overly long string that has machine code in it and hijack the program.Typo, you missed the [], I have added them below.Sure, and when/where you provide it, what will it look like if the underlying write operation takes a ubyte[] and not a void*? is it possible? is it worse than simply using a void*?I am more concerned with the fact that a ubyte[] should help guard against the char* buffer overruns that creaed a huge security industry. In fact I suspect that you might be somebody from NAV or McAfee and are here only to ensure security holes remain rampant... :P One of the biggest breakthroughs Java made was in the area of security. Part of this breakthrough was a result of their eliminating that nasty char* and using arrays with length info builtin. Having said that... Of course it is possible to read a int/long/real/whatever from a byte buffer. Moreover you can test to see if something went wrong in the buffer because you know how long it is... ================================================================ int readInt( ubyte buf, uint off = 0 ) {int readInt( ubyte[] buf, uint off = 0 ) { if( buf.length <= off+4 ) throw Error( "Buffer overrun" ); uint result = buf[off+0]; result |= (cast(int)(buf[off+1])) << 8; result |= (cast(int)(buf[off+2])) << 16; result |= (cast(int)(buf[off+3])) << 24; return result; } ================================================================And this is supposed to be nicer/easier/more efficient than.. bool readInt(out int x) { if (read(&x,x.sizeof) != x.sizeof) throw new Exception("Out of data"); return true; } As you can see using void* allows very convenient and totally buffer overrun safe code.<snip>Sorry I meant the problem with threads and shared buffers should be easier now. The bit about the combinatorial problem goes back to the other thread in which I wanted to see how you combine multiple streams...I am glad to hear you decided to split them. I think you will find it makes life simpler. I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)You mean the problem you see with threads and shared buffers?
Aug 03 2004
On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> wrote: <snip>Show me a safe function that takes void* as a parameter. That was really more the point I was making. There is no way to guanratee in read(void*,uint len) that len is not actually longer than the array someone passes in. When that happens your read function will overwrite the end of the array and eventually write over executable code. Somebody will find that bug and send a specially formatted overly long string that has machine code in it and hijack the program.I agree this is a problem, I have been dealing with it for years at work (we work with C only). The solution in this case is that nobody outside the Stream template class actually calls the read/write functions that take void* instead they call the ones provided for int, float, ubyte[], and so on. However, someone might want the void* ones in order to read/write a struct.. .. I have just discovered you can use ubyte[] and get the same sort of function as my void* one, check out... class Stream { ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0) { if (length == 0) length = buffer.length; buffer[offset..length] = 65; return length-offset; } bool read(out char x) { if (read(cast(ubyte[])(&x)[0..x.sizeof]) != x.sizeof) return false; return true; } } void main() { Stream st = new Stream(); char c; st.read(c); printf("%c\n",c); } as you can see using a cast, a slice and the address of the char we can do the same thing as with a void *. So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider... void badBuggyRead(out char x) { read(cast(ubyte[])(&x)[0..1000]); } so even tho read uses a ubyte[] it can still overrun.:)<snip>Sorry I meant the problem with threads and shared buffers should be easier now.I am glad to hear you decided to split them. I think you will find it makes life simpler. I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)You mean the problem you see with threads and shared buffers?The bit about the combinatorial problem goes back to the other thread in which I wanted to see how you combine multiple streams...Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg. alias OutputStream!(InputStream!(RawFile)) File; or something, I have not tried splitting them yet, then.. alias CRCReader!(File) CRCFileReader; alias CRCWriter!(File) CRCFileWriter; alias ZIPReader!(File) ZIPFileReader; alias ZIPWriter!(File) ZIPFileWriter; now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used... Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
Regan Heath wrote:On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> wrote:That is a good point.Show me a safe function that takes void* as a parameter. That was really more the point I was making. There is no way to guanratee in read(void*,uint len) that len is not actually longer than the array someone passes in. When that happens your read function will overwrite the end of the array and eventually write over executable code. Somebody will find that bug and send a specially formatted overly long string that has machine code in it and hijack the program.I agree this is a problem, I have been dealing with it for years at work (we work with C only). The solution in this case is that nobody outside the Stream template class actually calls the read/write functions that take void* instead they call the ones provided for int, float, ubyte[], and so on. However, someone might want the void* ones in order to read/write a struct..... I have just discovered you can use ubyte[] and get the same sort of function as my void* one, check out... class Stream { ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0) { if (length == 0) length = buffer.length; buffer[offset..length] = 65;Now that is pretty neat.So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider... void badBuggyRead(out char x) { read(cast(ubyte[])(&x)[0..1000]); } so even tho read uses a ubyte[] it can still overrun.You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.Consider the number of combinations of just Readers that are possible: File,Net,Mem - choose 1 of 3 Compression CRC } - choose any number and in any order Buffering Image,Audio,Video - choose 1 of 3 If I am not to sleepy to be thinking straight then there are rougly 100 combinations of readers with just these 9 classes.:)<snip>Sorry I meant the problem with threads and shared buffers should be easier now.I am glad to hear you decided to split them. I think you will find it makes life simpler. I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)You mean the problem you see with threads and shared buffers?The bit about the combinatorial problem goes back to the other thread in which I wanted to see how you combine multiple streams...Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg. alias OutputStream!(InputStream!(RawFile)) File; or something, I have not tried splitting them yet, then.. alias CRCReader!(File) CRCFileReader; alias CRCWriter!(File) CRCFileWriter; alias ZIPReader!(File) ZIPFileReader; alias ZIPWriter!(File) ZIPFileWriter; now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used... Regan
Aug 03 2004
On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> wrote:Regan Heath wrote:Just that bit.. or the whole thing? That bit above was a little hack, it sets the whole buffer to 65 or ascii 'A'.On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> wrote:That is a good point.Show me a safe function that takes void* as a parameter. That was really more the point I was making. There is no way to guanratee in read(void*,uint len) that len is not actually longer than the array someone passes in. When that happens your read function will overwrite the end of the array and eventually write over executable code. Somebody will find that bug and send a specially formatted overly long string that has machine code in it and hijack the program.I agree this is a problem, I have been dealing with it for years at work (we work with C only). The solution in this case is that nobody outside the Stream template class actually calls the read/write functions that take void* instead they call the ones provided for int, float, ubyte[], and so on. However, someone might want the void* ones in order to read/write a struct..... I have just discovered you can use ubyte[] and get the same sort of function as my void* one, check out... class Stream { ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0) { if (length == 0) length = buffer.length; buffer[offset..length] = 65;Now that is pretty neat.But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using: read(void* address, ulong length); you might call that wrong to. I cannot see a difference and void* is easier to use and smaller than void[].So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider... void badBuggyRead(out char x) { read(cast(ubyte[])(&x)[0..1000]); } so even tho read uses a ubyte[] it can still overrun.You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.Yeah.. so? when I need one I make an alias and use it.. when I need another I make an alias and use it, it's no different to simply typing new A(new B(new C))) when you use it, _except_, if you re-use it in several places then my alias is neater. I am not going to alias all x possible combinations right now :) Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/Consider the number of combinations of just Readers that are possible: File,Net,Mem - choose 1 of 3 Compression CRC } - choose any number and in any order Buffering Image,Audio,Video - choose 1 of 3 If I am not to sleepy to be thinking straight then there are rougly 100 combinations of readers with just these 9 classes.:)<snip>Sorry I meant the problem with threads and shared buffers should be easier now.I am glad to hear you decided to split them. I think you will find it makes life simpler. I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)You mean the problem you see with threads and shared buffers?The bit about the combinatorial problem goes back to the other thread in which I wanted to see how you combine multiple streams...Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg. alias OutputStream!(InputStream!(RawFile)) File; or something, I have not tried splitting them yet, then.. alias CRCReader!(File) CRCFileReader; alias CRCWriter!(File) CRCFileWriter; alias ZIPReader!(File) ZIPFileReader; alias ZIPWriter!(File) ZIPFileWriter; now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used... Regan
Aug 04 2004
Regan Heath wrote:On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> wrote:Not really. My DataXXXStream would handle reading all cases where you want to read a primitive. The struct thing is a special case that I will say should be handled by library read/write functions. So it is expected that people who want a primitive/struct will use a library function. Should somebody have the need for something strange and defeat the security measure then it is expected they will not do it in a way that causes a buffer overrun. Most buffer overruns are a result of the fact that deal with char* on a regular basis leads to small bugs. I eliminate those with ubyte[] (or possibly void[]). You fail to do that with void*.Regan Heath wrote:But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using:On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> wrote: So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider... void badBuggyRead(out char x) { read(cast(ubyte[])(&x)[0..1000]); } so even tho read uses a ubyte[] it can still overrun.You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.read(void* address, ulong length); you might call that wrong to. I cannot see a difference and void* is easier to use and smaller than void[].So for something that reads from a file then does buffering then decompression then computes a CRC check of the input stream and reads image data you would use something like this: ================================================================ alias BufferedInputStream!(FileInputStream) BufferedFileInputStream; alias DecompressionInputStream!(BufferedFileInputStream) DecompressionBufferedFileInputStream; alias CRCInputStream!(DecompressionBufferedFileInputStream) CRCDecompressionBufferedFileInputStream; alias ImageInputStream!(CRCDecompressionBufferedFileInputStream) ImageCRCDecompressionBufferedFileInputStream; CRCInputSream crc_in = new CRCDecompressionBufferedFileInputStream(filename); ImageInputSream iin= new ImageCRCDecompressionBufferedFileInputStream(crc_in); ================================================================ File - 10 times Buffered - 10 times Decompression - 8 times CRC - 7 times Image - 4 times ================================ I cannot imagine why you would like having all that alias clutter up your file instead of just using the minimal: ================================================================ CRCInputStream crc_in = new CRCInputStream ( new DecompressionInputStream ( new BufferedInputStream ( new FileInputStream( filename ) ) ) ); ImageInputSream iin = new ImageInputStream( crc_in ); ================================================================ File - 1 time Buffered - 1 time Decompression - 1 time CRC - 2 times Image - 2 times ================Yeah.. so? when I need one I make an alias and use it.. when I need another I make an alias and use it, it's no different to simply typing new A(new B(new C))) when you use it, _except_, if you re-use it in several places then my alias is neater. I am not going to alias all x possible combinations right now :)Consider the number of combinations of just Readers that are possible: File,Net,Mem - choose 1 of 3 Compression CRC } - choose any number and in any order Buffering Image,Audio,Video - choose 1 of 3 If I am not to sleepy to be thinking straight then there are rougly 100 combinations of readers with just these 9 classes.:)<snip>Sorry I meant the problem with threads and shared buffers should be easier now.I am glad to hear you decided to split them. I think you will find it makes life simpler. I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)You mean the problem you see with threads and shared buffers?The bit about the combinatorial problem goes back to the other thread in which I wanted to see how you combine multiple streams...Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg. alias OutputStream!(InputStream!(RawFile)) File; or something, I have not tried splitting them yet, then.. alias CRCReader!(File) CRCFileReader; alias CRCWriter!(File) CRCFileWriter; alias ZIPReader!(File) ZIPFileReader; alias ZIPWriter!(File) ZIPFileWriter; now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used... Regan
Aug 04 2004
On Wed, 04 Aug 2004 11:37:05 -0400, parabolis <parabolis softhome.net> wrote:Regan Heath wrote:I don't think so.On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> wrote:Not really. My DataXXXStream would handle reading all cases where you want to read a primitive. The struct thing is a special case that I will say should be handled by library read/write functions. So it is expected that people who want a primitive/struct will use a library function. Should somebody have the need for something strange and defeat the security measure then it is expected they will not do it in a way that causes a buffer overrun. Most buffer overruns are a result of the fact that deal with char* on a regular basis leads to small bugs. I eliminate those with ubyte[] (or possibly void[]).Regan Heath wrote:But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using:On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> wrote: So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider... void badBuggyRead(out char x) { read(cast(ubyte[])(&x)[0..1000]); } so even tho read uses a ubyte[] it can still overrun.You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.You fail to do that with void*.I don't try. Because it's impossible. <snip>Nope. alias ImageStream!(CRCStream!(DecompressStream!(File) CompressedImageCRC; // my 'File' is buffered. CompressedImageCRC f = new CompressedImageCRC(); or more likely 'CompressedImageCRC' will be replaced by a name that has context where I use it, if for example it was an image resource for a game it might be simply 'Image'I am not going to alias all x possible combinations right now :)So for something that reads from a file then does buffering then decompression then computes a CRC check of the input stream and reads image data you would use something like this:================================================================ alias BufferedInputStream!(FileInputStream) BufferedFileInputStream; alias DecompressionInputStream!(BufferedFileInputStream) DecompressionBufferedFileInputStream; alias CRCInputStream!(DecompressionBufferedFileInputStream) CRCDecompressionBufferedFileInputStream; alias ImageInputStream!(CRCDecompressionBufferedFileInputStream) ImageCRCDecompressionBufferedFileInputStream; CRCInputSream crc_in = new CRCDecompressionBufferedFileInputStream(filename); ImageInputSream iin= new ImageCRCDecompressionBufferedFileInputStream(crc_in); ================================================================ File - 10 times Buffered - 10 times Decompression - 8 times CRC - 7 times Image - 4 times ================================ I cannot imagine why you would like having all that alias clutter up your file instead of just using the minimal: ================================================================ CRCInputStream crc_in = new CRCInputStream ( new DecompressionInputStream ( new BufferedInputStream ( new FileInputStream( filename ) ) ) ); ImageInputSream iin = new ImageInputStream( crc_in ); ================================================================ File - 1 time Buffered - 1 time Decompression - 1 time CRC - 2 times Image - 2 times ================Now instantiate it 10 times and give me a tally. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 04 2004
parabolis wrote:Regan Heath wrote:Slicing does not create garbage. Arrays really are value types that get copied when you pass them to a function. You can generally treat them as reference types because the data they refer to is not copied along with them. An array is quite literally little more than this: struct Array(T) { T* data; int length; } Might I suggest that DataSources and DataSinks use void[]? void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless. (incidently, slicing void* is legal as well)On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> wrote: <snip>I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed. The len and off parameters allow a caller to take either approach.Here is the foundation of the stream library I imagine: ================================================================ interface DataSink { uint write( ubyte[] data, uint off = 0, uint len = 0); } interface DataSource { uint read( inout ubyte[] data, uint off = 0, uint len = 0); ulong seek( ulong size ); } ================================================================I think you need functions in the form: ulong write(void* data, ulong len = 0, ulong off = 0); notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong. If you use ubyte[] you don't need len or off as you can call with: ubyte[] big = "regan was here"; write(big[6..9]); to achieve both.The void* is a pointer with no associated type. The arrays in D are infinitely better than void* pointers because arrays have extra information. As I said earlier in my post the behavior of providing data in a particular non-byte format should be done elsewhere in a single DataXXStream.The whole idea behind DataSources and DataSinks is that they just pull bytes in and out of some other place without ever having any concern for their meaning. This is a textbook case of the right place to use void*. :) (or void[]) -- andy
Aug 03 2004
On Tue, 03 Aug 2004 21:30:29 -0700, Andy Friesen <andy ikagames.com> wrote:On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> wrote:Really? doesn't slicing create another array structure (the one you have described below) exactly the same as if/when you pass one to a function, so.. void foo(char[] a) { } void main() { char[] a = "12345"; foo(a[1..3]); } the above code creates 3 arrays: 1- 'a' at the start of main 2- one for the slice 3- one for the function call. leaving out the slice creates one less copy of the array (not the data) I think that is what parabolis meant.I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed. The len and off parameters allow a caller to take either approach.Slicing does not create garbage.Arrays really are value types that get copied when you pass them to a function. You can generally treat them as reference types because the data they refer to is not copied along with them. An array is quite literally little more than this: struct Array(T) { T* data; int length; } Might I suggest that DataSources and DataSinks use void[]? void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless. (incidently, slicing void* is legal as well)I agree void* or void[] should be used. Parabolis's other concern was a buffer overrun, but as I see it neither void[], void * or ubyte[] are any more buffer safe (see my other post for a detailed explaination) Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/The void* is a pointer with no associated type. The arrays in D are infinitely better than void* pointers because arrays have extra information. As I said earlier in my post the behavior of providing data in a particular non-byte format should be done elsewhere in a single DataXXStream.The whole idea behind DataSources and DataSinks is that they just pull bytes in and out of some other place without ever having any concern for their meaning. This is a textbook case of the right place to use void*. :) (or void[])
Aug 03 2004
Regan Heath wrote:Sure, but the second two can probably be optimized into one and the same. Besides, it's stack space. Nothing is faster than stack allocation. (sub esp, ...)Slicing does not create garbage.Really? doesn't slicing create another array structure (the one you have described below) exactly the same as if/when you pass one to a function, so.. void foo(char[] a) { } void main() { char[] a = "12345"; foo(a[1..3]); } the above code creates 3 arrays: 1- 'a' at the start of main 2- one for the slice 3- one for the function call. leaving out the slice creates one less copy of the array (not the data) I think that is what parabolis meant.References are to be preferred over pointers in C++ because constructing a null reference isn't easily possible to do by accident. It's easy to do on purpose, but if you do, Santa will put you on his Naughty list and give you coal. Also, your programs might crash or something. D arrays are the same way. Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :) -- andyThe whole idea behind DataSources and DataSinks is that they just pull bytes in and out of some other place without ever having any concern for their meaning. This is a textbook case of the right place to use void*. :) (or void[])I agree void* or void[] should be used. Parabolis's other concern was a buffer overrun, but as I see it neither void[], void * or ubyte[] are any more buffer safe (see my other post for a detailed explaination)
Aug 03 2004
Andy Friesen wrote:Besides, it's stack space. Nothing is faster than stack allocation. (sub esp, ...)Sure there is. Not allocating is infinitely faster. :)
Aug 03 2004
parabolis wrote:Andy Friesen wrote:Passing an array slice as an argument is exactly the same as passing a pointer to its contents and a size. (the exact same code should be emitted) This is why the %.*s trick works with printf. The length gets pushed first, then the pointer, which just so happens to be the same format as expected by %.*s. printf("%.*s\n", str); <===> printf("%.*s\n", str.length, &str[0]); While we're on the topic of speed hacking, though, might I suggest the following for improving application performance: main() { return 0; } (it reduces memory consumption too!) ;) -- andyBesides, it's stack space. Nothing is faster than stack allocation. (sub esp, ...)Sure there is. Not allocating is infinitely faster. :)
Aug 03 2004
While we're on the topic of speed hacking, though, might I suggest the following for improving application performance: main() { return 0; } (it reduces memory consumption too!)That is post-mature optimization. You should never have created application.d in the first place! :-)
Aug 04 2004
In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...D arrays are the same way. Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)Not sure I agree in this case. Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory. To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler). I had actually added wrapper functions to unformatted read/write all primitive types but recently removed them because they seemed redundant. I suppose if there's enough of a demand I'll add them back. Sean
Aug 04 2004
Sean Kelly wrote:In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...I am pretty sure the second read in your example parses it be treating the address of x as a ubyte array and then slicing into which creates a valid ubyte[] array to pass to a function.D arrays are the same way. Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)Not sure I agree in this case. Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory. To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler).
Aug 04 2004
On Wed, 04 Aug 2004 11:55:43 -0400, parabolis <parabolis softhome.net> wrote:Sean Kelly wrote:It's not guaranteed to be valid. replace x.sizeof with 1000 and it's an invalid ubyte[] array. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...I am pretty sure the second read in your example parses it be treating the address of x as a ubyte array and then slicing into which creates a valid ubyte[] array to pass to a function.D arrays are the same way. Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)Not sure I agree in this case. Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory. To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler).
Aug 04 2004
Sean Kelly wrote:In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...I changed my mind. You're right. :) Getting an invalid array is hard, except when you start slicing pointers, at which point it becomes a bit too easy. -- andyD arrays are the same way. Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)Not sure I agree in this case. Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory. To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler).
Aug 04 2004
Andy Friesen wrote:That is what I meant by a wrapper. It is actually defined in phobos\internal\adi.d Given that it is a struct it will be created on the stack and thus not GCed. However I still like to have the option to decide between the two. :)I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed. The len and off parameters allow a caller to take either approach.Slicing does not create garbage. Arrays really are value types that get copied when you pass them to a function. You can generally treat them as reference types because the data they refer to is not copied along with them. An array is quite literally little more than this: struct Array(T) { T* data; int length; }Might I suggest that DataSources and DataSinks use void[]? void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless. (incidently, slicing void* is legal as well)I had no idea there is a void[] in D and will have to consider it. As I explained in another post this is a textbook example of when *not* to use void*. If void[] exists then its use might be justified but honestly it warps my mind even trying to consider it.The void* is a pointer with no associated type. The arrays in D are infinitely better than void* pointers because arrays have extra information. As I said earlier in my post the behavior of providing data in a particular non-byte format should be done elsewhere in a single DataXXStream.The whole idea behind DataSources and DataSinks is that they just pull bytes in and out of some other place without ever having any concern for their meaning. This is a textbook case of the right place to use void*. :) (or void[])
Aug 03 2004
Andy Friesen wrote:Might I suggest that DataSources and DataSinks use void[]? void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless.This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns. However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[]. This suggests that at least some people using/writing functions with void[] parameters will do strange things. I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.
Aug 05 2004
On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis softhome.net> wrote:Andy Friesen wrote:I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other [].Might I suggest that DataSources and DataSinks use void[]? void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless.This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns.However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length.That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte.The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].But then you cannot address each of the 4 bytes of each int.This suggests that at least some people using/writing functions with void[] parameters will do strange things.Have you used 'void' as a type before, I suspect only people who have not used the concept before will get this wrong, and a simple line of documentation describing void[] will put them right.I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 05 2004
Regan Heath wrote:On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis softhome.net> wrote:My argument is that there exists a program in which a bug will be caught. You argument is that there does not exist a program such that a bug will be caught (or that for all programs there is no program such that a bug is caught). Assuming we have the function: read_bad(void*,uint len) read_good(ubyte[],uint len) A exerpt from program P in which a bug is caught is as follows: ============================== P ============================== ubyte ex[256]; read_bad(ex,0xFFFF_FFFF); // memory overwritten read_good(ex,0xFFFF_FFFF); // exception thrown ================================================================ P contains a bug that is caught using an array parameter. The existance of P simultaneously proves my argument and disproves yours. Yet we have had this discussion before and you seem to insist that since you can find examples where a bug is not caught my argument must be wrong somehow. I am not familiar with any logic in which such claims are expected. Either you will have to explain the logic system you are using to me so I can explain my claim properly or you will have to use the one I am using. Here are some links to mine: http://en.wikipedia.org/wiki/Logic http://en.wikipedia.org/wiki/Predicate_logic http://en.wikipedia.org/wiki/Universal_quantifier http://en.wikipedia.org/wiki/Existential_quantifierAndy Friesen wrote:I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other [].Might I suggest that DataSources and DataSinks use void[]? void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless.This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns.Wonderful guess. It is entirely more complicated than a ubyte[] being a partition of memory on 8-bit boundries and knowing how the length and sizeof will work.However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length.That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte.Yes that was exactly my point.The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].But then you cannot address each of the 4 bytes of each int.No I have never used void as a type before. I have always been under the impression that "void varX;" is not a legal declaration/definition in C or C++. I have used void* frequently in C/C++ but the size of any void* variables is of course the size of any pointer.This suggests that at least some people using/writing functions with void[] parameters will do strange things.Have you used 'void' as a type before, I suspect only people who havenot used the concept before will get this wrong, and a simple line of documentation describing void[] will put them right.Or using ubyte[] will write the documentation for me and provide some assurance that in cases in which people did not read the docs will have a chance of getting it right from the start.No actually I have been saying void is 'right' because streaming data is only partitioned according to the semantics of the interpretation of the data. Partitioning data into a byte forces an arbitrary partition of general data that would not happen conceptually with void. I just feel that using void[] lacks the ease of use you get with ubyte[].I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks.
Aug 06 2004
On Fri, 06 Aug 2004 14:29:19 -0400, parabolis <parabolis softhome.net> wrote: <snip>Ok.I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other [].My argument is that there exists a program in which a bug will be caught.You argument is that there does not exist a program such that a bug will be caught (or that for all programs there is no program such that a bug is caught).I make no such argument. In fact I am having a hard time following the above sentence.Assuming we have the function: read_bad(void*,uint len) read_good(ubyte[],uint len) A exerpt from program P in which a bug is caught is as follows: ============================== P ============================== ubyte ex[256]; read_bad(ex,0xFFFF_FFFF); // memory overwritten read_good(ex,0xFFFF_FFFF); // exception thrown ================================================================ P contains a bug that is caught using an array parameter. The existance of P simultaneously proves my argument and disproves yours.I don't think you understand my argument.Yet we have had this discussion before and you seem to insist that since you can find examples where a bug is not caught my argument must be wrong somehow. I am not familiar with any logic in which such claims are expected. Either you will have to explain the logic system you are using to me so I can explain my claim properly or you will have to use the one I am using. Here are some links to mine: http://en.wikipedia.org/wiki/Logic http://en.wikipedia.org/wiki/Predicate_logic http://en.wikipedia.org/wiki/Universal_quantifier http://en.wikipedia.org/wiki/Existential_quantifierYou are missing my point. There is one pivotal fact in this debate and that is that an array is _not_ guaranteed to be correct about it's own length. Consider: void read(void[] a, int length) { if (length > a.length) throw new Exception(..); } void main() { char* p = "0123456789"; read(&p[0..1000],1000); } no exception is throw and memory is overwritten. My point, which you seem to have missed, is simply: "An array is _not_ guaranteed to be correct about it's own length" The reason this point is all important in this debate is that when trying to write basic types and structs you _will_ need to create the array from the basic type, when doing so you _will_ need to define the arrays length manually. So there is _always_ going to be the same risk of error regardless of whether you use void[] or void*. Since the risk is the same in either case I vote for the clearest/cleanest/simplest code, as this reduces the risk of error slightly, the code I propsed: bool read(out int x) { return read(&x,x.sizeof) == x.sizeof; } is cleaner/clearer and simpler than bool read(out int x) { return read(&x[0..x.sizeof],x.sizeof) == x.sizeof); }The semantics of void* are quite well known, anyone who has used it knows what I have described above, anyone who doesn't will read the docs on void* before they start. Anyone who _guesses_ what will happen is asking for trouble.Wonderful guess. It is entirely more complicated than a ubyte[] being a partition of memory on 8-bit boundries and knowing how the length and sizeof will work.However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length.That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte.So we agree, void[] works in a logical fashion.Yes that was exactly my point.The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].But then you cannot address each of the 4 bytes of each int.In that case why didn't you read the documentation on the void[] type?No I have never used void as a type before. I have always been under the impression that "void varX;" is not a legal declaration/definition in C or C++. I have used void* frequently in C/C++ but the size of any void* variables is of course the size of any pointer.This suggests that at least some people using/writing functions with void[] parameters will do strange things.Have you used 'void' as a type before, I suspect only people who haveRubbish, you're basically asserting that ubyte[] is known of by everyone, and that is simply not true.not used the concept before will get this wrong, and a simple line of documentation describing void[] will put them right.Or using ubyte[] will write the documentation for me and provide some assurance that in cases in which people did not read the docs will have a chance of getting it right from the start.Ease of use?! How is this: bool read(out int x) { return read(&x[0..x.sizeof],x.sizeof) == x.sizeof); } easier than: bool read(out int x) { return read(&x,x.sizeof) == x.sizeof; } ? There seems to be 2 points in this argument, "what is easier" and "what is safer", my opinion which I have tried to demonstrate is that neither is safer and void* is easier. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/No actually I have been saying void is 'right' because streaming data is only partitioned according to the semantics of the interpretation of the data. Partitioning data into a byte forces an arbitrary partition of general data that would not happen conceptually with void. I just feel that using void[] lacks the ease of use you get with ubyte[].I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks.
Aug 08 2004
In article <ceulss$2fj6$1 digitaldaemon.com>, parabolis says...void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].For all D types, the number of bytes occupied by a T[] of length N is (N * T.sizeof). This should have been your default assumption. void.sizeof is 1. Jill
Aug 06 2004
Arcane Jill wrote:In article <ceulss$2fj6$1 digitaldaemon.com>, parabolis says...Sorry I meant from the docs http://www.digitalmars.com/d/type.html: void no type bit single bit byte signed 8 bits ubyte unsigned 8 bits ....void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].For all D types, the number of bytes occupied by a T[] of length N is (N * T.sizeof). This should have been your default assumption. void.sizeof is 1.
Aug 06 2004
In article <cf0e4q$mqi$1 digitaldaemon.com>, parabolis says...Sorry I meant from the docs http://www.digitalmars.com/d/type.html: void no type bit single bit byte signed 8 bits ubyte unsigned 8 bits ....Probably an esoteric question, but I assume that the byte size gurantee is only for machines with the proper architecture? Not that I expect to see a D compiler for the very few machines that support strange byte sizes, just wondering... Sean
Aug 06 2004
Sean Kelly wrote:In article <cf0e4q$mqi$1 digitaldaemon.com>, parabolis says...Actually that is not a terribly esoteric question. I do not believe the D byte is the same as the C/C++ char. (Which is what I assume you are referring to in this case.) I would be curious to know the answer as well. I would also be curious how a compiler would deal with Harvard architecture.Sorry I meant from the docs http://www.digitalmars.com/d/type.html: void no type bit single bit byte signed 8 bits ubyte unsigned 8 bits ....Probably an esoteric question, but I assume that the byte size gurantee is only for machines with the proper architecture? Not that I expect to see a D compiler for the very few machines that support strange byte sizes, just wondering...
Aug 06 2004
For another perspective/idea have a look at my thread entitled "My stream concept". I use template bolt-ins. There was a little problem with it, which was actually trivial to fix, I simply replaced the 'super.' calls with 'this.' calls. It should also be noted that my idea was strictly for creating the base level stream classes from the various devices i.e. File, Socket, Memory etc. The next step is to add filters (as described by Arcane Jill) I am hoping an idea will come to me as to how I can do that, without needing: new MemoryMap(new UTF16Filter(new Stream())); Regan On Tue, 3 Aug 2004 19:36:19 +0000 (UTC), Sean Kelly <sean f4.ca> wrote:I finally got back on my stream mods today and had a question: how should the wrapper class know the encoding scheme of the low-level data? For example, say all of the formatted IO code is in a mixin or base class (assume base class for the same of discussion) that calls a read(void*, size_t) or write(void*, size_t) method in the derived class. Now say I want to read a char, wchar, or dchar from the stream. How many bytes should I read and how do I know what the encoding format is? C++ streams handle this fairly simply by making the char type a template parameter: This has the obvious limitation that the programmer must instantiate the proper type of stream for the data format he is trying to read (as there is only one get/put method for any char type: CharT). But it makes things pretty explicit: Stream!(char) means "this is a stream formatted in UTF8." The other option I can think off offhand would be to have a class member that the derived class could set which specifies the encoding format: This has tbe benefit of allowing the user to read and write any char type with a single instantiation, but requires greater complexity in the Stream class and in the Derived class. And I wonder if such flexibility is truly necessary. Any other design possibilities? Preferences? I'm really trying to establish a good formatted IO design than work out the perfect stream API. Any other weird issues would be welcome also. Sean-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
In article <opsb6d5aw85a2sq9 digitalmars.com>, Regan Heath says...For another perspective/idea have a look at my thread entitled "My stream concept". I use template bolt-ins. There was a little problem with it, which was actually trivial to fix, I simply replaced the 'super.' calls with 'this.' calls. It should also be noted that my idea was strictly for creating the base level stream classes from the various devices i.e. File, Socket, Memory etc. The next step is to add filters (as described by Arcane Jill) I am hoping an idea will come to me as to how I can do that, without needing: new MemoryMap(new UTF16Filter(new Stream()));My design really set out extend the original stream approach, and it seemed the logical extension was pretty C++ like. I ended up creating a basic set of interfaces--Stream, InputStream, and OutputStream--and putting all the implementation in templates meant to be mixins. This was somewhat necessary to support the multiple inheritance type model. So the input file stream looks something like this: } Works quite well but it's very different from the Java approach. I'm still not sure which I like better, though I'll grant that the Java version is more flexible (at the expense of verbosity). The other potential issue is the top-heaviness of the design. I am warming up to the the idea of separate reader/writer adaptor classes. Sean
Aug 03 2004
Sean Kelly wrote:Works quite well but it's very different from the Java approach. I'm still not sure which I like better, though I'll grant that the Java version is more flexible (at the expense of verbosity). The other potential issue is the top-heaviness of the design. I am warming up to the the idea of separate reader/writer adaptor classes.I probably should have made the argument explicit but I do believe dealing with incoming and outgoing data at the same time is suspect of multi-threading issues. If your code is MT safe then you probably did much more work than you had to with little apparent benefit.
Aug 03 2004
Sean Kelly wrote:Works quite well but it's very different from the Java approach. I'm still not sure which I like better, though I'll grant that the Java version is more flexible (at the expense of verbosity). The other potential issue is theI also meant to suggest that I really like much less verbose class names like: FileIS and FileOS...
Aug 03 2004
What you good folks seem to be describing is pretty much how mango.io operates. All the questions raised so far are quietly handled by that library (even the separate input & output buffers, if you want that), so it might be worthwhile checking it out. It's also house-trained, documented, and has a raft of additional features that you selectively apply where appropriate (it's not all tragically intertwined). As a bonus, there's a ton of functionality already built on top of mango.io, including http-server, servlet-engine, clustering, logging, local & remote object caching; even tossing remote D executable objects around a local network. The DSP project is also targeting Mango as a delivery mechanism. Check them out over at dsource.org. I think it's great to have "competing" libraries under way, but at some point is it worth considering funneling efforts instead? Perhaps not? "Sean Kelly" <sean f4.ca> wrote in message news:ceopfj$1hcl$1 digitaldaemon.com...I finally got back on my stream mods today and had a question: how shouldthewrapper class know the encoding scheme of the low-level data? For example, say all of the formatted IO code is in a mixin or base class (assume base class for the same of discussion) that calls a read(void*,size_t)or write(void*, size_t) method in the derived class. Now say I want toread achar, wchar, or dchar from the stream. How many bytes should I read andhow doI know what the encoding format is? C++ streams handle this fairly simplybymaking the char type a template parameter: This has the obvious limitation that the programmer must instantiate thepropertype of stream for the data format he is trying to read (as there is onlyoneget/put method for any char type: CharT). But it makes things prettyexplicit:Stream!(char) means "this is a stream formatted in UTF8." The other option I can think off offhand would be to have a class memberthatthe derived class could set which specifies the encoding format: This has tbe benefit of allowing the user to read and write any char typewith asingle instantiation, but requires greater complexity in the Stream classand inthe Derived class. And I wonder if such flexibility is truly necessary. Any other design possibilities? Preferences? I'm really trying toestablish agood formatted IO design than work out the perfect stream API. Any otherweirdissues would be welcome also. Sean
Aug 03 2004
In article <cep4dd$1nde$1 digitaldaemon.com>, antiAlias says...What you good folks seem to be describing is pretty much how mango.io operates. All the questions raised so far are quietly handled by that library (even the separate input & output buffers, if you want that), so it might be worthwhile checking it out. It's also house-trained, documented, and has a raft of additional features that you selectively apply where appropriate (it's not all tragically intertwined).Yup. I've played around with Mango and kind of like it. One of the reasons I started these stream mods was to have an alternate design to compare to Mango for the sake of discussion. ie. I don't want folks to settle on Mango simply because the other choices are missing features.I think it's great to have "competing" libraries under way, but at some point is it worth considering funneling efforts instead? Perhaps not?Definately. Sean
Aug 03 2004
You are absolutely right. But not many people seem to know about Mango, so the opportunity for "spreading the news" was too great to pass up :-) "Sean Kelly" <sean f4.ca> wrote in message news:cep64d$1o0t$1 digitaldaemon.com...In article <cep4dd$1nde$1 digitaldaemon.com>, antiAlias says...itWhat you good folks seem to be describing is pretty much how mango.io operates. All the questions raised so far are quietly handled by that library (even the separate input & output buffers, if you want that), soreasons Imight be worthwhile checking it out. It's also house-trained, documented, and has a raft of additional features that you selectively apply where appropriate (it's not all tragically intertwined).Yup. I've played around with Mango and kind of like it. One of thestarted these stream mods was to have an alternate design to compare toMangofor the sake of discussion. ie. I don't want folks to settle on Mangosimplybecause the other choices are missing features.I think it's great to have "competing" libraries under way, but at some point is it worth considering funneling efforts instead? Perhaps not?Definately. Sean
Aug 03 2004
antiAlias wrote:What you good folks seem to be describing is pretty much how mango.io operates. All the questions raised so far are quietly handled by that library (even the separate input & output buffers, if you want that), so it might be worthwhile checking it out. It's also house-trained, documented, and has a raft of additional features that you selectively apply where appropriate (it's not all tragically intertwined).I cant help but ask how it manages to do both input and output and still avoid multi-threading issues?As a bonus, there's a ton of functionality already built on top of mango.io, including http-server, servlet-engine, clustering, logging, local & remote object caching; even tossing remote D executable objects around a local network. The DSP project is also targeting Mango as a delivery mechanism. Check them out over at dsource.org.I have only started looking over the library. It is rather extensive. The source is well documented and organized. Both are rare to see. I am not fond of the pdf format. Anyway I am impressed at the surface. I will take a look deeper within.I think it's great to have "competing" libraries under way, but at some point is it worth considering funneling efforts instead? Perhaps not?On the note of competing libraries I could not help but notice your primes.d implementation. You might want to look at the primes.d on Deimos and consider using that instead. It is rather cleverly designed and could be tuned to do no worse than your bsearch for all ushort values.
Aug 03 2004
The primes.d thing is now a distant and foggy memory :-) Can I hook you up with a copy of the latest (much better, with annotated source) documentation? You'll see Primes.d is gone, along with some other warts: http://svn.dsource.org/svn/projects/mango/downloads/mango_beta_9-2_doc.zip "parabolis" <parabolis softhome.net> wrote in message news:cep9ee$1ov1$1 digitaldaemon.com...antiAlias wrote:itWhat you good folks seem to be describing is pretty much how mango.io operates. All the questions raised so far are quietly handled by that library (even the separate input & output buffers, if you want that), sodocumented,might be worthwhile checking it out. It's also house-trained,mango.io,and has a raft of additional features that you selectively apply where appropriate (it's not all tragically intertwined).I cant help but ask how it manages to do both input and output and still avoid multi-threading issues?As a bonus, there's a ton of functionality already built on top ofremoteincluding http-server, servlet-engine, clustering, logging, local &mechanism.object caching; even tossing remote D executable objects around a local network. The DSP project is also targeting Mango as a deliveryCheck them out over at dsource.org.I have only started looking over the library. It is rather extensive. The source is well documented and organized. Both are rare to see. I am not fond of the pdf format. Anyway I am impressed at the surface. I will take a look deeper within.I think it's great to have "competing" libraries under way, but at some point is it worth considering funneling efforts instead? Perhaps not?On the note of competing libraries I could not help but notice your primes.d implementation. You might want to look at the primes.d on Deimos and consider using that instead. It is rather cleverly designed and could be tuned to do no worse than your bsearch for all ushort values.
Aug 03 2004
antiAlias wrote:The primes.d thing is now a distant and foggy memory :-) Can I hook you up with a copy of the latest (much better, with annotated source) documentation? You'll see Primes.d is gone, along with some other warts: http://svn.dsource.org/svn/projects/mango/downloads/mango_beta_9-2_doc.ziplol - The Mango Tree... just got it from the docs :) I am now without question under the belief that the mango docs are great. I was going to suggest in my last post that I would like to see some docs that cover more of the concept area than just doxygen stuff. I decided that it would probably be to much to excpect :) Quote: ================================================================ Note that these Tokenizers do not maintain any state of their own. Thus they are all thread-safe. ================================================================ This is always good to know from documentation. :) However I am curious about IPickle's design. Would it not be possible to serialize objects based on the data in ClassInfo?
Aug 03 2004
"parabolis" wrote..Quote: ================================================================ Note that these Tokenizers do not maintain any state of their own. Thus they are all thread-safe. ================================================================ This is always good to know from documentation. :) However I am curious about IPickle's design. Would it not be possible to serialize objects based on the data in ClassInfo?Doing it the introspection way (ala Java) has a bunch of issues all of it's own, and D doesn't have the power to expose all the requisite data as yet (I could be wrong on the latter though). IPickle was a nice and simple way to approach it; there's no monkey business anywhere (like Java has), it's explicit, and it's very fast. While not an overriding design factor, throughput is one of the main things all the Mango branches/packages keep an watchful eye upon. Frankly, I'd like to see a decent introspection approach emerge along the way; perhaps as a complement rather than a replacement: within Mango there's no obvious reason why the two approaches could not produce an equivalent serialized stream, and therefore be interchangeable at the endpoints. This is one area where I think getting other people involved in the project would help tremendously.
Aug 03 2004
antiAlias wrote:"parabolis" wrote..I think I was premature to suppose D could do that. I just gave the issue some thought and there is just enough introspection to make a shallow copy which is obviously not sufficient.Quote: ================================================================ Note that these Tokenizers do not maintain any state of their own. Thus they are all thread-safe. ================================================================ This is always good to know from documentation. :) However I am curious about IPickle's design. Would it not be possible to serialize objects based on the data in ClassInfo?Doing it the introspection way (ala Java) has a bunch of issues all of it's own, and D doesn't have the power to expose all the requisite data as yet (I could be wrong on the latter though).IPickle was a nice and simple way to approach it; there's no monkey business anywhere (like Java has), it's explicit, and it's very fast. While not an overriding design factor, throughput is one of the main things all the Mango branches/packages keep an watchful eye upon. Frankly, I'd like to see a decent introspection approach emerge along the way; perhaps as a complement rather than a replacement: within Mango there's no obvious reason why the two approaches could not produce an equivalent serialized stream, and therefore be interchangeable at the endpoints.Any automated serializing algorithm would have to either allow IPickles to [de-]serialize themselves or ignore read/write. However given one of those holds then the serialization ought to be compatible.This is one area where I think getting other people involved in the project would help tremendously.I think I am probably sold on being willing to help. It is more an issue of whether I can provide anything that will further mango. :)
Aug 03 2004
"parabolis" <parabolis softhome.net> wrote in message news:cepppv$1ugt$1 digitaldaemon.com...antiAlias wrote:it's"parabolis" wrote..Quote: ================================================================ Note that these Tokenizers do not maintain any state of their own. Thus they are all thread-safe. ================================================================ This is always good to know from documentation. :) However I am curious about IPickle's design. Would it not be possible to serialize objects based on the data in ClassInfo?Doing it the introspection way (ala Java) has a bunch of issues all ofyet (Iown, and D doesn't have the power to expose all the requisite data asbusinesscould be wrong on the latter though).I think I was premature to suppose D could do that. I just gave the issue some thought and there is just enough introspection to make a shallow copy which is obviously not sufficient.IPickle was a nice and simple way to approach it; there's no monkeyananywhere (like Java has), it's explicit, and it's very fast. While notMangooverriding design factor, throughput is one of the main things all thecomplementbranches/packages keep an watchful eye upon. Frankly, I'd like to see a decent introspection approach emerge along the way; perhaps as atherather than a replacement: within Mango there's no obvious reason whyprojecttwo approaches could not produce an equivalent serialized stream, and therefore be interchangeable at the endpoints.Any automated serializing algorithm would have to either allow IPickles to [de-]serialize themselves or ignore read/write. However given one of those holds then the serialization ought to be compatible.This is one area where I think getting other people involved in theThere's lots to do <g> Here's some things that have been noted: http://www.dsource.org/forums/viewtopic.php?t=174&sid=f5f234d101f0405ebaf9cb df728af44a And here's some more: http://www.dsource.org/forums/viewtopic.php?t=157&sid=f5f234d101f0405ebaf9cb df728af44a That's just the tip of the iceberg though. For example, there's no Unicode support as yet since we decided to wait until Hauke & AJ released all the requisite pieces (better to do it properly); IO filters/decorators such as companders have not actually been implemented yet, although there's a solid placeholder for them; there's some annoying things that are currently unimplemented on Unix (noted in the documentation todo list); etc. etc. Plenty of room for improvement all over the place, and that's before you hit the upper decks :-) The project is very open to other packages hooking in at any level: as a peer, as part of the Mango Tree itself, or as a package user. For example, there's currently a bit-sliced XML/SAX engine in the works (okay; "byte-sliced" then), plus the DSP project mentioned earlier (which looks to be really uber cool ... everyone should check that one out). Having real-world user-code drive the design and functionality is of truly immense value: the bad stuff is typically identified and removed/replaced rather quickly. Anyone who would like to get involved, please jump on the dsource.org forums!would help tremendously.I think I am probably sold on being willing to help. It is more an issue of whether I can provide anything that will further mango. :)
Aug 03 2004
"Sean Kelly" <sean f4.ca> wrote in message news:ceopfj$1hcl$1 digitaldaemon.com...This has tbe benefit of allowing the user to read and write any char typewith asingle instantiation, but requires greater complexity in the Stream classand inthe Derived class. And I wonder if such flexibility is truly necessary. Any other design possibilities? Preferences? I'm really trying toestablish agood formatted IO design than work out the perfect stream API. Any otherweirdissues would be welcome also.I'm one of those folks who is very much in favor of a file reader being able to automatically detect the encoding in it. Hence, D can auto-detect the UTF formatting. So, I'd recommend that the format be an enum that can be specifically set or can be auto-detected. Different resulting behaviors can be handled with virtual functions. Also, formats like UTF-16 have two variants, big end and little end. It should also be able to read data in other formats, such as code pages, and convert them to utf. These cannot be auto-detected.
Aug 03 2004
In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...I'm one of those folks who is very much in favor of a file reader being able to automatically detect the encoding in it. Hence, D can auto-detect the UTF formatting. So, I'd recommend that the format be an enum that can be specifically set or can be auto-detected. Different resulting behaviors can be handled with virtual functions.With all due respect, Walter, that's not really feasible. It is very hard, for example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward, but not all encodings make life that easy for us. You can't use an enum, because there are an unlimited number of possible encodings. Besides, if you're parsing an HTTP header, and if, within that header, you read "Content-Type: text/plain; encoding=MAC-ROMAN", then you can be pretty sure you know what the encoding of the following document is going to be. Other formats have different indicators (HTML meta tags; Python source file comments; -the list is endless). Only at the application level can you /really/ sort this out, because the application presumably knows what it's looking at.Also, formats like UTF-16 have two variants, big end and little end.Best to treat those as two separate encodings, although if the encoding is specified as "UTF-16" you may still need to auto-detect which variant is being used. Once you know for sure, stick with it.It should also be able to read data in other formats, such as code pages, and convert them to utf. These cannot be auto-detected.I think that's the whole point. Windows code pages /are/ encodings. WINDOWS-1252 is an encoding, same as UTF-8. I think people here are talking about encodings generally, not just UTFs. Jill
Aug 03 2004
In article <ceq0mg$20d8$1 digitaldaemon.com>, Arcane Jill says...In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...That reminds me. Which format does the code in utf.d use? I'm thinking I may do something like this for encoding for now: enum Format { UTF8 = 0, UTF16 = 1, UTF16LE = 1, UTF16BE = 2 } So "UTF-16" would actually default to one of the two methods. SeanAlso, formats like UTF-16 have two variants, big end and little end.Best to treat those as two separate encodings, although if the encoding is specified as "UTF-16" you may still need to auto-detect which variant is being used. Once you know for sure, stick with it.
Aug 04 2004
In article <ceqv9a$15b$1 digitaldaemon.com>, Sean Kelly says...That reminds me. Which format does the code in utf.d use?To be honest, I don't understand the question.I'm thinking I may do something like this for encoding for now: enum Format { UTF8 = 0, UTF16 = 1, UTF16LE = 1, UTF16BE = 2 } So "UTF-16" would actually default to one of the two methods.Whatever works, works. But I'd make the enum private. Encodings should be universally known by their IANA registered name, otherwise how can you map name to number. (For example, you encounter an XML file which declares its own encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that into an enum?) Got an unrelated question for you. In the stream function void read(out int), there is an assumption that the bytes will be embedded in the stream in little-endian order. Should applications assume (a) it's always little endian, regardless of host architecture, or (b) it's always host-byte order. Is there a big endian version? Is there a network byte order version? Should there be? Jill
Aug 04 2004
In article <cer4k8$7jj$1 digitaldaemon.com>, Arcane Jill says...In article <ceqv9a$15b$1 digitaldaemon.com>, Sean Kelly says...std.utf has methods like toUTF16. But does this target the big or little endian encoding scheme? I suppose I could assume it corresponds to the byte order of the target machine, but this would imply different behavior on different platforms.That reminds me. Which format does the code in utf.d use?To be honest, I don't understand the question.This raises an interesting question. Rather than having the encoding handled directly by the Stream layer perhaps it should be dropped into another class. I can't imagine coding a base lib to support "Joe's custom encoding scheme." For the moment though, I think I'll leave stream.d as-is. This seems like a design issue that will take a bit of talk to get right.I'm thinking I may do something like this for encoding for now: enum Format { UTF8 = 0, UTF16 = 1, UTF16LE = 1, UTF16BE = 2 } So "UTF-16" would actually default to one of the two methods.Whatever works, works. But I'd make the enum private. Encodings should be universally known by their IANA registered name, otherwise how can you map name to number. (For example, you encounter an XML file which declares its own encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that into an enum?)Got an unrelated question for you. In the stream function void read(out int), there is an assumption that the bytes will be embedded in the stream in little-endian order. Should applications assume (a) it's always little endian, regardless of host architecture, or (b) it's always host-byte order. Is there a big endian version? Is there a network byte order version?Not currently. This corresponds to the C++ design: unformatted IO is assumed to be in the byte order of the host platform.Should there be?Probably. Or at least one that converts to/from network byte order. I'll probably have the first cut of stream.d done in a few more days and after that we can talk about what's wrong with it, etc. Sean
Aug 04 2004
map nameWhatever works, works. But I'd make the enum private. Encodings should be universally known by their IANA registered name, otherwise how can youinto anto number. (For example, you encounter an XML file which declares its own encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn thathandledenum?)This raises an interesting question. Rather than having the encodingdirectly by the Stream layer perhaps it should be dropped into anotherclass. Ican't imagine coding a base lib to support "Joe's custom encoding scheme."Forthe moment though, I think I'll leave stream.d as-is. This seems like adesignissue that will take a bit of talk to get right.I wonder if delegates could help out here. Instead of subclasses or wrapping a stream in another stream the primary Stream class could have a delegate to sort out big/little endian or encoding issues. I'm not exactly sure how it would work but it's worth investigating. There might be issues with sharing data between the stream and the encoder/decoder delegate.int),Got an unrelated question for you. In the stream function void read(outendian,there is an assumption that the bytes will be embedded in the stream in little-endian order. Should applications assume (a) it's always littlethere aregardless of host architecture, or (b) it's always host-byte order. Isassumed tobig endian version? Is there a network byte order version?Not currently. This corresponds to the C++ design: unformatted IO isbe in the byte order of the host platform.thatShould there be?Probably. Or at least one that converts to/from network byte order. I'll probably have the first cut of stream.d done in a few more days and afterwe can talk about what's wrong with it, etc. Sean
Aug 04 2004
In article <cer7fh$9t5$1 digitaldaemon.com>, Sean Kelly says...std.utf has methods like toUTF16. But does this target the big or little endian encoding scheme? I suppose I could assume it corresponds to the byte order of the target machine, but this would imply different behavior on different platforms.Neither, really. toUTF16 returns an array of wchars, not an array of chars, so (conceptually) there is no byte-order issue involved. A wchar is (conceptually) a sixteen bit wide value, with bit 0 being the low order bit, and bit 15 being the high order bit. Byte ordering doesn't come into it. Problems occur, however, when a wchar or a dchar leaves the nice safe environment of D and heads out into a stream. Only then does byte ordering become an issue (as it does also with arrays of ints, etc.). If you cast a wchar[] (or an int[], etc.) to a void[], then the bytes of data don't change, only the reference has a different type. In practice, this means you have (inadvertantly) applied a host-byte-order encoding to the array. There doesn't seem to be much that a stream can do about this, so, I reckon the problem here lies not with the stream, but with the cast. In short, a cast is not the most architecture-independent way to convert an arbitrary array into a void[]. Maybe some new functions could be written to implement this?This raises an interesting question. Rather than having the encoding handled directly by the Stream layer perhaps it should be dropped into another class. I can't imagine coding a base lib to support "Joe's custom encoding scheme." For the moment though, I think I'll leave stream.d as-is. This seems like a design issue that will take a bit of talk to get right.Right. Someone writing an application ought to be able to make their own transcoder (extending a library-defined base class; implementing a library-defined interface; whatever). Let's say that (in an application, not a library) I define classes JoesCustomReader and JoesCustomWriter. Now, I should still be able to do: and read the file. If a reader needs to be identified by a globally unique enum, then I can't do that without the possibility of an enum value clash. But if, on the other hand, they are identified by a string, then the possibility of a clash becomes vanishingly small. I do agree with you that registration of readers/writers and the dispatching mechanism is something best left until later, however. Jill
Aug 04 2004
In article <ceraen$c48$1 digitaldaemon.com>, Arcane Jill says...Problems occur, however, when a wchar or a dchar leaves the nice safe environment of D and heads out into a stream. Only then does byte ordering become an issue (as it does also with arrays of ints, etc.).Bah. Of course. So the two UTF schemes just depend on the byte order when serialized. Makes sense.If you cast a wchar[] (or an int[], etc.) to a void[], then the bytes of data don't change, only the reference has a different type. In practice, this means you have (inadvertantly) applied a host-byte-order encoding to the array. There doesn't seem to be much that a stream can do about this, so, I reckon the problem here lies not with the stream, but with the cast. In short, a cast is not the most architecture-independent way to convert an arbitrary array into a void[]. Maybe some new functions could be written to implement this?I think byte order should be specified, perhaps as a quality of the stream. It could default to native and perhaps be switchable? The only other catch I see is that a console stream should probably ignore this setting and always leave everything in native format. In any case, this byte order would affect encoding schemes using > 1 byte characters and perhaps a new set of unformatted IO methods as well. Again something I'm going to ignore for now as it's more complexity than we need quite yet. Sean
Aug 04 2004
In article <cerbsa$d02$1 digitaldaemon.com>, Sean Kelly says...In short, a cast is not the most architecture-independent way to convert an arbitrary array into a void[]. Maybe some new functions could be written to implement this?I think byte order should be specified, perhaps as a quality of the stream. It could default to native and perhaps be switchable?Well, from one point of view, the problem we've got here is serialization. How do you serialize an array of primitive types having sizeof > 1? This boils down to a simpler question: how do you serialize a single primitive with sizeof > 1. Let's cut to a clear example - how do you serialize an int? std.stream.Stream.write(int) serializes in little-endian order. But the specs say "Outside of byte, ubyte, and char, the format is implementation-specific and should only be used in conjunction with read." I think this is scary. Perhaps it would be better for a stream to /mandate/ the order. As you suggest, it could be a property of the stream, but there are disadvantages to that - if you chain a whole bunch of streams together, each with different endianness, you could end up with a lot of byteswapping going on. Another possibility might be to ditch the function write(int), and replace it with two functions, writeBE(int) and writeLE(int), (and similarly with all other primitive types). That would be absolutely guaranteed to be platform independent. Of course that applies to wchar and dchar too, but the whole point of encodings (well, /one/ of the points of encodings anyway) is that you never have to spit out anything other than a stream of /bytes/. The encoding itself determines the byte order. There really is no such encoding as "UTF-16" (although calling wchar[]s UTF-16 does make sense). As far as actual encodings are concerned, the name "UTF-16" is just a shorthand way of saying "either UTF-16LE or UTF-16BE". When reading, you have to auto-detect between them, but once you've /established/ the encoding, then you rewind the stream and start reading it again with the now known encoding. When writing, you get to choose, arbitrarily (so you would probably choose native byte order), but you can make it easier for subsequent readers to auto-detect by writing a BOM at the start of the stream. How does this affect users' code? Well, you simply don't allow anyone to write (i.e. you define no such class). Instead, give them a factory method. Make them write: or even (but we said we wouldn't talk about dispatching yet, so let's stick with createUTF16Reader() to keep things simple) The function createUTF16Reader() reads the underlying stream, auto-detects between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a UTF16BEReader, and returns it. Somehow it needs a method of pushing back the characters it's already read into the stream. Then, when the caller calls s.read(), the exact encoding is known, and the stream is (re)read from the start.The only other catch I see is that a console stream should probably ignore this setting and always leave everything in native format.Maybe writeLE() and writeBE() could be supplemented by writeNative(), with the warning that it's no longer cross-platform? (Of course, the function write() does that right now, but calling it writeNative() would give you a clue that you were doing something a bit parochial).In any case, this byte order would affect encoding schemes using > 1 byte characters and perhaps a new set of unformatted IO methods as well.I don't think it would affect encodings at all, only the serialization of primitive types other than byte, ubyte and char. Transcoders, as I said, read or write /bytes/ to or from an underlying stream (but have dchar read() and/or void write(dchar) methods for callers to use).Again something I'm going to ignore for now as it's more complexity than we need quite yet.Righty ho. I vaguely remember Hauke saying he was working on a class to do something about transcoding issues, but I don't know the specifics. Arcane Jill
Aug 04 2004
"Arcane Jill" <Arcane_member pathlink.com> escribió en el mensaje news:cerk3u$i4f$1 digitaldaemon.com | (so you would probably choose native byte order), but you can make it easier for | subsequent readers to auto-detect by writing a BOM at the start of the stream. | | ... | | between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a | UTF16BEReader, and returns it. Somehow it needs a method of pushing back the | characters it's already read into the stream. Then, when the caller calls | s.read(), the exact encoding is known, and the stream is (re)read from the | start. | In the former case (the stream includes a BOM), would re-reading from the start include the BOM? If so, what good would it be for a user who just wants to read the file, independent of the encoding? (did I make myself clear?) ----------------------- Carlos Santander Bernal
Aug 04 2004
In article <ces5mu$r8p$1 digitaldaemon.com>, Carlos Santander B. says...In the former case (the stream includes a BOM), would re-reading from the start include the BOM?Good question. I guess probably not. If the encoding is known, then it's known - Since a BOM serves only to identify the encoding, you don't need to re-read it in this instance. That said, it's still best that readers be prepared to ignore it. That is, if a reader reads U+FEFF as the first character, it would be harmless to throw that character away and return instead the second one. Pretty much all BOM related questions are answered here: http://www.unicode.org/faq/utf_bom.html#BOM.If so, what good would it be for a user who just wants to read the file, independent of the encoding? (did I make myself clear?)If you fail to discard a BOM, and accidently treat it as a character, it will appear to your application as the character U+FEFF (ZERO WIDTH NON-BREAKING SPACE). It will display as a zero-width space. It has a general category of Cf (which actually makes it a formatting control, not a space!). Basically, it tries as hard as it can to do nothing at all. So it's useless to the "user who just wants to read the file" - useless, but harmless, most especially if you can recognise it and throw it away. Arcane Jill
Aug 04 2004
On Wed, 4 Aug 2004 16:58:48 +0000 (UTC), Arcane Jill <Arcane_member pathlink.com> wrote: <snip>Got an unrelated question for you. In the stream function void read(out int), there is an assumption that the bytes will be embedded in the stream in little-endian order. Should applications assume (a) it's always little endian, regardless of host architecture, or (b) it's always host-byte order. Is there a big endian version? Is there a network byte order version? Should there be?I think we go with (b). I think it is best handled with a filter. eg. Stream s = new BigEndian(new FileStream("test.dat",FileMode.READ)); so BigEndian looks like: #class BigEndian { You'll need a LittleEndian one too. Using the filter you can guarantee the endian-ness of the data. Of course if you're sending binary data from a LE to BE system via sockets you need to know what you're doing, and you need to decide what endian-ness will be used for the transmission, in this case on the one end of the socket you'll need a toBigEndian/toLittleEndian filter. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 04 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:ceq0mg$20d8$1 digitaldaemon.com...In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...ableI'm one of those folks who is very much in favor of a file reader beingUTFto automatically detect the encoding in it. Hence, D can auto-detect thecanformatting. So, I'd recommend that the format be an enum that can be specifically set or can be auto-detected. Different resulting behaviorsforbe handled with virtual functions.With all due respect, Walter, that's not really feasible. It is very hard,example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward,but notall encodings make life that easy for us. You can't use an enum, becausethereare an unlimited number of possible encodings.I understand there are limits to this. I think it should be done where possible, and that it should not be precluded by design.Besides, if you're parsing an HTTP header, and if, within that header, youread"Content-Type: text/plain; encoding=MAC-ROMAN", then you can be prettysure youknow what the encoding of the following document is going to be. Otherformatshave different indicators (HTML meta tags; Python source filecomments; -thelist is endless). Only at the application level can you /really/ sort thisout,because the application presumably knows what it's looking at.Yes. And this argues for a capability to switch horses midstream, so to speak.
Aug 04 2004