digitalmars.D - std.file.read implementation contest
- Andrei Alexandrescu (76/76) Feb 16 2009 Someone mentioned an old bug in std.file.read here:
- Steven Schveighoffer (48/89) Feb 16 2009 ...
- Andrei Alexandrescu (35/44) Feb 16 2009 [snip]
- Steven Schveighoffer (14/38) Feb 16 2009 Hm... won't this allocate only 15 bytes max if st_size is nonzero?
- Andrei Alexandrescu (5/50) Feb 16 2009 Guess I'll revert to +1. The effect is likely to be negligible anyway.
- Sean Kelly (7/14) Feb 16 2009 Neither one. std.read is intended to read the contents of a file in
- Steven Schveighoffer (11/22) Feb 16 2009 Then you have just created for yourself a never ending bug report :) I
- Andrei Alexandrescu (9/32) Feb 16 2009 I totally agree. The useful spec of std.file.read should be "reads the
- Sean Kelly (4/7) Feb 16 2009 Okay, that's a fair definition. So the correct behavior for reading an
- Andrei Alexandrescu (4/13) Feb 16 2009 I think that's reasonable and will document it as such. Nice example
- Steve Schveighoffer (9/18) Feb 16 2009 Then why doesn't cp have a warning about copying /dev/random or /dev/
- Andrei Alexandrescu (8/17) Feb 17 2009 I thought more about this and a nice solution is to have std.file.read
- bearophile (5/5) Feb 17 2009 Regarding this really useful part of the std lib, I warmly suggest to al...
- Denis Koroskin (2/9) Feb 17 2009 How does it work with /dev/random etc?
- Sean Kelly (10/30) Feb 16 2009 I wasn't saying that at all. But try using this approach on
Someone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.) Andrei // Implementation 1 void[] read(string name) { invariant fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); void[] buf; auto size = statbuf.st_size; if (size == 0) { /* The size could be 0 if the file is a device or a procFS file, * so we just have to try reading it. */ int readsize = 1024; while (1) { buf = GC.realloc(buf.ptr, size + readsize, GC.BlkAttr.NO_SCAN)[0 .. cast(int)size + readsize]; enforce(buf, "Out of memory"); scope(failure) delete buf; auto toread = readsize; while (toread) { auto numread = std.c.linux.linux.read(fd, buf.ptr + size, toread); cenforce(numread != -1, name); size += numread; if (numread == 0) { if (size == 0) // it really was 0 size delete buf; // don't need the buffer return buf[0 .. size]; // end of file } toread -= numread; } } } else { buf = GC.malloc(size, GC.BlkAttr.NO_SCAN)[0 .. size]; enforce(buf, "Out of memory"); scope(failure) delete buf; cenforce(std.c.linux.linux.read(fd, buf.ptr, size) == size, name); return buf[0 .. size]; } } // Implementation 2 void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); immutable initialAlloc = statbuf.st_size ? statbuf.st_size + 1 : 1024; void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN) [0 .. initialAlloc]; scope(failure) delete result; size_t size = 0; for (;;) { immutable actual = std.c.linux.linux.read(fd, result.ptr + size, result.length - size); cenforce(actual != actual.max, name); size += actual; if (size < result.length) break; auto newAlloc = size + 1024 * 4; result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN) [0 .. newAlloc]; } return result[0 .. size]; }
Feb 16 2009
"Andrei Alexandrescu" wroteSomeone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)Implementation 2, save 1 bug:Andrei // Implementation 1...else { buf = GC.malloc(size, GC.BlkAttr.NO_SCAN)[0 .. size]; enforce(buf, "Out of memory"); scope(failure) delete buf; cenforce(std.c.linux.linux.read(fd, buf.ptr, size) == size, name); return buf[0 .. size]; } }This doesn't work for /sys files which consistently say they are 4096 bytes large, even though they can be smaller or larger (encountered this with Tango also). I think you have to kill the check that the size read actually equals the size in stat, AND you have to continue reading even after you reach that size.// Implementation 2 void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); immutable initialAlloc = statbuf.st_size ? statbuf.st_size + 1 : 1024;Excellent idea, this allocates the minimum amount of space to tell if a file is larger than the file size returned in fstat. This should handle the /sys case and /proc case properly.void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN) [0 .. initialAlloc]; scope(failure) delete result; size_t size = 0; for (;;) { immutable actual = std.c.linux.linux.read(fd, result.ptr + size, result.length - size); cenforce(actual != actual.max, name); size += actual; if (size < result.length) break;Only bug: You need to check if actual > 0. A driver is allowed to return less data than requested, even if there is more to read. You must read until read returns 0. In this case you need a loop to do the reading. Yes the disk driver never returns less than requested unless EOF is reached, but the /proc files are run by any driver, which might do something like that.auto newAlloc = size + 1024 * 4; result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN) [0 .. newAlloc]; } return result[0 .. size]; }I would rewrite: void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); auto allocSize = statbuf.st_size ? statbuf.st_size + 1 : 1024; void[] result; scope(failure) delete result; size_t size = 0; outer: while(true) { result = GC.realloc(result.ptr, allocSize, GC.BlkAttr.NO_SCAN) [0 .. allocSize]; while(size < allocSize) { immutable actual = std.c.linux.linux.read(fd, result.ptr + size, allocSize - size); if(actual == 0) break outer; cenforce(actual != actual.max, name); size += actual; } allocSize = size + 1024 * 4; } return result[0 .. size]; }
Feb 16 2009
Steven Schveighoffer wrote:"Andrei Alexandrescu" wrote[snip] Aha, cool. Thanks for the info. I've adapted the code to still only use one loop: void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); immutable initialAlloc = statbuf.st_size ? (statbuf.st_size + 16) & 15 : 1024; void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN) [0 .. initialAlloc]; scope(failure) delete result; size_t size = 0; for (;;) { immutable actual = std.c.linux.linux.read(fd, result.ptr + size, result.length - size); cenforce(actual != -1, name); if (actual == 0) break; size += actual; if (size < result.length) continue; auto newAlloc = size + 1024 * 4; result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN) [0 .. newAlloc]; } return result[0 .. size]; } One more improvement suggested by Walter - I allocate a multiple of 16 because that's allocator's granularity. AndreiSomeone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)Implementation 2, save 1 bug:
Feb 16 2009
"Andrei Alexandrescu" wroteSteven Schveighoffer wrote:Hm... won't this allocate only 15 bytes max if st_size is nonzero? I think you meant: (statbuf.st_size + 16) & ~15 Two more points: if you allocate a size of 16, I think you'll actually get 32 bytes because of the sentinel byte. Somehow you should account for that, or you automatically double the allocation size each time (or at least a page more than you want). And the increments are not in 16, they are in powers of 2 *starting* with 16. For example, if you allocate a size of 80 bytes (16 * 5), you will actually get 128 bytes. Other than that, it looks good. -Steve"Andrei Alexandrescu" wrote[snip] Aha, cool. Thanks for the info. I've adapted the code to still only use one loop: void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); immutable initialAlloc = statbuf.st_size ? (statbuf.st_size + 16) & 15 : 1024;Someone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)Implementation 2, save 1 bug:
Feb 16 2009
Steven Schveighoffer wrote:"Andrei Alexandrescu" wroteStupid!!!Steven Schveighoffer wrote:Hm... won't this allocate only 15 bytes max if st_size is nonzero? I think you meant: (statbuf.st_size + 16) & ~15"Andrei Alexandrescu" wrote[snip] Aha, cool. Thanks for the info. I've adapted the code to still only use one loop: void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); immutable initialAlloc = statbuf.st_size ? (statbuf.st_size + 16) & 15 : 1024;Someone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)Implementation 2, save 1 bug:Two more points: if you allocate a size of 16, I think you'll actually get 32 bytes because of the sentinel byte. Somehow you should account for that, or you automatically double the allocation size each time (or at least a page more than you want). And the increments are not in 16, they are in powers of 2 *starting* with 16. For example, if you allocate a size of 80 bytes (16 * 5), you will actually get 128 bytes. Other than that, it looks good.Guess I'll revert to +1. The effect is likely to be negligible anyway. Thanks! Andrei
Feb 16 2009
Andrei Alexandrescu wrote:Someone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p rting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)Neither one. std.read is intended to read the contents of a file in bulk into memory. If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case). Streaming operations should be used on unbounded virtual files, not std.read. Sean
Feb 16 2009
"Sean Kelly" wroteAndrei Alexandrescu wrote:Then you have just created for yourself a never ending bug report :) I think reading /proc and /sys files is an essential part of writing scripts and useful utilities. std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once. If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available. Complaining that *nix isn't pure enough for the likes of D isn't very practical. I view the usage of fstat to get the file size as being an optimization, not a validation of the OS' sanity. -SteveSomeone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p rting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)Neither one. std.read is intended to read the contents of a file in bulk into memory. If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case). Streaming operations should be used on unbounded virtual files, not std.read.
Feb 16 2009
Steven Schveighoffer wrote:"Sean Kelly" wroteI totally agree. The useful spec of std.file.read should be "reads the file to exhaustion in a buffer and returns it", not "if the size according to fstat is the same as the actual file size, /proc stuff and concurrent access against the file notwithstanding, reads exactly that many bytes in a buffer and returns it. Otherwise, leaves you scratching your head." And yes, using fstat is just a little irrelevant implementation trick, not something that conditions the workings of read. AndreiAndrei Alexandrescu wrote:Then you have just created for yourself a never ending bug report :) I think reading /proc and /sys files is an essential part of writing scripts and useful utilities. std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once. If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available. Complaining that *nix isn't pure enough for the likes of D isn't very practical. I view the usage of fstat to get the file size as being an optimization, not a validation of the OS' sanity.Someone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p rting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)Neither one. std.read is intended to read the contents of a file in bulk into memory. If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case). Streaming operations should be used on unbounded virtual files, not std.read.
Feb 16 2009
Andrei Alexandrescu wrote:I totally agree. The useful spec of std.file.read should be "reads the file to exhaustion in a buffer and returns it"Okay, that's a fair definition. So the correct behavior for reading an unbounded stream should be an out of memory error? This would be entirely reasonable, but any failure conditions should be described as well.
Feb 16 2009
Sean Kelly wrote:Andrei Alexandrescu wrote:I think that's reasonable and will document it as such. Nice example with /dev/random though :o). AndreiI totally agree. The useful spec of std.file.read should be "reads the file to exhaustion in a buffer and returns it"Okay, that's a fair definition. So the correct behavior for reading an unbounded stream should be an out of memory error? This would be entirely reasonable, but any failure conditions should be described as well.
Feb 16 2009
On Mon, 16 Feb 2009 16:36:52 -0800, Sean Kelly wrote:Andrei Alexandrescu wrote:Then why doesn't cp have a warning about copying /dev/random or /dev/ zero? I think it's enough to specify what it does (read the entire file) and trust that people understand what they are doing when the attempt to read an unbounded stream. The warning belongs with the stream, not with the simple utility. It probably only takes once for the developer to screw it up to learn ;) But I guess it doesn't hurt... -SteveI totally agree. The useful spec of std.file.read should be "reads the file to exhaustion in a buffer and returns it"Okay, that's a fair definition. So the correct behavior for reading an unbounded stream should be an out of memory error? This would be entirely reasonable, but any failure conditions should be described as well.
Feb 16 2009
Sean Kelly wrote:Andrei Alexandrescu wrote:I thought more about this and a nice solution is to have std.file.read take an optional "up to" parameter: void[] read(in char[] filename, size_t upTo = size_t.max); Then you can get a random int with: int rnd = (cast(int[]) read("/dev/random", 4))[0]; Yum. AndreiI totally agree. The useful spec of std.file.read should be "reads the file to exhaustion in a buffer and returns it"Okay, that's a fair definition. So the correct behavior for reading an unbounded stream should be an out of memory error? This would be entirely reasonable, but any failure conditions should be described as well.
Feb 17 2009
Regarding this really useful part of the std lib, I warmly suggest to also benchmark the final code against a similar operation done in Perl or Python2.x. Such tests very often show performance bugs/problems. Python code for the test is really simple: file("filename.txt).read() Bye, bearophile
Feb 17 2009
On Tue, 17 Feb 2009 21:08:15 +0300, bearophile <bearophileHUGS lycos.com> wrote:Regarding this really useful part of the std lib, I warmly suggest to also benchmark the final code against a similar operation done in Perl or Python2.x. Such tests very often show performance bugs/problems. Python code for the test is really simple: file("filename.txt).read() Bye, bearophileHow does it work with /dev/random etc?
Feb 17 2009
Steven Schveighoffer wrote:"Sean Kelly" wroteI wasn't saying that at all. But try using this approach on /dev/random, for example. One of the great things about *nix is that _anything_ can be represented as a file and therefore operated on via the same scripting techniques, however, I think it's an open issue whether this behavior lies within or outside the design of std.read. I do think one could argue that std.read should make a best effort and read from such files up to some imposed maximum byte count, but this still seems like stretching its intended purpose to me. SeanAndrei Alexandrescu wrote:Then you have just created for yourself a never ending bug report :) I think reading /proc and /sys files is an essential part of writing scripts and useful utilities. std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once. If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available. Complaining that *nix isn't pure enough for the likes of D isn't very practical.Someone mentioned an old bug in std.file.read here: http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p rting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)Neither one. std.read is intended to read the contents of a file in bulk into memory. If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case). Streaming operations should be used on unbounded virtual files, not std.read.
Feb 16 2009