digitalmars.D.bugs - MmFile
- Kevin Bealer (44/44) Jul 17 2005 This is on linux (32 bit amd cpu), version 1.126 of dmd.
- Ben Hinkle (22/29) Jul 18 2005 Agreed. Users should be given the option of not mapping the whole file. ...
- Kevin Bealer (19/48) Jul 18 2005 I have written something like this in C++ at work, but bigger and perfor...
- Ben Hinkle (8/36) Jul 19 2005 Good idea. You know, I don't mean to be nosy but it sounds like you woul...
- Kevin Bealer (9/47) Jul 19 2005 You can see the part I described are at
- Kevin Bealer (16/66) Oct 25 2005 Sorry - I said I would look at this and I did, but I got distracted by v...
- Ben Hinkle (7/43) Aug 14 2005 I debugged the issue and the fstat and mmap are due to the writef. For e...
This is on linux (32 bit amd cpu), version 1.126 of dmd. There are several problems with MmFile. 1. The first is that the MmFile class produces output (via printf() on line 227 of mmfile.d) in the case of a not-found file. 2. The second is that despite this error on open(), there seem to be attempts to stat, then map the file. The stat fails, but for some reason the mmap seems to work, even though it is passed "-1" as the file descriptor, as shown below in the strace output. Looking at the source code for MmFile, I can't see how this is even possible, nevertheless, the strace run shows it -- it should throw an exception in errNo() in the same control block as the printf() but for some reason continues running and dispatching the system calls. 3. The file size variable is specified as "size_t", and there is no starting point specifiable. There should probably be a pair of 64 bit offsets instead, ie begin and end, or start and length. A 32 bit machine can use 64 bit files by mapping in slices of them. Small section of strace output: :open("mmaperr.dq", O_RDONLY) = -1 ENOENT (No such file or directory) :fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 :mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d7b000 :write(1, "\topen error, errno = 2\n", 23 open error, errno = 2 :) = 23 Source code: :import std.stdio; :import std.mmfile; :import std.file; : :int main(char[][] x) :{ : MmFile mmf; : try { : mmf = new MmFile("mmaperr.dq"); : } : catch(FileException e) { : writefln("A"); : } : : writefln("B"); : delete mmf; : writefln("C"); : return 0; :} Kevin
Jul 17 2005
3. The file size variable is specified as "size_t", and there is no starting point specifiable. There should probably be a pair of 64 bit offsets instead, ie begin and end, or start and length. A 32 bit machine can use 64 bit files by mapping in slices of them.Agreed. Users should be given the option of not mapping the whole file. I'd like to see something like class MmFile { private ulong start, len; ... ubyte opIndex(ulong i) { // should do bounds checking when implemented for real if (i not in mapped region) { unmap current region map region [len*(i/len) len*((i+1)/len)) } return data[i - start]; } void[] opSlice(ulong i, ulong j) { if (slice not in mapped region) { unmap current region map region[i j) if j-i>len otherwise map len block } ... } ... }
Jul 18 2005
In article <dbg5i0$2qfl$1 digitaldaemon.com>, Ben Hinkle says...I have written something like this in C++ at work, but bigger and performance focused. I call it the "atlas" code because it is a collection of maps (atlas, get it?). It keeps a table of reference counted mapped areas. The data sets are often large, in the neighborhood of 13gb for the largest (it will probably be 25% larger by end of year). The individual files are always less than 1 GB, but working with more than 128MB slices proved very cumbersome -- there is always an mmap() call that fails for indistinct reasons, possibly internal fragmentation. You can have 1.5 gb of data, but if you don't slice it up small, the process tanks eventually. (The code and all its dependencies are public domain and CVSable, so if anyone wants to reuse it, let me know and I'll drop a pointer.) An interesting optimization if anyone uses this technique -- if the file contains a lot of uneven regions (as our files always do), map an extra few blocks on the end of each slice. That way you almost never need to "piece together" parts of two mmapped regions. Data "on the border" don't need their own mmap() call. This technique almost *halves* the number of mmap calls at minimal cost and without copying any bytes. Kevin3. The file size variable is specified as "size_t", and there is no starting point specifiable. There should probably be a pair of 64 bit offsets instead, ie begin and end, or start and length. A 32 bit machine can use 64 bit files by mapping in slices of them.Agreed. Users should be given the option of not mapping the whole file. I'd like to see something like class MmFile { private ulong start, len; ... ubyte opIndex(ulong i) { // should do bounds checking when implemented for real if (i not in mapped region) { unmap current region map region [len*(i/len) len*((i+1)/len)) } return data[i - start]; } void[] opSlice(ulong i, ulong j) { if (slice not in mapped region) { unmap current region map region[i j) if j-i>len otherwise map len block } ... } ... }
Jul 18 2005
I have written something like this in C++ at work, but bigger and performance focused. I call it the "atlas" code because it is a collection of maps (atlas, get it?). It keeps a table of reference counted mapped areas. The data sets are often large, in the neighborhood of 13gb for the largest (it will probably be 25% larger by end of year). The individual files are always less than 1 GB, but working with more than 128MB slices proved very cumbersome -- there is always an mmap() call that fails for indistinct reasons, possibly internal fragmentation. You can have 1.5 gb of data, but if you don't slice it up small, the process tanks eventually. (The code and all its dependencies are public domain and CVSable, so if anyone wants to reuse it, let me know and I'll drop a pointer.)does it have a URL?An interesting optimization if anyone uses this technique -- if the file contains a lot of uneven regions (as our files always do), map an extra few blocks on the end of each slice. That way you almost never need to "piece together" parts of two mmapped regions. Data "on the border" don't need their own mmap() call. This technique almost *halves* the number of mmap calls at minimal cost and without copying any bytes.Good idea. You know, I don't mean to be nosy but it sounds like you would be a great person to fix up MmFile. :-) I've never used mmfiles to the extent you have and who knows how much experience other folks around here have with large dataset handling. I know it sucks when you post some bugs or enhancement requests and the response is "you have the code, fix it up and send it to Walter" but please consider it if you have the time.
Jul 19 2005
In article <dbiqat$29q4$1 digitaldaemon.com>, Ben Hinkle says...You can see the part I described are at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8hpp-source.html and http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8cpp-source.html. To get the code to actually compile, bigger pieces are probably required. KevinI have written something like this in C++ at work, but bigger and performance focused. I call it the "atlas" code because it is a collection of maps (atlas, get it?). It keeps a table of reference counted mapped areas. The data sets are often large, in the neighborhood of 13gb for the largest (it will probably be 25% larger by end of year). The individual files are always less than 1 GB, but working with more than 128MB slices proved very cumbersome -- there is always an mmap() call that fails for indistinct reasons, possibly internal fragmentation. You can have 1.5 gb of data, but if you don't slice it up small, the process tanks eventually. (The code and all its dependencies are public domain and CVSable, so if anyone wants to reuse it, let me know and I'll drop a pointer.)does it have a URL?I've been looking for a project; I'll have a look at it. KevinAn interesting optimization if anyone uses this technique -- if the file contains a lot of uneven regions (as our files always do), map an extra few blocks on the end of each slice. That way you almost never need to "piece together" parts of two mmapped regions. Data "on the border" don't need their own mmap() call. This technique almost *halves* the number of mmap calls at minimal cost and without copying any bytes.Good idea. You know, I don't mean to be nosy but it sounds like you would be a great person to fix up MmFile. :-) I've never used mmfiles to the extent you have and who knows how much experience other folks around here have with large dataset handling. I know it sucks when you post some bugs or enhancement requests and the response is "you have the code, fix it up and send it to Walter" but please consider it if you have the time.
Jul 19 2005
In article <dbjn30$mlo$1 digitaldaemon.com>, Kevin Bealer says...In article <dbiqat$29q4$1 digitaldaemon.com>, Ben Hinkle says...Sorry - I said I would look at this and I did, but I got distracted by various other aspects of my life - some important, most not. In any case, I have written code for this for Windows and Linux. I could describe my changes, and/or email what I have to someone. Is it still needed? I saw something about mmfile fixes, but I haven't checked how complete they are. The basic thing I changed was to allow 64 bit files to be mapped given arbitrary start and end offsets. I could describe my changes, and/or email the code I have to someone. Or, I could merge the differences in and test it further it is still needed? I saw something about mmfile fixes, but I haven't checked how complete they are. (I haven't done anything else in D or even kept up on reading the forum, for several months, so my compiler isnt even up to date anymore. I have a couple of half finished projects that have been stagnating all this time too.) KevinYou can see the part I described are at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8hpp-source.html and http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8cpp-source.html. To get the code to actually compile, bigger pieces are probably required. KevinI have written something like this in C++ at work, but bigger and performance focused. I call it the "atlas" code because it is a collection of maps (atlas, get it?). It keeps a table of reference counted mapped areas. The data sets are often large, in the neighborhood of 13gb for the largest (it will probably be 25% larger by end of year). The individual files are always less than 1 GB, but working with more than 128MB slices proved very cumbersome -- there is always an mmap() call that fails for indistinct reasons, possibly internal fragmentation. You can have 1.5 gb of data, but if you don't slice it up small, the process tanks eventually. (The code and all its dependencies are public domain and CVSable, so if anyone wants to reuse it, let me know and I'll drop a pointer.)does it have a URL?I've been looking for a project; I'll have a look at it. KevinAn interesting optimization if anyone uses this technique -- if the file contains a lot of uneven regions (as our files always do), map an extra few blocks on the end of each slice. That way you almost never need to "piece together" parts of two mmapped regions. Data "on the border" don't need their own mmap() call. This technique almost *halves* the number of mmap calls at minimal cost and without copying any bytes.Good idea. You know, I don't mean to be nosy but it sounds like you would be a great person to fix up MmFile. :-) I've never used mmfiles to the extent you have and who knows how much experience other folks around here have with large dataset handling. I know it sucks when you post some bugs or enhancement requests and the response is "you have the code, fix it up and send it to Walter" but please consider it if you have the time.
Oct 25 2005
I debugged the issue and the fstat and mmap are due to the writef. For example try : void main() { : printf("hi\n"); : } and you'll see the same fstat/mmap before the print. It only happens before the first printf/writef - subsequent prints don't fstat/mmap.2. The second is that despite this error on open(), there seem to be attempts to stat, then map the file. The stat fails, but for some reason the mmap seems to work, even though it is passed "-1" as the file descriptor, as shown below in the strace output. Looking at the source code for MmFile, I can't see how this is even possible, nevertheless, the strace run shows it -- it should throw an exception in errNo() in the same control block as the printf() but for some reason continues running and dispatching the system calls.Small section of strace output: :open("mmaperr.dq", O_RDONLY) = -1 ENOENT (No such file or directory) :fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 :mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d7b000 :write(1, "\topen error, errno = 2\n", 23 open error, errno = 2 :) = 23 Source code: :import std.stdio; :import std.mmfile; :import std.file; : :int main(char[][] x) :{ : MmFile mmf; : try { : mmf = new MmFile("mmaperr.dq"); : } : catch(FileException e) { : writefln("A"); : } : : writefln("B"); : delete mmf; : writefln("C"); : return 0; :} Kevin
Aug 14 2005