www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - MmFile

reply Kevin Bealer <Kevin_member pathlink.com> writes:
This is on linux (32 bit amd cpu), version 1.126 of dmd.

There are several problems with MmFile.

1. The first is that the MmFile class produces output (via printf() on line 227
of mmfile.d) in the case of a not-found file.

2. The second is that despite this error on open(), there seem to be attempts to
stat, then map the file.  The stat fails, but for some reason the mmap seems to
work, even though it is passed "-1" as the file descriptor, as shown below in
the strace output.

Looking at the source code for MmFile, I can't see how this is even possible,
nevertheless, the strace run shows it -- it should throw an exception in errNo()
in the same control block as the printf() but for some reason continues running
and dispatching the system calls.

3. The file size variable is specified as "size_t", and there is no starting
point specifiable.  There should probably be a pair of 64 bit offsets instead,
ie begin and end, or start and length.  A 32 bit machine can use 64 bit files by
mapping in slices of them.


Small section of strace output:

:open("mmaperr.dq", O_RDONLY)            = -1 ENOENT (No such file or directory)
:fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
:mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7d7b000
:write(1, "\topen error, errno = 2\n", 23        open error, errno = 2
:) = 23

Source code:

:import std.stdio;
:import std.mmfile;
:import std.file;
:
:int main(char[][] x)
:{
:    MmFile mmf;
:    try {
:        mmf = new MmFile("mmaperr.dq");
:    }
:    catch(FileException e) {
:        writefln("A");
:    }
:
:    writefln("B");
:    delete mmf;
:    writefln("C");
:    return 0;
:}

Kevin
Jul 17 2005
next sibling parent reply "Ben Hinkle" <ben.hinkle gmail.com> writes:
 3. The file size variable is specified as "size_t", and there is no 
 starting
 point specifiable.  There should probably be a pair of 64 bit offsets 
 instead,
 ie begin and end, or start and length.  A 32 bit machine can use 64 bit 
 files by
 mapping in slices of them.
Agreed. Users should be given the option of not mapping the whole file. I'd like to see something like class MmFile { private ulong start, len; ... ubyte opIndex(ulong i) { // should do bounds checking when implemented for real if (i not in mapped region) { unmap current region map region [len*(i/len) len*((i+1)/len)) } return data[i - start]; } void[] opSlice(ulong i, ulong j) { if (slice not in mapped region) { unmap current region map region[i j) if j-i>len otherwise map len block } ... } ... }
Jul 18 2005
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <dbg5i0$2qfl$1 digitaldaemon.com>, Ben Hinkle says...
 3. The file size variable is specified as "size_t", and there is no 
 starting
 point specifiable.  There should probably be a pair of 64 bit offsets 
 instead,
 ie begin and end, or start and length.  A 32 bit machine can use 64 bit 
 files by
 mapping in slices of them.
Agreed. Users should be given the option of not mapping the whole file. I'd like to see something like class MmFile { private ulong start, len; ... ubyte opIndex(ulong i) { // should do bounds checking when implemented for real if (i not in mapped region) { unmap current region map region [len*(i/len) len*((i+1)/len)) } return data[i - start]; } void[] opSlice(ulong i, ulong j) { if (slice not in mapped region) { unmap current region map region[i j) if j-i>len otherwise map len block } ... } ... }
I have written something like this in C++ at work, but bigger and performance focused. I call it the "atlas" code because it is a collection of maps (atlas, get it?). It keeps a table of reference counted mapped areas. The data sets are often large, in the neighborhood of 13gb for the largest (it will probably be 25% larger by end of year). The individual files are always less than 1 GB, but working with more than 128MB slices proved very cumbersome -- there is always an mmap() call that fails for indistinct reasons, possibly internal fragmentation. You can have 1.5 gb of data, but if you don't slice it up small, the process tanks eventually. (The code and all its dependencies are public domain and CVSable, so if anyone wants to reuse it, let me know and I'll drop a pointer.) An interesting optimization if anyone uses this technique -- if the file contains a lot of uneven regions (as our files always do), map an extra few blocks on the end of each slice. That way you almost never need to "piece together" parts of two mmapped regions. Data "on the border" don't need their own mmap() call. This technique almost *halves* the number of mmap calls at minimal cost and without copying any bytes. Kevin
Jul 18 2005
parent reply "Ben Hinkle" <ben.hinkle gmail.com> writes:
 I have written something like this in C++ at work, but bigger and 
 performance
 focused.  I call it the "atlas" code because it is a collection of maps 
 (atlas,
 get it?).  It keeps a table of reference counted mapped areas.  The data 
 sets
 are often large, in the neighborhood of 13gb for the largest (it will 
 probably
 be 25% larger by end of year).

 The individual files are always less than 1 GB, but working with more than 
 128MB
 slices proved very cumbersome -- there is always an mmap() call that fails 
 for
 indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb 
 of
 data, but if you don't slice it up small, the process tanks eventually.

 (The code and all its dependencies are public domain and CVSable, so if 
 anyone
 wants to reuse it, let me know and I'll drop a pointer.)
does it have a URL?
 An interesting optimization if anyone uses this technique -- if the file
 contains a lot of uneven regions (as our files always do), map an extra 
 few
 blocks on the end of each slice.  That way you almost never need to "piece
 together" parts of two mmapped regions.  Data "on the border" don't need 
 their
 own mmap() call.  This technique almost *halves* the number of mmap calls 
 at
 minimal cost and without copying any bytes.
Good idea. You know, I don't mean to be nosy but it sounds like you would be a great person to fix up MmFile. :-) I've never used mmfiles to the extent you have and who knows how much experience other folks around here have with large dataset handling. I know it sucks when you post some bugs or enhancement requests and the response is "you have the code, fix it up and send it to Walter" but please consider it if you have the time.
Jul 19 2005
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <dbiqat$29q4$1 digitaldaemon.com>, Ben Hinkle says...
 I have written something like this in C++ at work, but bigger and 
 performance
 focused.  I call it the "atlas" code because it is a collection of maps 
 (atlas,
 get it?).  It keeps a table of reference counted mapped areas.  The data 
 sets
 are often large, in the neighborhood of 13gb for the largest (it will 
 probably
 be 25% larger by end of year).

 The individual files are always less than 1 GB, but working with more than 
 128MB
 slices proved very cumbersome -- there is always an mmap() call that fails 
 for
 indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb 
 of
 data, but if you don't slice it up small, the process tanks eventually.

 (The code and all its dependencies are public domain and CVSable, so if 
 anyone
 wants to reuse it, let me know and I'll drop a pointer.)
does it have a URL?
You can see the part I described are at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8hpp-source.html and http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8cpp-source.html. To get the code to actually compile, bigger pieces are probably required. Kevin
 An interesting optimization if anyone uses this technique -- if the file
 contains a lot of uneven regions (as our files always do), map an extra 
 few
 blocks on the end of each slice.  That way you almost never need to "piece
 together" parts of two mmapped regions.  Data "on the border" don't need 
 their
 own mmap() call.  This technique almost *halves* the number of mmap calls 
 at
 minimal cost and without copying any bytes.
Good idea. You know, I don't mean to be nosy but it sounds like you would be a great person to fix up MmFile. :-) I've never used mmfiles to the extent you have and who knows how much experience other folks around here have with large dataset handling. I know it sucks when you post some bugs or enhancement requests and the response is "you have the code, fix it up and send it to Walter" but please consider it if you have the time.
I've been looking for a project; I'll have a look at it. Kevin
Jul 19 2005
parent Kevin Bealer <Kevin_member pathlink.com> writes:
In article <dbjn30$mlo$1 digitaldaemon.com>, Kevin Bealer says...
In article <dbiqat$29q4$1 digitaldaemon.com>, Ben Hinkle says...
 I have written something like this in C++ at work, but bigger and 
 performance
 focused.  I call it the "atlas" code because it is a collection of maps 
 (atlas,
 get it?).  It keeps a table of reference counted mapped areas.  The data 
 sets
 are often large, in the neighborhood of 13gb for the largest (it will 
 probably
 be 25% larger by end of year).

 The individual files are always less than 1 GB, but working with more than 
 128MB
 slices proved very cumbersome -- there is always an mmap() call that fails 
 for
 indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb 
 of
 data, but if you don't slice it up small, the process tanks eventually.

 (The code and all its dependencies are public domain and CVSable, so if 
 anyone
 wants to reuse it, let me know and I'll drop a pointer.)
does it have a URL?
You can see the part I described are at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8hpp-source.html and http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8cpp-source.html. To get the code to actually compile, bigger pieces are probably required. Kevin
 An interesting optimization if anyone uses this technique -- if the file
 contains a lot of uneven regions (as our files always do), map an extra 
 few
 blocks on the end of each slice.  That way you almost never need to "piece
 together" parts of two mmapped regions.  Data "on the border" don't need 
 their
 own mmap() call.  This technique almost *halves* the number of mmap calls 
 at
 minimal cost and without copying any bytes.
Good idea. You know, I don't mean to be nosy but it sounds like you would be a great person to fix up MmFile. :-) I've never used mmfiles to the extent you have and who knows how much experience other folks around here have with large dataset handling. I know it sucks when you post some bugs or enhancement requests and the response is "you have the code, fix it up and send it to Walter" but please consider it if you have the time.
I've been looking for a project; I'll have a look at it. Kevin
Sorry - I said I would look at this and I did, but I got distracted by various other aspects of my life - some important, most not. In any case, I have written code for this for Windows and Linux. I could describe my changes, and/or email what I have to someone. Is it still needed? I saw something about mmfile fixes, but I haven't checked how complete they are. The basic thing I changed was to allow 64 bit files to be mapped given arbitrary start and end offsets. I could describe my changes, and/or email the code I have to someone. Or, I could merge the differences in and test it further it is still needed? I saw something about mmfile fixes, but I haven't checked how complete they are. (I haven't done anything else in D or even kept up on reading the forum, for several months, so my compiler isnt even up to date anymore. I have a couple of half finished projects that have been stagnating all this time too.) Kevin
Oct 25 2005
prev sibling parent Ben Hinkle <Ben_member pathlink.com> writes:
I debugged the issue and the fstat and mmap are due to the writef. For example
try
: void main() {
:   printf("hi\n");
: }
and you'll see the same fstat/mmap before the print. It only happens before the
first printf/writef - subsequent prints don't fstat/mmap.

2. The second is that despite this error on open(), there seem to be attempts to
stat, then map the file.  The stat fails, but for some reason the mmap seems to
work, even though it is passed "-1" as the file descriptor, as shown below in
the strace output.

Looking at the source code for MmFile, I can't see how this is even possible,
nevertheless, the strace run shows it -- it should throw an exception in errNo()
in the same control block as the printf() but for some reason continues running
and dispatching the system calls.
Small section of strace output:

:open("mmaperr.dq", O_RDONLY)            = -1 ENOENT (No such file or directory)
:fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
:mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7d7b000
:write(1, "\topen error, errno = 2\n", 23        open error, errno = 2
:) = 23

Source code:

:import std.stdio;
:import std.mmfile;
:import std.file;
:
:int main(char[][] x)
:{
:    MmFile mmf;
:    try {
:        mmf = new MmFile("mmaperr.dq");
:    }
:    catch(FileException e) {
:        writefln("A");
:    }
:
:    writefln("B");
:    delete mmf;
:    writefln("C");
:    return 0;
:}

Kevin
Aug 14 2005