digitalmars.D.bugs

digitalmars.D.bugs - MmFile

Kevin Bealer (44/44) Jul 17 2005 This is on linux (32 bit amd cpu), version 1.126 of dmd.

Ben Hinkle (22/29) Jul 18 2005 Agreed. Users should be given the option of not mapping the whole file. ...

Kevin Bealer (19/48) Jul 18 2005 I have written something like this in C++ at work, but bigger and perfor...

Ben Hinkle (8/36) Jul 19 2005 Good idea. You know, I don't mean to be nosy but it sounds like you woul...

Kevin Bealer (9/47) Jul 19 2005 You can see the part I described are at

Kevin Bealer (16/66) Oct 25 2005 Sorry - I said I would look at this and I did, but I got distracted by v...

Ben Hinkle (7/43) Aug 14 2005 I debugged the issue and the fstat and mmap are due to the writef. For e...

Kevin Bealer <Kevin_member pathlink.com> writes:

This is on linux (32 bit amd cpu), version 1.126 of dmd.

There are several problems with MmFile.

1. The first is that the MmFile class produces output (via printf() on line 227
of mmfile.d) in the case of a not-found file.

2. The second is that despite this error on open(), there seem to be attempts to
stat, then map the file.  The stat fails, but for some reason the mmap seems to
work, even though it is passed "-1" as the file descriptor, as shown below in
the strace output.

Looking at the source code for MmFile, I can't see how this is even possible,
nevertheless, the strace run shows it -- it should throw an exception in errNo()
in the same control block as the printf() but for some reason continues running
and dispatching the system calls.

3. The file size variable is specified as "size_t", and there is no starting
point specifiable.  There should probably be a pair of 64 bit offsets instead,
ie begin and end, or start and length.  A 32 bit machine can use 64 bit files by
mapping in slices of them.


Small section of strace output:

:open("mmaperr.dq", O_RDONLY)            = -1 ENOENT (No such file or directory)
:fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
:mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7d7b000
:write(1, "\topen error, errno = 2\n", 23        open error, errno = 2
:) = 23

Source code:

:import std.stdio;
:import std.mmfile;
:import std.file;
:
:int main(char[][] x)
:{
:    MmFile mmf;
:    try {
:        mmf = new MmFile("mmaperr.dq");
:    }
:    catch(FileException e) {
:        writefln("A");
:    }
:
:    writefln("B");
:    delete mmf;
:    writefln("C");
:    return 0;
:}

Kevin

Jul 17 2005

"Ben Hinkle" <ben.hinkle gmail.com> writes:

 3. The file size variable is specified as "size_t", and there is no 
 starting
 point specifiable.  There should probably be a pair of 64 bit offsets 
 instead,
 ie begin and end, or start and length.  A 32 bit machine can use 64 bit 
 files by
 mapping in slices of them.

Agreed. Users should be given the option of not mapping the whole file. I'd 
like to see something like
  class MmFile {
    private ulong start, len;
    ...
    ubyte opIndex(ulong i) {
      // should do bounds checking when implemented for real
      if (i not in mapped region) {
        unmap current region
        map region [len*(i/len) len*((i+1)/len))
      }
      return data[i - start];
    }
    void[] opSlice(ulong i, ulong j) {
      if (slice not in mapped region) {
        unmap current region
        map region[i j) if j-i>len otherwise map len block
      }
      ...
    }
    ...
  }

Jul 18 2005

Kevin Bealer <Kevin_member pathlink.com> writes:

In article <dbg5i0$2qfl$1 digitaldaemon.com>, Ben Hinkle says...
 3. The file size variable is specified as "size_t", and there is no 
 starting
 point specifiable.  There should probably be a pair of 64 bit offsets 
 instead,
 ie begin and end, or start and length.  A 32 bit machine can use 64 bit 
 files by
 mapping in slices of them.

Agreed. Users should be given the option of not mapping the whole file. I'd 
like to see something like
  class MmFile {
    private ulong start, len;
    ...
    ubyte opIndex(ulong i) {
      // should do bounds checking when implemented for real
      if (i not in mapped region) {
        unmap current region
        map region [len*(i/len) len*((i+1)/len))
      }
      return data[i - start];
    }
    void[] opSlice(ulong i, ulong j) {
      if (slice not in mapped region) {
        unmap current region
        map region[i j) if j-i>len otherwise map len block
      }
      ...
    }
    ...
  }

I have written something like this in C++ at work, but bigger and performance
focused.  I call it the "atlas" code because it is a collection of maps (atlas,
get it?).  It keeps a table of reference counted mapped areas.  The data sets
are often large, in the neighborhood of 13gb for the largest (it will probably
be 25% larger by end of year).

The individual files are always less than 1 GB, but working with more than 128MB
slices proved very cumbersome -- there is always an mmap() call that fails for
indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb of
data, but if you don't slice it up small, the process tanks eventually.

(The code and all its dependencies are public domain and CVSable, so if anyone
wants to reuse it, let me know and I'll drop a pointer.)

An interesting optimization if anyone uses this technique -- if the file
contains a lot of uneven regions (as our files always do), map an extra few
blocks on the end of each slice.  That way you almost never need to "piece
together" parts of two mmapped regions.  Data "on the border" don't need their
own mmap() call.  This technique almost *halves* the number of mmap calls at
minimal cost and without copying any bytes.

Kevin

Jul 18 2005

"Ben Hinkle" <ben.hinkle gmail.com> writes:

 I have written something like this in C++ at work, but bigger and 
 performance
 focused.  I call it the "atlas" code because it is a collection of maps 
 (atlas,
 get it?).  It keeps a table of reference counted mapped areas.  The data 
 sets
 are often large, in the neighborhood of 13gb for the largest (it will 
 probably
 be 25% larger by end of year).

 The individual files are always less than 1 GB, but working with more than 
 128MB
 slices proved very cumbersome -- there is always an mmap() call that fails 
 for
 indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb 
 of
 data, but if you don't slice it up small, the process tanks eventually.

 (The code and all its dependencies are public domain and CVSable, so if 
 anyone
 wants to reuse it, let me know and I'll drop a pointer.)

does it have a URL?

 An interesting optimization if anyone uses this technique -- if the file
 contains a lot of uneven regions (as our files always do), map an extra 
 few
 blocks on the end of each slice.  That way you almost never need to "piece
 together" parts of two mmapped regions.  Data "on the border" don't need 
 their
 own mmap() call.  This technique almost *halves* the number of mmap calls 
 at
 minimal cost and without copying any bytes.

Good idea. You know, I don't mean to be nosy but it sounds like you would be 
a great person to fix up MmFile. :-) I've never used mmfiles to the extent 
you have and who knows how much experience other folks around here have with 
large dataset handling.
I know it sucks when you post some bugs or enhancement requests and the 
response is "you have the code, fix it up and send it to Walter" but please 
consider it if you have the time.

Jul 19 2005

Kevin Bealer <Kevin_member pathlink.com> writes:

In article <dbiqat$29q4$1 digitaldaemon.com>, Ben Hinkle says...
 I have written something like this in C++ at work, but bigger and 
 performance
 focused.  I call it the "atlas" code because it is a collection of maps 
 (atlas,
 get it?).  It keeps a table of reference counted mapped areas.  The data 
 sets
 are often large, in the neighborhood of 13gb for the largest (it will 
 probably
 be 25% larger by end of year).

 The individual files are always less than 1 GB, but working with more than 
 128MB
 slices proved very cumbersome -- there is always an mmap() call that fails 
 for
 indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb 
 of
 data, but if you don't slice it up small, the process tanks eventually.

 (The code and all its dependencies are public domain and CVSable, so if 
 anyone
 wants to reuse it, let me know and I'll drop a pointer.)

does it have a URL?

You can see the part I described are at
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8hpp-source.html
and
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8cpp-source.html.

To get the code to actually compile, bigger pieces are probably required.

Kevin

 An interesting optimization if anyone uses this technique -- if the file
 contains a lot of uneven regions (as our files always do), map an extra 
 few
 blocks on the end of each slice.  That way you almost never need to "piece
 together" parts of two mmapped regions.  Data "on the border" don't need 
 their
 own mmap() call.  This technique almost *halves* the number of mmap calls 
 at
 minimal cost and without copying any bytes.

Good idea. You know, I don't mean to be nosy but it sounds like you would be 
a great person to fix up MmFile. :-) I've never used mmfiles to the extent 
you have and who knows how much experience other folks around here have with 
large dataset handling.
I know it sucks when you post some bugs or enhancement requests and the 
response is "you have the code, fix it up and send it to Walter" but please 
consider it if you have the time. 

I've been looking for a project; I'll have a look at it.

Kevin

Jul 19 2005

Kevin Bealer <Kevin_member pathlink.com> writes:

In article <dbjn30$mlo$1 digitaldaemon.com>, Kevin Bealer says...
In article <dbiqat$29q4$1 digitaldaemon.com>, Ben Hinkle says...
 I have written something like this in C++ at work, but bigger and 
 performance
 focused.  I call it the "atlas" code because it is a collection of maps 
 (atlas,
 get it?).  It keeps a table of reference counted mapped areas.  The data 
 sets
 are often large, in the neighborhood of 13gb for the largest (it will 
 probably
 be 25% larger by end of year).

 The individual files are always less than 1 GB, but working with more than 
 128MB
 slices proved very cumbersome -- there is always an mmap() call that fails 
 for
 indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb 
 of
 data, but if you don't slice it up small, the process tanks eventually.

 (The code and all its dependencies are public domain and CVSable, so if 
 anyone
 wants to reuse it, let me know and I'll drop a pointer.)

does it have a URL?

You can see the part I described are at
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8hpp-source.html
and
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8cpp-source.html.

To get the code to actually compile, bigger pieces are probably required.

Kevin

 An interesting optimization if anyone uses this technique -- if the file
 contains a lot of uneven regions (as our files always do), map an extra 
 few
 blocks on the end of each slice.  That way you almost never need to "piece
 together" parts of two mmapped regions.  Data "on the border" don't need 
 their
 own mmap() call.  This technique almost *halves* the number of mmap calls 
 at
 minimal cost and without copying any bytes.

Good idea. You know, I don't mean to be nosy but it sounds like you would be 
a great person to fix up MmFile. :-) I've never used mmfiles to the extent 
you have and who knows how much experience other folks around here have with 
large dataset handling.
I know it sucks when you post some bugs or enhancement requests and the 
response is "you have the code, fix it up and send it to Walter" but please 
consider it if you have the time. 

I've been looking for a project; I'll have a look at it.

Kevin


Sorry - I said I would look at this and I did, but I got distracted by various
other aspects of my life - some important, most not.  In any case, I have
written code for this for Windows and Linux.

I could describe my changes, and/or email what I have to someone.  Is it still
needed?  I saw something about mmfile fixes, but I haven't checked how complete
they are.

The basic thing I changed was to allow 64 bit files to be mapped given arbitrary
start and end offsets.

I could describe my changes, and/or email the code I have to someone.  Or, I
could merge the differences in and test it further it is still needed?  I saw
something about mmfile fixes, but I haven't checked how complete they are.

(I haven't done anything else in D or even kept up on reading the forum, for
several months, so my compiler isnt even up to date anymore.  I have a couple of
half finished projects that have been stagnating all this time too.)

Kevin

Oct 25 2005

Ben Hinkle <Ben_member pathlink.com> writes:

I debugged the issue and the fstat and mmap are due to the writef. For example
try
: void main() {
:   printf("hi\n");
: }
and you'll see the same fstat/mmap before the print. It only happens before the
first printf/writef - subsequent prints don't fstat/mmap.

2. The second is that despite this error on open(), there seem to be attempts to
stat, then map the file.  The stat fails, but for some reason the mmap seems to
work, even though it is passed "-1" as the file descriptor, as shown below in
the strace output.

Looking at the source code for MmFile, I can't see how this is even possible,
nevertheless, the strace run shows it -- it should throw an exception in errNo()
in the same control block as the printf() but for some reason continues running
and dispatching the system calls.

Small section of strace output:

:open("mmaperr.dq", O_RDONLY)            = -1 ENOENT (No such file or directory)
:fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
:mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7d7b000
:write(1, "\topen error, errno = 2\n", 23        open error, errno = 2
:) = 23

Source code:

:import std.stdio;
:import std.mmfile;
:import std.file;
:
:int main(char[][] x)
:{
:    MmFile mmf;
:    try {
:        mmf = new MmFile("mmaperr.dq");
:    }
:    catch(FileException e) {
:        writefln("A");
:    }
:
:    writefln("B");
:    delete mmf;
:    writefln("C");
:    return 0;
:}

Kevin

Aug 14 2005

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - MmFile