digitalmars.D.learn - randomIO, std.file, core.stdc.stdio

Charles Hixson via Digitalmars-d-learn (4/4) Jul 25 2016 Are there reasons why one would use rawRead and rawWrite rather than

ketmar (8/13) Jul 25 2016 yes: keeping API consistent. ;-)

Charles Hixson via Digitalmars-d-learn (9/21) Jul 25 2016 OK. If it's just a question of "looking cleaner" and "style", then I

ketmar (5/10) Jul 25 2016 only if you are really used to write C code. when you see

Charles Hixson via Digitalmars-d-learn (4/13) Jul 25 2016 Yes, but I really despise the syntax they came up with. It's probably

ketmar (32/35) Jul 25 2016 that's why i wrote iv.stream, and then iv.vfs, with convenient

Charles Hixson via Digitalmars-d-learn (26/60) Jul 26 2016 That's sort of what I have in mind, but I want to do what in Fortran

ketmar (3/11) Jul 26 2016 it looks like you want a serialization library. there are some:

Steven Schveighoffer (11/35) Jul 26 2016 It's more than just that. Having a bounded array is safer than a

Charles Hixson via Digitalmars-d-learn (27/63) Jul 26 2016 That *does* make the syntax a lot nicer, and I understand the safety

Steven Schveighoffer (24/34) Jul 26 2016 This is probably a misunderstanding on your part.

Charles Hixson via Digitalmars-d-learn (6/42) Jul 26 2016 Thanks. Since there isn't any excess overhead I guess I'll use stdio.

Steven Schveighoffer (12/17) Jul 26 2016 Even for doing random I/O buffering is helpful. It depends on the size

Charles Hixson via Digitalmars-d-learn (32/51) Jul 26 2016 I've considered mmapfile often, but when I read the documentation I end

Adam D. Ruppe (29/34) Jul 26 2016 It is just mapped to virtual memory without actually being loaded

Charles Hixson via Digitalmars-d-learn (21/54) Jul 26 2016 O, dear. It was sounding like such an excellent approach until this

Rene Zwanenburg (8/15) Jul 27 2016 Not necessarily. The usual approach is to over-allocate your file

Charles Hixson via Digitalmars-d-learn (6/17) Jul 27 2016 Well, that would mean I didn't need to reopen the file so often, but

Steven Schveighoffer (29/53) Jul 26 2016 Of course that isn't what is happening :)

Charles Hixson via Digitalmars-d-learn writes:

Are there reasons why one would use rawRead and rawWrite rather than 
fread and fwrite when doiing binary random io?  What are the advantages?

In particular, if one is reading and writing structs rather than arrays 
or ranges, are there any advantages?

Jul 25 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Monday, 25 July 2016 at 18:54:27 UTC, Charles Hixson wrote:
 Are there reasons why one would use rawRead and rawWrite rather 
 than fread and fwrite when doiing binary random io?  What are 
 the advantages?

 In particular, if one is reading and writing structs rather 
 than arrays or ranges, are there any advantages?

yes: keeping API consistent. ;-)

for example, my stream i/o modules works with anything that has 
`rawRead`/`rawWrite` methods, but don't bother to check for any 
other.

besides, `rawRead` is just looks cleaner, even with all 
`(&a)[0..1])` noise.

so, a question of style.

Jul 25 2016

Charles Hixson via Digitalmars-d-learn writes:

On 07/25/2016 05:18 PM, ketmar via Digitalmars-d-learn wrote:
 On Monday, 25 July 2016 at 18:54:27 UTC, Charles Hixson wrote:
 Are there reasons why one would use rawRead and rawWrite rather than 
 fread and fwrite when doiing binary random io?  What are the advantages?

 In particular, if one is reading and writing structs rather than 
 arrays or ranges, are there any advantages?

 yes: keeping API consistent. ;-)

 for example, my stream i/o modules works with anything that has 
 `rawRead`/`rawWrite` methods, but don't bother to check for any other.

 besides, `rawRead` is just looks cleaner, even with all `(&a)[0..1])` 
 noise.

 so, a question of style.

OK.  If it's just a question of "looking cleaner" and "style", then I 
will prefer the core.stdc.stdio approach.  I find it's appearance 
extremely much cleaner...except that that's understating things. I'll 
probably wrap those routines in a struct to ensure things like files 
being properly closed, and not have explicit pointers persisting over 
large areas of code.

(I said a lot more, but it was just a rant about how ugly I find 
rawRead/rawWrite syntax, so I deleted it.)

Jul 25 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Tuesday, 26 July 2016 at 01:19:49 UTC, Charles Hixson wrote:
 then I will prefer the core.stdc.stdio approach.  I find it's 
 appearance extremely much cleaner...

only if you are really used to write C code. when you see 
pointer, or explicit type size argument in D, it is a sign of C 
disease.

 I'll probably wrap those routines in a struct to ensure things 
 like files being properly closed, and not have explicit 
 pointers persisting over large areas of code.

exactly what std.stdio.File did! ;-)

Jul 25 2016

Charles Hixson via Digitalmars-d-learn writes:

On 07/25/2016 07:11 PM, ketmar via Digitalmars-d-learn wrote:
 On Tuesday, 26 July 2016 at 01:19:49 UTC, Charles Hixson wrote:
 then I will prefer the core.stdc.stdio approach.  I find it's 
 appearance extremely much cleaner...

 only if you are really used to write C code. when you see pointer, or 
 explicit type size argument in D, it is a sign of C disease.

 I'll probably wrap those routines in a struct to ensure things like 
 files being properly closed, and not have explicit pointers 
 persisting over large areas of code.

 exactly what std.stdio.File did! ;-)

Yes, but I really despise the syntax they came up with.  It's probably 
good if most of your I/O is ranges, but mine hasn't yet ever been.  
(Combining ranges with random I/O?)

Jul 25 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Tuesday, 26 July 2016 at 04:05:22 UTC, Charles Hixson wrote:
 Yes, but I really despise the syntax they came up with.  It's 
 probably good if most of your I/O is ranges, but mine hasn't 
 yet ever been.  (Combining ranges with random I/O?)

that's why i wrote iv.stream, and then iv.vfs, with convenient 
things like `readNum!T`, for example. you absolutely don't need 
to reimplement the whole std.stdio.File if all you need it better 
API. thanks to UFCS, you can write your new API as free functions 
accepting std.stdio.File as first arg. or even generic stream, 
like i did in iv.stream:


enum isReadableStream(T) = is(typeof((inout int=0) {
   auto t = T.init;
   ubyte[1] b;
   auto v = cast(void[])b;
   t.rawRead(v);
}));

enum isWriteableStream(T) = is(typeof((inout int=0) {
   auto t = T.init;
   ubyte[1] b;
   t.rawWrite(cast(void[])b);
}));

T readInt(T : ulong, ST) (auto ref ST st) if 
(isReadableStream!ST) {
   T res;
   ubyte* b = cast(ubyte*)&res;
   foreach (immutable idx; 0..T.sizeof) {
     if (st.rawRead(b[idx..idx+1]).length != 1) throw new 
Exception("read error");
   }
   return res;
}


and then:
   auto fl = File("myfile");
   auto i = fl.readInt!uint;

something like that.

Jul 25 2016

Charles Hixson via Digitalmars-d-learn writes:

On 07/25/2016 09:22 PM, ketmar via Digitalmars-d-learn wrote:
 On Tuesday, 26 July 2016 at 04:05:22 UTC, Charles Hixson wrote:
 Yes, but I really despise the syntax they came up with.  It's 
 probably good if most of your I/O is ranges, but mine hasn't yet ever 
 been.  (Combining ranges with random I/O?)

 that's why i wrote iv.stream, and then iv.vfs, with convenient things 
 like `readNum!T`, for example. you absolutely don't need to 
 reimplement the whole std.stdio.File if all you need it better API. 
 thanks to UFCS, you can write your new API as free functions accepting 
 std.stdio.File as first arg. or even generic stream, like i did in 
 iv.stream:


 enum isReadableStream(T) = is(typeof((inout int=0) {
   auto t = T.init;
   ubyte[1] b;
   auto v = cast(void[])b;
   t.rawRead(v);
 }));

 enum isWriteableStream(T) = is(typeof((inout int=0) {
   auto t = T.init;
   ubyte[1] b;
   t.rawWrite(cast(void[])b);
 }));

 T readInt(T : ulong, ST) (auto ref ST st) if (isReadableStream!ST) {
   T res;
   ubyte* b = cast(ubyte*)&res;
   foreach (immutable idx; 0..T.sizeof) {
     if (st.rawRead(b[idx..idx+1]).length != 1) throw new 
 Exception("read error");
   }
   return res;
 }


 and then:
   auto fl = File("myfile");
   auto i = fl.readInt!uint;

 something like that.

That's sort of what I have in mind, but I want to do what in Fortran 
would be (would have been?) called record I/O, except that I want a file 
header that specifies a few things like magic number, records allocated, 
head of free list, etc.  In practice I don't see any need for record 
size not known at compile time...except that if there are different
versions of the program, they might include different things, so, e.g., 
the size of the file header might need to be variable.

This is a design problem I'm still trying to wrap my head around. 
Efficiency seems to say "you need to know the size at compile time", but 
flexibility says "you can't depend on the size at compile time".  The 
only compromise position seems to compromise safety (by depending on 
void * and record size parameters that aren't guaranteed safe).  I'll 
probably eventually decide in favor of "size fixed at compile time", but 
I'm still dithering.  But clearly efficiency dictates that the read size 
not be a basic type.  I'm currently thinking of a struct that's about 1 
KB in size.  As far as the I/O routines are concerned this will probably 
all be uninterpreted bytes, unless I throw in some sequencing for error 
recovery...but that's probably making things too complex, and should be 
left for a higher level.

Clearly this is a bit of a specialized case, so I wouldn't be 
considering implementing all of stdio, only the relevant bits, and those 
wrapped with an interpretation based around record number.

The thing is, I'd probably be writing this wrapper anyway, what I was 
wondering originally is whether there was any reason to use std.file as 
the underlying library rather than going directly to core.stdc.stdio.

Jul 26 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Tuesday, 26 July 2016 at 16:35:26 UTC, Charles Hixson wrote:
 That's sort of what I have in mind, but I want to do what in 
 Fortran would be (would have been?) called record I/O, except 
 that I want a file header that specifies a few things like 
 magic number, records allocated, head of free list, etc.  In 
 practice I don't see any need for record size not known at 
 compile time...except that if there are different
 versions of the program, they might include different things, 
 so, e.g., the size of the file header might need to be variable.

it looks like you want a serialization library. there are some: 
http://wiki.dlang.org/Serialization_Libraries

Jul 26 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/25/16 9:19 PM, Charles Hixson via Digitalmars-d-learn wrote:
 On 07/25/2016 05:18 PM, ketmar via Digitalmars-d-learn wrote:
 On Monday, 25 July 2016 at 18:54:27 UTC, Charles Hixson wrote:
 Are there reasons why one would use rawRead and rawWrite rather than
 fread and fwrite when doiing binary random io?  What are the advantages?

 In particular, if one is reading and writing structs rather than
 arrays or ranges, are there any advantages?

 yes: keeping API consistent. ;-)

 for example, my stream i/o modules works with anything that has
 `rawRead`/`rawWrite` methods, but don't bother to check for any other.

 besides, `rawRead` is just looks cleaner, even with all `(&a)[0..1])`
 noise.

 so, a question of style.

 OK.  If it's just a question of "looking cleaner" and "style", then I
 will prefer the core.stdc.stdio approach.  I find it's appearance
 extremely much cleaner...except that that's understating things. I'll
 probably wrap those routines in a struct to ensure things like files
 being properly closed, and not have explicit pointers persisting over
 large areas of code.

It's more than just that. Having a bounded array is safer than a 
pointer/length separated parameters. Literally, rawRead and rawWrite are 
inferred  safe, whereas fread and fwrite are not.

But D is so nice with UFCS, you don't have to live with APIs you don't 
like. Allow me to suggest adding a helper function to your code:

rawReadItem(T)(File f, ref T item)  trusted
{
    f.rawRead(&item[0 .. 1]);
}

-Steve

Jul 26 2016

Charles Hixson via Digitalmars-d-learn writes:

On 07/26/2016 05:31 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
 On 7/25/16 9:19 PM, Charles Hixson via Digitalmars-d-learn wrote:
 On 07/25/2016 05:18 PM, ketmar via Digitalmars-d-learn wrote:
 On Monday, 25 July 2016 at 18:54:27 UTC, Charles Hixson wrote:
 Are there reasons why one would use rawRead and rawWrite rather than
 fread and fwrite when doiing binary random io?  What are the 
 advantages?

 In particular, if one is reading and writing structs rather than
 arrays or ranges, are there any advantages?

 yes: keeping API consistent. ;-)

 for example, my stream i/o modules works with anything that has
 `rawRead`/`rawWrite` methods, but don't bother to check for any other.

 besides, `rawRead` is just looks cleaner, even with all `(&a)[0..1])`
 noise.

 so, a question of style.

 OK.  If it's just a question of "looking cleaner" and "style", then I
 will prefer the core.stdc.stdio approach.  I find it's appearance
 extremely much cleaner...except that that's understating things. I'll
 probably wrap those routines in a struct to ensure things like files
 being properly closed, and not have explicit pointers persisting over
 large areas of code.

 It's more than just that. Having a bounded array is safer than a 
 pointer/length separated parameters. Literally, rawRead and rawWrite 
 are inferred  safe, whereas fread and fwrite are not.

 But D is so nice with UFCS, you don't have to live with APIs you don't 
 like. Allow me to suggest adding a helper function to your code:

 rawReadItem(T)(File f, ref T item)  trusted
 {
    f.rawRead(&item[0 .. 1]);
 }

 -Steve

That *does* make the syntax a lot nicer, and I understand the safety 
advantage of not using pointer/length separated parameters.  But I'm 
going to be wrapping the I/O anyway, and the external interface is going 
to be more like:
struct RF (T, long magic)
{
....
void read (size_t recNo, ref T val){...}
size_t read (ref T val){...}
...
}
where a sequential read returns the record number, or you specify the 
record number and get an indexedIO read.  So the length with be 
T.sizeof, and will be specified at the time the file is opened.  To me 
this seems to eliminate the advantage of stdfile, and stdfile seems to 
add a level of indirection.

Ranges aren't free, are they? If so then I should probably use stdfile, 
because that is probably less likely to change than core.stdc.stdio.  
When I see "f.rawRead(&item[0 .. 1])" it looks to me as if unneeded code 
is being generated explictly to be thrown away.  (I don't like using 
pointer/length either, but it's actually easier to understand than this 
kind of thing, and this LOOKS like it's generating extra code.)

That said, perhaps I should use stdio anyway.  When doing I/O it's the 
disk speed that's the really slow part, and that so dominates things 
that worrying about trivialities is foolish.  And since it's going to be 
wrapped anyway, the ugly will be confined to a very small routine.

Jul 26 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/26/16 12:58 PM, Charles Hixson via Digitalmars-d-learn wrote:

 Ranges aren't free, are they? If so then I should probably use stdfile,
 because that is probably less likely to change than core.stdc.stdio.

Do you mean slices?

 When I see "f.rawRead(&item[0 .. 1])" it looks to me as if unneeded code
 is being generated explictly to be thrown away.  (I don't like using
 pointer/length either, but it's actually easier to understand than this
 kind of thing, and this LOOKS like it's generating extra code.)

This is probably a misunderstanding on your part.

&item is accessing the item as a pointer. Since the compiler already has 
it as a reference, this is a noop -- just an expression to change the type.

[0 .. 1] is constructing a slice out of a pointer. It's done all inline 
by the compiler (there is no special _d_constructSlice function), so 
that is very very quick. There is no bounds checking, because pointers 
do not have bounds checks.

So there is pretty much zero overhead for this. Just push the pointer 
and length onto the stack (or registers, not sure of ABI), and call rawRead.

 That said, perhaps I should use stdio anyway.  When doing I/O it's the
 disk speed that's the really slow part, and that so dominates things
 that worrying about trivialities is foolish.  And since it's going to be
 wrapped anyway, the ugly will be confined to a very small routine.

Having written a very templated io library 
(https://github.com/schveiguy/iopipe), I can tell you that in my 
experience, the slowdown comes from 2 things: 1) spending time calling 
the kernel, and 2) not being able to inline.

This of course assumes that proper buffering is done. Buffering should 
mitigate most of the slowdown from the disk. It is expensive, but you 
amortize the expense by buffering.

C's i/o is pretty much as good as it gets for an opaque non-inlinable 
system, as long as your requirements are simple enough. The std.stdio 
code should basically inline into the calls you should be making, and it 
handles a bunch of stuff that optimizes the calls (such as locking the 
file handle for one complex operation).

-Steve

Jul 26 2016

Charles Hixson via Digitalmars-d-learn writes:

On 07/26/2016 10:18 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
 On 7/26/16 12:58 PM, Charles Hixson via Digitalmars-d-learn wrote:

 Ranges aren't free, are they? If so then I should probably use stdfile,
 because that is probably less likely to change than core.stdc.stdio.

 Do you mean slices?

 When I see "f.rawRead(&item[0 .. 1])" it looks to me as if unneeded code
 is being generated explictly to be thrown away.  (I don't like using
 pointer/length either, but it's actually easier to understand than this
 kind of thing, and this LOOKS like it's generating extra code.)

 This is probably a misunderstanding on your part.

 &item is accessing the item as a pointer. Since the compiler already 
 has it as a reference, this is a noop -- just an expression to change 
 the type.

 [0 .. 1] is constructing a slice out of a pointer. It's done all 
 inline by the compiler (there is no special _d_constructSlice 
 function), so that is very very quick. There is no bounds checking, 
 because pointers do not have bounds checks.

 So there is pretty much zero overhead for this. Just push the pointer 
 and length onto the stack (or registers, not sure of ABI), and call 
 rawRead.

 That said, perhaps I should use stdio anyway.  When doing I/O it's the
 disk speed that's the really slow part, and that so dominates things
 that worrying about trivialities is foolish.  And since it's going to be
 wrapped anyway, the ugly will be confined to a very small routine.

 Having written a very templated io library 
 (https://github.com/schveiguy/iopipe), I can tell you that in my 
 experience, the slowdown comes from 2 things: 1) spending time calling 
 the kernel, and 2) not being able to inline.

 This of course assumes that proper buffering is done. Buffering should 
 mitigate most of the slowdown from the disk. It is expensive, but you 
 amortize the expense by buffering.

 C's i/o is pretty much as good as it gets for an opaque non-inlinable 
 system, as long as your requirements are simple enough. The std.stdio 
 code should basically inline into the calls you should be making, and 
 it handles a bunch of stuff that optimizes the calls (such as locking 
 the file handle for one complex operation).

 -Steve

Thanks.  Since there isn't any excess overhead I guess I'll use stdio.  
Buffering, however, isn't going to help at all since I'm doing 
randomIO.  I know that most of the data the system reads from disk is 
going to end up getting thrown away, since my records will generally be 
smaller than 8K, but there's no help for that.

Jul 26 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/26/16 1:57 PM, Charles Hixson via Digitalmars-d-learn wrote:

 Thanks.  Since there isn't any excess overhead I guess I'll use stdio.
 Buffering, however, isn't going to help at all since I'm doing
 randomIO.  I know that most of the data the system reads from disk is
 going to end up getting thrown away, since my records will generally be
 smaller than 8K, but there's no help for that.

Even for doing random I/O buffering is helpful. It depends on the size 
of your items.

Essentially, to read 10 bytes from a file probably costs the same as 
reading 100,000 bytes from a file. So may as well buffer that in case 
you need it.

Now, C i/o's buffering may not suit your exact needs. So I don't know 
how it will perform. You may want to consider mmap which tells the 
kernel to link pages of memory directly to disk access. Then the kernel 
is doing all the buffering for you. Phobos has support for it, but it's 
pretty minimal from what I can see: http://dlang.org/phobos/std_mmfile.html

-Steve

Jul 26 2016

Charles Hixson via Digitalmars-d-learn writes:

On 07/26/2016 11:31 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
 On 7/26/16 1:57 PM, Charles Hixson via Digitalmars-d-learn wrote:

 Thanks.  Since there isn't any excess overhead I guess I'll use stdio.
 Buffering, however, isn't going to help at all since I'm doing
 randomIO.  I know that most of the data the system reads from disk is
 going to end up getting thrown away, since my records will generally be
 smaller than 8K, but there's no help for that.

 Even for doing random I/O buffering is helpful. It depends on the size 
 of your items.

 Essentially, to read 10 bytes from a file probably costs the same as 
 reading 100,000 bytes from a file. So may as well buffer that in case 
 you need it.

 Now, C i/o's buffering may not suit your exact needs. So I don't know 
 how it will perform. You may want to consider mmap which tells the 
 kernel to link pages of memory directly to disk access. Then the 
 kernel is doing all the buffering for you. Phobos has support for it, 
 but it's pretty minimal from what I can see: 
 http://dlang.org/phobos/std_mmfile.html

 -Steve

I've considered mmapfile often, but when I read the documentation I end 
up realizing that I don't understand it.  So I look up memory mapped 
files in other places, and I still don't understand it.  It looks as if 
the entire file is stored in memory, which is not at all what I want, 
but I also can't really believe that's what's going on.  I know that 
there was an early form of this in a version of BASIC (the version that 
RISS was written in, but I don't remember which version that was) and in 
*that* version array elements were read in as needed.  (It wasn't 
spectacularly efficient.)  But memory mapped files don't seem to work 
that way, because people keep talking about how efficient they are.  Do 
you know a good introductory tutorial?  I'm guessing that "window size" 
might refer to the number of bytes available, but what if you need to 
append to the file?  Etc.

A part of the problem is that I don't want this to be a process with an 
arbitrarily high memory use.  Buffering would be fine, if I could use 
it, but for my purposes sequential access is likely to be rare, and the 
working layout of the data in RAM doesn't (can't reasonably) match the 
layout on disk.  IIUC (this is a few decades old) the system buffer size 
is about 8K.  I expect to never need to read that large a chunk, but I'm 
going to try to keep the chunks in multiples of 1024 bytes, and if it's 
reasonable to exactly 1024 bytes.  So I should never need two reads or 
writes for a chunk.  I guess to be sure of this I'd better make sure the 
file header is also 1024 bytes.  (I'm guessing that the seek to position 
results in the appropriate buffer being read into the system buffer, so 
if my header were 512 bytes I might occasionally need to do double reads 
or writes.)

I'm guessing that memory mapped files trade off memory use against speed 
of access, and for my purposes that's probably a bad trade, even though 
databases are doing that more and more.  I'm likely to need all the 
memory I can lay my hands on, and even then thrashing wouldn't surprise 
me.  So a fixed buffer size seems a huge advantage.

Jul 26 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Tuesday, 26 July 2016 at 19:30:35 UTC, Charles Hixson wrote:
 It looks as if the entire file is stored in memory, which is 
 not at all what I want, but I also can't really believe that's 
 what's going on.


It is just mapped to virtual memory without actually being loaded 
into physical memory, so when you access the array it returns, 
the kernel loads a page of the file into memory, but it doesn't 
do that until it actually has to.

Think of it as being like this:

struct MagicFile {
     ubyte[] opIndex(size_t idx) {
           auto buffer = new ubyte[](some_block_length);
           fseek(fp, idx, SEEK_SET);
           fread(buffer.ptr, buffer.length, 1);
           return buffer;
     }
}


And something analogous for writing, but instead of being done 
with overloaded operators in D, it is done with the MMU hardware 
by the kernel (and the kernel also does smarter buffering than 
this little example).


 A part of the problem is that I don't want this to be a process 
 with an arbitrarily high memory use.

The kernel will automatically handle physical memory usage too, 
similarly to a page file. If you haven't read a portion of the 
file recently, it will discard that page, since it can always 
read it again off disk if needed, but if you do have memory to 
spare, it will keep the data in memory for faster access later.


So basically the operating system handles a lot of the details 
which makes it efficient.


Growing a memory mapped file is a bit tricky though, you need to 
unmap and remap. Since it is an OS concept, you can always look 
for C or C++ examples too, like herE:

Jul 26 2016

Charles Hixson via Digitalmars-d-learn writes:

On 07/26/2016 12:53 PM, Adam D. Ruppe via Digitalmars-d-learn wrote:
 On Tuesday, 26 July 2016 at 19:30:35 UTC, Charles Hixson wrote:
 It looks as if the entire file is stored in memory, which is not at 
 all what I want, but I also can't really believe that's what's going on.


 It is just mapped to virtual memory without actually being loaded into 
 physical memory, so when you access the array it returns, the kernel 
 loads a page of the file into memory, but it doesn't do that until it 
 actually has to.

 Think of it as being like this:

 struct MagicFile {
     ubyte[] opIndex(size_t idx) {
           auto buffer = new ubyte[](some_block_length);
           fseek(fp, idx, SEEK_SET);
           fread(buffer.ptr, buffer.length, 1);
           return buffer;
     }
 }


 And something analogous for writing, but instead of being done with 
 overloaded operators in D, it is done with the MMU hardware by the 
 kernel (and the kernel also does smarter buffering than this little 
 example).


 A part of the problem is that I don't want this to be a process with 
 an arbitrarily high memory use.

 The kernel will automatically handle physical memory usage too, 
 similarly to a page file. If you haven't read a portion of the file 
 recently, it will discard that page, since it can always read it again 
 off disk if needed, but if you do have memory to spare, it will keep 
 the data in memory for faster access later.


 So basically the operating system handles a lot of the details which 
 makes it efficient.


 Growing a memory mapped file is a bit tricky though, you need to unmap 
 and remap. Since it is an OS concept, you can always look for C or C++ 
 examples too, like herE: 


O, dear.  It was sounding like such an excellent approach until this 
last paragraph, but growing the file is going to be one of the common 
operations.  (Certainly at first.)  It sounds as if that means the file 
needs to be closed and re-opened for extensions.  And I quote from 
https://www.gnu.org/software/libc/manual/html_node/Memory_002d
apped-I_002fO.html: 
<END
Function: /void */ *mremap* /(void *address, size_t length, size_t 
new_length, int flag)/

    Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety
    Concepts
    <https://www.gnu.org/software/libc/manual/html_node/POSIX-Safety-Concepts.html#POSIX-Safety-Concepts>.


    This function can be used to change the size of an existing memory
    area. address and length must cover a region entirely mapped in the
    same |mmap| statement. A new mapping with the same characteristics
    will be returned with the length new_length.

...
This function is only available on a few systems. Except for performing 
optional optimizations one should not rely on this function.

END
So I'm probably better off sticking to using a seek based i/o system.

Jul 26 2016

Rene Zwanenburg <renezwanenburg gmail.com> writes:

On Wednesday, 27 July 2016 at 02:20:57 UTC, Charles Hixson wrote:
 O, dear.  It was sounding like such an excellent approach until 
 this
 last paragraph, but growing the file is going to be one of the 
 common
 operations.  (Certainly at first.) (...)
 So I'm probably better off sticking to using a seek based i/o 
 system.

Not necessarily. The usual approach is to over-allocate your file 
so you don't need to grow it that often. This is the exact same 
strategy used by D's dynamic arrays and grow-able array-backed 
lists in other languages - the difference between list length and 
capacity.

There is no built-in support for this in std.mmfile afaik. But 
it's not hard to do yourself.

Jul 27 2016

Charles Hixson via Digitalmars-d-learn writes:

On 07/27/2016 06:46 AM, Rene Zwanenburg via Digitalmars-d-learn wrote:
 On Wednesday, 27 July 2016 at 02:20:57 UTC, Charles Hixson wrote:
 O, dear.  It was sounding like such an excellent approach until this
 last paragraph, but growing the file is going to be one of the common
 operations.  (Certainly at first.) (...)
 So I'm probably better off sticking to using a seek based i/o system.

 Not necessarily. The usual approach is to over-allocate your file so 
 you don't need to grow it that often. This is the exact same strategy 
 used by D's dynamic arrays and grow-able array-backed lists in other 
 languages - the difference between list length and capacity.

 There is no built-in support for this in std.mmfile afaik. But it's 
 not hard to do yourself.

Well, that would mean I didn't need to reopen the file so often, but 
that sure wouldn't mean I wouldn't need to re-open the file.  And it 
would add considerable complexity.  Possibly that would be an optimal 
approach once the data was mainly collected, but I won't want to 
re-write this bit at that point.

Jul 27 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/26/16 3:30 PM, Charles Hixson via Digitalmars-d-learn wrote:
 On 07/26/2016 11:31 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:

 Now, C i/o's buffering may not suit your exact needs. So I don't know
 how it will perform. You may want to consider mmap which tells the
 kernel to link pages of memory directly to disk access. Then the
 kernel is doing all the buffering for you. Phobos has support for it,
 but it's pretty minimal from what I can see:
 http://dlang.org/phobos/std_mmfile.html

 I've considered mmapfile often, but when I read the documentation I end
 up realizing that I don't understand it.  So I look up memory mapped
 files in other places, and I still don't understand it.  It looks as if
 the entire file is stored in memory, which is not at all what I want,
 but I also can't really believe that's what's going on.

Of course that isn't what is happening :)

What happens is that the kernel says memory page 0x12345 (or whatever) 
is mapped to the file. Then when you access a mapped page, the system 
memory management unit gets a page fault (because that memory isn't 
loaded), which triggers the kernel to load that page of memory. Kernel 
sees that the memory is really mapped to that file, and loads the page 
from the file instead. As you write to the memory location, the page is 
marked dirty, and at some point, the kernel flushes that page back to disk.

Everything is done behind the scenes and is in tune with the filesystem 
itself, so you get a little extra benefit from that.

 I know that
 there was an early form of this in a version of BASIC (the version that
 RISS was written in, but I don't remember which version that was) and in
 *that* version array elements were read in as needed.  (It wasn't
 spectacularly efficient.)  But memory mapped files don't seem to work
 that way, because people keep talking about how efficient they are.  Do
 you know a good introductory tutorial?  I'm guessing that "window size"
 might refer to the number of bytes available, but what if you need to
 append to the file?  Etc.

To be honest, I'm not super familiar with actually using them, I just 
have a rough idea of how they work. The actual usage you will have to 
look up.

 A part of the problem is that I don't want this to be a process with an
 arbitrarily high memory use.

You should know that you can allocate as much memory as you want, as 
long as you have address space for it, and you won't actually map that 
to physical memory until you use it. So the management of the memory is 
done lazily, all supported by the MMU hardware. This is true for actual 
memory too!

Note that the only "memory" you are using for the mmaped file are page 
buffers in the kernel which are likely already being used to buffer the 
disk reads. It's not like it's loading the entire file into memory, and 
probably doesn't even load all sequential pages into memory. It only 
loads the ones you use.

I'm pretty much at my limit for knowledge of this subject (and maybe I 
have a few things incorrect), I'm sure others here know much more. I 
suggest you play a bit with it to see what the performance is like. I 
have also heard that it's very fast.

-Steve

Jul 26 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - randomIO, std.file, core.stdc.stdio