digitalmars.D.learn - Reading a structured binary file?
- Gary Willoughby (4/4) Aug 02 2013 What library commands do i use to read from a structured binary
- Dicebot (4/8) Aug 02 2013 http://dlang.org/phobos/std_file.html#.read
- Justin Whear (14/18) Aug 02 2013 You can use File.rawRead:
- John Colvin (4/8) Aug 02 2013 How big is the file?
- Gary Willoughby (1/4) Aug 02 2013 Quite large so i'll probably stream it. Thanks guys.
- Jesse Phillips (8/12) Aug 02 2013 You've gotten some help already around functions D provides. But
- Jonathan M Davis (6/10) Aug 02 2013 I'd probably use std.mmfile and std.bitmanip to do it. MmFile will allow...
- captaindet (4/14) Aug 02 2013 FWIW
- H. S. Teoh (15/30) Aug 02 2013 [...]
- monarch_dodra (15/62) Aug 03 2013 I did some benching a while back with user bioinfornatics. He had
- H. S. Teoh (11/56) Aug 03 2013 Sorry, I lost the context of this discussion, what algo are you
- Gary Willoughby (4/12) Aug 03 2013 This sounds a great idea but once the file has been opened as a
- Gary Willoughby (6/9) Aug 03 2013 I'm currently doing this:
- John Colvin (11/21) Aug 03 2013 That defeats the object of memory mapping, as the [] at the end
- Jonathan M Davis (7/25) Aug 03 2013 Are you sure about that? Maybe I'm just not familiar enough with mmap, b...
- H. S. Teoh (15/40) Aug 03 2013 [...]
- Jesse Phillips (15/25) Aug 05 2013 You will need to slice the size of the data you want, otherwise
- H. S. Teoh (14/20) Aug 05 2013 I don't know about D's Mmfile, but AFAIK, it maps directly to the OS
- Jonathan M Davis (11/31) Aug 05 2013 mmap is awesome. It makes handling large files _way_ easier, especially ...
- Jonathan M Davis (3/15) Aug 03 2013 Yeah. That's how you do it.
- Jonathan M Davis (9/49) Aug 03 2013 That's what I thought that mmap did, but it's not something that I've st...
What library commands do i use to read from a structured binary file? I want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these to bytes, shorts and ints respectively. I can't seem to find anything like readByte().
Aug 02 2013
On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:What library commands do i use to read from a structured binary file? I want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these to bytes, shorts and ints respectively. I can't seem to find anything like readByte().?
Aug 02 2013
On Fri, 02 Aug 2013 19:49:54 +0200, Gary Willoughby wrote:What library commands do i use to read from a structured binary file? I want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these to bytes, shorts and ints respectively. I can't seem to find anything like readByte().You can use File.rawRead: ushort[1] myShort; file.rawRead(myShort); Or if you have structures in the file: struct Foo { align(1): int bar; short k; char[7] str; } Foo[1] foo; file.rawRead(foo);
Aug 02 2013
On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:What library commands do i use to read from a structured binary file? I want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these to bytes, shorts and ints respectively. I can't seem to find anything like readByte().How big is the file? If it's not too huge i'd just read it in with std.file.read and then sort out splitting it up from there.
Aug 02 2013
How big is the file? If it's not too huge i'd just read it in with std.file.read and then sort out splitting it up from there.Quite large so i'll probably stream it. Thanks guys.
Aug 02 2013
On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:What library commands do i use to read from a structured binary file? I want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these to bytes, shorts and ints respectively. I can't seem to find anything like readByte().You've gotten some help already around functions D provides. But I thought I would mention I'd recently tried to do some large file parsing for binary data, and decided to try and blog about it. http://he-the-great.livejournal.com/47550.html I can't say this is the best solution, but it worked. I was parsing a 20 gig OpenStreetMap planet file.
Aug 02 2013
On Friday, August 02, 2013 19:49:54 Gary Willoughby wrote:What library commands do i use to read from a structured binary file? I want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these to bytes, shorts and ints respectively. I can't seem to find anything like readByte().I'd probably use std.mmfile and std.bitmanip to do it. MmFile will allow you to efficiently operate on the file as a ubyte[] in memory thanks to mmap, and std.bitmanip's peek and read functions make it easy to convert multiple bytes into integral values. - Jonathan M Davis
Aug 02 2013
On 2013-08-02 17:13, Jonathan M Davis wrote:On Friday, August 02, 2013 19:49:54 Gary Willoughby wrote:FWIW i have to deal with big data files that can be a few GB. for some data analysis software i wrote in C a while back i did some testing with caching and such. turns out that for Win7-64 the automatic caching done by the OS is really good and any attempt to speed things up actually slowed it down. no kidding, i have seen more than 2GB of data being automatically cached. of course the system RAM must be larger than the file size (if i remember my tests correctly by a factor of ~2, but this is maybe not a linear relationship, i did not actually change the RAM just the size of the data file) and it will hold it in the cache only as long as there are no concurrent applications requiring RAM or caching. i guess my point is, if your target is Win7 and your files are >5x smaller than the installed RAM i would not bother at all trying to optimize file access. i suppose -nix machine will do a similar good job these days. /detWhat library commands do i use to read from a structured binary file? I want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these to bytes, shorts and ints respectively. I can't seem to find anything like readByte().I'd probably use std.mmfile and std.bitmanip to do it. MmFile will allow you to efficiently operate on the file as a ubyte[] in memory thanks to mmap, and std.bitmanip's peek and read functions make it easy to convert multiple bytes into integral values. - Jonathan M Davis
Aug 02 2013
On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote: [...]FWIW i have to deal with big data files that can be a few GB. for some data analysis software i wrote in C a while back i did some testing with caching and such. turns out that for Win7-64 the automatic caching done by the OS is really good and any attempt to speed things up actually slowed it down. no kidding, i have seen more than 2GB of data being automatically cached. of course the system RAM must be larger than the file size (if i remember my tests correctly by a factor of ~2, but this is maybe not a linear relationship, i did not actually change the RAM just the size of the data file) and it will hold it in the cache only as long as there are no concurrent applications requiring RAM or caching. i guess my point is, if your target is Win7 and your files are >5x smaller than the installed RAM i would not bother at all trying to optimize file access. i suppose -nix machine will do a similar good job these days.[...] IIRC, Linux has been caching files (or disk blocks, rather) in memory since the days of Win95. Of course, memory in those days was much scarcer, but file sizes were smaller too. :) There's still a cost to copy the kernel buffers into userspace, though, which should not be disregarded. But if you use mmap, then you're essentially accessing that memory cache directly, which is as good as it gets. I don't know how well mmap works on windows, though, IIRC it doesn't have the same semantics as Posix, so you could accidentally run into performance issues by using it the wrong way on windows. T -- There is no gravity. The earth sucks.
Aug 02 2013
On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote: [...]I did some benching a while back with user bioinfornatics. He had to do some pretty large file reads, preferably in very little time. Observations showed my algo was *much* faster under windows then linux. What we observed is that under windows, as soon as you open a file for reading, windows starts buffering the file in a parallel thread. What we did was create two threads. The first did nothing but read the file, store it into chunks of memory, and then pass it to a worker thread. The worker thread did the parsing proper. Doing this *halved* the linux runtime, tying it with the "monothreaded" windows run time. Windows saw no change. FYI, the full thread is here: forum.dlang.org/thread/gmfqwzgtjfnqiajghmsx forum.dlang.orgFWIW i have to deal with big data files that can be a few GB. for some data analysis software i wrote in C a while back i did some testing with caching and such. turns out that for Win7-64 the automatic caching done by the OS is really good and any attempt to speed things up actually slowed it down. no kidding, i have seen more than 2GB of data being automatically cached. of course the system RAM must be larger than the file size (if i remember my tests correctly by a factor of ~2, but this is maybe not a linear relationship, i did not actually change the RAM just the size of the data file) and it will hold it in the cache only as long as there are no concurrent applications requiring RAM or caching. i guess my point is, if your target is Win7 and your files are >5x smaller than the installed RAM i would not bother at all trying to optimize file access. i suppose -nix machine will do a similar good job these days.[...] IIRC, Linux has been caching files (or disk blocks, rather) in memory since the days of Win95. Of course, memory in those days was much scarcer, but file sizes were smaller too. :) There's still a cost to copy the kernel buffers into userspace, though, which should not be disregarded. But if you use mmap, then you're essentially accessing that memory cache directly, which is as good as it gets. I don't know how well mmap works on windows, though, IIRC it doesn't have the same semantics as Posix, so you could accidentally run into performance issues by using it the wrong way on windows. T
Aug 03 2013
On Sat, Aug 03, 2013 at 11:29:01PM +0200, monarch_dodra wrote:On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:[...]On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote: [...]FWIW i have to deal with big data files that can be a few GB. for some data analysis software i wrote in C a while back i did some testing with caching and such. turns out that for Win7-64 the automatic caching done by the OS is really good and any attempt to speed things up actually slowed it down. no kidding, i have seen more than 2GB of data being automatically cached. of course the system RAM must be larger than the file size (if i remember my tests correctly by a factor of ~2, but this is maybe not a linear relationship, i did not actually change the RAM just the size of the data file) and it will hold it in the cache only as long as there are no concurrent applications requiring RAM or caching. i guess my point is, if your target is Win7 and your files are >5x smaller than the installed RAM i would not bother at all trying to optimize file access. i suppose -nix machine will do a similar good job these days.[...] IIRC, Linux has been caching files (or disk blocks, rather) in memory since the days of Win95. Of course, memory in those days was much scarcer, but file sizes were smaller too. :) There's still a cost to copy the kernel buffers into userspace, though, which should not be disregarded. But if you use mmap, then you're essentially accessing that memory cache directly, which is as good as it gets. I don't know how well mmap works on windows, though, IIRC it doesn't have the same semantics as Posix, so you could accidentally run into performance issues by using it the wrong way on windows.I did some benching a while back with user bioinfornatics. He had to do some pretty large file reads, preferably in very little time. Observations showed my algo was *much* faster under windows then linux.Sorry, I lost the context of this discussion, what algo are you referring to?What we observed is that under windows, as soon as you open a file for reading, windows starts buffering the file in a parallel thread. What we did was create two threads. The first did nothing but read the file, store it into chunks of memory, and then pass it to a worker thread. The worker thread did the parsing proper. Doing this *halved* the linux runtime, tying it with the "monothreaded" windows run time. Windows saw no change.Interesting. I wonder if you could, under Linux, mmap a file then have one thread access the first byte of each file block while another thread does the real work with the data.FYI, the full thread is here: forum.dlang.org/thread/gmfqwzgtjfnqiajghmsx forum.dlang.orgI'll take a look, thanks. T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!
Aug 03 2013
On Friday, 2 August 2013 at 22:13:28 UTC, Jonathan M Davis wrote:I'd probably use std.mmfile and std.bitmanip to do it. MmFile will allow you to efficiently operate on the file as a ubyte[] in memory thanks to mmap, and std.bitmanip's peek and read functions make it easy to convert multiple bytes into integral values. - Jonathan M DavisThis sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?
Aug 03 2013
On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Aug 03 2013
On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory. 3 options I can think of: 1) copy read from std.bitmanip and modify it to work nicely with MmFile 2) write a wrapper for MmFile to let it work nicely with read 3) rewrite/modify MmFile I would love to do 3) at some point, but I'm too busy at the moment.This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Aug 03 2013
On Saturday, August 03, 2013 23:10:12 John Colvin wrote:On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap. Certainly, if slicing it like that copies it all into memory, that's a big problem. - Jonathan M DavisOn Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Aug 03 2013
On Sat, Aug 03, 2013 at 02:25:23PM -0700, Jonathan M Davis wrote:On Saturday, August 03, 2013 23:10:12 John Colvin wrote:[...] I think he meant that the OS will have to load the entire file into memory if you sliced the mmap'ed file, not that you'll copy all the data. I'm not certain this is true, though, because slicing as I understand it only returns the address of the start of the mmap'ed addresses coupled with its length. I don't think the OS will actually load anything into memory until you reference an address within that mmap'ed range. And even then, only those disk blocks that correspond with the referenced addresses will actually be loaded -- this is the point of virtual memory, after all. T -- The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap. Certainly, if slicing it like that copies it all into memory, that's a big problem.On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Aug 03 2013
On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:You will need to slice the size of the data you want, otherwise you're effectively doing std.file.read(). It doesn't need to be for a single value (as in the example), it could be a block of data which is then individual parsed for the pieces. auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[indexInFile..uint.sizeof]; indexInFile += uint.sizeof; buffer.read!uint(); //etc. The only way I'm seeing to advance through the file is to keep an index on where you're currently reading from. This actually works perfect for the FileRange I mentioned in the previous post. Though I'm not familiar with how mmfile manages its memory, but hopefully there isn't buffer reuse or storing the slice could be overridden (not an issue for value data, but string data).This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Aug 05 2013
On Tue, Aug 06, 2013 at 06:48:12AM +0200, Jesse Phillips wrote: [...]The only way I'm seeing to advance through the file is to keep an index on where you're currently reading from. This actually works perfect for the FileRange I mentioned in the previous post. Though I'm not familiar with how mmfile manages its memory, but hopefully there isn't buffer reuse or storing the slice could be overridden (not an issue for value data, but string data).I don't know about D's Mmfile, but AFAIK, it maps directly to the OS mmap(), which basically maps a portion of your program's address space to the data on the disk. Meaning that the memory is managed by the OS, and addresses will not change from under you. In the underlying physical memory, pages may get swapped out and reused, but this is invisible to your program, since referencing them will cause the OS to swap the pages back in, so you'll never end up with invalid pointers. The worst that could happen is the I/O performance hit associated with swapping. Such is the utility of virtual memory. T -- Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG
Aug 05 2013
On Monday, August 05, 2013 23:04:58 H. S. Teoh wrote:On Tue, Aug 06, 2013 at 06:48:12AM +0200, Jesse Phillips wrote: [...]mmap is awesome. It makes handling large files _way_ easier, especially when you have to worry about performance. It was a huge performance boost for one of our video recorder programs where I work when we switched to using mmap on it (this device is recording multiple video streams from cameras 24/7, and performance is critical). Trying to do what mmap does on your own is incredibly bug-prone and bound to be worse for performance (since you're doing it instead of the kernel). One of our older products tries to do it on its own (probably because the developers didn't know about mmap), and it's a royal mess. - Jonathan M DavisThe only way I'm seeing to advance through the file is to keep an index on where you're currently reading from. This actually works perfect for the FileRange I mentioned in the previous post. Though I'm not familiar with how mmfile manages its memory, but hopefully there isn't buffer reuse or storing the slice could be overridden (not an issue for value data, but string data).I don't know about D's Mmfile, but AFAIK, it maps directly to the OS mmap(), which basically maps a portion of your program's address space to the data on the disk. Meaning that the memory is managed by the OS, and addresses will not change from under you. In the underlying physical memory, pages may get swapped out and reused, but this is invisible to your program, since referencing them will cause the OS to swap the pages back in, so you'll never end up with invalid pointers. The worst that could happen is the I/O performance hit associated with swapping. Such is the utility of virtual memory.
Aug 05 2013
On Saturday, August 03, 2013 20:23:55 Gary Willoughby wrote:On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:Yeah. That's how you do it. - Jonathan M DavisThis sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Aug 03 2013
On Saturday, August 03, 2013 14:31:16 H. S. Teoh wrote:On Sat, Aug 03, 2013 at 02:25:23PM -0700, Jonathan M Davis wrote:That's what I thought that mmap did, but it's not something that I've studied in detail. Aside from that though, my main complaint about MmFile is the fact that it's a class when it really should be a reference-counted struct. At some point, we should probably create MMFile or somesuch which _is_ a reference counted struct and then deprecate MmFile. But if we do that, then we should be sure of whatever other changes the implementation needs and do those with it. - Jonathan M DavisOn Saturday, August 03, 2013 23:10:12 John Colvin wrote:[...] I think he meant that the OS will have to load the entire file into memory if you sliced the mmap'ed file, not that you'll copy all the data. I'm not certain this is true, though, because slicing as I understand it only returns the address of the start of the mmap'ed addresses coupled with its length. I don't think the OS will actually load anything into memory until you reference an address within that mmap'ed range. And even then, only those disk blocks that correspond with the referenced addresses will actually be loaded -- this is the point of virtual memory, after all.On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap. Certainly, if slicing it like that copies it all into memory, that's a big problem.On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Aug 03 2013