digitalmars.D.learn - parallel unzip in progress
- Jay Norwood (20/20) Apr 02 2012 I'm working on a parallel unzip. I started with phobos std.zip,
- Jay Norwood (21/21) Apr 03 2012 On Tuesday, 3 April 2012 at 05:27:08 UTC, Jay Norwood wrote:
- dennis luehring (4/25) Apr 04 2012 great idea if its ok that the directory strucutre is in an inconsistent
- Jay Norwood (10/20) Apr 04 2012 Yes, I was concerned about race conditions using recursive
- Jay Norwood (14/15) Apr 04 2012 I decided to try the option where the data is stored in the zip
- Jay Norwood (18/24) Apr 07 2012 btw, I posted a fix to setTimes that enables it to update the
I'm working on a parallel unzip. I started with phobos std.zip, but found that to be too monolithic. I needed to separate out the tasks that get the directory entries, create the directory tree, get the compressed data, expand the data and create the uncompressed files on disk. It currently unzips a 2GB directory struct in about 18 secs while 7zip takes around 55 secs. Only about 4 seconds of this is the creation of the directory structure and the expanding. The other 14 secs is writing the regular files. The subtasks needed to be separated not only because of the need to run them in parallel, but also because the current std.zip implementation is a memory hog, keeping the whole compressed and expanded data sections in memory. I was running out of memory in a 32 bit application just attempting to unzip the test file with the std.zip operations. The parallel version peaks at around 150MB memory used during the operation. The parallel version is still missing the operation of restoring the original file attributes, and I see no example in the documents of what would normally be done. Am I missing this somewhere? I'll have to dig around...
Apr 02 2012
On Tuesday, 3 April 2012 at 05:27:08 UTC, Jay Norwood wrote: .......... So, to answer my own questions ... I placed the code below in a taskpool parallel foreach loop, where each am is an archive member. It is expanded, and the expanded data is written to a file. The original time info is coverted to SysTime using a localTime() tz timezone calculated outside the loop. Then the setTimes call updates the file timestamp in Windows. I just used the same system time st value for modification and access times. expand2(am); string destFilename = destName ~ am.name; std.file.write(destFilename,am.expandedData); SysTime st = DosFileTimeToSysTime(am.time, tz); std.file.setTimes(destFilename, st, st); The whole unzip, including restore of file modification times, completed in under 17 seconds for this 2GB expanded directory structure with some 39K files. 7zip took 55 seconds. This particular loop is currently excluding restore of times on directory entries, but I suppose I can restore the directory times after all the files have been expanded into the directory structure.
Apr 03 2012
Am 04.04.2012 08:31, schrieb Jay Norwood:On Tuesday, 3 April 2012 at 05:27:08 UTC, Jay Norwood wrote: .......... So, to answer my own questions ... I placed the code below in a taskpool parallel foreach loop, where each am is an archive member. It is expanded, and the expanded data is written to a file. The original time info is coverted to SysTime using a localTime() tz timezone calculated outside the loop. Then the setTimes call updates the file timestamp in Windows. I just used the same system time st value for modification and access times. expand2(am); string destFilename = destName ~ am.name; std.file.write(destFilename,am.expandedData); SysTime st = DosFileTimeToSysTime(am.time, tz); std.file.setTimes(destFilename, st, st); The whole unzip, including restore of file modification times, completed in under 17 seconds for this 2GB expanded directory structure with some 39K files. 7zip took 55 seconds. This particular loop is currently excluding restore of times on directory entries, but I suppose I can restore the directory times after all the files have been expanded into the directory structure.great idea if its ok that the directory strucutre is in an inconsistent state on error - thats the main problem with parallel stuff - its faster if everything works ok - its hard to control the state if not
Apr 04 2012
On Wednesday, 4 April 2012 at 07:25:25 UTC, dennis luehring wrote:Am 04.04.2012 08:31, schrieb Jay Norwood:Yes, I was concerned about race conditions using recursive mkdir in parallel, so have avoided that problem as a rule and concentrated on things that have a better return. Unfortunately, std.file.setTimes doesn't currently work for folders, and I don't see an alternative in the library docs that can be used to restore times for folders. The setTimes call throws an exception. It could just be an oversight in the implementation, since I don't see any comments in the docs that restrict its use to regular files.This particular loop is currently excluding restore of times on directory entries, but I suppose I can restore the directory times after all the files have been expanded into the directory structure.great idea if its ok that the directory strucutre is in an inconsistent state on error - thats the main problem with parallel stuff - its faster if everything works ok - its hard to control the state if not
Apr 04 2012
On Wednesday, 4 April 2012 at 07:39:56 UTC, Jay Norwood wrote:On Wednesday, 4 April 2012 at 07:25:25 UTC, dennis luehringI decided to try the option where the data is stored in the zip file uncompressed. Since the folder is just over 2GB, I ran into the stdio File problems with being unable to properly return the size of the file. I see bugs on this posted that go back a few years. Kind of basic problems. The work-around was to convert all the file operations to use std.stream equivalents, and that worked well, but I see i the bug reports that even that was only working correctly on windows. So I'm on windows, and ok for me, but it would be too bad to limit use to Windows. Seems like stdio runtime support for File operations above 2GB would be a basic expectation for a "system" language these days.
Apr 04 2012
On Wednesday, 4 April 2012 at 19:41:21 UTC, Jay Norwood wrote: > The work-around was to convert all the file operations to usestd.stream equivalents, and that worked well, but I see i the bug reports that even that was only working correctly on windows. So I'm on windows, and ok for me, but it would be too bad to limit use to Windows. Seems like stdio runtime support for File operations above 2GB would be a basic expectation for a "system" language these days.btw, I posted a fix to setTimes that enables it to update the timestamp on directories as well as regular files, along with the source code of this example. I also did some research on why ntfs is such a dog when doing delete operations on hard drives, as well as spending several hours looking at procmon logs, and have decided that the problem is primarily related to multiple accesses in the master file table file for the larger files. There is much discussion on the matter of the MFT getting fragmented on these larger drives, and a couple of interesting proposed tweaks in the second link. http://ixbtlabs.com/articles/ntfs/index3.html http://www.gilsmethod.com/speed-up-vista-with-these-simple-ntfs-tweaks The second link shows you how to reserve a larger area for MFT, and the link below looks like it might be able to clean out any files from the reserved MFT spaces. http://www.mydefrag.com/index.html
Apr 07 2012