www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - parallel unzip in progress

reply "Jay Norwood" <jayn prismnet.com> writes:
I'm working on a parallel unzip.  I started with phobos std.zip, 
but found that to be too monolithic.  I needed to separate out 
the tasks that get the directory entries, create the directory 
tree, get the compressed data, expand the data and create the 
uncompressed files on disk.  It currently unzips a 2GB directory 
struct in about 18 secs while 7zip takes around 55 secs. Only 
about 4 seconds of this is the creation of the directory 
structure and the expanding.  The other 14 secs is writing  the 
regular files.

The subtasks needed to be separated not only because of the need 
to run them in parallel, but also because the current std.zip 
implementation is a memory hog, keeping the whole compressed and 
expanded data sections in memory. I was running out of memory in 
a 32 bit application just attempting to unzip the test file with 
the std.zip operations.  The parallel version peaks at around 
150MB memory used during the operation.


  The parallel version is still missing the operation of restoring 
the original file attributes, and I see no example in the 
documents of what would normally be done.  Am I missing this 
somewhere? I'll have to dig around...
Apr 02 2012
parent reply "Jay Norwood" <jayn prismnet.com> writes:
On Tuesday, 3 April 2012 at 05:27:08 UTC, Jay Norwood wrote:

..........
So, to answer my own questions ... I placed the code below in a 
taskpool parallel foreach loop, where each am is an archive 
member.  It is expanded,  and the expanded data is written to a 
file.  The original time info is coverted to SysTime using a 
localTime() tz timezone calculated outside the loop.  Then the 
setTimes call updates the file timestamp in Windows.  I just used 
the same system time st value for modification and access times.

expand2(am);
string destFilename = destName ~ am.name;
std.file.write(destFilename,am.expandedData);
SysTime st = DosFileTimeToSysTime(am.time, tz);
std.file.setTimes(destFilename, st, st);

The whole unzip, including restore of file modification times,  
completed in under 17 seconds for this 2GB expanded directory 
structure with some 39K files.  7zip took 55 seconds.


This particular loop is currently excluding restore of times on 
directory entries, but I suppose I can restore the directory 
times after all the files have been expanded into the directory 
structure.
Apr 03 2012
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 04.04.2012 08:31, schrieb Jay Norwood:
 On Tuesday, 3 April 2012 at 05:27:08 UTC, Jay Norwood wrote:

 ..........
 So, to answer my own questions ... I placed the code below in a
 taskpool parallel foreach loop, where each am is an archive
 member.  It is expanded,  and the expanded data is written to a
 file.  The original time info is coverted to SysTime using a
 localTime() tz timezone calculated outside the loop.  Then the
 setTimes call updates the file timestamp in Windows.  I just used
 the same system time st value for modification and access times.

 expand2(am);
 string destFilename = destName ~ am.name;
 std.file.write(destFilename,am.expandedData);
 SysTime st = DosFileTimeToSysTime(am.time, tz);
 std.file.setTimes(destFilename, st, st);

 The whole unzip, including restore of file modification times,
 completed in under 17 seconds for this 2GB expanded directory
 structure with some 39K files.  7zip took 55 seconds.


 This particular loop is currently excluding restore of times on
 directory entries, but I suppose I can restore the directory
 times after all the files have been expanded into the directory
 structure.
great idea if its ok that the directory strucutre is in an inconsistent state on error - thats the main problem with parallel stuff - its faster if everything works ok - its hard to control the state if not
Apr 04 2012
parent reply "Jay Norwood" <jayn prismnet.com> writes:
On Wednesday, 4 April 2012 at 07:25:25 UTC, dennis luehring wrote:
 Am 04.04.2012 08:31, schrieb Jay Norwood:
 This particular loop is currently excluding restore of times on
 directory entries, but I suppose I can restore the directory
 times after all the files have been expanded into the directory
 structure.
great idea if its ok that the directory strucutre is in an inconsistent state on error - thats the main problem with parallel stuff - its faster if everything works ok - its hard to control the state if not
Yes, I was concerned about race conditions using recursive mkdir in parallel, so have avoided that problem as a rule and concentrated on things that have a better return. Unfortunately, std.file.setTimes doesn't currently work for folders, and I don't see an alternative in the library docs that can be used to restore times for folders. The setTimes call throws an exception. It could just be an oversight in the implementation, since I don't see any comments in the docs that restrict its use to regular files.
Apr 04 2012
parent reply "Jay Norwood" <jayn prismnet.com> writes:
On Wednesday, 4 April 2012 at 07:39:56 UTC, Jay Norwood wrote:
 On Wednesday, 4 April 2012 at 07:25:25 UTC, dennis luehring
I decided to try the option where the data is stored in the zip file uncompressed. Since the folder is just over 2GB, I ran into the stdio File problems with being unable to properly return the size of the file. I see bugs on this posted that go back a few years. Kind of basic problems. The work-around was to convert all the file operations to use std.stream equivalents, and that worked well, but I see i the bug reports that even that was only working correctly on windows. So I'm on windows, and ok for me, but it would be too bad to limit use to Windows. Seems like stdio runtime support for File operations above 2GB would be a basic expectation for a "system" language these days.
Apr 04 2012
parent "Jay Norwood" <jayn prismnet.com> writes:
On Wednesday, 4 April 2012 at 19:41:21 UTC, Jay Norwood wrote:
  > The work-around was to convert all the file operations to use
 std.stream equivalents, and that worked well, but I see i the 
 bug reports that even that was only working correctly on 
 windows.  So I'm on windows, and ok for me, but it would be too 
 bad to limit use to Windows.

 Seems like stdio runtime support for File operations above 2GB 
 would be a basic expectation for a "system" language these days.
btw, I posted a fix to setTimes that enables it to update the timestamp on directories as well as regular files, along with the source code of this example. I also did some research on why ntfs is such a dog when doing delete operations on hard drives, as well as spending several hours looking at procmon logs, and have decided that the problem is primarily related to multiple accesses in the master file table file for the larger files. There is much discussion on the matter of the MFT getting fragmented on these larger drives, and a couple of interesting proposed tweaks in the second link. http://ixbtlabs.com/articles/ntfs/index3.html http://www.gilsmethod.com/speed-up-vista-with-these-simple-ntfs-tweaks The second link shows you how to reserve a larger area for MFT, and the link below looks like it might be able to clean out any files from the reserved MFT spaces. http://www.mydefrag.com/index.html
Apr 07 2012