www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Read and write gzip files easily.

reply "Kamil Slowikowski" <kslowikowski gmail.com> writes:
Hi there, I'm new to D and have a lot of learning ahead of me. It 
would
be extremely helpful to me if someone with D experience could 
show me
some code examples.

I'd like to neatly read and write gzipped files for my work. I 
have read
several threads on these forums on the topic of std.zlib or 
std.zip and I haven't been able to figure it out.

Here's a Python script that does what I want. Can you please show 
me
an example in D that does the same thing?

<code>


import gzip


with gzip.open("input.gz") as stream:
     for line in stream:
         print line


with gzip.open("output.gz", "w") as stream:
     stream.write("some output goes here\n")
</code>


I have a second request. I would like to start using D more in my 
work,
and in particular I would like to use and extend the BioD 
library. Artem
Tarasov made a nice module to handle BGZF, and I would like to 
see an
example like my Python code above using Artem's module.

Read more about BGZF:
http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

BioD:
https://github.com/biod/BioD/blob/d2bea0a0da63eb820fcf11ae367456b2c367ec04/bio/core/bgzf/compress.d
Feb 19 2014
next sibling parent reply Artem Tarasov <lomereiter gmail.com> writes:
Wow, that's unexpected :)

Unfortunately, there's no standard module for processing gzip/bz2. The
former can be dealt with using etc.c.zlib, but there's no convenient
interface for working with file as a stream. Thus, the easiest way that I
know of is as follows:

import std.stdio, std.process;
auto pipe = pipeShell("gunzip -c " ~ filename); // replace with pigz if you
wish
File input = pipe.stdout;

Regarding your second request, this forum is not an appropriate place to
provide usage examples for a library, so that will go into a private e-mail.


On Wed, Feb 19, 2014 at 7:51 PM, Kamil Slowikowski
<kslowikowski gmail.com>wrote:

 I have a second request. I would like to start using D more in my work,
 and in particular I would like to use and extend the BioD library. Artem
 Tarasov made a nice module to handle BGZF, and I would like to see an
 example like my Python code above using Artem's module.
Feb 19 2014
next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Wednesday, 19 February 2014 at 16:27:32 UTC, Artem Tarasov 
wrote:
 Unfortunately, there's no standard module for processing 
 gzip/bz2.
std.zlib handles gzip but it doesn't present a file nor range interface over it. This will work though: void main() { import std.zlib; import std.stdio; auto uc = new UnCompress(); foreach(chunk; File("testd.gz").byChunk(1024)) { auto uncompressed = uc.uncompress(chunk); writeln(cast(string) uncompressed); } // also look at anything left in the buffer writeln(cast(string) uc.flush()); } And if you are writing, use new Compress(HeaderFormat.gzip) then call the compress method and write what it returns to teh file.
Feb 19 2014
next sibling parent Artem Tarasov <lomereiter gmail.com> writes:
Ah, indeed. I dismissed it because it allocates on each call, and heavy GC
usage in multithreaded app is a performance killer.

On Wed, Feb 19, 2014 at 8:36 PM, Adam D. Ruppe <destructionator gmail.com>wrote:

 std.zlib handles gzip but it doesn't present a file nor range interface
 over it.
Feb 19 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Wednesday, 19 February 2014 at 16:36:29 UTC, Adam D. Ruppe 
wrote:
 On Wednesday, 19 February 2014 at 16:27:32 UTC, Artem Tarasov 
 wrote:
 Unfortunately, there's no standard module for processing 
 gzip/bz2.
std.zlib handles gzip but it doesn't present a file nor range interface over it. This will work though: void main() { import std.zlib; import std.stdio; auto uc = new UnCompress(); foreach(chunk; File("testd.gz").byChunk(1024)) { auto uncompressed = uc.uncompress(chunk); writeln(cast(string) uncompressed); } // also look at anything left in the buffer writeln(cast(string) uc.flush()); }
Regrettably, the above code has a bug. Currently, std.zlib stores a reference to the buffer passed to it, and since byChunk reuses the buffer, the code will fail when uncompressing multiple chunks.
Feb 19 2014
prev sibling parent reply "Kamil Slowikowski" <kslowikowski gmail.com> writes:
On Wednesday, 19 February 2014 at 16:36:29 UTC, Adam D. Ruppe 
wrote:
 And if you are writing, use new Compress(HeaderFormat.gzip) 
 then call the compress method and write what it returns to teh 
 file.
I successfully read and printed the contents of a gzipped file, but the documentation is too sparse for me to figure out why I can't write a gzipped file. I'd appreciate any tips. Here's the output: - - - $ echo -e "hi there\nhere's some text in a file\n-K" | gzip > test.gz $ zcat test.gz hi there here's some text in a file -K $ ./zfile.d test.gz out.gz hi there here's some text in a file -K $ zcat out.gz gzip: out.gz: unexpected end of file - - - And the code: - - - // zfile.d import std.stdio, std.stream, std.zlib, std.c.process, std.process, std.file; void main(string[] args) { if (args.length != 3) { writefln("Usage: ./%s <file> <output>", args[0]); exit(0); } // Read command line arguments. string filename = args[1]; string outfile = args[2]; auto len = filename.length; std.file.File input; // Automatically decompress the file if it ends with "gz". if (filename[len - 2 .. len] == "gz") { auto pipe = pipeShell("gunzip -c " ~ filename); input = pipe.stdout; } else { input = std.stdio.File(filename); } // Write data to a stream in memory auto mem = new MemoryStream(); string line; while ((line = input.readln()) !is null) { mem.write(line); // Also write the line to stdout. write(line); } // Put the uncompressed data into a new gz file. auto comp = new Compress(HeaderFormat.gzip); auto compressed = comp.compress(mem.data); //comp.flush(); // Does not fix the problem. // See the raw compressed bytes. //writeln(cast(ubyte[])compressed); // Write compressed output to a file. with (new std.stream.File(outfile, FileMode.OutNew)) { writeExact(compressed.ptr, compressed.length); //write(cast(ubyte[])compressed); // Also does not work. } } - - -
Feb 19 2014
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 20 February 2014 at 03:58:01 UTC, Kamil Slowikowski 
wrote:
     auto compressed = comp.compress(mem.data);
     //comp.flush(); // Does not fix the problem.
You need to write each compressed block and the flush. So more like: writeToFile(comp.compress(mem.data)); // loop over all the data btw writeToFile(comp.flush()); and that should do it. flush returns the remainder of the data.
Feb 19 2014
parent "Kamil Slowikowski" <kslowikowski gmail.com> writes:
On Thursday, 20 February 2014 at 04:03:45 UTC, Adam D. Ruppe 
wrote:
 On Thursday, 20 February 2014 at 03:58:01 UTC, Kamil 
 Slowikowski wrote:
    auto compressed = comp.compress(mem.data);
    //comp.flush(); // Does not fix the problem.
You need to write each compressed block and the flush. So more like: writeToFile(comp.compress(mem.data)); // loop over all the data btw writeToFile(comp.flush()); and that should do it. flush returns the remainder of the data.
Hey Adam, thanks for the tip. Next problem: the output has strange characters, as shown: - - - ./zfile.d test.gz out.gz hi there here's some text in a file -K Thu Feb 20 00:07:52 kamil W530 ~/work/dlang zcat out.gz hi there here's some text in a file -K zcat test.gz | wc -c 39 zcat out.gz | wc -c 63 zcat test.gz | hexdump 0000000 6968 7420 6568 6572 680a 7265 2765 2073 0000010 6f73 656d 7420 7865 2074 6e69 6120 6620 0000020 6c69 0a65 4b2d 000a 0000027 zcat out.gz | hexdump 0000000 0009 0000 0000 0000 6968 7420 6568 6572 0000010 1b0a 0000 0000 0000 6800 7265 2765 2073 0000020 6f73 656d 7420 7865 2074 6e69 6120 6620 0000030 6c69 0a65 0003 0000 0000 0000 4b2d 000a 000003f - - - Code: - - - import std.stdio, std.stream, std.zlib, std.c.process, std.process, std.file; void main(string[] args) { if (args.length != 3) { writefln("Usage: ./%s <file> <output>", args[0]); exit(0); } // Read command line arguments. string filename = args[1]; string outfile = args[2]; auto len = filename.length; std.file.File input; // Automatically decompress the file if it ends with "gz". if (filename[len - 2 .. len] == "gz") { auto pipe = pipeShell("gunzip -c " ~ filename); input = pipe.stdout; } else { input = std.stdio.File(filename); } // Write data to a stream in memory auto mem = new MemoryStream(); string line; while ((line = input.readln()) !is null) { mem.write(line); // Also write the line to stdout. write(line); } // Put the data into a new gz file. auto comp = new Compress(HeaderFormat.gzip); // See the raw compressed bytes. //writeln(cast(ubyte[])compressed); // Write compressed output to a file. with (new std.stream.File(outfile, FileMode.OutNew)) { auto compressed = comp.compress(mem.data); writeExact(compressed.ptr, compressed.length); // Get any remaining data. compressed = comp.flush(); writeExact(compressed.ptr, compressed.length); } } - - -
Feb 19 2014
prev sibling parent reply "Kamil Slowikowski" <kslowikowski gmail.com> writes:
On Wednesday, 19 February 2014 at 16:27:32 UTC, Artem Tarasov 
wrote:
 the easiest way that I
 know of is as follows:

 import std.stdio, std.process;
 auto pipe = pipeShell("gunzip -c " ~ filename); // replace with 
 pigz if you
 wish
 File input = pipe.stdout;
Artem, thank you! I've used a similar trick in the past with Python because calling the system's gzip or pigz in a subprocess.Pipe is faster than using the python gzip module. I'm very glad to see how easy it is in D.
 Regarding your second request, this forum is not an appropriate 
 place to
 provide usage examples for a library, so that will go into a 
 private e-mail.
Thanks, again! I'm looking forward to hearing from you :) Adam D. Ruppe Thanks for your example! I couldn't find such an example anywhere on the web. Craig Dillabaugh Please feel free to move the thread, sorry for posting in the wrong place.
Feb 19 2014
parent "Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:
  Craig Dillabaugh
 Please feel free to move the thread, sorry for posting in the 
 wrong place.
Actually, the thread can't be moved I believe, it is here forever. Not a big deal though, lots of people new to D post questions here and miss the D.learn forum, so you are not alone. Since I didn't have a good answer to your original question I decided I should let you know about D.learn.
Feb 19 2014
prev sibling next sibling parent reply "Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:
On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil Slowikowski 
wrote:
 Hi there, I'm new to D and have a lot of learning ahead of me. 
 It would
 be extremely helpful to me if someone with D experience could 
 show me
 some code examples.

 I'd like to neatly read and write gzipped files for my work. I 
 have read
 several threads on these forums on the topic of std.zlib or 
 std.zip and I haven't been able to figure it out.

 Here's a Python script that does what I want. Can you please 
 show me
 an example in D that does the same thing?

 <code>


 import gzip


 with gzip.open("input.gz") as stream:
     for line in stream:
         print line


 with gzip.open("output.gz", "w") as stream:
     stream.write("some output goes here\n")
 </code>


 I have a second request. I would like to start using D more in 
 my work,
 and in particular I would like to use and extend the BioD 
 library. Artem
 Tarasov made a nice module to handle BGZF, and I would like to 
 see an
 example like my Python code above using Artem's module.

 Read more about BGZF:
 http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

 BioD:
 https://github.com/biod/BioD/blob/d2bea0a0da63eb820fcf11ae367456b2c367ec04/bio/core/bgzf/compress.d
It is not part of the standard library, but you may want to have a look at the GzipInputStream in vibeD. http://vibed.org/api/vibe.stream.zlib/GzipInputStream
Feb 19 2014
parent "Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:
On Wednesday, 19 February 2014 at 16:32:54 UTC, Craig Dillabaugh 
wrote:
 On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil 
 Slowikowski wrote:

 It is not part of the standard library, but you may want to 
 have a look at the GzipInputStream in vibeD.

 http://vibed.org/api/vibe.stream.zlib/GzipInputStream
Also meant to add, this thread belongs in the D.learn forum rather than here.
Feb 19 2014
prev sibling next sibling parent "nazriel" <spam dzfl.pl> writes:
On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil Slowikowski 
wrote:
 Hi there, I'm new to D and have a lot of learning ahead of me. 
 It would
 be extremely helpful to me if someone with D experience could 
 show me
 some code examples.

 I'd like to neatly read and write gzipped files for my work. I 
 have read
 several threads on these forums on the topic of std.zlib or 
 std.zip and I haven't been able to figure it out.

 Here's a Python script that does what I want. Can you please 
 show me
 an example in D that does the same thing?

 <code>


 import gzip


 with gzip.open("input.gz") as stream:
     for line in stream:
         print line


 with gzip.open("output.gz", "w") as stream:
     stream.write("some output goes here\n")
 </code>


 I have a second request. I would like to start using D more in 
 my work,
 and in particular I would like to use and extend the BioD 
 library. Artem
 Tarasov made a nice module to handle BGZF, and I would like to 
 see an
 example like my Python code above using Artem's module.

 Read more about BGZF:
 http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

 BioD:
 https://github.com/biod/BioD/blob/d2bea0a0da63eb820fcf11ae367456b2c367ec04/bio/core/bgzf/compress.d
Witaj Kamil :) Feel free to also visit #d channel on freenode IRC network.
Feb 19 2014
prev sibling parent reply "Stephan Schiffels" <stephan_schiffels mac.com> writes:
On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil Slowikowski 
wrote:
 Hi there, I'm new to D and have a lot of learning ahead of me. 
 It would
 be extremely helpful to me if someone with D experience could 
 show me
 some code examples.

 I'd like to neatly read and write gzipped files for my work. I 
 have read
 several threads on these forums on the topic of std.zlib or 
 std.zip and I haven't been able to figure it out.
Hi Kamil, I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes: GzipInputRange, GzipByLine, and GzipOut. Here is how I can now use them: _____________________ import gzip; import std.stdio; void main() { auto byLine = new GzipByLine("test.gz"); foreach(line; byLine) writeln(line); auto gzipOutFile = new GzipOut("testout.gz"); gzipOutFile.compress("bla bla bla"); gzipOutFile.finish(); } That is all quite convenient and I was wondering whether something like that would be useful even in Phobos. But it's clear that for phobos things would involve a lot more work to comply with the requirements. This so far simply served my needs and is not as generic as it could be: Here is the code: ___________gzip.d__________________ import std.zlib; import std.stdio; import std.range; import std.traits; class GzipInputRange { UnCompress uncompressObj; File f; auto CHUNKSIZE = 0x4000; ReturnType!(f.byChunk) chunkRange; bool exhausted; char[] uncompressedBuffer; size_t bufferIndex; this(string filename) { f = File(filename, "r"); chunkRange = f.byChunk(CHUNKSIZE); uncompressObj = new UnCompress(); load(); } void load() { if(!chunkRange.empty) { auto raw = chunkRange.front.dup; chunkRange.popFront(); uncompressedBuffer = cast(char[])uncompressObj.uncompress(raw); bufferIndex = 0; } else { if(!exhausted) { uncompressedBuffer = cast(char[])uncompressObj.flush(); exhausted = true; bufferIndex = 0; } else uncompressedBuffer.length = 0; } } property char front() { return uncompressedBuffer[bufferIndex]; } void popFront() { bufferIndex += 1; if(bufferIndex >= uncompressedBuffer.length) { load(); bufferIndex = 0; } } property bool empty() { return uncompressedBuffer.length == 0; } } class GzipByLine { GzipInputRange range; char[] buf; this(string filename) { this.range = new GzipInputRange(filename); popFront(); } property bool empty() { return buf.length == 0; } void popFront() { buf.length = 0; while(!range.empty && range.front != '\n') { buf ~= range.front; range.popFront(); } range.popFront(); } string front() { return buf.idup; } } class GzipOut { Compress compressObj; File f; this(string filename) { f = File(filename, "w"); compressObj = new Compress(HeaderFormat.gzip); } void compress(string s) { auto compressed = compressObj.compress(s.dup); f.rawWrite(compressed); } void finish() { auto compressed = compressObj.flush(); f.rawWrite(compressed); } }
Feb 20 2014
next sibling parent reply "Kamil Slowikowski" <kslowikowski gmail.com> writes:
On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels 
wrote:
 Hi Kamil,
 I am glad someone has the exact same problem as I had. I 
 actually solved this, inspired by the python API you quoted 
 above. I wrote these classes:
 GzipInputRange, GzipByLine, and GzipOut.
Stephan, awesome! Thank you very much for sharing your classes. It's nice to see how you've approached this problem. Your code is very clear and easy to understand (for me). Also, I now see the error in my code: I believe I should use rawWrite to write compressed data and not writeExact.
Feb 20 2014
next sibling parent Artem Tarasov <lomereiter gmail.com> writes:
On Thu, Feb 20, 2014 at 9:05 PM, Kamil Slowikowski
<kslowikowski gmail.com>wrote:

 Also, I now see the error in my code: I believe I should use rawWrite to
 write compressed data and not writeExact.
That's not an error, that's two different ways to access files: std.stream.File and std.stdio.File - the latter is more recommended to use.
Feb 20 2014
prev sibling parent "Stephan Schiffels" <stephan_schiffels mac.com> writes:
On Thursday, 20 February 2014 at 17:05:37 UTC, Kamil Slowikowski
wrote:
 On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan 
 Schiffels wrote:
 Hi Kamil,
 I am glad someone has the exact same problem as I had. I 
 actually solved this, inspired by the python API you quoted 
 above. I wrote these classes:
 GzipInputRange, GzipByLine, and GzipOut.
Stephan, awesome! Thank you very much for sharing your classes. It's nice to see how you've approached this problem. Your code is very clear and easy to understand (for me). Also, I now see the error in my code: I believe I should use rawWrite to write compressed data and not writeExact.
You're welcome. If you manage to put GzipOut.finish() into the destructor of the class to automatically flush the file upon destruction of the object, let me know. I tried this and it gives a SegFault… I was too lazy to try to understand it but I am sure it must be in principle possible. Stephan
Feb 20 2014
prev sibling parent reply "Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> writes:
On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels 
wrote:
 Hi Kamil,
 I am glad someone has the exact same problem as I had. I 
 actually solved this, inspired by the python API you quoted 
 above. I wrote these classes:
 GzipInputRange, GzipByLine, and GzipOut.
 Here is how I can now use them:
I've polished your module a bit at: https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb0012f637b/zio.d Reflections: - Performance is terrible even with -release -noboundscheck -unittest. About 20 times slower than zcat $F | wc -l. I'm guessing _chunkRange.front.dup slows things down. I tried removing the .dup but then I get std.zlib.ZlibException std/zlib.d(59): data error I don't believe we should have to do a copy of _chunkRange.front but I can't figure out how to solve it. Anybody understands how to fix this? - Shouldn't GzipOut.finish() call this.close()? Otherwise the file remains unflushed. - And what about calling this.close() in GzipOut.~this()? Is that needed to?
May 03 2015
next sibling parent reply "Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> writes:
 I've polished your module a bit at:
https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb0012f637b/zio.d Latest at https://github.com/nordlow/justd/blob/master/zio.d
May 03 2015
parent =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Sunday, 3 May 2015 at 14:35:49 UTC, Per Nordlöw wrote:
 Latest at

 https://github.com/nordlow/justd/blob/master/zio.d
Should be https://github.com/nordlow/phobos-next/blob/master/src/zio.d
May 11 2017
prev sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
And there is Zipios++

http://zipios.sourceforge.net/


On Sun, 2015-05-03 at 14:33 +0000, via Digitalmars-d wrote:
 On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels=20
 wrote:
 Hi Kamil,
 I am glad someone has the exact same problem as I had. I=20
 actually solved this, inspired by the python API you quoted=20
 above. I wrote these classes:
 GzipInputRange, GzipByLine, and GzipOut.
 Here is how I can now use them:
=20
=20 I've polished your module a bit at: =20 https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb 0012f637b/zio.d =20 Reflections: =20 - Performance is terrible even with -release -noboundscheck=20 -unittest. About 20 times slower than zcat $F | wc -l. I'm=20 guessing =20 _chunkRange.front.dup =20 slows things down. I tried removing the .dup but then I get =20 std.zlib.ZlibException std/zlib.d(59): data error =20 I don't believe we should have to do a copy of _chunkRange.front=20 but I can't figure out how to solve it. Anybody understands how=20 to fix this? =20 - Shouldn't GzipOut.finish() call this.close()? Otherwise the=20 file remains unflushed. - And what about calling this.close() in GzipOut.~this()? Is that=20 needed to?
--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 03 2015