digitalmars.D - Read and write gzip files easily.
- Kamil Slowikowski (34/34) Feb 19 2014 Hi there, I'm new to D and have a lot of learning ahead of me. It
- Artem Tarasov (13/17) Feb 19 2014 Wow, that's unexpected :)
- Adam D. Ruppe (18/20) Feb 19 2014 std.zlib handles gzip but it doesn't present a file nor range
- Artem Tarasov (3/5) Feb 19 2014 Ah, indeed. I dismissed it because it allocates on each call, and heavy ...
- Vladimir Panteleev (5/23) Feb 19 2014 Regrettably, the above code has a bug. Currently, std.zlib stores
- Kamil Slowikowski (71/74) Feb 19 2014 I successfully read and printed the contents of a gzipped file,
- Adam D. Ruppe (9/11) Feb 19 2014 You need to write each compressed block and the flush. So more
- Kamil Slowikowski (79/90) Feb 19 2014 Hey Adam, thanks for the tip.
- Kamil Slowikowski (13/24) Feb 19 2014 Artem, thank you! I've used a similar trick in the past with
- Craig Dillabaugh (5/8) Feb 19 2014 Actually, the thread can't be moved I believe, it is here forever.
- Craig Dillabaugh (5/39) Feb 19 2014 It is not part of the standard library, but you may want to have
- Craig Dillabaugh (4/9) Feb 19 2014 Also meant to add, this thread belongs in the D.learn forum
- nazriel (4/38) Feb 19 2014 Witaj Kamil :)
- Stephan Schiffels (114/123) Feb 20 2014 Hi Kamil,
- Kamil Slowikowski (7/12) Feb 20 2014 Stephan, awesome! Thank you very much for sharing your classes.
- Artem Tarasov (4/6) Feb 20 2014 That's not an error, that's two different ways to access files:
- Stephan Schiffels (8/20) Feb 20 2014 You're welcome. If you manage to put GzipOut.finish() into the
- "Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> (18/24) May 03 2015 I've polished your module a bit at:
- "Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> (3/4) May 03 2015 https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb001...
- =?UTF-8?B?Tm9yZGzDtnc=?= (3/5) May 11 2017 Should be
- Russel Winder via Digitalmars-d (13/48) May 03 2015 And there is Zipios++
Hi there, I'm new to D and have a lot of learning ahead of me. It would be extremely helpful to me if someone with D experience could show me some code examples. I'd like to neatly read and write gzipped files for my work. I have read several threads on these forums on the topic of std.zlib or std.zip and I haven't been able to figure it out. Here's a Python script that does what I want. Can you please show me an example in D that does the same thing? <code> import gzip with gzip.open("input.gz") as stream: for line in stream: print line with gzip.open("output.gz", "w") as stream: stream.write("some output goes here\n") </code> I have a second request. I would like to start using D more in my work, and in particular I would like to use and extend the BioD library. Artem Tarasov made a nice module to handle BGZF, and I would like to see an example like my Python code above using Artem's module. Read more about BGZF: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html BioD: https://github.com/biod/BioD/blob/d2bea0a0da63eb820fcf11ae367456b2c367ec04/bio/core/bgzf/compress.d
Feb 19 2014
Wow, that's unexpected :) Unfortunately, there's no standard module for processing gzip/bz2. The former can be dealt with using etc.c.zlib, but there's no convenient interface for working with file as a stream. Thus, the easiest way that I know of is as follows: import std.stdio, std.process; auto pipe = pipeShell("gunzip -c " ~ filename); // replace with pigz if you wish File input = pipe.stdout; Regarding your second request, this forum is not an appropriate place to provide usage examples for a library, so that will go into a private e-mail. On Wed, Feb 19, 2014 at 7:51 PM, Kamil Slowikowski <kslowikowski gmail.com>wrote:I have a second request. I would like to start using D more in my work, and in particular I would like to use and extend the BioD library. Artem Tarasov made a nice module to handle BGZF, and I would like to see an example like my Python code above using Artem's module.
Feb 19 2014
On Wednesday, 19 February 2014 at 16:27:32 UTC, Artem Tarasov wrote:Unfortunately, there's no standard module for processing gzip/bz2.std.zlib handles gzip but it doesn't present a file nor range interface over it. This will work though: void main() { import std.zlib; import std.stdio; auto uc = new UnCompress(); foreach(chunk; File("testd.gz").byChunk(1024)) { auto uncompressed = uc.uncompress(chunk); writeln(cast(string) uncompressed); } // also look at anything left in the buffer writeln(cast(string) uc.flush()); } And if you are writing, use new Compress(HeaderFormat.gzip) then call the compress method and write what it returns to teh file.
Feb 19 2014
Ah, indeed. I dismissed it because it allocates on each call, and heavy GC usage in multithreaded app is a performance killer. On Wed, Feb 19, 2014 at 8:36 PM, Adam D. Ruppe <destructionator gmail.com>wrote:std.zlib handles gzip but it doesn't present a file nor range interface over it.
Feb 19 2014
On Wednesday, 19 February 2014 at 16:36:29 UTC, Adam D. Ruppe wrote:On Wednesday, 19 February 2014 at 16:27:32 UTC, Artem Tarasov wrote:Regrettably, the above code has a bug. Currently, std.zlib stores a reference to the buffer passed to it, and since byChunk reuses the buffer, the code will fail when uncompressing multiple chunks.Unfortunately, there's no standard module for processing gzip/bz2.std.zlib handles gzip but it doesn't present a file nor range interface over it. This will work though: void main() { import std.zlib; import std.stdio; auto uc = new UnCompress(); foreach(chunk; File("testd.gz").byChunk(1024)) { auto uncompressed = uc.uncompress(chunk); writeln(cast(string) uncompressed); } // also look at anything left in the buffer writeln(cast(string) uc.flush()); }
Feb 19 2014
On Wednesday, 19 February 2014 at 16:36:29 UTC, Adam D. Ruppe wrote:And if you are writing, use new Compress(HeaderFormat.gzip) then call the compress method and write what it returns to teh file.I successfully read and printed the contents of a gzipped file, but the documentation is too sparse for me to figure out why I can't write a gzipped file. I'd appreciate any tips. Here's the output: - - - $ echo -e "hi there\nhere's some text in a file\n-K" | gzip > test.gz $ zcat test.gz hi there here's some text in a file -K $ ./zfile.d test.gz out.gz hi there here's some text in a file -K $ zcat out.gz gzip: out.gz: unexpected end of file - - - And the code: - - - // zfile.d import std.stdio, std.stream, std.zlib, std.c.process, std.process, std.file; void main(string[] args) { if (args.length != 3) { writefln("Usage: ./%s <file> <output>", args[0]); exit(0); } // Read command line arguments. string filename = args[1]; string outfile = args[2]; auto len = filename.length; std.file.File input; // Automatically decompress the file if it ends with "gz". if (filename[len - 2 .. len] == "gz") { auto pipe = pipeShell("gunzip -c " ~ filename); input = pipe.stdout; } else { input = std.stdio.File(filename); } // Write data to a stream in memory auto mem = new MemoryStream(); string line; while ((line = input.readln()) !is null) { mem.write(line); // Also write the line to stdout. write(line); } // Put the uncompressed data into a new gz file. auto comp = new Compress(HeaderFormat.gzip); auto compressed = comp.compress(mem.data); //comp.flush(); // Does not fix the problem. // See the raw compressed bytes. //writeln(cast(ubyte[])compressed); // Write compressed output to a file. with (new std.stream.File(outfile, FileMode.OutNew)) { writeExact(compressed.ptr, compressed.length); //write(cast(ubyte[])compressed); // Also does not work. } } - - -
Feb 19 2014
On Thursday, 20 February 2014 at 03:58:01 UTC, Kamil Slowikowski wrote:auto compressed = comp.compress(mem.data); //comp.flush(); // Does not fix the problem.You need to write each compressed block and the flush. So more like: writeToFile(comp.compress(mem.data)); // loop over all the data btw writeToFile(comp.flush()); and that should do it. flush returns the remainder of the data.
Feb 19 2014
On Thursday, 20 February 2014 at 04:03:45 UTC, Adam D. Ruppe wrote:On Thursday, 20 February 2014 at 03:58:01 UTC, Kamil Slowikowski wrote:Hey Adam, thanks for the tip. Next problem: the output has strange characters, as shown: - - - ./zfile.d test.gz out.gz hi there here's some text in a file -K Thu Feb 20 00:07:52 kamil W530 ~/work/dlang zcat out.gz hi there here's some text in a file -K zcat test.gz | wc -c 39 zcat out.gz | wc -c 63 zcat test.gz | hexdump 0000000 6968 7420 6568 6572 680a 7265 2765 2073 0000010 6f73 656d 7420 7865 2074 6e69 6120 6620 0000020 6c69 0a65 4b2d 000a 0000027 zcat out.gz | hexdump 0000000 0009 0000 0000 0000 6968 7420 6568 6572 0000010 1b0a 0000 0000 0000 6800 7265 2765 2073 0000020 6f73 656d 7420 7865 2074 6e69 6120 6620 0000030 6c69 0a65 0003 0000 0000 0000 4b2d 000a 000003f - - - Code: - - - import std.stdio, std.stream, std.zlib, std.c.process, std.process, std.file; void main(string[] args) { if (args.length != 3) { writefln("Usage: ./%s <file> <output>", args[0]); exit(0); } // Read command line arguments. string filename = args[1]; string outfile = args[2]; auto len = filename.length; std.file.File input; // Automatically decompress the file if it ends with "gz". if (filename[len - 2 .. len] == "gz") { auto pipe = pipeShell("gunzip -c " ~ filename); input = pipe.stdout; } else { input = std.stdio.File(filename); } // Write data to a stream in memory auto mem = new MemoryStream(); string line; while ((line = input.readln()) !is null) { mem.write(line); // Also write the line to stdout. write(line); } // Put the data into a new gz file. auto comp = new Compress(HeaderFormat.gzip); // See the raw compressed bytes. //writeln(cast(ubyte[])compressed); // Write compressed output to a file. with (new std.stream.File(outfile, FileMode.OutNew)) { auto compressed = comp.compress(mem.data); writeExact(compressed.ptr, compressed.length); // Get any remaining data. compressed = comp.flush(); writeExact(compressed.ptr, compressed.length); } } - - -auto compressed = comp.compress(mem.data); //comp.flush(); // Does not fix the problem.You need to write each compressed block and the flush. So more like: writeToFile(comp.compress(mem.data)); // loop over all the data btw writeToFile(comp.flush()); and that should do it. flush returns the remainder of the data.
Feb 19 2014
On Wednesday, 19 February 2014 at 16:27:32 UTC, Artem Tarasov wrote:the easiest way that I know of is as follows: import std.stdio, std.process; auto pipe = pipeShell("gunzip -c " ~ filename); // replace with pigz if you wish File input = pipe.stdout;Artem, thank you! I've used a similar trick in the past with Python because calling the system's gzip or pigz in a subprocess.Pipe is faster than using the python gzip module. I'm very glad to see how easy it is in D.Regarding your second request, this forum is not an appropriate place to provide usage examples for a library, so that will go into a private e-mail.Thanks, again! I'm looking forward to hearing from you :) Adam D. Ruppe Thanks for your example! I couldn't find such an example anywhere on the web. Craig Dillabaugh Please feel free to move the thread, sorry for posting in the wrong place.
Feb 19 2014
Craig Dillabaugh Please feel free to move the thread, sorry for posting in the wrong place.Actually, the thread can't be moved I believe, it is here forever. Not a big deal though, lots of people new to D post questions here and miss the D.learn forum, so you are not alone. Since I didn't have a good answer to your original question I decided I should let you know about D.learn.
Feb 19 2014
On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil Slowikowski wrote:Hi there, I'm new to D and have a lot of learning ahead of me. It would be extremely helpful to me if someone with D experience could show me some code examples. I'd like to neatly read and write gzipped files for my work. I have read several threads on these forums on the topic of std.zlib or std.zip and I haven't been able to figure it out. Here's a Python script that does what I want. Can you please show me an example in D that does the same thing? <code> import gzip with gzip.open("input.gz") as stream: for line in stream: print line with gzip.open("output.gz", "w") as stream: stream.write("some output goes here\n") </code> I have a second request. I would like to start using D more in my work, and in particular I would like to use and extend the BioD library. Artem Tarasov made a nice module to handle BGZF, and I would like to see an example like my Python code above using Artem's module. Read more about BGZF: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html BioD: https://github.com/biod/BioD/blob/d2bea0a0da63eb820fcf11ae367456b2c367ec04/bio/core/bgzf/compress.dIt is not part of the standard library, but you may want to have a look at the GzipInputStream in vibeD. http://vibed.org/api/vibe.stream.zlib/GzipInputStream
Feb 19 2014
On Wednesday, 19 February 2014 at 16:32:54 UTC, Craig Dillabaugh wrote:On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil Slowikowski wrote: It is not part of the standard library, but you may want to have a look at the GzipInputStream in vibeD. http://vibed.org/api/vibe.stream.zlib/GzipInputStreamAlso meant to add, this thread belongs in the D.learn forum rather than here.
Feb 19 2014
On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil Slowikowski wrote:Hi there, I'm new to D and have a lot of learning ahead of me. It would be extremely helpful to me if someone with D experience could show me some code examples. I'd like to neatly read and write gzipped files for my work. I have read several threads on these forums on the topic of std.zlib or std.zip and I haven't been able to figure it out. Here's a Python script that does what I want. Can you please show me an example in D that does the same thing? <code> import gzip with gzip.open("input.gz") as stream: for line in stream: print line with gzip.open("output.gz", "w") as stream: stream.write("some output goes here\n") </code> I have a second request. I would like to start using D more in my work, and in particular I would like to use and extend the BioD library. Artem Tarasov made a nice module to handle BGZF, and I would like to see an example like my Python code above using Artem's module. Read more about BGZF: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html BioD: https://github.com/biod/BioD/blob/d2bea0a0da63eb820fcf11ae367456b2c367ec04/bio/core/bgzf/compress.dWitaj Kamil :) Feel free to also visit #d channel on freenode IRC network.
Feb 19 2014
On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil Slowikowski wrote:Hi there, I'm new to D and have a lot of learning ahead of me. It would be extremely helpful to me if someone with D experience could show me some code examples. I'd like to neatly read and write gzipped files for my work. I have read several threads on these forums on the topic of std.zlib or std.zip and I haven't been able to figure it out.Hi Kamil, I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes: GzipInputRange, GzipByLine, and GzipOut. Here is how I can now use them: _____________________ import gzip; import std.stdio; void main() { auto byLine = new GzipByLine("test.gz"); foreach(line; byLine) writeln(line); auto gzipOutFile = new GzipOut("testout.gz"); gzipOutFile.compress("bla bla bla"); gzipOutFile.finish(); } That is all quite convenient and I was wondering whether something like that would be useful even in Phobos. But it's clear that for phobos things would involve a lot more work to comply with the requirements. This so far simply served my needs and is not as generic as it could be: Here is the code: ___________gzip.d__________________ import std.zlib; import std.stdio; import std.range; import std.traits; class GzipInputRange { UnCompress uncompressObj; File f; auto CHUNKSIZE = 0x4000; ReturnType!(f.byChunk) chunkRange; bool exhausted; char[] uncompressedBuffer; size_t bufferIndex; this(string filename) { f = File(filename, "r"); chunkRange = f.byChunk(CHUNKSIZE); uncompressObj = new UnCompress(); load(); } void load() { if(!chunkRange.empty) { auto raw = chunkRange.front.dup; chunkRange.popFront(); uncompressedBuffer = cast(char[])uncompressObj.uncompress(raw); bufferIndex = 0; } else { if(!exhausted) { uncompressedBuffer = cast(char[])uncompressObj.flush(); exhausted = true; bufferIndex = 0; } else uncompressedBuffer.length = 0; } } property char front() { return uncompressedBuffer[bufferIndex]; } void popFront() { bufferIndex += 1; if(bufferIndex >= uncompressedBuffer.length) { load(); bufferIndex = 0; } } property bool empty() { return uncompressedBuffer.length == 0; } } class GzipByLine { GzipInputRange range; char[] buf; this(string filename) { this.range = new GzipInputRange(filename); popFront(); } property bool empty() { return buf.length == 0; } void popFront() { buf.length = 0; while(!range.empty && range.front != '\n') { buf ~= range.front; range.popFront(); } range.popFront(); } string front() { return buf.idup; } } class GzipOut { Compress compressObj; File f; this(string filename) { f = File(filename, "w"); compressObj = new Compress(HeaderFormat.gzip); } void compress(string s) { auto compressed = compressObj.compress(s.dup); f.rawWrite(compressed); } void finish() { auto compressed = compressObj.flush(); f.rawWrite(compressed); } }
Feb 20 2014
On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels wrote:Hi Kamil, I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes: GzipInputRange, GzipByLine, and GzipOut.Stephan, awesome! Thank you very much for sharing your classes. It's nice to see how you've approached this problem. Your code is very clear and easy to understand (for me). Also, I now see the error in my code: I believe I should use rawWrite to write compressed data and not writeExact.
Feb 20 2014
On Thu, Feb 20, 2014 at 9:05 PM, Kamil Slowikowski <kslowikowski gmail.com>wrote:Also, I now see the error in my code: I believe I should use rawWrite to write compressed data and not writeExact.That's not an error, that's two different ways to access files: std.stream.File and std.stdio.File - the latter is more recommended to use.
Feb 20 2014
On Thursday, 20 February 2014 at 17:05:37 UTC, Kamil Slowikowski wrote:On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels wrote:You're welcome. If you manage to put GzipOut.finish() into the destructor of the class to automatically flush the file upon destruction of the object, let me know. I tried this and it gives a SegFault… I was too lazy to try to understand it but I am sure it must be in principle possible. StephanHi Kamil, I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes: GzipInputRange, GzipByLine, and GzipOut.Stephan, awesome! Thank you very much for sharing your classes. It's nice to see how you've approached this problem. Your code is very clear and easy to understand (for me). Also, I now see the error in my code: I believe I should use rawWrite to write compressed data and not writeExact.
Feb 20 2014
On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels wrote:Hi Kamil, I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes: GzipInputRange, GzipByLine, and GzipOut. Here is how I can now use them:I've polished your module a bit at: https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb0012f637b/zio.d Reflections: - Performance is terrible even with -release -noboundscheck -unittest. About 20 times slower than zcat $F | wc -l. I'm guessing _chunkRange.front.dup slows things down. I tried removing the .dup but then I get std.zlib.ZlibException std/zlib.d(59): data error I don't believe we should have to do a copy of _chunkRange.front but I can't figure out how to solve it. Anybody understands how to fix this? - Shouldn't GzipOut.finish() call this.close()? Otherwise the file remains unflushed. - And what about calling this.close() in GzipOut.~this()? Is that needed to?
May 03 2015
I've polished your module a bit at:https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb0012f637b/zio.d Latest at https://github.com/nordlow/justd/blob/master/zio.d
May 03 2015
On Sunday, 3 May 2015 at 14:35:49 UTC, Per Nordlöw wrote:Latest at https://github.com/nordlow/justd/blob/master/zio.dShould be https://github.com/nordlow/phobos-next/blob/master/src/zio.d
May 11 2017
And there is Zipios++ http://zipios.sourceforge.net/ On Sun, 2015-05-03 at 14:33 +0000, via Digitalmars-d wrote:On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels=20 wrote:--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winderHi Kamil, I am glad someone has the exact same problem as I had. I=20 actually solved this, inspired by the python API you quoted=20 above. I wrote these classes: GzipInputRange, GzipByLine, and GzipOut. Here is how I can now use them: =20=20 I've polished your module a bit at: =20 https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb 0012f637b/zio.d =20 Reflections: =20 - Performance is terrible even with -release -noboundscheck=20 -unittest. About 20 times slower than zcat $F | wc -l. I'm=20 guessing =20 _chunkRange.front.dup =20 slows things down. I tried removing the .dup but then I get =20 std.zlib.ZlibException std/zlib.d(59): data error =20 I don't believe we should have to do a copy of _chunkRange.front=20 but I can't figure out how to solve it. Anybody understands how=20 to fix this? =20 - Shouldn't GzipOut.finish() call this.close()? Otherwise the=20 file remains unflushed. - And what about calling this.close() in GzipOut.~this()? Is that=20 needed to?
May 03 2015