digitalmars.D.learn - Using iopipe to stream a gzipped file
- Andrew (23/23) Jan 03 2018 Hi,
- Steven Schveighoffer (4/29) Jan 03 2018 Do you have a sample file I can play with? Your iopipe chain looks
- Andrew (24/55) Jan 03 2018 A sample file (about 250MB) can be found here:
- Steven Schveighoffer (17/19) Jan 03 2018 So it looks like the file you have is a concatenated gzip file. If I
- Andrew (8/30) Jan 04 2018 Ah thank you, that makes sense. These types of files are
- Steven Schveighoffer (8/13) Jan 04 2018 Hm... that utility seems to say it will result in bgz file extension? So...
- Andrew (24/39) Jan 04 2018 That would be really great for me, thank you! By default bgzip
- Steven Schveighoffer (13/21) Jan 04 2018 It's now been updated, see version 0.0.3.
- Andrew (3/17) Jan 04 2018 That works perfectly, thank you very much!
Hi, I have a very large gziped text file (all ASCII characters and ~500GB) that I want to stream and process line-by-line, and I thought the iopipe library would be perfect for this, but I can't seem to get it to work. So far, this is the closest I have to getting it to work: import iopipe.textpipe; import iopipe.zip; import iopipe.bufpipe; import iopipe.stream; void main() { auto fileToRead = openDev("file.gz").bufd.unzip(CompressionFormat.gzip); foreach (line; fileToRead.assumeText.byLineRange!false) { \\ do stuff } } but this only processes the first ~200 odd lines (I guess the initial read into the buffer). Can anyone help me out? Thanks very much Andrew
Jan 03 2018
On 1/3/18 9:45 AM, Andrew wrote:Hi, I have a very large gziped text file (all ASCII characters and ~500GB) that I want to stream and process line-by-line, and I thought the iopipe library would be perfect for this, but I can't seem to get it to work. So far, this is the closest I have to getting it to work: import iopipe.textpipe; import iopipe.zip; import iopipe.bufpipe; import iopipe.stream; void main() { auto fileToRead = openDev("file.gz").bufd.unzip(CompressionFormat.gzip); foreach (line; fileToRead.assumeText.byLineRange!false) { \\ do stuff } } but this only processes the first ~200 odd lines (I guess the initial read into the buffer). Can anyone help me out?Do you have a sample file I can play with? Your iopipe chain looks correct, so I'm not sure why it wouldn't work. -Steve
Jan 03 2018
On Wednesday, 3 January 2018 at 16:09:19 UTC, Steven Schveighoffer wrote:On 1/3/18 9:45 AM, Andrew wrote:A sample file (about 250MB) can be found here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz It should have 1,103,800 lines, but the following code only reports 256: import iopipe.textpipe; import iopipe.zip; import iopipe.bufpipe; import iopipe.stream; import std.stdio; void main() { auto fileToRead = openDev("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz").bufd.unzip(CompressionFormat.gzip); auto counter = 0; foreach (line; fileToRead.assumeText.byLineRange!false) { counter++; } writeln(counter); } Thanks for looking into this. AndrewHi, I have a very large gziped text file (all ASCII characters and ~500GB) that I want to stream and process line-by-line, and I thought the iopipe library would be perfect for this, but I can't seem to get it to work. So far, this is the closest I have to getting it to work: import iopipe.textpipe; import iopipe.zip; import iopipe.bufpipe; import iopipe.stream; void main() { auto fileToRead = openDev("file.gz").bufd.unzip(CompressionFormat.gzip); foreach (line; fileToRead.assumeText.byLineRange!false) { \\ do stuff } } but this only processes the first ~200 odd lines (I guess the initial read into the buffer). Can anyone help me out?Do you have a sample file I can play with? Your iopipe chain looks correct, so I'm not sure why it wouldn't work. -Steve
Jan 03 2018
On 1/3/18 12:03 PM, Andrew wrote:Thanks for looking into this.So it looks like the file you have is a concatenated gzip file. If I gunzip the file and recompress it, it works properly. Looking at the docs of zlib inflate [1]: " Unlike the gunzip utility and gzread() ..., inflate() will not automatically decode concatenated gzip streams. inflate() will return Z_STREAM_END at the end of the gzip stream. The state would need to be reset to continue decoding a subsequent gzip stream." So what is happening is the inflate function is returning Z_STREAM_END, and I'm considering the stream done from that return code. I'm not sure yet how to fix this. I suppose I can check if any more data exists, and then re-init and continue. I have to look up what a concatenated gzip file is. gzread isn't good for generic purposes, because it requires an actual file input (I want to support any input type, including memory data). -Steve [1] https://github.com/dlang/phobos/blob/master/etc/c/zlib.d#L874
Jan 03 2018
On Thursday, 4 January 2018 at 02:44:09 UTC, Steven Schveighoffer wrote:On 1/3/18 12:03 PM, Andrew wrote:Ah thank you, that makes sense. These types of files are compressed using the bgzip utility so that the file can be indexed meaning specific rows extracted quickly (there's more details of this here http://www.htslib.org/doc/tabix.html and the code can be found here: https://github.com/samtools/htslib/blob/develop/bgzf.c)Thanks for looking into this.So it looks like the file you have is a concatenated gzip file. If I gunzip the file and recompress it, it works properly. Looking at the docs of zlib inflate [1]: " Unlike the gunzip utility and gzread() ..., inflate() will not automatically decode concatenated gzip streams. inflate() will return Z_STREAM_END at the end of the gzip stream. The state would need to be reset to continue decoding a subsequent gzip stream." So what is happening is the inflate function is returning Z_STREAM_END, and I'm considering the stream done from that return code. I'm not sure yet how to fix this. I suppose I can check if any more data exists, and then re-init and continue. I have to look up what a concatenated gzip file is. gzread isn't good for generic purposes, because it requires an actual file input (I want to support any input type, including memory data). -Steve [1] https://github.com/dlang/phobos/blob/master/etc/c/zlib.d#L874
Jan 04 2018
On 1/4/18 7:01 AM, Andrew wrote:Ah thank you, that makes sense. These types of files are compressed using the bgzip utility so that the file can be indexed meaning specific rows extracted quickly (there's more details of this here http://www.htslib.org/doc/tabix.html and the code can be found here: https://github.com/samtools/htslib/blob/develop/bgzf.c)Hm... that utility seems to say it will result in bgz file extension? So this must be an extraction from one of those files? In any case, I'll figure out how to deal with concatenated gzip file, and update iopipe. Next version will focus on a bunch of stuff relating to the 2 zip threads recently posted here. Thanks! -Steve
Jan 04 2018
On Thursday, 4 January 2018 at 12:15:27 UTC, Steven Schveighoffer wrote:On 1/4/18 7:01 AM, Andrew wrote:That would be really great for me, thank you! By default bgzip produces a file with the standard .gz extension. Looking at the code it adds an extra field to the standard gzip header: /* BGZF/GZIP header (speciallized from RFC 1952; little endian): +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 31|139| 8| 4| 0| 0|255| 6| 66| 67| 2|BLK_LEN| +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ BGZF extension: ^ ^ ^ ^ | | | | FLG.EXTRA XLEN B C BGZF format is compatible with GZIP. It limits the size of each compressed block to 2^16 bytes and adds and an extra "BC" field in the gzip header which records the size. */ Thanks again! AndrewAh thank you, that makes sense. These types of files are compressed using the bgzip utility so that the file can be indexed meaning specific rows extracted quickly (there's more details of this here http://www.htslib.org/doc/tabix.html and the code can be found here: https://github.com/samtools/htslib/blob/develop/bgzf.c)Hm... that utility seems to say it will result in bgz file extension? So this must be an extraction from one of those files? In any case, I'll figure out how to deal with concatenated gzip file, and update iopipe. Next version will focus on a bunch of stuff relating to the 2 zip threads recently posted here. Thanks! -Steve
Jan 04 2018
On 1/4/18 7:23 AM, Andrew wrote:On Thursday, 4 January 2018 at 12:15:27 UTC, Steven Schveighoffer wrote:It's now been updated, see version 0.0.3. Note, the performance isn't something I focused on. I'll note that gzcat | wc -l is 2x faster than your simple example on that file. I can think of a couple reasons for this: 1. gzcat may use mmap to increase read speed 2. gzcat may read larger chunks at once (this can be tuned using iopipe as well, just set the optimizedReadSize). 3. gzcat file.gz | iopipe_byline -nooutput is about 20% faster than using wc -l, so it's definitely not the line parsing. Let me know if this works correctly for your other test cases! If not, file an issue: https://github.com/schveiguy/iopipe/issues -SteveIn any case, I'll figure out how to deal with concatenated gzip file, and update iopipe. Next version will focus on a bunch of stuff relating to the 2 zip threads recently posted here.That would be really great for me, thank you!
Jan 04 2018
On Thursday, 4 January 2018 at 15:48:21 UTC, Steven Schveighoffer wrote:It's now been updated, see version 0.0.3. Note, the performance isn't something I focused on. I'll note that gzcat | wc -l is 2x faster than your simple example on that file. I can think of a couple reasons for this: 1. gzcat may use mmap to increase read speed 2. gzcat may read larger chunks at once (this can be tuned using iopipe as well, just set the optimizedReadSize). 3. gzcat file.gz | iopipe_byline -nooutput is about 20% faster than using wc -l, so it's definitely not the line parsing. Let me know if this works correctly for your other test cases! If not, file an issue: https://github.com/schveiguy/iopipe/issues -SteveThat works perfectly, thank you very much!
Jan 04 2018