digitalmars.D.learn - Processing a gzipped csv-file by line-by-line

=?UTF-8?B?Tm9yZGzDtnc=?= (7/7) May 10 2017 What's fastest way to on-the-fly-decompress and process a gzipped

ketmar (5/12) May 10 2017 iv.vfs[0] can do that (transparently decompress gzip files, and more). y...
Nicholas Wilson (3/10) May 10 2017 I suggest you take a look at Steven's iopipe (also watch his

H. S. Teoh via Digitalmars-d-learn (7/23) May 10 2017 Also, if you need to parse lots of CSV data very fast, you might be

Seb (2/6) May 10 2017 Or asdf: https://github.com/tamediadigital/asdf

Steven Schveighoffer (9/24) May 12 2017 Yeah, this should work and be quite fast:

Jesse Phillips (8/15) May 10 2017 You can't really parse a CSV file line-by-line.

H. S. Teoh via Digitalmars-d-learn (28/33) May 11 2017 Or you could use std.mmfile. But if it's decompressed data, then it

Laeeth Isharc (7/44) May 11 2017 I hacked your code to work with std.experimental.allocator. If I
Steven Schveighoffer (9/32) May 12 2017 Yeah, iopipe treats char[] as a random-access sliceable range :)

Jon Degenhardt (12/19) May 10 2017 I was curious what byLineFast was, I'm guessing it's from here:

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

What's fastest way to on-the-fly-decompress and process a gzipped 
csv-fil line by line?

Is it possible to combine

http://dlang.org/phobos/std_zlib.html

with some stream variant of

File(path).byLineFast

?

May 10 2017

ketmar <ketmar ketmar.no-ip.org> writes:

Nordlöw wrote:

 What's fastest way to on-the-fly-decompress and process a gzipped csv-fil 
 line by line?

 Is it possible to combine

 http://dlang.org/phobos/std_zlib.html

 with some stream variant of

 File(path).byLineFast

 ?

iv.vfs[0] can do that (transparently decompress gzip files, and more). yet it 
is far from "fastest", so i don't think that it will fit. yet i can't miss 
such a great opportunity for self-promotion.


[0] http://repo.or.cz/iv.d.git/tree/HEAD:/vfs

May 10 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote:
 What's fastest way to on-the-fly-decompress and process a 
 gzipped csv-fil line by line?

 Is it possible to combine

 http://dlang.org/phobos/std_zlib.html

 with some stream variant of

 File(path).byLineFast

 ?

I suggest you take a look at Steven's iopipe (also watch his 
Dconf presentation). should be very simple.

May 10 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, May 10, 2017 at 11:17:44PM +0000, Nicholas Wilson via
Digitalmars-d-learn wrote:
 On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordl�w wrote:
 What's fastest way to on-the-fly-decompress and process a gzipped
 csv-fil line by line?
 
 Is it possible to combine
 
 http://dlang.org/phobos/std_zlib.html
 
 with some stream variant of
 
 File(path).byLineFast
 
 ?

 
 I suggest you take a look at Steven's iopipe (also watch his Dconf
 presentation). should be very simple.

Also, if you need to parse lots of CSV data very fast, you might be
interested in this:

	https://github.com/quickfur/fastcsv


T

-- 
Just because you can, doesn't mean you should.

May 10 2017

Seb <seb wilzba.ch> writes:

On Wednesday, 10 May 2017 at 23:19:15 UTC, H. S. Teoh wrote:
 Also, if you need to parse lots of CSV data very fast, you 
 might be interested in this:

 	https://github.com/quickfur/fastcsv


 T

Or asdf: https://github.com/tamediadigital/asdf

May 10 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 5/10/17 7:17 PM, Nicholas Wilson wrote:
 On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote:
 What's fastest way to on-the-fly-decompress and process a gzipped
 csv-fil line by line?

 Is it possible to combine

 http://dlang.org/phobos/std_zlib.html

 with some stream variant of

 File(path).byLineFast

 ?

 I suggest you take a look at Steven's iopipe (also watch his Dconf
 presentation). should be very simple.

Yeah, this should work and be quite fast:

import iopipe.zip;
import iopipe.textpipe;
import iopipe.bufpipe;
import iopipe.stream;

foreach(line; openDev(path).bufd.unzip.decodeText.byLineRange)

I think that was actually one of my slide examples.

-Steve

May 12 2017

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote:
 What's fastest way to on-the-fly-decompress and process a 
 gzipped csv-fil line by line?

 Is it possible to combine

 http://dlang.org/phobos/std_zlib.html

 with some stream variant of

 File(path).byLineFast

 ?

You can't really parse a CSV file line-by-line.

H.S. Teoh mentioned fastcsv but requires all the data to be in 
memory.

If you can get the zip to decompress into a range of dchar then 
std.csv will work with it. It is by far not the fastest, but much 
speed is lost since it supports input ranges and doesn't 
specialize on any other range type.

May 10 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, May 10, 2017 at 11:40:08PM +0000, Jesse Phillips via
Digitalmars-d-learn wrote:
[...]
 H.S. Teoh mentioned fastcsv but requires all the data to be in memory.

Or you could use std.mmfile.  But if it's decompressed data, then it
would still need to be small enough to fit in memory.  Well, in theory
you *could* use an anonymous mapping for std.mmfile as an OS-backed
virtual memory buffer to decompress into, but it's questionable whether
that's really worth the effort.


 If you can get the zip to decompress into a range of dchar then
 std.csv will work with it. It is by far not the fastest, but much
 speed is lost since it supports input ranges and doesn't specialize on
 any other range type.

I actually spent some time today to look into whether fastcsv can
possibly be made to work with general input ranges as long as they
support slicing... and immediately ran into the infamous autodecoding
issue: strings are not random-access ranges because of autodecoding, so
it would require either extensive code surgery to make it work, or ugly
hacks to bypass autodecoding.  I'm quite tempted to attempt the latter,
in fact, but not now since it's getting busier at work and I don't have
that much free time to spend on a major refactoring of fastcsv.

Alternatively, I could possibly hack together a version of fastcsv that
took a range of const(char)[] as input (rather than a single string), so
that, in theory, it could handle arbitrarily large input files as long
as the caller can provide a range of data blocks, e.g., File.byChunk, or
in this particular case, a range of decompressed data blocks from
whatever decompressor is used to extract the data.  As long as you
consume the individual rows without storing references to them
indefinitely (don't try to make an array of the entire dataset),
fastcsv's optimizations should still work, since unreferenced blocks
will eventually get cleaned up by the GC when memory runs low.


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5

May 11 2017

Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:

On Friday, 12 May 2017 at 00:18:47 UTC, H. S. Teoh wrote:
 On Wed, May 10, 2017 at 11:40:08PM +0000, Jesse Phillips via 
 Digitalmars-d-learn wrote: [...]
 H.S. Teoh mentioned fastcsv but requires all the data to be in 
 memory.

 Or you could use std.mmfile.  But if it's decompressed data, 
 then it would still need to be small enough to fit in memory.  
 Well, in theory you *could* use an anonymous mapping for 
 std.mmfile as an OS-backed virtual memory buffer to decompress 
 into, but it's questionable whether that's really worth the 
 effort.


 If you can get the zip to decompress into a range of dchar 
 then std.csv will work with it. It is by far not the fastest, 
 but much speed is lost since it supports input ranges and 
 doesn't specialize on any other range type.

 I actually spent some time today to look into whether fastcsv 
 can possibly be made to work with general input ranges as long 
 as they support slicing... and immediately ran into the 
 infamous autodecoding issue: strings are not random-access 
 ranges because of autodecoding, so it would require either 
 extensive code surgery to make it work, or ugly hacks to bypass 
 autodecoding.  I'm quite tempted to attempt the latter, in 
 fact, but not now since it's getting busier at work and I don't 
 have that much free time to spend on a major refactoring of 
 fastcsv.

 Alternatively, I could possibly hack together a version of 
 fastcsv that took a range of const(char)[] as input (rather 
 than a single string), so that, in theory, it could handle 
 arbitrarily large input files as long as the caller can provide 
 a range of data blocks, e.g., File.byChunk, or in this 
 particular case, a range of decompressed data blocks from 
 whatever decompressor is used to extract the data.  As long as 
 you consume the individual rows without storing references to 
 them indefinitely (don't try to make an array of the entire 
 dataset), fastcsv's optimizations should still work, since 
 unreferenced blocks will eventually get cleaned up by the GC 
 when memory runs low.


 T

I hacked your code to work with std.experimental.allocator.  If I 
remember it was a fair bit faster for my use.  Let me know if you 
would like me to tidy up into a pull request.

Thanks for the library.

Also - sent you an email.  Not sure if you got it.


Laeeth

May 11 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 5/11/17 8:18 PM, H. S. Teoh via Digitalmars-d-learn wrote:
 On Wed, May 10, 2017 at 11:40:08PM +0000, Jesse Phillips via
Digitalmars-d-learn wrote:

 If you can get the zip to decompress into a range of dchar then
 std.csv will work with it. It is by far not the fastest, but much
 speed is lost since it supports input ranges and doesn't specialize on
 any other range type.

 I actually spent some time today to look into whether fastcsv can
 possibly be made to work with general input ranges as long as they
 support slicing... and immediately ran into the infamous autodecoding
 issue: strings are not random-access ranges because of autodecoding, so
 it would require either extensive code surgery to make it work, or ugly
 hacks to bypass autodecoding.  I'm quite tempted to attempt the latter,
 in fact, but not now since it's getting busier at work and I don't have
 that much free time to spend on a major refactoring of fastcsv.

Yeah, iopipe treats char[] as a random-access sliceable range :) 
Autodecoding gets annoying if you want to do anything fancy (like 
chain(somestr, someotherstr))

 Alternatively, I could possibly hack together a version of fastcsv that
 took a range of const(char)[] as input (rather than a single string), so
 that, in theory, it could handle arbitrarily large input files as long
 as the caller can provide a range of data blocks, e.g., File.byChunk, or
 in this particular case, a range of decompressed data blocks from
 whatever decompressor is used to extract the data.  As long as you
 consume the individual rows without storing references to them
 indefinitely (don't try to make an array of the entire dataset),
 fastcsv's optimizations should still work, since unreferenced blocks
 will eventually get cleaned up by the GC when memory runs low.

I'm interested in getting a fast CSV parser built on top of iopipe. I 
may fork your code and see if I can get it to work. Since you already 
work on arrays, it should be quite simple, since arrays are also iopipes 
by default.

-Steve

May 12 2017

Jon Degenhardt <jond noreply.com> writes:

On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote:
 What's fastest way to on-the-fly-decompress and process a 
 gzipped csv-fil line by line?

 Is it possible to combine

 http://dlang.org/phobos/std_zlib.html

 with some stream variant of

 File(path).byLineFast

 ?

I was curious what byLineFast was, I'm guessing it's from here: 
https://github.com/biod/BioD/blob/master/bio/core/utils/bylinefast.d.

I didn't test it, but it appears it may pre-date the speed 
improvements made to std.stdio.byLine perhaps a year and a half 
ago. If so, it might be worth comparing it to the current Phobos 
version, and of course iopipe.

As mentioned in one of the other replies, byLine and variants 
aren't appropriate for CSV with escapes. For that, a real CSV 
parser is needed. As an alternative, run a converter that 
converts from csv to another format.

--Jon

May 10 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Processing a gzipped csv-file by line-by-line