digitalmars.D.learn - Speed of csvReader

data pulverizer (37/37) Jan 21 2016 I have been reading large text files with D's csv file reader and

Rikki Cattermole (8/44) Jan 21 2016 Okay without registering not gonna get that data.

data pulverizer (13/20) Jan 21 2016 That helped a lot, I disable GC and inlined as you suggested and

data pulverizer (7/32) Jan 21 2016 I should probably add compiler version info:

bachmeier (9/10) Jan 21 2016 In that case, have you looked at

data pulverizer (2/12) Jan 21 2016 Thanks. That's certainly something to try.

=?UTF-8?Q?Ali_=c3=87ehreli?= (5/8) Jan 21 2016 These two as well please:

data pulverizer (4/12) Jan 21 2016 Thank you, adding those two flags brings down the time a little
bachmeier (7/9) Jan 21 2016 fread source is here:

Edwin van Leeuwen (5/10) Jan 21 2016 Is it csvReader or readText that is slow? i.e. could you move

Saurabh Das (7/17) Jan 21 2016 Please try this:

Saurabh Das (11/30) Jan 21 2016 Actually since you're aiming for speed, this might be better:

data pulverizer (7/40) Jan 21 2016 @Saurabh I have tried your latest suggestion and the time reduces

data pulverizer (4/24) Jan 21 2016 p.s. @Saurabh the result looks fine from the cast.
wobbles (4/27) Jan 21 2016 Interesting that reading a file is so slow.

data pulverizer (2/4) Jan 21 2016 Yes, its just insane isn't it?

Saurabh Das (9/14) Jan 21 2016 It is insane. Earlier in the thread we were tackling the wrong

data pulverizer (11/26) Jan 21 2016 Good news and bad new. I was going for something similar to what

data pulverizer (19/30) Jan 21 2016 I should probably include the first few lines of the file:

Justin Whear (10/12) Jan 21 2016 byLine reuses a buffer (for speed) and the subsequent split operation

data pulverizer (3/15) Jan 21 2016 Thanks. It now works with byLineCopy()

data pulverizer (18/36) Jan 21 2016 Currently the timing is similar to python pandas:

Edwin van Leeuwen (10/13) Jan 21 2016 The underlying problem most likely is that csvReader has (AFAIK)

Jesse Phillips (18/21) Jan 21 2016 CsvReader hasn't been compared and optimized from other CSV

H. S. Teoh via Digitalmars-d-learn (266/273) Jan 21 2016 [...]

Jesse Phillips (10/17) Jan 21 2016 As mentioned validation can be turned off
Brad Anderson (7/19) Jan 21 2016 What about wrapping the slices in a range-like interface that

Brad Anderson (3/11) Jan 21 2016 Oh, you discussed range-based later. I should have finished

cym13 (38/39) Jan 21 2016 It may be fast but I think it may be related to the fact that

H. S. Teoh via Digitalmars-d-learn (9/15) Jan 21 2016 [...]
H. S. Teoh via Digitalmars-d-learn (18/54) Jan 21 2016 Alright, I decided to take on the challenge to write a "real" CSV

cym13 (6/30) Jan 21 2016 Great! Sorry for the separator thing, I didn't read your code

Jesse Phillips (2/7) Jan 21 2016 CSV doesn't have comments, sorry.

cym13 (4/11) Jan 21 2016 I've met libraries that accepted lines beginning by # as comment

H. S. Teoh via Digitalmars-d-learn (32/37) Jan 21 2016 Comments? You mean in the code? 'cos the CSV grammar described in

H. S. Teoh via Digitalmars-d-learn (7/20) Jan 21 2016 Oh, forgot to mention, the parsing times are still lightning fast after
H. S. Teoh via Digitalmars-d-learn (18/27) Jan 21 2016 [...]
Jesse Phillips (12/17) Jan 21 2016 std.csv will reject this. If validation is turned off this is

H. S. Teoh via Digitalmars-d-learn (12/32) Jan 21 2016 This case is still manageable, because there are no embedded commas.

cym13 (6/9) Jan 21 2016 Right, re-reading the RFC would have been a great thing. That

Jesse Phillips (18/27) Jan 22 2016 You have to understand CSV didn't come from a standard. People

H. S. Teoh via Digitalmars-d-learn (8/10) Jan 21 2016 [...]

Edwin van Leeuwen (3/13) Jan 22 2016 That's pretty impressive. Maybe turn it on into a dub package so
data pulverizer (19/29) Jan 22 2016 Hi H. S. Teoh, I have used you fastcsv on my file:

data pulverizer (2/22) Jan 22 2016 I guess the next step is allowing Tuple rows with mixed types.

H. S. Teoh via Digitalmars-d-learn (26/32) Jan 22 2016 Thanks!
H. S. Teoh via Digitalmars-d-learn (47/48) Jan 23 2016 Alright. I threw together a new CSV parsing function that loads CSV data

Jesse Phillips (5/7) Jan 23 2016 My suggestion is to take the unittests used in std.csv and try to

H. S. Teoh via Digitalmars-d-learn (14/18) Jan 25 2016 My thought is to integrate the fastcsv code into std.csv, such that the

bachmeier (3/7) Jan 26 2016 Wouldn't it be simpler to add a new function? Otherwise you'll
Jesse Phillips (8/18) Jan 26 2016 That is why I suggested starting with the unittests. I don't

data pulverizer (6/10) Jan 21 2016 Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my

H. S. Teoh via Digitalmars-d-learn (7/19) Jan 21 2016 Did you supply a main() function? If not, it won't run, because

data pulverizer (5/25) Jan 21 2016 Thanks, I got used to getting away with running the "script" file
data pulverizer (2/8) Jan 21 2016 Great benchmarks! This is something else for me to learn from.

Gerald Jansen (9/16) Jan 26 2016 As a D novice still struggling with the concept that composable

Chris Wright (7/24) Jan 26 2016 You want to reduce allocations. Ranges often let you do that. However,

Gerald Jansen (3/18) Jan 26 2016 Sure, that part is clear. Presumably the quoted comment referred
H. S. Teoh via Digitalmars-d-learn (75/102) Jan 26 2016 Yeah, in the course of this exercise, I found that the one thing that

Gerald Jansen (2/7) Jan 27 2016 Many thanks for the detailed explanation.
jmh530 (2/5) Jan 27 2016 Really interesting discussion.
Laeeth Isharc (15/19) Oct 29 2016 Thanks for sharing this, HS Teoh. I tried replacing allocations

Gerald Jansen (4/6) Jan 21 2016 This great blog post has an optimized FastReader for CSV files:

data pulverizer (6/12) Jan 21 2016 Thanks a lot Gerald, the blog and the discussions were very

Jon D (25/28) Jan 21 2016 FWIW - I've been implementing a few programs manipulating

H. S. Teoh via Digitalmars-d-learn (19/31) Jan 21 2016 While byLine has improved a lot, it's still not the fastest thing in the

Jon D (17/37) Jan 21 2016 No disagreement, but I had other goals. At a high level, I'm

data pulverizer <data.pulverizer gmail.com> writes:

I have been reading large text files with D's csv file reader and 
have found it slow compared to R's read.table function which is 
not known to be particularly fast. Here I am reading Fannie Mae 
mortgage acquisition data which can be found here 
http://www.fanniemae.com/portal/funding-the-market/data/loan-p
rformance-data.html after registering:

D Code:

import std.algorithm;
import std.array;
import std.file;
import std.csv;
import std.stdio;
import std.typecons;
import std.datetime;

alias row_type = Tuple!(string, string, string, string, string, 
string, string, string,
                         string, string, string, string, string, 
string, string, string,
                         string, string, string, string, string, 
string);

void main(){
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();
   double time = sw.peek().msecs;
   writeln("Time (s): ", time/1000);
}

Time (s): 13.478

R Code:

system.time(x <- read.table("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22)))
    user  system elapsed
   7.810   0.067   7.874


R takes about half as long to read the file. Both read the data 
in the "equivalent" type format. Am I doing something incorrect 
here?

Jan 21 2016

Rikki Cattermole <alphaglosined gmail.com> writes:

On 21/01/16 10:39 PM, data pulverizer wrote:
 I have been reading large text files with D's csv file reader and have
 found it slow compared to R's read.table function which is not known to
 be particularly fast. Here I am reading Fannie Mae mortgage acquisition
 data which can be found here
 http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html
 after registering:

 D Code:

 import std.algorithm;
 import std.array;
 import std.file;
 import std.csv;
 import std.stdio;
 import std.typecons;
 import std.datetime;

 alias row_type = Tuple!(string, string, string, string, string, string,
 string, string,
                          string, string, string, string, string, string,
 string, string,
                          string, string, string, string, string, string);

 void main(){
    StopWatch sw;
    sw.start();
    auto buffer = std.file.readText("Acquisition_2009Q2.txt");
    auto records = csvReader!row_type(buffer, '|').array;
    sw.stop();
    double time = sw.peek().msecs;
    writeln("Time (s): ", time/1000);
 }

 Time (s): 13.478

 R Code:

 system.time(x <- read.table("Acquisition_2009Q2.txt", sep = "|",
 colClasses = rep("character", 22)))
     user  system elapsed
    7.810   0.067   7.874


 R takes about half as long to read the file. Both read the data in the
 "equivalent" type format. Am I doing something incorrect here?

Okay without registering not gonna get that data.

So usual things to think about, did you turn on release mode?
What about inlining?

Lastly how about disabling the GC?

import core.memory : GC;
GC.disable();

dmd -release -inline code.d

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole 
wrote:

 Okay without registering not gonna get that data.

 So usual things to think about, did you turn on release mode?
 What about inlining?

 Lastly how about disabling the GC?

 import core.memory : GC;
 GC.disable();

 dmd -release -inline code.d

That helped a lot, I disable GC and inlined as you suggested and 
the time is now:

Time (s): 8.754

However, with R's data.table package gives us:

system.time(x <- fread("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22)))
    user  system elapsed
   0.852   0.021   0.872

I should probably have begun with this timing. Its not my 
intention to turn this into a speed-only competition, however the 
ingest of files and speed of calculation is very important to me.

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 10:40:39 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole 
 wrote:

 Okay without registering not gonna get that data.

 So usual things to think about, did you turn on release mode?
 What about inlining?

 Lastly how about disabling the GC?

 import core.memory : GC;
 GC.disable();

 dmd -release -inline code.d

 That helped a lot, I disable GC and inlined as you suggested 
 and the time is now:

 Time (s): 8.754

 However, with R's data.table package gives us:

 system.time(x <- fread("Acquisition_2009Q2.txt", sep = "|", 
 colClasses = rep("character", 22)))
    user  system elapsed
   0.852   0.021   0.872

 I should probably have begun with this timing. Its not my 
 intention to turn this into a speed-only competition, however 
 the ingest of files and speed of calculation is very important 
 to me.

I should probably add compiler version info:

~$ dmd --version
DMD64 D Compiler v2.069.2
Copyright (c) 1999-2015 by Digital Mars written by Walter Bright

Running Ubuntu 14.04 LTS

Jan 21 2016

bachmeier <no spam.com> writes:

On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer 
wrote:

 Running Ubuntu 14.04 LTS

In that case, have you looked at

http://lancebachmeier.com/rdlang/

If this is a serious bottleneck you can solve it with two lines

evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22))`);
auto x = RMatrix(evalR("x"));

and then you've got access to the data in D.

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 16:25:55 UTC, bachmeier wrote:
 On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer 
 wrote:

 Running Ubuntu 14.04 LTS

 In that case, have you looked at

 http://lancebachmeier.com/rdlang/

 If this is a serious bottleneck you can solve it with two lines

 evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|", 
 colClasses = rep("character", 22))`);
 auto x = RMatrix(evalR("x"));

 and then you've got access to the data in D.

Thanks. That's certainly something to try.

Jan 21 2016

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 01/21/2016 02:40 AM, data pulverizer wrote:

 dmd -release -inline code.d


These two as well please:

   -O -boundscheck=off

 the ingest of files and
 speed of calculation is very important to me.

We should understand why D is slow in this case. :)

Ali

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote:
 On 01/21/2016 02:40 AM, data pulverizer wrote:

 dmd -release -inline code.d


 These two as well please:

   -O -boundscheck=off

 the ingest of files and
 speed of calculation is very important to me.

 We should understand why D is slow in this case. :)

 Ali

Thank you, adding those two flags brings down the time a little 
more ...

Time (s): 6.832

Jan 21 2016

bachmeier <no spam.com> writes:

On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote:

 We should understand why D is slow in this case. :)

 Ali

fread source is here:

https://github.com/Rdatatable/data.table/blob/master/src/fread.c

Good luck trying to work through that (which explains why I'm 
using D). I don't know what their magic is, but data.table is 
many times faster than anything else in R, so I don't think it's 
trivial.

Jan 21 2016

Edwin van Leeuwen <edder tkwsping.nl> writes:

On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();


Is it csvReader or readText that is slow? i.e. could you move 
sw.start(); one line down (after the readText command) and see 
how long just the csvReader part takes?

Jan 21 2016

Saurabh Das <saurabh.das gmail.com> writes:

On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen 
wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
 wrote:
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();


 Is it csvReader or readText that is slow? i.e. could you move 
 sw.start(); one line down (after the readText command) and see 
 how long just the csvReader part takes?

Please try this:

auto records = 
File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array;

Can you put up some sample data and share the number of records 
in the file as well.

Jan 21 2016

Saurabh Das <saurabh.das gmail.com> writes:

On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen 
 wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
 wrote:
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();


 Is it csvReader or readText that is slow? i.e. could you move 
 sw.start(); one line down (after the readText command) and see 
 how long just the csvReader part takes?

 Please try this:

 auto records = 
 File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array;

 Can you put up some sample data and share the number of records 
 in the file as well.

Actually since you're aiming for speed, this might be better:

sw.start();
auto records = 
File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a 
=> cast(dchar)a).csvReader!row_type('|').array
sw.stop();

Please do verify that the end result is the same - I'm not 100% 
confident of the cast.

Thanks,
Saurabh

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van 
 Leeuwen wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
 wrote:
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();


 Is it csvReader or readText that is slow? i.e. could you move 
 sw.start(); one line down (after the readText command) and 
 see how long just the csvReader part takes?

 Please try this:

 auto records = 
 File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array;

 Can you put up some sample data and share the number of 
 records in the file as well.

 Actually since you're aiming for speed, this might be better:

 sw.start();
 auto records = 
 File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a 
 => cast(dchar)a).csvReader!row_type('|').array
 sw.stop();

 Please do verify that the end result is the same - I'm not 100% 
 confident of the cast.

 Thanks,
 Saurabh

 Saurabh I have tried your latest suggestion and the time reduces 
fractionally to:

Time (s): 6.345

the previous suggestion actually increased the time

 Edwin van Leeuwen The csvReader is what takes the most time, the 
readText takes 0.229 s

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das 
 Actually since you're aiming for speed, this might be better:

 sw.start();
 auto records = 
 File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a =>
cast(dchar)a).csvReader!row_type('|').array
 sw.stop();

 Please do verify that the end result is the same - I'm not 
 100% confident of the cast.

 Thanks,
 Saurabh

  Saurabh I have tried your latest suggestion and the time 
 reduces fractionally to:

 Time (s): 6.345

 the previous suggestion actually increased the time

  Edwin van Leeuwen The csvReader is what takes the most time, 
 the readText takes 0.229 s

p.s.  Saurabh the result looks fine from the cast.

Thanks

Jan 21 2016

wobbles <grogan.colin gmail.com> writes:

On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das 
 wrote:
 [...]

 Actually since you're aiming for speed, this might be better:

 sw.start();
 auto records = 
 File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a =>
cast(dchar)a).csvReader!row_type('|').array
 sw.stop();

 Please do verify that the end result is the same - I'm not 
 100% confident of the cast.

 Thanks,
 Saurabh

  Saurabh I have tried your latest suggestion and the time 
 reduces fractionally to:

 Time (s): 6.345

 the previous suggestion actually increased the time

  Edwin van Leeuwen The csvReader is what takes the most time, 
 the readText takes 0.229 s

Interesting that reading a file is so slow.

Your timings from R, is that including reading the file also?

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
 Interesting that reading a file is so slow.

 Your timings from R, is that including reading the file also?

Yes, its just insane isn't it?

Jan 21 2016

Saurabh Das <saurabh.das gmail.com> writes:

On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
 Interesting that reading a file is so slow.

 Your timings from R, is that including reading the file also?

 Yes, its just insane isn't it?

It is insane. Earlier in the thread we were tackling the wrong 
problem clearly. Hence the adage, "measure first" :-/.

As suggested by Edwin van Leeuwen, can you give us a timing of:
auto records = File("Acquisition_2009Q2.txt", "r").byLine.map!(a 
=> a.split("|").array).array;

Thanks,
Saurabh

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 17:17:52 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer 
 wrote:
 On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
 Interesting that reading a file is so slow.

 Your timings from R, is that including reading the file also?

 Yes, its just insane isn't it?

 It is insane. Earlier in the thread we were tackling the wrong 
 problem clearly. Hence the adage, "measure first" :-/.

 As suggested by Edwin van Leeuwen, can you give us a timing of:
 auto records = File("Acquisition_2009Q2.txt", 
 "r").byLine.map!(a => a.split("|").array).array;

 Thanks,
 Saurabh

Good news and bad new. I was going for something similar to what 
you have above and both slash the time alot:

Time (s): 1.024

But now the output is a little garbled. For some reason the 
splitter isn't splitting correctly - or we are not applying it 
properly. Line 0:

["100001703051", "RETAIL", "BANK OF AMERICA, 
N.A.|4.875|207000|3", "0", "03/200", "|05", "2009|75", "75|1|26", 
"80", "|N", "|", "O ", "ASH", "OU", " REFINANCE|PUD|1|INVE", 
"TOR", "C", "|801||FRM", "\n\n", "863", "", "FRM"]

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 18:31:17 UTC, data pulverizer 
wrote:
 Good news and bad new. I was going for something similar to 
 what you have above and both slash the time alot:

 Time (s): 1.024

 But now the output is a little garbled. For some reason the 
 splitter isn't splitting correctly - or we are not applying it 
 properly. Line 0:

 ["100001703051", "RETAIL", "BANK OF AMERICA, 
 N.A.|4.875|207000|3", "0", "03/200", "|05", "2009|75", 
 "75|1|26", "80", "|N", "|", "O ", "ASH", "OU", " 
 REFINANCE|PUD|1|INVE", "TOR", "C", "|801||FRM", "\n\n", "863", 
 "", "FRM"]

I should probably include the first few lines of the file:

100000511550|RETAIL|FLAGSTAR CAPITAL MARKETS 
CORPORATION|5|222000|360|04/2009|06/2009|44|44|2|37|823|NO|NO 
CASH-OUT REFINANCE|PUD|1|PRINCIPAL|AZ|863||FRM
100001031040|BROKER|SUNTRUST MORTGAGE 
INC.|4.99|456000|360|03/2009|05/2009|83|83|1|47|744|NO|NO 
CASH-OUT REFINANCE|SF|1|PRINCIPAL|MD|211|12|FRM
100001445182|CORRESPONDENT|CITIMORTGAGE, 
INC.|4.875|172000|360|05/2009|07/2009|80|80|2|25|797|NO|CASH-OUT 
REFINANCE|SF|1|PRINCIPAL|TX|758||FRM
100001703051|RETAIL|BANK OF AMERICA, 
N.A.|4.875|207000|360|03/2009|05/2009|75|75|1|26|806|NO|NO 
CASH-OUT REFINANCE|PUD|1|INVESTOR|CO|801||FRM
100006033316|CORRESPONDENT|JPMORGAN CHASE BANK, NATIONAL 
ASSOCIATION|5|170000|360|05/2009|07/2009|80|80|1|23|771|NO|CASH-OUT
REFINANCE|PUD|1|PRINCIPAL|VA|224||FRM


It's interesting that the output first array is not the same as 
the input

Jan 21 2016

Justin Whear <justin economicmodeling.com> writes:

On Thu, 21 Jan 2016 18:37:08 +0000, data pulverizer wrote:
 
 It's interesting that the output first array is not the same as the
 input

byLine reuses a buffer (for speed) and the subsequent split operation 
just returns slices into that buffer.  So when byLine progresses to the 
next line the strings (slices) returned previously now point into a 
buffer with different contents.  You should either use byLineCopy or .idup 
to create copies of the relevant strings.  If your use-case allows for 
streaming and doesn't require having all the data present at once, you 
could continue to use byLine and just be careful not to refer to previous 
rows.

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear wrote:
 On Thu, 21 Jan 2016 18:37:08 +0000, data pulverizer wrote:

 It's interesting that the output first array is not the same 
 as the input

 byLine reuses a buffer (for speed) and the subsequent split 
 operation just returns slices into that buffer.  So when byLine 
 progresses to the next line the strings (slices) returned 
 previously now point into a buffer with different contents.  
 You should either use byLineCopy or .idup to create copies of 
 the relevant strings.  If your use-case allows for streaming 
 and doesn't require having all the data present at once, you 
 could continue to use byLine and just be careful not to refer 
 to previous rows.

Thanks. It now works with byLineCopy()

Time (s): 1.128

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 19:08:38 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear 
 wrote:
 On Thu, 21 Jan 2016 18:37:08 +0000, data pulverizer wrote:

 It's interesting that the output first array is not the same 
 as the input

 byLine reuses a buffer (for speed) and the subsequent split 
 operation just returns slices into that buffer.  So when 
 byLine progresses to the next line the strings (slices) 
 returned previously now point into a buffer with different 
 contents.  You should either use byLineCopy or .idup to create 
 copies of the relevant strings.  If your use-case allows for 
 streaming and doesn't require having all the data present at 
 once, you could continue to use byLine and just be careful not 
 to refer to previous rows.

 Thanks. It now works with byLineCopy()

 Time (s): 1.128

Currently the timing is similar to python pandas:


import pandas as pd
import time
col_types = {'col1': str, 'col2': str, 'col3': str, 'col4': str, 
'col5': str, 'col6': str, 'col7': str, 'col8': str, 'col9': str, 
'col10': str, 'col11': str, 'col12': str, 'col13': str, 'col14': 
str, 'col15': str, 'col16': str, 'col17': str, 'col18': str, 
'col19': str, 'col20': str, 'col21': str, 'col22': str}
begin = time.time()
x = pd.read_csv('Acquisition_2009Q2.txt', sep = '|', dtype = 
col_types)
end = time.time()

print end - begin

$ python file_read.py
1.19544792175

Jan 21 2016

Edwin van Leeuwen <edder tkwsping.nl> writes:

On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
  Edwin van Leeuwen The csvReader is what takes the most time, 
 the readText takes 0.229 s

The underlying problem most likely is that csvReader has (AFAIK) 
never been properly optimized/profiled (very old piece of the 
library). You could try to implement a rough csvReader using 
buffer.byLine() and for each line use split("|") to split at "|". 
That should be faster, because it doesn't do any checking.

Non tested code:
string[][] res = buffer.byLine().map!((a) => 
a.split("|").array).array;

Jan 21 2016

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
 R takes about half as long to read the file. Both read the data 
 in the "equivalent" type format. Am I doing something incorrect 
 here?

CsvReader hasn't been compared and optimized from other CSV 
readers. It does have allocation for the parsed string (even if 
it isn't changed) and it does a number of validation checks.

You may get some improvement disabling the CSV validation, but 
again this wasn't tested for performance.

     csvReader!(string,Malformed.ignore)(str)

Generally people recommend using GDC/LCD if you need resulting 
executable performance, but csvReader being slower isn't the most 
surprising.

Before submitting my library to phobos I had started a CSV reader 
that would do no allocations and instead return the string slice. 
This wasn't completed and so it never had performance testing 
done against it. It could very well be slower.

https://github.com/JesseKPhillips/JPDLibs/blob/csvoptimize/csv/csv.d

My original CSV parser was really slow because I parsed the 
string twice.

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jan 21, 2016 at 07:11:05PM +0000, Jesse Phillips via
Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote:
R takes about half as long to read the file. Both read the data in
the "equivalent" type format. Am I doing something incorrect here?

 
 CsvReader hasn't been compared and optimized from other CSV readers.
 It does have allocation for the parsed string (even if it isn't
 changed) and it does a number of validation checks.

[...]

This piqued my interest today, so I decided to take a shot at writing a
fast CSV parser.  First, I downloaded a sample large CSV file from:

	ftp://ftp.census.gov/econ2013/CBP_CSV/cbp13co.zip

This file has over 2 million records, so I thought it would serve as a
good dataset to run benchmarks on.

Since the OP wanted the loaded data in an array of records, as opposed
iterating over the records as an input range, I decided that the best
way to optimize this use case was to load the entire file into memory
and then return an array of slices to this data, instead of wasting time
(and memory) copying the data.

Furthermore, since it will be an array of records which are arrays of
slices to field values, another optimization is to allocate a large
buffer for storing consecutive field slices, and then in the outer array
just slice the buffer to represent a record. This greatly cuts down on
the number of GC allocations needed.

Once the buffer is full, we don't allocate a larger buffer and copy
everything over; this is unnecessary (and wasteful) because the outer
array doesn't care where its elements point to. Instead, we allocate a
new buffer, leaving previous records pointing to slices of the old
buffer, and start appending more field slices in the new buffer, and so
on. After all, the records don't have to exist in consecutive slices.
There's just a minor overhead in that if we run out of space in the
buffer while in the middle of parsing a record, we need to copy the
current record's field slices into the new buffer, so that all the
fields belonging to this record remain contiguous (so that the outer
array can just slice them). This is a very small overhead compared to
copying the entire buffer into a new memory block (as would happen if we
kept the buffer as a single array that needs to expand), so it ought to
be negligible.

So in a nutshell, what we have is an outer array, each element of which
is a slice (representing a record) that points to some slice of one of
the buffers. Each buffer is a contiguous sequence of slices
(representing a field) pointing to some segment of the original data.

Here's the code:

---------------------------------------------------------------------------
	/**
	 * Experimental fast CSV reader.
	 *
	 * Based on RFC 4180.
	 */
	module fastcsv;
	
	/**
	 * Reads CSV data from the given filename.
	 */
	auto csvFromUtf8File(string filename)
	{
	    import std.file : read;
	    return csvFromString(cast(string) read(filename));
	}
	
	/**
	 * Parses CSV data in a string.
	 *
	 * Params:
	 *  fieldDelim = The field delimiter (default: ',')
	 *  data = The data in CSV format.
	 */
	auto csvFromString(dchar fieldDelim=',', dchar quote='"')(const(char)[] data)
	{
	    import core.memory;
	    import std.array : appender;
	
	    enum fieldBlockSize = 1 << 16;
	    auto fields = new const(char)[][fieldBlockSize];
	    size_t curField = 0;
	
	    GC.disable();
	    auto app = appender!(const(char)[][][]);
	
	    // Scan data
	    size_t i;
	    while (i < data.length)
	    {
	        // Parse records
	        size_t firstField = curField;
	        while (i < data.length && data[i] != '\n' && data[i] != '\r')
	        {
	            // Parse fields
	            size_t firstChar, lastChar;
	            if (data[i] == quote)
	            {
	                i++;
	                firstChar = i;
	                while (i < data.length && data[i] != fieldDelim &&
	                       data[i] != '\n' && data[i] != '\r')
	                {
	                    i++;
	                }
	                lastChar = (i < data.length && data[i-1] == quote) ? i-1 : i;
	            }
	            else
	            {
	                firstChar = i;
	                while (i < data.length && data[i] != fieldDelim &&
	                       data[i] != '\n' && data[i] != '\r')
	                {
	                    i++;
	                }
	                lastChar = i;
	            }
	            if (curField >= fields.length)
	            {
	                // Fields block is full; copy current record fields into new
	                // block so that they are contiguous.
	                auto nextFields = new const(char)[][fieldBlockSize];
	                nextFields[0 .. curField - firstField] =
	                    fields[firstField .. curField];
	
	                //fields.length = firstField; // release unused memory?
	
	                curField = curField - firstField;
	                firstField = 0;
	                fields = nextFields;
	            }
	            assert(curField < fields.length);
	            fields[curField++] = data[firstChar .. lastChar];
	
	            // Skip over field delimiter
	            if (i < data.length && data[i] == fieldDelim)
	                i++;
	        }
	        app.put(fields[firstField .. curField]);
	
	        // Skip over record delimiter(s)
	        while (i < data.length && (data[i] == '\n' || data[i] == '\r'))
	            i++;
	    }
	
	    GC.collect();
	    GC.enable();
	    return app.data;
	}
	
	unittest
	{
	    auto sampleData =
	        `123,abc,"mno pqr",0` ~ "\n" ~
	        `456,def,"stuv wx",1` ~ "\n" ~
	        `78,ghijk,"yx",2`;
	
	    auto parsed = csvFromString(sampleData);
	    assert(parsed == [
	        [ "123", "abc", "mno pqr", "0" ],
	        [ "456", "def", "stuv wx", "1" ],
	        [ "78", "ghijk", "yx", "2" ]
	    ]);
	}
	
	unittest
	{
	    auto dosData =
	        `123,aa,bb,cc` ~ "\r\n" ~
	        `456,dd,ee,ff` ~ "\r\n" ~
	        `789,gg,hh,ii` ~ "\r\n";
	
	    auto parsed = csvFromString(dosData);
	    assert(parsed == [
	        [ "123", "aa", "bb", "cc" ],
	        [ "456", "dd", "ee", "ff" ],
	        [ "789", "gg", "hh", "ii" ]
	    ]);
	}
---------------------------------------------------------------------------

There are some limitations to this approach: while the current code does
try to unwrap quoted values in the CSV, it does not correctly parse
escaped double quotes ("") in the fields. This is because to process
those values correctly we'd have to copy the field data into a new
string and construct its interpreted value, which is slow.  So I leave
it as an exercise for the reader to implement (it's not hard, when the
double double-quote sequence is detected, allocate a new string with the
interpreted data instead of slicing the original data. Either that, or
just unescape the quotes in the application code itself).

Now, in the first version of this code, I didn't have the GC calls...
those were added later when I discovered that the GC was slowing it down
to about the same speed (or worse!) as std.csv. A little profiling
showed that 80% of the time was spent in the GC mark/collect code. After
adding in the code to disable the GC, the performance improved
dramatically.

Of course, running without GC collection is not a fair comparison with
std.csv, so I added an option to my benchmark program to disable the GC
for std.csv as well.  While the result was slightly faster, it was still
much slower than my fastcsv code. (Though to be fair, std.csv does
perform validation checks and so forth that fastcsv doesn't even try
to.)

Anyway, here are the performance numbers from one of the benchmark runs
(these numbers are pretty typical):

	std.csv (with gc): 2126884 records in 23144 msecs
	std.csv (no gc): 2126884 records in 18109 msecs
	fastcsv (no gc): 2126884 records in 1358 msecs

As you can see, our little array-slicing scheme gives us a huge
performance boost over the more generic std.csv range-based code. We
managed to cut out over 90% of the total runtime, even when std.csv is
run with GC disabled. We even try to be nice in fastcsv by calling
GC.collect to cleanup after we're done, and this collection time is
included in the benchmark.

While this is no fancy range-based code, and one might say it's more
hackish and C-like than idiomatic D, the problem is that current D
compilers can't quite optimize range-based code to this extent yet.
Perhaps in the future optimizers will improve so that more idiomiatic,
range-based code will have comparable performance with fastcsv. (At
least in theory this should be possible.)

Finally, just for the record, here's the benchmark code I used:

---------------------------------------------------------------------------
	/**
	 * Crude benchmark for fastcsv.
	 */
	import core.memory;
	import std.array;
	import std.csv;
	import std.file;
	import std.datetime;
	import std.stdio;
	
	import fastcsv;
	
	int main(string[] argv)
	{
	    if (argv.length < 2)
	    {
	        stderr.writeln("Specify std, stdnogc, or fast");
	        return 1;
	    }
	
	    // Obtained from ftp://ftp.census.gov/econ2013/CBP_CSV/cbp13co.zip
	    enum csvFile = "ext/cbp13co.txt";
	
	    string input = cast(string) read(csvFile);
	
	    if (argv[1] == "std")
	    {
	        auto result = benchmark!({
	            auto data = std.csv.csvReader(input).array;
	            writefln("std.csv read %d records", data.length);
	        })(1);
	        writefln("std.csv: %s msecs", result[0].msecs);
	    }
	    else if (argv[1] == "stdnogc")
	    {
	        auto result = benchmark!({
	            GC.disable();
	            auto data = std.csv.csvReader(input).array;
	            writefln("std.csv (nogc) read %d records", data.length);
	            GC.enable();
	        })(1);
	        writefln("std.csv: %s msecs", result[0].msecs);
	    }
	    else if (argv[1] == "fast")
	    {
	        auto result = benchmark!({
	            auto data = fastcsv.csvFromString(input);
	            writefln("fastcsv read %d records", data.length);
	        })(1);
	        writefln("fastcsv: %s msecs", result[0].msecs);
	    }
	    else
	    {
	        stderr.writeln("Unknown option: " ~ argv[1]);
	        return 1;
	    }
	    return 0;
	}
---------------------------------------------------------------------------


--T

Jan 21 2016

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 Of course, running without GC collection is not a fair 
 comparison with std.csv, so I added an option to my benchmark 
 program to disable the GC for std.csv as well.  While the 
 result was slightly faster, it was still much slower than my 
 fastcsv code. (Though to be fair, std.csv does perform 
 validation checks and so forth that fastcsv doesn't even try 
 to.)

As mentioned validation can be turned off

auto data = std.csv.csvReader!(string, 
Malformed.ignore)(input).array;

I forgot to mention that one of the requirements for std.csv was 
that it worked on the base range type, input range. Not that 
slicing wouldn't be a valid addition.

I was also going to do the same thing with my sliced CSV, no 
fixing of the escaped quote. That would have just been a helper 
function the user could map over the results.

Jan 21 2016

Brad Anderson <eco gnuk.net> writes:

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 [snip]
 There are some limitations to this approach: while the current 
 code does try to unwrap quoted values in the CSV, it does not 
 correctly parse escaped double quotes ("") in the fields. This 
 is because to process those values correctly we'd have to copy 
 the field data into a new string and construct its interpreted 
 value, which is slow.  So I leave it as an exercise for the 
 reader to implement (it's not hard, when the double 
 double-quote sequence is detected, allocate a new string with 
 the interpreted data instead of slicing the original data. 
 Either that, or just unescape the quotes in the application 
 code itself).

What about wrapping the slices in a range-like interface that 
would unescape the quotes on demand? You could even set a flag on 
it during the initial pass to say the field has double quotes 
that need to be escaped so it doesn't need to take a per-pop 
performance hit checking for double quotes (that's probably a 
pretty minor boost, if any, though).

Jan 21 2016

Brad Anderson <eco gnuk.net> writes:

On Thursday, 21 January 2016 at 22:13:38 UTC, Brad Anderson wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 [...]

 What about wrapping the slices in a range-like interface that 
 would unescape the quotes on demand? You could even set a flag 
 on it during the initial pass to say the field has double 
 quotes that need to be escaped so it doesn't need to take a 
 per-pop performance hit checking for double quotes (that's 
 probably a pretty minor boost, if any, though).

Oh, you discussed range-based later. I should have finished 
reading before replying.

Jan 21 2016

cym13 <cpicard openmailbox.org> writes:

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 [...]

It may be fast but I think it may be related to the fact that 
this is not a CSV parser. Don't get me wrong, it is able to parse 
a format defined by delimiters but true CSV is one hell of a 
beast. Of course most data look like:

     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,oh my gosh

but you can have delimiters inside a field:

     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,"14,2",gentle
     3,Pinkie Pie,169,oh my gosh

or quotes in a quoted field, in that case you have to double the 
quotes:

     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,"He said ""oh my gosh"""

but in that case external quotes aren't required:

     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,He said ""oh my gosh""

but at least it's always one record per line, no? No? No.

     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,"He said
     ""oh my gosh""
     And she replied
     ""Come on! Have fun!"""

I'll stop there, but you get the picture. Simply splitting by 
line then separator may work well on most data, but I wouldn't 
put it in production or in the standard library. Note that I 
think you did a great job optimizing your code, and I respect 
that, it's just a friendly reminder.

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jan 21, 2016 at 11:03:23PM +0000, cym13 via Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[...]

 
 It may be fast but I think it may be related to the fact that this is
 not a CSV parser. Don't get me wrong, it is able to parse a format
 defined by delimiters but true CSV is one hell of a beast.

[...]

As I stated, I didn't fully implement the parsing of quoted fields. (Or,
for that matter, the correct parsing of crazy wrapped values like you
pointed out.) This is not finished code; it's more of a proof of
concept.


T

-- 
Lottery: tax on the stupid. -- Slashdotter

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jan 21, 2016 at 11:03:23PM +0000, cym13 via Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[...]

 
 It may be fast but I think it may be related to the fact that this is
 not a CSV parser. Don't get me wrong, it is able to parse a format
 defined by delimiters but true CSV is one hell of a beast.

Alright, I decided to take on the challenge to write a "real" CSV
parser... since it's a bit tedious to keep posting code in the forum,
I've pushed it to github instead:

	https://github.com/quickfur/fastcsv


[...]
 but you can have delimiters inside a field:
 
     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,"14,2",gentle
     3,Pinkie Pie,169,oh my gosh

Fixed.


 or quotes in a quoted field, in that case you have to double the quotes:
 
     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,"He said ""oh my gosh"""

Fixed.  Well, except the fact that I don't actually interpret the
doubled quotes, but leave it up to the caller to filter them out at the
application level.


 but in that case external quotes aren't required:
 
     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,He said ""oh my gosh""

Actually, this has already worked before. (Excepting the untranslated
doubled quotes, of course.)


 but at least it's always one record per line, no? No? No.
 
     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,"He said
     ""oh my gosh""
     And she replied
     ""Come on! Have fun!"""

Fixed.


 I'll stop there, but you get the picture. Simply splitting by line
 then separator may work well on most data, but I wouldn't put it in
 production or in the standard library.

Actually, my code does *not* split by line then by separator. Did you
read it? ;-)


T

-- 
The most powerful one-line C program: #include "/dev/tty" -- IOCCC

Jan 21 2016

cym13 <cpicard openmailbox.org> writes:

On Friday, 22 January 2016 at 00:26:16 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 11:03:23PM +0000, cym13 via 
 Digitalmars-d-learn wrote:
 [...]

 Alright, I decided to take on the challenge to write a "real" 
 CSV parser... since it's a bit tedious to keep posting code in 
 the forum, I've pushed it to github instead:

 	https://github.com/quickfur/fastcsv


 [...]
     [...]

 Fixed.


     [...]

 Fixed.  Well, except the fact that I don't actually interpret 
 the doubled quotes, but leave it up to the caller to filter 
 them out at the application level.


     [...]

 Actually, this has already worked before. (Excepting the 
 untranslated
 doubled quotes, of course.)


     [...]

 Fixed.


 [...]

 Actually, my code does *not* split by line then by separator. 
 Did you read it? ;-)


 T

Great! Sorry for the separator thing, I didn't read your code 
carefully. You still lack some things like comments and surely 
more things that I don't know about but it's getting there. I 
didn't think you'd go through the trouble of fixing those things 
to be honnest, I'm impressed.

Jan 21 2016

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote:
 Great! Sorry for the separator thing, I didn't read your code 
 carefully. You still lack some things like comments and surely 
 more things that I don't know about but it's getting there. I 
 didn't think you'd go through the trouble of fixing those 
 things to be honnest, I'm impressed.

CSV doesn't have comments, sorry.

Jan 21 2016

cym13 <cpicard openmailbox.org> writes:

On Friday, 22 January 2016 at 01:14:48 UTC, Jesse Phillips wrote:
 On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote:
 Great! Sorry for the separator thing, I didn't read your code 
 carefully. You still lack some things like comments and surely 
 more things that I don't know about but it's getting there. I 
 didn't think you'd go through the trouble of fixing those 
 things to be honnest, I'm impressed.

 CSV doesn't have comments, sorry.


(outside of "" of course) and wrongly assumed it was a standard 
thing, I stand corrected.

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Fri, Jan 22, 2016 at 12:56:02AM +0000, cym13 via Digitalmars-d-learn wrote:
[...]
 Great! Sorry for the separator thing, I didn't read your code
 carefully. You still lack some things like comments and surely more
 things that I don't know about but it's getting there.

Comments? You mean in the code?  'cos the CSV grammar described in
RFC-4180 doesn't seem to have the possibility of comments in the CSV
itself...


 I didn't think you'd go through the trouble of fixing those things to
 be honnest, I'm impressed.

They weren't that hard to fix, because the original code already had a
separate path for quoted values, so it was just a matter of deleting
some of the loop conditions to make the quoted path accept delimiters
and newlines. In fact, the original code already accepted doubled
quotes in the unquoted field path.

It was only to implement interpretation of doubled quotes that required
modifications to both inner loops.

Now having said that, though, I think there are some bugs in the code
that might cause an array overrun... and the fix might slow things down
yet a bit more. There are also some fundamental limitations:

1) The CSV data has to be loadable into memory in its entirety. This may
not be possible for very large files, or on machines with low memory.

2) There is no ranged-based interface. I *think* this should be possible
to add, but it will probably increase the overhead and make the code
slower.

3) There is no validation of the input whatsoever. If you feed it
malformed CSV, it will give you nonsensical output. Well, it may crash,
but hopefully won't anymore after I fix those missing bounds checks...
but it will still give you nonsensical output.

4) The accepted syntax is actually a little larger than strict CSV (in
the sense of RFC-4180); Unicode input is accepted but RFC-4180 does not
allow Unicode. This may actually be a plus, though, because I'm
expecting that modern CSV may actually contain Unicode data, not just
the ASCII range defined in RFC-4180.


T

-- 
The volume of a pizza of thickness a and radius z can be described by the
following formula: pi zz a. -- Wouter Verhelst

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
 On Thu, Jan 21, 2016 at 11:03:23PM +0000, cym13 via Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[...]

 
 It may be fast but I think it may be related to the fact that this is
 not a CSV parser. Don't get me wrong, it is able to parse a format
 defined by delimiters but true CSV is one hell of a beast.

 
 Alright, I decided to take on the challenge to write a "real" CSV
 parser... since it's a bit tedious to keep posting code in the forum,
 I've pushed it to github instead:
 
 	https://github.com/quickfur/fastcsv

Oh, forgot to mention, the parsing times are still lightning fast after
the fixes I mentioned: still around 1190 msecs or so.

Now I'm tempted to actually implement doubled-quote interpretation... as
long as the input file doesn't contain unreasonable amounts of doubled
quotes, I'm expecting the speed should remain pretty fast.


--T

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jan 21, 2016 at 04:31:03PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
 On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:

[...]
 	https://github.com/quickfur/fastcsv

 
 Oh, forgot to mention, the parsing times are still lightning fast
 after the fixes I mentioned: still around 1190 msecs or so.
 
 Now I'm tempted to actually implement doubled-quote interpretation...
 as long as the input file doesn't contain unreasonable amounts of
 doubled quotes, I'm expecting the speed should remain pretty fast.

[...]

Done, commits pushed to github.

The new code now parses doubled quotes correctly.  The performance is
slightly worse now, around 1300 msecs on average, even in files that
don't have any doubled quotes (it's a penalty incurred by the inner loop
needing to detect doubled quote sequences).

My benchmark input file doesn't have any doubled quotes, however (code
correctness with doubled quotes is gauged by unittests only); so the
performance numbers may not accurately reflect true performance in the
general case. (But if doubled quotes are rare, as I'm expecting, the
actual performance shouldn't change too much in general usage...)

Maybe somebody who has a file with lots of ""'s can run the benchmark to
see how badly it performs? :-P


T

-- 
Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be
algorithms.

Jan 21 2016

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote:
 but in that case external quotes aren't required:

     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,He said ""oh my gosh""

std.csv will reject this. If validation is turned off this is 
fine but your data will include "".

"A field containing new lines, commas, or double quotes should be 
enclosed in double quotes (customizable)"

This because it is not possible to decide what correct parsing 
should be. Is the data using including two double quotes? What if 
there was only one quote there, do I have to remember it was 
their and decide not to throw it out because I didn't see another 
quote? At this point the data is not following CSV rules so if 
I'm validating I'm throwing it out and if I'm not validating I'm 
not stripping data.

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Fri, Jan 22, 2016 at 01:13:07AM +0000, Jesse Phillips via
Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote:
but in that case external quotes aren't required:

    number,name,price,comment
    1,Twilight,150,good friend
    2,Fluttershy,142,gentle
    3,Pinkie Pie,169,He said ""oh my gosh""

 
 std.csv will reject this. If validation is turned off this is fine but
 your data will include "".
 
 "A field containing new lines, commas, or double quotes should be
 enclosed in double quotes (customizable)"
 
 This because it is not possible to decide what correct parsing should
 be. Is the data using including two double quotes? What if there was
 only one quote there, do I have to remember it was their and decide
 not to throw it out because I didn't see another quote? At this point
 the data is not following CSV rules so if I'm validating I'm throwing
 it out and if I'm not validating I'm not stripping data.

This case is still manageable, because there are no embedded commas.
Everything between the last comma and the next comma or newline
unambiguously belongs to the current field.  As to how to interpret it
(should the result contain single or doubled quotes?), though, that
could potentially be problematic.

And now that you mention this, RFC-4180 does not allow doubled quotes in
an unquoted field. I'll take that out of the code (it improves
performance :-D).


T

-- 
First Rule of History: History doesn't repeat itself -- historians merely
repeat each other.

Jan 21 2016

cym13 <cpicard openmailbox.org> writes:

On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote:
 And now that you mention this, RFC-4180 does not allow doubled 
 quotes in an unquoted field. I'll take that out of the code (it 
 improves performance :-D).

Right, re-reading the RFC would have been a great thing. That 
said I saw that kind of CSV in the real world, so I don't know 
what to think of it. I'm not saying it should be supported, but I 
wonder if there are points outside RFC-4180 that are taken for 
granted.

Jan 21 2016

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Friday, 22 January 2016 at 01:36:40 UTC, cym13 wrote:
 On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote:
 And now that you mention this, RFC-4180 does not allow doubled 
 quotes in an unquoted field. I'll take that out of the code 
 (it improves performance :-D).

 Right, re-reading the RFC would have been a great thing. That 
 said I saw that kind of CSV in the real world, so I don't know 
 what to think of it. I'm not saying it should be supported, but 
 I wonder if there are points outside RFC-4180 that are taken 
 for granted.

You have to understand CSV didn't come from a standard. People 
started using because it was simple for writing out some tabular 
data. Then they changed it because their data changed. It's not 
like their language came with a CSV parser, it was always hand 
written and people still do it today. And that is why data is 
delimited with so many things not comma (people thought they 
wouldn't need to escape their data).

So yes, some CSV parsers will accept comments but that just means 

assume that two double quotes in unquoted data is just a quote, 
but then it breaks for those who have that kind of data which 
isn't escaped.

There is also many other issues with CSV data, like is the file 
in ASCII or UTF or some other code page. And many times CSV isn't 
well formed because the data was output without proper escaping.

std.csv isn't the end-all csv parsers, but it will at least 
handle well formed CSV that use different separators or quotes.

Jan 22 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
 [...]
 	https://github.com/quickfur/fastcsv



[...]

Fixed some boundary condition crashes and reverted doubled quote
handling in unquoted fields (since those are illegal according to RFC
4810).  Performance is back in the ~1200 msec range.


T

-- 
There is no gravity. The earth sucks.

Jan 21 2016

Edwin van Leeuwen <edder tkwsping.nl> writes:

On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via 
 Digitalmars-d-learn wrote:
 [...]
 	https://github.com/quickfur/fastcsv



 [...]

 Fixed some boundary condition crashes and reverted doubled 
 quote handling in unquoted fields (since those are illegal 
 according to RFC 4810).  Performance is back in the ~1200 msec 
 range.


 T

That's pretty impressive. Maybe turn it on into a dub package so 
that data pulverizer could easily test it on his data :)

Jan 22 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via 
 Digitalmars-d-learn wrote:
 [...]
 	https://github.com/quickfur/fastcsv



 [...]

 Fixed some boundary condition crashes and reverted doubled 
 quote handling in unquoted fields (since those are illegal 
 according to RFC 4810).  Performance is back in the ~1200 msec 
 range.


 T

Hi H. S. Teoh,  I have used you fastcsv on my file:

import std.file;
import fastcsv;
import std.stdio;
import std.datetime;

void main(){
   StopWatch sw;
   sw.start();
   auto input = cast(string) read("Acquisition_2009Q2.txt");
   auto mydata = fastcsv.csvToArray!('|')(input);
   sw.stop();
   double time = sw.peek().msecs;
   writeln("Time (s): ", time/1000);
}

$ dmd file_read_5.d fastcsv.d
$ ./file_read_5
Time (s): 0.679

Fastest so far, very nice.

Jan 22 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Friday, 22 January 2016 at 21:41:46 UTC, data pulverizer wrote:
 On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
 [...]

 Hi H. S. Teoh,  I have used you fastcsv on my file:

 import std.file;
 import fastcsv;
 import std.stdio;
 import std.datetime;

 void main(){
   StopWatch sw;
   sw.start();
   auto input = cast(string) read("Acquisition_2009Q2.txt");
   auto mydata = fastcsv.csvToArray!('|')(input);
   sw.stop();
   double time = sw.peek().msecs;
   writeln("Time (s): ", time/1000);
 }

 $ dmd file_read_5.d fastcsv.d
 $ ./file_read_5
 Time (s): 0.679

 Fastest so far, very nice.

I guess the next step is allowing Tuple rows with mixed types.

Jan 22 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Fri, Jan 22, 2016 at 10:04:58PM +0000, data pulverizer via
Digitalmars-d-learn wrote:
[...]
$ dmd file_read_5.d fastcsv.d
$ ./file_read_5
Time (s): 0.679

Fastest so far, very nice.


Thanks!


 I guess the next step is allowing Tuple rows with mixed types.

I thought about that a little today. I'm guessing that most of the
performance will be dependent on the conversion into the target types.
Right now it's extremely fast because, for the most part, it's just
taking slices of an existing string. It shouldn't be too hard to extend
the current code so that instead of assembling the string slices in a
block buffer, it will run them through std.conv.to instead and store
them in an array of some given struct. But there may be performance
degradation because now we have to do non-trivial operations on the
string slices.

Converting from const(char)[] to string probably should be avoided where
not necessary, since otherwise it will involve lots and lots of small
allocations and the GC will become very slow. Converting to ints may not
be too bad... but conversion to types like floating point may be quite
slow. Now, assembling the resulting structs into an array could
potentially be slow... but perhaps an analogous block buffer technique
can be used to create the array piecemeal in separate blocks, and only
perform the final assembly into a single array at the very end (thus
avoiding reallocating and copying the growing array as we go along).

But we'll see.  Performance predictions are rarely accurate; only a
profiler will tell the truth about where the real bottlenecks are. :-)


T

-- 
LINUX = Lousy Interface for Nefarious Unix Xenophobes.

Jan 22 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Fri, Jan 22, 2016 at 10:04:58PM +0000, data pulverizer via
Digitalmars-d-learn wrote:
[...]
 I guess the next step is allowing Tuple rows with mixed types.

Alright. I threw together a new CSV parsing function that loads CSV data
into an array of structs. Currently, the implementation is not quite
polished yet (it blindly assumes the first row is a header row, which it
discards), but it does work, and outperforms std.csv by about an order
of magnitude.

The initial implementation was very slow (albeit still somewhat fast
than std.csv by about 10% or so) when given a struct with string fields.
However, structs with POD fields are lightning fast (not significantly
different from before, in spite of all the calls to std.conv.to!). This
suggested that the slowdown was caused by excessive allocations of small
strings, causing a heavy GC load.  This suspicion was confirmed when I
ran the same input data with a struct where all string fields were
replaced with const(char)[] (so that std.conv.to simply returned slices
to the data) -- the performance shot back up to about 1700 msecs, a
little slower than the original version of reading into an array of
array of const(char)[] slices, but about 58 times(!) the performance of
std.csv.

So I tried a simple optimization: instead of allocating a string per
field, allocate 64KB string buffers and copy string field values into
it, then taking slices from the buffer to assign to the struct's string
fields.  With this optimization, running times came down to about the
1900 msec range, which is only marginally slower than the const(char)[]
case, about 51 times faster than std.csv.

Here are the actual benchmark values:

1) std.csv: 2126883 records, 102136 msecs

2) fastcsv (struct with string fields): 2126883 records, 1978 msecs

3) fastcsv (struct with const(char)[] fields): 2126883 records, 1743 msecs

The latest code is available on github:

	https://github.com/quickfur/fastcsv

The benchmark driver now has 3 new targets:

stdstruct	- std.csv parsing of CSV into structs
faststruct	- fastcsv parsing of CSV into struct (string fields)
faststruct2	- fastcsv parsing of CSV into struct (const(char)[] fields)

Note that the structs are hard-coded into the code, so they will only
work with the census.gov test file.

Things still left to do:

- Fix header parsing to have a consistent interface with std.csv, or at
  least allow the user to configure whether or not the first row should
  be discarded.

- Support transcription to Tuples?

- Refactor the code to have less copy-pasta.

- Ummm... make it ready for integration with std.csv maybe? ;-)


T

-- 
Fact is stranger than fiction.

Jan 23 2016

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Sunday, 24 January 2016 at 01:57:11 UTC, H. S. Teoh wrote:
 - Ummm... make it ready for integration with std.csv maybe? ;-)


 T

My suggestion is to take the unittests used in std.csv and try to 
get your code working with them. As fastcsv limitations would 
prevent replacing the std.csv implementation the API may not need 
to match, but keeping close to the same would be best.

Jan 23 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Sun, Jan 24, 2016 at 06:07:41AM +0000, Jesse Phillips via
Digitalmars-d-learn wrote:
[...]
 My suggestion is to take the unittests used in std.csv and try to get
 your code working with them. As fastcsv limitations would prevent
 replacing the std.csv implementation the API may not need to match,
 but keeping close to the same would be best.

My thought is to integrate the fastcsv code into std.csv, such that the
current std.csv code will serve as fallback in the cases where fastcsv's
limitations would prevent it from being used, with fastcsv being chosen
where possible.

It may be possible to lift some of fastcsv's limitations, now that a few
performance bottlenecks have been identified (validation, excessive
number of small allocations, being the main ones). The code could be
generalized a bit more while preserving the optimizations in these key
areas.


T

-- 
BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL

Jan 25 2016

bachmeier <no spam.net> writes:

On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote:
 My thought is to integrate the fastcsv code into std.csv, such 
 that the current std.csv code will serve as fallback in the 
 cases where fastcsv's limitations would prevent it from being 
 used, with fastcsv being chosen where possible.

Wouldn't it be simpler to add a new function? Otherwise you'll 
end up with very different performance for almost the same data.

Jan 26 2016

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote:
 On Sun, Jan 24, 2016 at 06:07:41AM +0000, Jesse Phillips via 
 Digitalmars-d-learn wrote: [...]
 My suggestion is to take the unittests used in std.csv and try 
 to get your code working with them. As fastcsv limitations 
 would prevent replacing the std.csv implementation the API may 
 not need to match, but keeping close to the same would be best.

 My thought is to integrate the fastcsv code into std.csv, such 
 that the current std.csv code will serve as fallback in the 
 cases where fastcsv's limitations would prevent it from being 
 used, with fastcsv being chosen where possible.

That is why I suggested starting with the unittests. I don't 
expect the implementations to share much code, std.csv is written 
to only use front, popFront, and empty. Most of the work is done 
in csvNextToken so it might be able to take advantage of 
random-access ranges for more performance. I just think the 
unittests will help to define where switching algorthims will be 
required since they exercise a good portion of the API.

Jan 26 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 07:11:05PM +0000, Jesse Phillips via 
 This piqued my interest today, so I decided to take a shot at 
 writing a fast CSV parser.  First, I downloaded a sample large 
 CSV file from: [...]

Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my 
machine but I get ctr1.o errors for example:

.../crt1.o(.debug_info): relocation 0 has invalid symbol index 0

are there flags that I should be compiling with or some other 
thing that I am missing?

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jan 21, 2016 at 11:29:49PM +0000, data pulverizer via
Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 07:11:05PM +0000, Jesse Phillips via This piqued
my interest today, so I decided to take a shot at writing a fast CSV
parser.  First, I downloaded a sample large CSV file from: [...]

 
 Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my machine
 but I get ctr1.o errors for example:
 
 .../crt1.o(.debug_info): relocation 0 has invalid symbol index 0
 
 are there flags that I should be compiling with or some other thing
 that I am missing?

Did you supply a main() function? If not, it won't run, because
fastcsv.d is only a module.  If you want to run the benchmark, you'll
have to compile both benchmark.d and fastcsv.d together.


T

-- 
Give a man a fish, and he eats once. Teach a man to fish, and he will sit
forever.

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 11:29:49PM +0000, data pulverizer via 
 Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 07:11:05PM +0000, Jesse Phillips via 
This piqued my interest today, so I decided to take a shot at 
writing a fast CSV parser.  First, I downloaded a sample 
large CSV file from: [...]

 
 Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my 
 machine but I get ctr1.o errors for example:
 
 .../crt1.o(.debug_info): relocation 0 has invalid symbol index 
 0
 
 are there flags that I should be compiling with or some other 
 thing that I am missing?

 Did you supply a main() function? If not, it won't run, because 
 fastcsv.d is only a module.  If you want to run the benchmark, 
 you'll have to compile both benchmark.d and fastcsv.d together.


 T

Thanks, I got used to getting away with running the "script" file 
in the same folder as a single file module - it usually works but 
occasionally (like now) I have to compile both together as you 
suggested.

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote:
 are there flags that I should be compiling with or some other 
 thing that I am missing?

 Did you supply a main() function? If not, it won't run, because 
 fastcsv.d is only a module.  If you want to run the benchmark, 
 you'll have to compile both benchmark.d and fastcsv.d together.


 T

Great benchmarks! This is something else for me to learn from.

Jan 21 2016

Gerald Jansen <gjansen ownmail.net> writes:

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 While this is no fancy range-based code, and one might say it's 
 more hackish and C-like than idiomatic D, the problem is that 
 current D compilers can't quite optimize range-based code to 
 this extent yet. Perhaps in the future optimizers will improve 
 so that more idiomiatic, range-based code will have comparable 
 performance with fastcsv. (At least in theory this should be 
 possible.)

As a D novice still struggling with the concept that composable 
range-based functions can be more efficient than good-old looping 
(ya, I know, cache friendliness and GC avoidance), I find it 
extremely interesting that someone as expert as yourself would 
reach for a C-like approach for serious data crunching. Given 
that data crunching is the kind of thing I need to do a lot, I'm 
wondering how general your statement above might be at this time 
w.r.t. this and possibly other domains.

Jan 26 2016

Chris Wright <dhasenan gmail.com> writes:

On Tue, 26 Jan 2016 18:16:28 +0000, Gerald Jansen wrote:

 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 While this is no fancy range-based code, and one might say it's more
 hackish and C-like than idiomatic D, the problem is that current D
 compilers can't quite optimize range-based code to this extent yet.
 Perhaps in the future optimizers will improve so that more idiomiatic,
 range-based code will have comparable performance with fastcsv. (At
 least in theory this should be possible.)

 
 As a D novice still struggling with the concept that composable
 range-based functions can be more efficient than good-old looping (ya, I
 know, cache friendliness and GC avoidance), I find it extremely
 interesting that someone as expert as yourself would reach for a C-like
 approach for serious data crunching. Given that data crunching is the
 kind of thing I need to do a lot, I'm wondering how general your
 statement above might be at this time w.r.t. this and possibly other
 domains.

You want to reduce allocations. Ranges often let you do that. However, 
it's sometimes unsafe to reuse range values that aren't immutable. That 
means, if you want to keep the values around, you need to copy them -- 
which introduces an allocation.

You can get fewer large allocations by reading the whole file at once 
manually and using slices into that large allocation.

Jan 26 2016

Gerald Jansen <gjansen ownmail.net> writes:

On Tuesday, 26 January 2016 at 20:54:34 UTC, Chris Wright wrote:
 On Tue, 26 Jan 2016 18:16:28 +0000, Gerald Jansen wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 While this is no fancy range-based code, and one might say 
 it's more hackish and C-like than idiomatic D, the problem is 
 that current D compilers can't quite optimize range-based 
 code to this extent yet. Perhaps in the future optimizers 
 will improve so that more idiomiatic, range-based code will 
 have comparable performance with fastcsv.

 
 ... data crunching ... I'm wondering how general your 
 statement above might be at this time w.r.t. this and possibly 
 other domains.

 You can get fewer large allocations by reading the whole file 
 at once manually and using slices into that large allocation.

Sure, that part is clear. Presumably the quoted comment referred 
to more than just that technique.

Jan 26 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, Jan 26, 2016 at 08:54:34PM +0000, Chris Wright via Digitalmars-d-learn
wrote:
 On Tue, 26 Jan 2016 18:16:28 +0000, Gerald Jansen wrote:
 
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 While this is no fancy range-based code, and one might say it's
 more hackish and C-like than idiomatic D, the problem is that
 current D compilers can't quite optimize range-based code to this
 extent yet.  Perhaps in the future optimizers will improve so that
 more idiomiatic, range-based code will have comparable performance
 with fastcsv. (At least in theory this should be possible.)

 
 As a D novice still struggling with the concept that composable
 range-based functions can be more efficient than good-old looping
 (ya, I know, cache friendliness and GC avoidance), I find it
 extremely interesting that someone as expert as yourself would reach
 for a C-like approach for serious data crunching. Given that data
 crunching is the kind of thing I need to do a lot, I'm wondering how
 general your statement above might be at this time w.r.t. this and
 possibly other domains.

 
 You want to reduce allocations. Ranges often let you do that. However,
 it's sometimes unsafe to reuse range values that aren't immutable.
 That means, if you want to keep the values around, you need to copy
 them -- which introduces an allocation.
 
 You can get fewer large allocations by reading the whole file at once
 manually and using slices into that large allocation.

Yeah, in the course of this exercise, I found that the one thing that
has had the biggest impact on performance is the amount of allocations
involved.  Basically, I noted that the less allocations are made, the
more efficient the code. I'm not sure exactly why this is so, but it's
probably something to do with the fact that tracing GCs work better with
fewer allocations of larger objects, than many allocations of small
objects.  I have also noted in the past that D's current GC runs
collections a little too often; in past projects I've obtained
significant speedup (in one case, up to 40% reduction of total runtime)
by suppressing automatic collections and scheduling them manually at a
lower frequency.

In short, I've found that reducing GC load plays a much bigger role in
performance than the range vs. loops issue.

The reason I chose to write manual loops at first is to eliminate all
possibility of unexpected overhead that might hide behind range
primitives, as well as compiler limitations, as current optimizers
aren't exactly tuned for range-based idioms, and may fail to recognize
certain range-based idioms that would lead to much more efficient code.
However, in my second iteration when I made the fastcsv parser return an
input range instead of an array, I found only negligible performance
differences.  This suggests that perhaps range-based code may not
perform that badly after all. I have yet to test this hypothesis, as the
inner loop that parses fields in a single row is still a manual loop;
but my suspicion is that it wouldn't do too badly in range-based form
either.

What might make a big difference, though, is the part where slicing is
used, since that is essential for reducing the number of allocations.

The current iteration of struct-based parsing code, for instance, went
through an initial version that was excruciatingly slow for structs with
string fields. Why? Because the function takes const(char)[] as input,
and you can't legally get strings out of that unless you make a copy of
that data (since const means you cannot modify it, but somebody else
still might). So std.conv.to would allocate a new string and copy the
contents over, every time a string field was parsed, resulting in a
large number of small allocations.

To solve this, I decided to use a string buffer: instead of one
allocation per string, pre-allocate a large-ish char[] buffer, and every
time a string field was parsed, append the data into the buffer. If the
buffer becomes full, allocate a new one. Take a slice of the buffer
corresponding to that field and cast it to string (this is safe since
the algorithm was constructed never to write over previous parts of the
buffer).  This seemingly trivial optimization won me a performance
improvement of an order of magnitude(!).

This is particularly enlightening, since it suggests that even the
overhead of copying all the string fields out of the original data into
a new buffer does not add up to that much.  The new struct-based parser
also returns an input range rather than an array; I found that
constructing the array directly vs. copying from an input range didn't
really make that big of a difference either. What did make a huge
difference is reducing the number of allocations.

So the moral of the story is: avoid large numbers of small allocations.
If you have to do it, consider consolidating your allocations into a
series of allocations of large(ish) buffers instead, and taking slices
of the buffers.

(And on a tangential note, this backs up Walter's claim that string
manipulation in C/C++ ultimately will lose, because of strcpy() and
strlen(). Think of how many times in C/C++ code you have to copy string
data just because you can't guarantee the incoming string will still be
around after you return, and how many times you have to iterate over
strings just because arrays are pointers and thus have no length.  You
couldn't write the equivalent of fastcsv in C/C++, because you'll leak
memory and/or get dangling pointers, since you don't know what will
happen to the incoming data after you return, so you can't just take
slices of it. You'd be forced to malloc() all your strings, and then
somehow ensure the caller will clean up properly. Ultimately you'd need
a convoluted, unnatural API just to make sure the memory housekeeping is
taken care of. Whereas in D, even though the GC is so atrociously slow,
it *does* let you freely slice things to your heart's content with zero
API complication, no memory leaks, and when done right, can even rival
C/C++ performance, and that at a fraction of the mental load required to
write leak-free, pointer-bug-free C/C++ code.)


T

-- 
Tell me and I forget. Teach me and I remember. Involve me and I understand. --
Benjamin Franklin

Jan 26 2016

Gerald Jansen <gjansen ownmail.net> writes:

On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:
 ...
 So the moral of the story is: avoid large numbers of small 
 allocations. If you have to do it, consider consolidating your 
 allocations into a series of allocations of large(ish) buffers 
 instead, and taking slices of the buffers.

Many thanks for the detailed explanation.

Jan 27 2016

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:
 Yeah, in the course of this exercise, I found that the one 
 thing that has had the biggest impact on performance is the 
 amount of allocations involved.  [...snip]

Really interesting discussion.

Jan 27 2016

Laeeth Isharc <laeeth-nospam nospamlaeeth.com> writes:

On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:
 So the moral of the story is: avoid large numbers of small 
 allocations. If you have to do it, consider consolidating your 
 allocations into a series of allocations of large(ish) buffers 
 instead, and taking slices of the buffers.

Thanks for sharing this, HS Teoh.  I tried replacing allocations 
with using a Region from std.experimental.allocator (with 
FreeList and Quantizer on top), and then just deallocating 
everything in one go once I am done with the data.  Seems to be a 
little faster, but I haven't had time to measure it.

Just came across this C++ project, which seems to have 
astonishing performance.  7 minutes for reading a terabyte, and 
2.5 to 4.5 GB/sec for reading file cold.  That's pretty 
impressive.  (Obviously they read in parallel, but I haven't yet 
read source to see what the other tricks might be).

It would be nice to be able match that in D, though practically 
speaking it's probably easiest just to wrap it:

http://www.wise.io/tech/paratext

https://github.com/wiseio/paratext

Oct 29 2016

Gerald Jansen <gjansen ownmail.net> writes:

On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
 I have been reading large text files with D's csv file reader 
 and have found it slow compared to R's read.table function

This great blog post has an optimized FastReader for CSV files:

http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

Jan 21 2016

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 January 2016 at 20:46:15 UTC, Gerald Jansen wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
 wrote:
 I have been reading large text files with D's csv file reader 
 and have found it slow compared to R's read.table function

 This great blog post has an optimized FastReader for CSV files:

 http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

Thanks a lot Gerald, the blog and the discussions were very 
useful and revealing - for me it shows that you can use the D 
language to write fast code and then if you need it, to wring 
more performance and you can go as low level as you want all 
without leaving the D language or its tooling ecosystem.

Jan 21 2016

Jon D <jond noreply.com> writes:

On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
 I have been reading large text files with D's csv file reader 
 and have found it slow compared to R's read.table function 
 which is not known to be particularly fast.

FWIW - I've been implementing a few programs manipulating 
delimited files, e.g. tab-delimited. Simpler than CSV files 
because there is no escaping inside the data. I've been trying to 
do this in relatively straightforward ways, e.g. using byLine 
rather than byChunk. (Goal is to explore the power of D standard 
libraries).

I've gotten significant speed-ups in a couple different ways:
* DMD libraries 2.068+  -  byLine is dramatically faster
* LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the 
DMD compiler
* Avoid utf-8 to dchar conversion - This conversion often occurs 
silently when working with ranges, but is generally not needed 
when manipulating data.
* Avoid unnecessary string copies. e.g. Don't gratuitously 
convert char[] to string.

At this point performance of the utilities I've been writing is 
quite good. They don't have direct equivalents with other tools 
(such as gnu core utils), so a head-to-head is not appropriate, 
but generally it seems the tools are quite competitive without 
needing to do my own buffer or memory management. And, they are 
dramatically faster than the same tools written in perl (which I 
was happy with).

--Jon

Jan 21 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jan 21, 2016 at 10:09:24PM +0000, Jon D via Digitalmars-d-learn wrote:
[...]
 FWIW - I've been implementing a few programs manipulating delimited
 files, e.g. tab-delimited. Simpler than CSV files because there is no
 escaping inside the data. I've been trying to do this in relatively
 straightforward ways, e.g. using byLine rather than byChunk. (Goal is
 to explore the power of D standard libraries).
 
 I've gotten significant speed-ups in a couple different ways:
 * DMD libraries 2.068+  -  byLine is dramatically faster
 * LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the DMD compiler

While byLine has improved a lot, it's still not the fastest thing in the
world, because it still performs (at least) one OS roundtrip per line,
not to mention it will auto-reencode to UTF-8. If your data is already
in a known encoding, reading in the entire file and casting to
(|w|d)string then splitting it by line will be a lot faster, since you
can eliminate a lot of I/O roundtrips that way.

In any case, it's well-known that gdc/ldc generally produce code that's
about 20%-30% faster than dmd-compiled code, sometimes a lot more. While
DMD has gotten some improvements in this area recently, it still has a
long way to go before it can catch up.  For performance-sensitive code I
always reach for gdc instead of dmd.


 * Avoid utf-8 to dchar conversion - This conversion often occurs
 silently when working with ranges, but is generally not needed when
 manipulating data.

[...]

Yet another nail in the coffin of auto-decoding.  I wonder how many more
nails we will need before Andrei is convinced...


T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it
often, use it unsparingly, to subdue your listeners into submission!

Jan 21 2016

Jon D <jond noreply.com> writes:

On Thursday, 21 January 2016 at 22:20:28 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 10:09:24PM +0000, Jon D via 
 Digitalmars-d-learn wrote: [...]
 FWIW - I've been implementing a few programs manipulating 
 delimited files, e.g. tab-delimited. Simpler than CSV files 
 because there is no escaping inside the data. I've been trying 
 to do this in relatively straightforward ways, e.g. using 
 byLine rather than byChunk. (Goal is to explore the power of D 
 standard libraries).
 
 I've gotten significant speed-ups in a couple different ways:
 * DMD libraries 2.068+  -  byLine is dramatically faster
 * LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the 
 DMD compiler

 While byLine has improved a lot, it's still not the fastest 
 thing in the world, because it still performs (at least) one OS 
 roundtrip per line, not to mention it will auto-reencode to 
 UTF-8. If your data is already in a known encoding, reading in 
 the entire file and casting to (|w|d)string then splitting it 
 by line will be a lot faster, since you can eliminate a lot of 
 I/O roundtrips that way.

No disagreement, but I had other goals. At a high level, I'm 
trying to learn and evaluate D, which partly involves 
understanding the strengths and weaknesses of the standard 
library. From this perspective, byLine was a logical starting 
point. More specifically, the tools I'm writing are often used in 
unix pipelines, so input can be a mixture of standard input and 
files. And, the files can be arbitrarily large. In these cases, 
reading the entire file is not always appropriate. Buffering 
usually is, and my code knows when it is dealing with files vs 
standard input and could handle these differently. However, 
standard library code could handle these distinctions as well, 
which was part of the reason for trying the straightforward 
approach.

Aside - Despite the 'learning D' motivation, the tools are real 
tools, and writing them in D has been a clear win, especially 
with the byLine performance improvements in 2.068.

Jan 21 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Speed of csvReader