www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Speed of csvReader

reply data pulverizer <data.pulverizer gmail.com> writes:
I have been reading large text files with D's csv file reader and 
have found it slow compared to R's read.table function which is 
not known to be particularly fast. Here I am reading Fannie Mae 
mortgage acquisition data which can be found here 
http://www.fanniemae.com/portal/funding-the-market/data/loan-p
rformance-data.html after registering:

D Code:

import std.algorithm;
import std.array;
import std.file;
import std.csv;
import std.stdio;
import std.typecons;
import std.datetime;

alias row_type = Tuple!(string, string, string, string, string, 
string, string, string,
                         string, string, string, string, string, 
string, string, string,
                         string, string, string, string, string, 
string);

void main(){
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();
   double time = sw.peek().msecs;
   writeln("Time (s): ", time/1000);
}

Time (s): 13.478

R Code:

system.time(x <- read.table("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22)))
    user  system elapsed
   7.810   0.067   7.874


R takes about half as long to read the file. Both read the data 
in the "equivalent" type format. Am I doing something incorrect 
here?
Jan 21 2016
next sibling parent reply Rikki Cattermole <alphaglosined gmail.com> writes:
On 21/01/16 10:39 PM, data pulverizer wrote:
 I have been reading large text files with D's csv file reader and have
 found it slow compared to R's read.table function which is not known to
 be particularly fast. Here I am reading Fannie Mae mortgage acquisition
 data which can be found here
 http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html
 after registering:

 D Code:

 import std.algorithm;
 import std.array;
 import std.file;
 import std.csv;
 import std.stdio;
 import std.typecons;
 import std.datetime;

 alias row_type = Tuple!(string, string, string, string, string, string,
 string, string,
                          string, string, string, string, string, string,
 string, string,
                          string, string, string, string, string, string);

 void main(){
    StopWatch sw;
    sw.start();
    auto buffer = std.file.readText("Acquisition_2009Q2.txt");
    auto records = csvReader!row_type(buffer, '|').array;
    sw.stop();
    double time = sw.peek().msecs;
    writeln("Time (s): ", time/1000);
 }

 Time (s): 13.478

 R Code:

 system.time(x <- read.table("Acquisition_2009Q2.txt", sep = "|",
 colClasses = rep("character", 22)))
     user  system elapsed
    7.810   0.067   7.874


 R takes about half as long to read the file. Both read the data in the
 "equivalent" type format. Am I doing something incorrect here?
Okay without registering not gonna get that data. So usual things to think about, did you turn on release mode? What about inlining? Lastly how about disabling the GC? import core.memory : GC; GC.disable(); dmd -release -inline code.d
Jan 21 2016
parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole 
wrote:

 Okay without registering not gonna get that data.

 So usual things to think about, did you turn on release mode?
 What about inlining?

 Lastly how about disabling the GC?

 import core.memory : GC;
 GC.disable();

 dmd -release -inline code.d
That helped a lot, I disable GC and inlined as you suggested and the time is now: Time (s): 8.754 However, with R's data.table package gives us: system.time(x <- fread("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))) user system elapsed 0.852 0.021 0.872 I should probably have begun with this timing. Its not my intention to turn this into a speed-only competition, however the ingest of files and speed of calculation is very important to me.
Jan 21 2016
next sibling parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 10:40:39 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole 
 wrote:

 Okay without registering not gonna get that data.

 So usual things to think about, did you turn on release mode?
 What about inlining?

 Lastly how about disabling the GC?

 import core.memory : GC;
 GC.disable();

 dmd -release -inline code.d
That helped a lot, I disable GC and inlined as you suggested and the time is now: Time (s): 8.754 However, with R's data.table package gives us: system.time(x <- fread("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))) user system elapsed 0.852 0.021 0.872 I should probably have begun with this timing. Its not my intention to turn this into a speed-only competition, however the ingest of files and speed of calculation is very important to me.
I should probably add compiler version info: ~$ dmd --version DMD64 D Compiler v2.069.2 Copyright (c) 1999-2015 by Digital Mars written by Walter Bright Running Ubuntu 14.04 LTS
Jan 21 2016
parent reply bachmeier <no spam.com> writes:
On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer 
wrote:

 Running Ubuntu 14.04 LTS
In that case, have you looked at http://lancebachmeier.com/rdlang/ If this is a serious bottleneck you can solve it with two lines evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))`); auto x = RMatrix(evalR("x")); and then you've got access to the data in D.
Jan 21 2016
parent data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 16:25:55 UTC, bachmeier wrote:
 On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer 
 wrote:

 Running Ubuntu 14.04 LTS
In that case, have you looked at http://lancebachmeier.com/rdlang/ If this is a serious bottleneck you can solve it with two lines evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))`); auto x = RMatrix(evalR("x")); and then you've got access to the data in D.
Thanks. That's certainly something to try.
Jan 21 2016
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 01/21/2016 02:40 AM, data pulverizer wrote:

 dmd -release -inline code.d
These two as well please: -O -boundscheck=off
 the ingest of files and
 speed of calculation is very important to me.
We should understand why D is slow in this case. :) Ali
Jan 21 2016
next sibling parent data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote:
 On 01/21/2016 02:40 AM, data pulverizer wrote:

 dmd -release -inline code.d
These two as well please: -O -boundscheck=off
 the ingest of files and
 speed of calculation is very important to me.
We should understand why D is slow in this case. :) Ali
Thank you, adding those two flags brings down the time a little more ... Time (s): 6.832
Jan 21 2016
prev sibling parent bachmeier <no spam.com> writes:
On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote:

 We should understand why D is slow in this case. :)

 Ali
fread source is here: https://github.com/Rdatatable/data.table/blob/master/src/fread.c Good luck trying to work through that (which explains why I'm using D). I don't know what their magic is, but data.table is many times faster than anything else in R, so I don't think it's trivial.
Jan 21 2016
prev sibling next sibling parent reply Edwin van Leeuwen <edder tkwsping.nl> writes:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();
Is it csvReader or readText that is slow? i.e. could you move sw.start(); one line down (after the readText command) and see how long just the csvReader part takes?
Jan 21 2016
parent reply Saurabh Das <saurabh.das gmail.com> writes:
On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen 
wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
 wrote:
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();
Is it csvReader or readText that is slow? i.e. could you move sw.start(); one line down (after the readText command) and see how long just the csvReader part takes?
Please try this: auto records = File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array; Can you put up some sample data and share the number of records in the file as well.
Jan 21 2016
parent reply Saurabh Das <saurabh.das gmail.com> writes:
On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen 
 wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
 wrote:
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();
Is it csvReader or readText that is slow? i.e. could you move sw.start(); one line down (after the readText command) and see how long just the csvReader part takes?
Please try this: auto records = File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array; Can you put up some sample data and share the number of records in the file as well.
Actually since you're aiming for speed, this might be better: sw.start(); auto records = File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array sw.stop(); Please do verify that the end result is the same - I'm not 100% confident of the cast. Thanks, Saurabh
Jan 21 2016
parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van 
 Leeuwen wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
 wrote:
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();
Is it csvReader or readText that is slow? i.e. could you move sw.start(); one line down (after the readText command) and see how long just the csvReader part takes?
Please try this: auto records = File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array; Can you put up some sample data and share the number of records in the file as well.
Actually since you're aiming for speed, this might be better: sw.start(); auto records = File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array sw.stop(); Please do verify that the end result is the same - I'm not 100% confident of the cast. Thanks, Saurabh
Saurabh I have tried your latest suggestion and the time reduces fractionally to: Time (s): 6.345 the previous suggestion actually increased the time Edwin van Leeuwen The csvReader is what takes the most time, the readText takes 0.229 s
Jan 21 2016
next sibling parent data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das 
 Actually since you're aiming for speed, this might be better:

 sw.start();
 auto records = 
 File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a =>
cast(dchar)a).csvReader!row_type('|').array
 sw.stop();

 Please do verify that the end result is the same - I'm not 
 100% confident of the cast.

 Thanks,
 Saurabh
Saurabh I have tried your latest suggestion and the time reduces fractionally to: Time (s): 6.345 the previous suggestion actually increased the time Edwin van Leeuwen The csvReader is what takes the most time, the readText takes 0.229 s
p.s. Saurabh the result looks fine from the cast. Thanks
Jan 21 2016
prev sibling next sibling parent reply wobbles <grogan.colin gmail.com> writes:
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das 
 wrote:
 [...]
Actually since you're aiming for speed, this might be better: sw.start(); auto records = File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array sw.stop(); Please do verify that the end result is the same - I'm not 100% confident of the cast. Thanks, Saurabh
Saurabh I have tried your latest suggestion and the time reduces fractionally to: Time (s): 6.345 the previous suggestion actually increased the time Edwin van Leeuwen The csvReader is what takes the most time, the readText takes 0.229 s
Interesting that reading a file is so slow. Your timings from R, is that including reading the file also?
Jan 21 2016
parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
 Interesting that reading a file is so slow.

 Your timings from R, is that including reading the file also?
Yes, its just insane isn't it?
Jan 21 2016
parent reply Saurabh Das <saurabh.das gmail.com> writes:
On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
 Interesting that reading a file is so slow.

 Your timings from R, is that including reading the file also?
Yes, its just insane isn't it?
It is insane. Earlier in the thread we were tackling the wrong problem clearly. Hence the adage, "measure first" :-/. As suggested by Edwin van Leeuwen, can you give us a timing of: auto records = File("Acquisition_2009Q2.txt", "r").byLine.map!(a => a.split("|").array).array; Thanks, Saurabh
Jan 21 2016
parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 17:17:52 UTC, Saurabh Das wrote:
 On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer 
 wrote:
 On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
 Interesting that reading a file is so slow.

 Your timings from R, is that including reading the file also?
Yes, its just insane isn't it?
It is insane. Earlier in the thread we were tackling the wrong problem clearly. Hence the adage, "measure first" :-/. As suggested by Edwin van Leeuwen, can you give us a timing of: auto records = File("Acquisition_2009Q2.txt", "r").byLine.map!(a => a.split("|").array).array; Thanks, Saurabh
Good news and bad new. I was going for something similar to what you have above and both slash the time alot: Time (s): 1.024 But now the output is a little garbled. For some reason the splitter isn't splitting correctly - or we are not applying it properly. Line 0: ["100001703051", "RETAIL", "BANK OF AMERICA, N.A.|4.875|207000|3", "0", "03/200", "|05", "2009|75", "75|1|26", "80", "|N", "|", "O ", "ASH", "OU", " REFINANCE|PUD|1|INVE", "TOR", "C", "|801||FRM", "\n\n", "863", "", "FRM"]
Jan 21 2016
parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 18:31:17 UTC, data pulverizer 
wrote:
 Good news and bad new. I was going for something similar to 
 what you have above and both slash the time alot:

 Time (s): 1.024

 But now the output is a little garbled. For some reason the 
 splitter isn't splitting correctly - or we are not applying it 
 properly. Line 0:

 ["100001703051", "RETAIL", "BANK OF AMERICA, 
 N.A.|4.875|207000|3", "0", "03/200", "|05", "2009|75", 
 "75|1|26", "80", "|N", "|", "O ", "ASH", "OU", " 
 REFINANCE|PUD|1|INVE", "TOR", "C", "|801||FRM", "\n\n", "863", 
 "", "FRM"]
I should probably include the first few lines of the file: 100000511550|RETAIL|FLAGSTAR CAPITAL MARKETS CORPORATION|5|222000|360|04/2009|06/2009|44|44|2|37|823|NO|NO CASH-OUT REFINANCE|PUD|1|PRINCIPAL|AZ|863||FRM 100001031040|BROKER|SUNTRUST MORTGAGE INC.|4.99|456000|360|03/2009|05/2009|83|83|1|47|744|NO|NO CASH-OUT REFINANCE|SF|1|PRINCIPAL|MD|211|12|FRM 100001445182|CORRESPONDENT|CITIMORTGAGE, INC.|4.875|172000|360|05/2009|07/2009|80|80|2|25|797|NO|CASH-OUT REFINANCE|SF|1|PRINCIPAL|TX|758||FRM 100001703051|RETAIL|BANK OF AMERICA, N.A.|4.875|207000|360|03/2009|05/2009|75|75|1|26|806|NO|NO CASH-OUT REFINANCE|PUD|1|INVESTOR|CO|801||FRM 100006033316|CORRESPONDENT|JPMORGAN CHASE BANK, NATIONAL ASSOCIATION|5|170000|360|05/2009|07/2009|80|80|1|23|771|NO|CASH-OUT REFINANCE|PUD|1|PRINCIPAL|VA|224||FRM It's interesting that the output first array is not the same as the input
Jan 21 2016
parent reply Justin Whear <justin economicmodeling.com> writes:
On Thu, 21 Jan 2016 18:37:08 +0000, data pulverizer wrote:
 
 It's interesting that the output first array is not the same as the
 input
byLine reuses a buffer (for speed) and the subsequent split operation just returns slices into that buffer. So when byLine progresses to the next line the strings (slices) returned previously now point into a buffer with different contents. You should either use byLineCopy or .idup to create copies of the relevant strings. If your use-case allows for streaming and doesn't require having all the data present at once, you could continue to use byLine and just be careful not to refer to previous rows.
Jan 21 2016
parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear wrote:
 On Thu, 21 Jan 2016 18:37:08 +0000, data pulverizer wrote:

 It's interesting that the output first array is not the same 
 as the input
byLine reuses a buffer (for speed) and the subsequent split operation just returns slices into that buffer. So when byLine progresses to the next line the strings (slices) returned previously now point into a buffer with different contents. You should either use byLineCopy or .idup to create copies of the relevant strings. If your use-case allows for streaming and doesn't require having all the data present at once, you could continue to use byLine and just be careful not to refer to previous rows.
Thanks. It now works with byLineCopy() Time (s): 1.128
Jan 21 2016
parent data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 19:08:38 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear 
 wrote:
 On Thu, 21 Jan 2016 18:37:08 +0000, data pulverizer wrote:

 It's interesting that the output first array is not the same 
 as the input
byLine reuses a buffer (for speed) and the subsequent split operation just returns slices into that buffer. So when byLine progresses to the next line the strings (slices) returned previously now point into a buffer with different contents. You should either use byLineCopy or .idup to create copies of the relevant strings. If your use-case allows for streaming and doesn't require having all the data present at once, you could continue to use byLine and just be careful not to refer to previous rows.
Thanks. It now works with byLineCopy() Time (s): 1.128
Currently the timing is similar to python pandas: import pandas as pd import time col_types = {'col1': str, 'col2': str, 'col3': str, 'col4': str, 'col5': str, 'col6': str, 'col7': str, 'col8': str, 'col9': str, 'col10': str, 'col11': str, 'col12': str, 'col13': str, 'col14': str, 'col15': str, 'col16': str, 'col17': str, 'col18': str, 'col19': str, 'col20': str, 'col21': str, 'col22': str} begin = time.time() x = pd.read_csv('Acquisition_2009Q2.txt', sep = '|', dtype = col_types) end = time.time() print end - begin $ python file_read.py 1.19544792175
Jan 21 2016
prev sibling parent Edwin van Leeuwen <edder tkwsping.nl> writes:
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:
 On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
  Edwin van Leeuwen The csvReader is what takes the most time, 
 the readText takes 0.229 s
The underlying problem most likely is that csvReader has (AFAIK) never been properly optimized/profiled (very old piece of the library). You could try to implement a rough csvReader using buffer.byLine() and for each line use split("|") to split at "|". That should be faster, because it doesn't do any checking. Non tested code: string[][] res = buffer.byLine().map!((a) => a.split("|").array).array;
Jan 21 2016
prev sibling next sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
 R takes about half as long to read the file. Both read the data 
 in the "equivalent" type format. Am I doing something incorrect 
 here?
CsvReader hasn't been compared and optimized from other CSV readers. It does have allocation for the parsed string (even if it isn't changed) and it does a number of validation checks. You may get some improvement disabling the CSV validation, but again this wasn't tested for performance. csvReader!(string,Malformed.ignore)(str) Generally people recommend using GDC/LCD if you need resulting executable performance, but csvReader being slower isn't the most surprising. Before submitting my library to phobos I had started a CSV reader that would do no allocations and instead return the string slice. This wasn't completed and so it never had performance testing done against it. It could very well be slower. https://github.com/JesseKPhillips/JPDLibs/blob/csvoptimize/csv/csv.d My original CSV parser was really slow because I parsed the string twice.
Jan 21 2016
parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Jan 21, 2016 at 07:11:05PM +0000, Jesse Phillips via
Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote:
R takes about half as long to read the file. Both read the data in
the "equivalent" type format. Am I doing something incorrect here?
CsvReader hasn't been compared and optimized from other CSV readers. It does have allocation for the parsed string (even if it isn't changed) and it does a number of validation checks.
[...] This piqued my interest today, so I decided to take a shot at writing a fast CSV parser. First, I downloaded a sample large CSV file from: ftp://ftp.census.gov/econ2013/CBP_CSV/cbp13co.zip This file has over 2 million records, so I thought it would serve as a good dataset to run benchmarks on. Since the OP wanted the loaded data in an array of records, as opposed iterating over the records as an input range, I decided that the best way to optimize this use case was to load the entire file into memory and then return an array of slices to this data, instead of wasting time (and memory) copying the data. Furthermore, since it will be an array of records which are arrays of slices to field values, another optimization is to allocate a large buffer for storing consecutive field slices, and then in the outer array just slice the buffer to represent a record. This greatly cuts down on the number of GC allocations needed. Once the buffer is full, we don't allocate a larger buffer and copy everything over; this is unnecessary (and wasteful) because the outer array doesn't care where its elements point to. Instead, we allocate a new buffer, leaving previous records pointing to slices of the old buffer, and start appending more field slices in the new buffer, and so on. After all, the records don't have to exist in consecutive slices. There's just a minor overhead in that if we run out of space in the buffer while in the middle of parsing a record, we need to copy the current record's field slices into the new buffer, so that all the fields belonging to this record remain contiguous (so that the outer array can just slice them). This is a very small overhead compared to copying the entire buffer into a new memory block (as would happen if we kept the buffer as a single array that needs to expand), so it ought to be negligible. So in a nutshell, what we have is an outer array, each element of which is a slice (representing a record) that points to some slice of one of the buffers. Each buffer is a contiguous sequence of slices (representing a field) pointing to some segment of the original data. Here's the code: --------------------------------------------------------------------------- /** * Experimental fast CSV reader. * * Based on RFC 4180. */ module fastcsv; /** * Reads CSV data from the given filename. */ auto csvFromUtf8File(string filename) { import std.file : read; return csvFromString(cast(string) read(filename)); } /** * Parses CSV data in a string. * * Params: * fieldDelim = The field delimiter (default: ',') * data = The data in CSV format. */ auto csvFromString(dchar fieldDelim=',', dchar quote='"')(const(char)[] data) { import core.memory; import std.array : appender; enum fieldBlockSize = 1 << 16; auto fields = new const(char)[][fieldBlockSize]; size_t curField = 0; GC.disable(); auto app = appender!(const(char)[][][]); // Scan data size_t i; while (i < data.length) { // Parse records size_t firstField = curField; while (i < data.length && data[i] != '\n' && data[i] != '\r') { // Parse fields size_t firstChar, lastChar; if (data[i] == quote) { i++; firstChar = i; while (i < data.length && data[i] != fieldDelim && data[i] != '\n' && data[i] != '\r') { i++; } lastChar = (i < data.length && data[i-1] == quote) ? i-1 : i; } else { firstChar = i; while (i < data.length && data[i] != fieldDelim && data[i] != '\n' && data[i] != '\r') { i++; } lastChar = i; } if (curField >= fields.length) { // Fields block is full; copy current record fields into new // block so that they are contiguous. auto nextFields = new const(char)[][fieldBlockSize]; nextFields[0 .. curField - firstField] = fields[firstField .. curField]; //fields.length = firstField; // release unused memory? curField = curField - firstField; firstField = 0; fields = nextFields; } assert(curField < fields.length); fields[curField++] = data[firstChar .. lastChar]; // Skip over field delimiter if (i < data.length && data[i] == fieldDelim) i++; } app.put(fields[firstField .. curField]); // Skip over record delimiter(s) while (i < data.length && (data[i] == '\n' || data[i] == '\r')) i++; } GC.collect(); GC.enable(); return app.data; } unittest { auto sampleData = `123,abc,"mno pqr",0` ~ "\n" ~ `456,def,"stuv wx",1` ~ "\n" ~ `78,ghijk,"yx",2`; auto parsed = csvFromString(sampleData); assert(parsed == [ [ "123", "abc", "mno pqr", "0" ], [ "456", "def", "stuv wx", "1" ], [ "78", "ghijk", "yx", "2" ] ]); } unittest { auto dosData = `123,aa,bb,cc` ~ "\r\n" ~ `456,dd,ee,ff` ~ "\r\n" ~ `789,gg,hh,ii` ~ "\r\n"; auto parsed = csvFromString(dosData); assert(parsed == [ [ "123", "aa", "bb", "cc" ], [ "456", "dd", "ee", "ff" ], [ "789", "gg", "hh", "ii" ] ]); } --------------------------------------------------------------------------- There are some limitations to this approach: while the current code does try to unwrap quoted values in the CSV, it does not correctly parse escaped double quotes ("") in the fields. This is because to process those values correctly we'd have to copy the field data into a new string and construct its interpreted value, which is slow. So I leave it as an exercise for the reader to implement (it's not hard, when the double double-quote sequence is detected, allocate a new string with the interpreted data instead of slicing the original data. Either that, or just unescape the quotes in the application code itself). Now, in the first version of this code, I didn't have the GC calls... those were added later when I discovered that the GC was slowing it down to about the same speed (or worse!) as std.csv. A little profiling showed that 80% of the time was spent in the GC mark/collect code. After adding in the code to disable the GC, the performance improved dramatically. Of course, running without GC collection is not a fair comparison with std.csv, so I added an option to my benchmark program to disable the GC for std.csv as well. While the result was slightly faster, it was still much slower than my fastcsv code. (Though to be fair, std.csv does perform validation checks and so forth that fastcsv doesn't even try to.) Anyway, here are the performance numbers from one of the benchmark runs (these numbers are pretty typical): std.csv (with gc): 2126884 records in 23144 msecs std.csv (no gc): 2126884 records in 18109 msecs fastcsv (no gc): 2126884 records in 1358 msecs As you can see, our little array-slicing scheme gives us a huge performance boost over the more generic std.csv range-based code. We managed to cut out over 90% of the total runtime, even when std.csv is run with GC disabled. We even try to be nice in fastcsv by calling GC.collect to cleanup after we're done, and this collection time is included in the benchmark. While this is no fancy range-based code, and one might say it's more hackish and C-like than idiomatic D, the problem is that current D compilers can't quite optimize range-based code to this extent yet. Perhaps in the future optimizers will improve so that more idiomiatic, range-based code will have comparable performance with fastcsv. (At least in theory this should be possible.) Finally, just for the record, here's the benchmark code I used: --------------------------------------------------------------------------- /** * Crude benchmark for fastcsv. */ import core.memory; import std.array; import std.csv; import std.file; import std.datetime; import std.stdio; import fastcsv; int main(string[] argv) { if (argv.length < 2) { stderr.writeln("Specify std, stdnogc, or fast"); return 1; } // Obtained from ftp://ftp.census.gov/econ2013/CBP_CSV/cbp13co.zip enum csvFile = "ext/cbp13co.txt"; string input = cast(string) read(csvFile); if (argv[1] == "std") { auto result = benchmark!({ auto data = std.csv.csvReader(input).array; writefln("std.csv read %d records", data.length); })(1); writefln("std.csv: %s msecs", result[0].msecs); } else if (argv[1] == "stdnogc") { auto result = benchmark!({ GC.disable(); auto data = std.csv.csvReader(input).array; writefln("std.csv (nogc) read %d records", data.length); GC.enable(); })(1); writefln("std.csv: %s msecs", result[0].msecs); } else if (argv[1] == "fast") { auto result = benchmark!({ auto data = fastcsv.csvFromString(input); writefln("fastcsv read %d records", data.length); })(1); writefln("fastcsv: %s msecs", result[0].msecs); } else { stderr.writeln("Unknown option: " ~ argv[1]); return 1; } return 0; } --------------------------------------------------------------------------- --T
Jan 21 2016
next sibling parent Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 Of course, running without GC collection is not a fair 
 comparison with std.csv, so I added an option to my benchmark 
 program to disable the GC for std.csv as well.  While the 
 result was slightly faster, it was still much slower than my 
 fastcsv code. (Though to be fair, std.csv does perform 
 validation checks and so forth that fastcsv doesn't even try 
 to.)
As mentioned validation can be turned off auto data = std.csv.csvReader!(string, Malformed.ignore)(input).array; I forgot to mention that one of the requirements for std.csv was that it worked on the base range type, input range. Not that slicing wouldn't be a valid addition. I was also going to do the same thing with my sliced CSV, no fixing of the escaped quote. That would have just been a helper function the user could map over the results.
Jan 21 2016
prev sibling next sibling parent reply Brad Anderson <eco gnuk.net> writes:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 [snip]
 There are some limitations to this approach: while the current 
 code does try to unwrap quoted values in the CSV, it does not 
 correctly parse escaped double quotes ("") in the fields. This 
 is because to process those values correctly we'd have to copy 
 the field data into a new string and construct its interpreted 
 value, which is slow.  So I leave it as an exercise for the 
 reader to implement (it's not hard, when the double 
 double-quote sequence is detected, allocate a new string with 
 the interpreted data instead of slicing the original data. 
 Either that, or just unescape the quotes in the application 
 code itself).
What about wrapping the slices in a range-like interface that would unescape the quotes on demand? You could even set a flag on it during the initial pass to say the field has double quotes that need to be escaped so it doesn't need to take a per-pop performance hit checking for double quotes (that's probably a pretty minor boost, if any, though).
Jan 21 2016
parent Brad Anderson <eco gnuk.net> writes:
On Thursday, 21 January 2016 at 22:13:38 UTC, Brad Anderson wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 [...]
What about wrapping the slices in a range-like interface that would unescape the quotes on demand? You could even set a flag on it during the initial pass to say the field has double quotes that need to be escaped so it doesn't need to take a per-pop performance hit checking for double quotes (that's probably a pretty minor boost, if any, though).
Oh, you discussed range-based later. I should have finished reading before replying.
Jan 21 2016
prev sibling next sibling parent reply cym13 <cpicard openmailbox.org> writes:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 [...]
It may be fast but I think it may be related to the fact that this is not a CSV parser. Don't get me wrong, it is able to parse a format defined by delimiters but true CSV is one hell of a beast. Of course most data look like: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,oh my gosh but you can have delimiters inside a field: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,"14,2",gentle 3,Pinkie Pie,169,oh my gosh or quotes in a quoted field, in that case you have to double the quotes: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,"He said ""oh my gosh""" but in that case external quotes aren't required: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,He said ""oh my gosh"" but at least it's always one record per line, no? No? No. number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,"He said ""oh my gosh"" And she replied ""Come on! Have fun!""" I'll stop there, but you get the picture. Simply splitting by line then separator may work well on most data, but I wouldn't put it in production or in the standard library. Note that I think you did a great job optimizing your code, and I respect that, it's just a friendly reminder.
Jan 21 2016
next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Jan 21, 2016 at 11:03:23PM +0000, cym13 via Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[...]
It may be fast but I think it may be related to the fact that this is not a CSV parser. Don't get me wrong, it is able to parse a format defined by delimiters but true CSV is one hell of a beast.
[...] As I stated, I didn't fully implement the parsing of quoted fields. (Or, for that matter, the correct parsing of crazy wrapped values like you pointed out.) This is not finished code; it's more of a proof of concept. T -- Lottery: tax on the stupid. -- Slashdotter
Jan 21 2016
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Jan 21, 2016 at 11:03:23PM +0000, cym13 via Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[...]
It may be fast but I think it may be related to the fact that this is not a CSV parser. Don't get me wrong, it is able to parse a format defined by delimiters but true CSV is one hell of a beast.
Alright, I decided to take on the challenge to write a "real" CSV parser... since it's a bit tedious to keep posting code in the forum, I've pushed it to github instead: https://github.com/quickfur/fastcsv [...]
 but you can have delimiters inside a field:
 
     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,"14,2",gentle
     3,Pinkie Pie,169,oh my gosh
Fixed.
 or quotes in a quoted field, in that case you have to double the quotes:
 
     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,"He said ""oh my gosh"""
Fixed. Well, except the fact that I don't actually interpret the doubled quotes, but leave it up to the caller to filter them out at the application level.
 but in that case external quotes aren't required:
 
     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,He said ""oh my gosh""
Actually, this has already worked before. (Excepting the untranslated doubled quotes, of course.)
 but at least it's always one record per line, no? No? No.
 
     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,"He said
     ""oh my gosh""
     And she replied
     ""Come on! Have fun!"""
Fixed.
 I'll stop there, but you get the picture. Simply splitting by line
 then separator may work well on most data, but I wouldn't put it in
 production or in the standard library.
Actually, my code does *not* split by line then by separator. Did you read it? ;-) T -- The most powerful one-line C program: #include "/dev/tty" -- IOCCC
Jan 21 2016
parent reply cym13 <cpicard openmailbox.org> writes:
On Friday, 22 January 2016 at 00:26:16 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 11:03:23PM +0000, cym13 via 
 Digitalmars-d-learn wrote:
 [...]
Alright, I decided to take on the challenge to write a "real" CSV parser... since it's a bit tedious to keep posting code in the forum, I've pushed it to github instead: https://github.com/quickfur/fastcsv [...]
     [...]
Fixed.
     [...]
Fixed. Well, except the fact that I don't actually interpret the doubled quotes, but leave it up to the caller to filter them out at the application level.
     [...]
Actually, this has already worked before. (Excepting the untranslated doubled quotes, of course.)
     [...]
Fixed.
 [...]
Actually, my code does *not* split by line then by separator. Did you read it? ;-) T
Great! Sorry for the separator thing, I didn't read your code carefully. You still lack some things like comments and surely more things that I don't know about but it's getting there. I didn't think you'd go through the trouble of fixing those things to be honnest, I'm impressed.
Jan 21 2016
next sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote:
 Great! Sorry for the separator thing, I didn't read your code 
 carefully. You still lack some things like comments and surely 
 more things that I don't know about but it's getting there. I 
 didn't think you'd go through the trouble of fixing those 
 things to be honnest, I'm impressed.
CSV doesn't have comments, sorry.
Jan 21 2016
parent cym13 <cpicard openmailbox.org> writes:
On Friday, 22 January 2016 at 01:14:48 UTC, Jesse Phillips wrote:
 On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote:
 Great! Sorry for the separator thing, I didn't read your code 
 carefully. You still lack some things like comments and surely 
 more things that I don't know about but it's getting there. I 
 didn't think you'd go through the trouble of fixing those 
 things to be honnest, I'm impressed.
CSV doesn't have comments, sorry.
(outside of "" of course) and wrongly assumed it was a standard thing, I stand corrected.
Jan 21 2016
prev sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Fri, Jan 22, 2016 at 12:56:02AM +0000, cym13 via Digitalmars-d-learn wrote:
[...]
 Great! Sorry for the separator thing, I didn't read your code
 carefully. You still lack some things like comments and surely more
 things that I don't know about but it's getting there.
Comments? You mean in the code? 'cos the CSV grammar described in RFC-4180 doesn't seem to have the possibility of comments in the CSV itself...
 I didn't think you'd go through the trouble of fixing those things to
 be honnest, I'm impressed.
They weren't that hard to fix, because the original code already had a separate path for quoted values, so it was just a matter of deleting some of the loop conditions to make the quoted path accept delimiters and newlines. In fact, the original code already accepted doubled quotes in the unquoted field path. It was only to implement interpretation of doubled quotes that required modifications to both inner loops. Now having said that, though, I think there are some bugs in the code that might cause an array overrun... and the fix might slow things down yet a bit more. There are also some fundamental limitations: 1) The CSV data has to be loadable into memory in its entirety. This may not be possible for very large files, or on machines with low memory. 2) There is no ranged-based interface. I *think* this should be possible to add, but it will probably increase the overhead and make the code slower. 3) There is no validation of the input whatsoever. If you feed it malformed CSV, it will give you nonsensical output. Well, it may crash, but hopefully won't anymore after I fix those missing bounds checks... but it will still give you nonsensical output. 4) The accepted syntax is actually a little larger than strict CSV (in the sense of RFC-4180); Unicode input is accepted but RFC-4180 does not allow Unicode. This may actually be a plus, though, because I'm expecting that modern CSV may actually contain Unicode data, not just the ASCII range defined in RFC-4180. T -- The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst
Jan 21 2016
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
 On Thu, Jan 21, 2016 at 11:03:23PM +0000, cym13 via Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[...]
It may be fast but I think it may be related to the fact that this is not a CSV parser. Don't get me wrong, it is able to parse a format defined by delimiters but true CSV is one hell of a beast.
Alright, I decided to take on the challenge to write a "real" CSV parser... since it's a bit tedious to keep posting code in the forum, I've pushed it to github instead: https://github.com/quickfur/fastcsv
Oh, forgot to mention, the parsing times are still lightning fast after the fixes I mentioned: still around 1190 msecs or so. Now I'm tempted to actually implement doubled-quote interpretation... as long as the input file doesn't contain unreasonable amounts of doubled quotes, I'm expecting the speed should remain pretty fast. --T
Jan 21 2016
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Jan 21, 2016 at 04:31:03PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
 On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
[...]
 	https://github.com/quickfur/fastcsv
Oh, forgot to mention, the parsing times are still lightning fast after the fixes I mentioned: still around 1190 msecs or so. Now I'm tempted to actually implement doubled-quote interpretation... as long as the input file doesn't contain unreasonable amounts of doubled quotes, I'm expecting the speed should remain pretty fast.
[...] Done, commits pushed to github. The new code now parses doubled quotes correctly. The performance is slightly worse now, around 1300 msecs on average, even in files that don't have any doubled quotes (it's a penalty incurred by the inner loop needing to detect doubled quote sequences). My benchmark input file doesn't have any doubled quotes, however (code correctness with doubled quotes is gauged by unittests only); so the performance numbers may not accurately reflect true performance in the general case. (But if doubled quotes are rare, as I'm expecting, the actual performance shouldn't change too much in general usage...) Maybe somebody who has a file with lots of ""'s can run the benchmark to see how badly it performs? :-P T -- Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.
Jan 21 2016
prev sibling next sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote:
 but in that case external quotes aren't required:

     number,name,price,comment
     1,Twilight,150,good friend
     2,Fluttershy,142,gentle
     3,Pinkie Pie,169,He said ""oh my gosh""
std.csv will reject this. If validation is turned off this is fine but your data will include "". "A field containing new lines, commas, or double quotes should be enclosed in double quotes (customizable)" This because it is not possible to decide what correct parsing should be. Is the data using including two double quotes? What if there was only one quote there, do I have to remember it was their and decide not to throw it out because I didn't see another quote? At this point the data is not following CSV rules so if I'm validating I'm throwing it out and if I'm not validating I'm not stripping data.
Jan 21 2016
parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Fri, Jan 22, 2016 at 01:13:07AM +0000, Jesse Phillips via
Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote:
but in that case external quotes aren't required:

    number,name,price,comment
    1,Twilight,150,good friend
    2,Fluttershy,142,gentle
    3,Pinkie Pie,169,He said ""oh my gosh""
std.csv will reject this. If validation is turned off this is fine but your data will include "". "A field containing new lines, commas, or double quotes should be enclosed in double quotes (customizable)" This because it is not possible to decide what correct parsing should be. Is the data using including two double quotes? What if there was only one quote there, do I have to remember it was their and decide not to throw it out because I didn't see another quote? At this point the data is not following CSV rules so if I'm validating I'm throwing it out and if I'm not validating I'm not stripping data.
This case is still manageable, because there are no embedded commas. Everything between the last comma and the next comma or newline unambiguously belongs to the current field. As to how to interpret it (should the result contain single or doubled quotes?), though, that could potentially be problematic. And now that you mention this, RFC-4180 does not allow doubled quotes in an unquoted field. I'll take that out of the code (it improves performance :-D). T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
Jan 21 2016
parent reply cym13 <cpicard openmailbox.org> writes:
On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote:
 And now that you mention this, RFC-4180 does not allow doubled 
 quotes in an unquoted field. I'll take that out of the code (it 
 improves performance :-D).
Right, re-reading the RFC would have been a great thing. That said I saw that kind of CSV in the real world, so I don't know what to think of it. I'm not saying it should be supported, but I wonder if there are points outside RFC-4180 that are taken for granted.
Jan 21 2016
parent Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Friday, 22 January 2016 at 01:36:40 UTC, cym13 wrote:
 On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote:
 And now that you mention this, RFC-4180 does not allow doubled 
 quotes in an unquoted field. I'll take that out of the code 
 (it improves performance :-D).
Right, re-reading the RFC would have been a great thing. That said I saw that kind of CSV in the real world, so I don't know what to think of it. I'm not saying it should be supported, but I wonder if there are points outside RFC-4180 that are taken for granted.
You have to understand CSV didn't come from a standard. People started using because it was simple for writing out some tabular data. Then they changed it because their data changed. It's not like their language came with a CSV parser, it was always hand written and people still do it today. And that is why data is delimited with so many things not comma (people thought they wouldn't need to escape their data). So yes, some CSV parsers will accept comments but that just means assume that two double quotes in unquoted data is just a quote, but then it breaks for those who have that kind of data which isn't escaped. There is also many other issues with CSV data, like is the file in ASCII or UTF or some other code page. And many times CSV isn't well formed because the data was output without proper escaping. std.csv isn't the end-all csv parsers, but it will at least handle well formed CSV that use different separators or quotes.
Jan 22 2016
prev sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
 [...]
 	https://github.com/quickfur/fastcsv
[...] Fixed some boundary condition crashes and reverted doubled quote handling in unquoted fields (since those are illegal according to RFC 4810). Performance is back in the ~1200 msec range. T -- There is no gravity. The earth sucks.
Jan 21 2016
next sibling parent Edwin van Leeuwen <edder tkwsping.nl> writes:
On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via 
 Digitalmars-d-learn wrote:
 [...]
 	https://github.com/quickfur/fastcsv
[...] Fixed some boundary condition crashes and reverted doubled quote handling in unquoted fields (since those are illegal according to RFC 4810). Performance is back in the ~1200 msec range. T
That's pretty impressive. Maybe turn it on into a dub package so that data pulverizer could easily test it on his data :)
Jan 22 2016
prev sibling parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via 
 Digitalmars-d-learn wrote:
 [...]
 	https://github.com/quickfur/fastcsv
[...] Fixed some boundary condition crashes and reverted doubled quote handling in unquoted fields (since those are illegal according to RFC 4810). Performance is back in the ~1200 msec range. T
Hi H. S. Teoh, I have used you fastcsv on my file: import std.file; import fastcsv; import std.stdio; import std.datetime; void main(){ StopWatch sw; sw.start(); auto input = cast(string) read("Acquisition_2009Q2.txt"); auto mydata = fastcsv.csvToArray!('|')(input); sw.stop(); double time = sw.peek().msecs; writeln("Time (s): ", time/1000); } $ dmd file_read_5.d fastcsv.d $ ./file_read_5 Time (s): 0.679 Fastest so far, very nice.
Jan 22 2016
parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Friday, 22 January 2016 at 21:41:46 UTC, data pulverizer wrote:
 On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
 [...]
Hi H. S. Teoh, I have used you fastcsv on my file: import std.file; import fastcsv; import std.stdio; import std.datetime; void main(){ StopWatch sw; sw.start(); auto input = cast(string) read("Acquisition_2009Q2.txt"); auto mydata = fastcsv.csvToArray!('|')(input); sw.stop(); double time = sw.peek().msecs; writeln("Time (s): ", time/1000); } $ dmd file_read_5.d fastcsv.d $ ./file_read_5 Time (s): 0.679 Fastest so far, very nice.
I guess the next step is allowing Tuple rows with mixed types.
Jan 22 2016
next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Fri, Jan 22, 2016 at 10:04:58PM +0000, data pulverizer via
Digitalmars-d-learn wrote:
[...]
$ dmd file_read_5.d fastcsv.d
$ ./file_read_5
Time (s): 0.679

Fastest so far, very nice.
Thanks!
 I guess the next step is allowing Tuple rows with mixed types.
I thought about that a little today. I'm guessing that most of the performance will be dependent on the conversion into the target types. Right now it's extremely fast because, for the most part, it's just taking slices of an existing string. It shouldn't be too hard to extend the current code so that instead of assembling the string slices in a block buffer, it will run them through std.conv.to instead and store them in an array of some given struct. But there may be performance degradation because now we have to do non-trivial operations on the string slices. Converting from const(char)[] to string probably should be avoided where not necessary, since otherwise it will involve lots and lots of small allocations and the GC will become very slow. Converting to ints may not be too bad... but conversion to types like floating point may be quite slow. Now, assembling the resulting structs into an array could potentially be slow... but perhaps an analogous block buffer technique can be used to create the array piecemeal in separate blocks, and only perform the final assembly into a single array at the very end (thus avoiding reallocating and copying the growing array as we go along). But we'll see. Performance predictions are rarely accurate; only a profiler will tell the truth about where the real bottlenecks are. :-) T -- LINUX = Lousy Interface for Nefarious Unix Xenophobes.
Jan 22 2016
prev sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Fri, Jan 22, 2016 at 10:04:58PM +0000, data pulverizer via
Digitalmars-d-learn wrote:
[...]
 I guess the next step is allowing Tuple rows with mixed types.
Alright. I threw together a new CSV parsing function that loads CSV data into an array of structs. Currently, the implementation is not quite polished yet (it blindly assumes the first row is a header row, which it discards), but it does work, and outperforms std.csv by about an order of magnitude. The initial implementation was very slow (albeit still somewhat fast than std.csv by about 10% or so) when given a struct with string fields. However, structs with POD fields are lightning fast (not significantly different from before, in spite of all the calls to std.conv.to!). This suggested that the slowdown was caused by excessive allocations of small strings, causing a heavy GC load. This suspicion was confirmed when I ran the same input data with a struct where all string fields were replaced with const(char)[] (so that std.conv.to simply returned slices to the data) -- the performance shot back up to about 1700 msecs, a little slower than the original version of reading into an array of array of const(char)[] slices, but about 58 times(!) the performance of std.csv. So I tried a simple optimization: instead of allocating a string per field, allocate 64KB string buffers and copy string field values into it, then taking slices from the buffer to assign to the struct's string fields. With this optimization, running times came down to about the 1900 msec range, which is only marginally slower than the const(char)[] case, about 51 times faster than std.csv. Here are the actual benchmark values: 1) std.csv: 2126883 records, 102136 msecs 2) fastcsv (struct with string fields): 2126883 records, 1978 msecs 3) fastcsv (struct with const(char)[] fields): 2126883 records, 1743 msecs The latest code is available on github: https://github.com/quickfur/fastcsv The benchmark driver now has 3 new targets: stdstruct - std.csv parsing of CSV into structs faststruct - fastcsv parsing of CSV into struct (string fields) faststruct2 - fastcsv parsing of CSV into struct (const(char)[] fields) Note that the structs are hard-coded into the code, so they will only work with the census.gov test file. Things still left to do: - Fix header parsing to have a consistent interface with std.csv, or at least allow the user to configure whether or not the first row should be discarded. - Support transcription to Tuples? - Refactor the code to have less copy-pasta. - Ummm... make it ready for integration with std.csv maybe? ;-) T -- Fact is stranger than fiction.
Jan 23 2016
parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Sunday, 24 January 2016 at 01:57:11 UTC, H. S. Teoh wrote:
 - Ummm... make it ready for integration with std.csv maybe? ;-)


 T
My suggestion is to take the unittests used in std.csv and try to get your code working with them. As fastcsv limitations would prevent replacing the std.csv implementation the API may not need to match, but keeping close to the same would be best.
Jan 23 2016
parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Sun, Jan 24, 2016 at 06:07:41AM +0000, Jesse Phillips via
Digitalmars-d-learn wrote:
[...]
 My suggestion is to take the unittests used in std.csv and try to get
 your code working with them. As fastcsv limitations would prevent
 replacing the std.csv implementation the API may not need to match,
 but keeping close to the same would be best.
My thought is to integrate the fastcsv code into std.csv, such that the current std.csv code will serve as fallback in the cases where fastcsv's limitations would prevent it from being used, with fastcsv being chosen where possible. It may be possible to lift some of fastcsv's limitations, now that a few performance bottlenecks have been identified (validation, excessive number of small allocations, being the main ones). The code could be generalized a bit more while preserving the optimizations in these key areas. T -- BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL
Jan 25 2016
next sibling parent bachmeier <no spam.net> writes:
On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote:
 My thought is to integrate the fastcsv code into std.csv, such 
 that the current std.csv code will serve as fallback in the 
 cases where fastcsv's limitations would prevent it from being 
 used, with fastcsv being chosen where possible.
Wouldn't it be simpler to add a new function? Otherwise you'll end up with very different performance for almost the same data.
Jan 26 2016
prev sibling parent Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote:
 On Sun, Jan 24, 2016 at 06:07:41AM +0000, Jesse Phillips via 
 Digitalmars-d-learn wrote: [...]
 My suggestion is to take the unittests used in std.csv and try 
 to get your code working with them. As fastcsv limitations 
 would prevent replacing the std.csv implementation the API may 
 not need to match, but keeping close to the same would be best.
My thought is to integrate the fastcsv code into std.csv, such that the current std.csv code will serve as fallback in the cases where fastcsv's limitations would prevent it from being used, with fastcsv being chosen where possible.
That is why I suggested starting with the unittests. I don't expect the implementations to share much code, std.csv is written to only use front, popFront, and empty. Most of the work is done in csvNextToken so it might be able to take advantage of random-access ranges for more performance. I just think the unittests will help to define where switching algorthims will be required since they exercise a good portion of the API.
Jan 26 2016
prev sibling next sibling parent reply data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 07:11:05PM +0000, Jesse Phillips via 
 This piqued my interest today, so I decided to take a shot at 
 writing a fast CSV parser.  First, I downloaded a sample large 
 CSV file from: [...]
Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my machine but I get ctr1.o errors for example: .../crt1.o(.debug_info): relocation 0 has invalid symbol index 0 are there flags that I should be compiling with or some other thing that I am missing?
Jan 21 2016
parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Jan 21, 2016 at 11:29:49PM +0000, data pulverizer via
Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 07:11:05PM +0000, Jesse Phillips via This piqued
my interest today, so I decided to take a shot at writing a fast CSV
parser.  First, I downloaded a sample large CSV file from: [...]
Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my machine but I get ctr1.o errors for example: .../crt1.o(.debug_info): relocation 0 has invalid symbol index 0 are there flags that I should be compiling with or some other thing that I am missing?
Did you supply a main() function? If not, it won't run, because fastcsv.d is only a module. If you want to run the benchmark, you'll have to compile both benchmark.d and fastcsv.d together. T -- Give a man a fish, and he eats once. Teach a man to fish, and he will sit forever.
Jan 21 2016
next sibling parent data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 11:29:49PM +0000, data pulverizer via 
 Digitalmars-d-learn wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 07:11:05PM +0000, Jesse Phillips via 
This piqued my interest today, so I decided to take a shot at 
writing a fast CSV parser.  First, I downloaded a sample 
large CSV file from: [...]
Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my machine but I get ctr1.o errors for example: .../crt1.o(.debug_info): relocation 0 has invalid symbol index 0 are there flags that I should be compiling with or some other thing that I am missing?
Did you supply a main() function? If not, it won't run, because fastcsv.d is only a module. If you want to run the benchmark, you'll have to compile both benchmark.d and fastcsv.d together. T
Thanks, I got used to getting away with running the "script" file in the same folder as a single file module - it usually works but occasionally (like now) I have to compile both together as you suggested.
Jan 21 2016
prev sibling parent data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote:
 are there flags that I should be compiling with or some other 
 thing that I am missing?
Did you supply a main() function? If not, it won't run, because fastcsv.d is only a module. If you want to run the benchmark, you'll have to compile both benchmark.d and fastcsv.d together. T
Great benchmarks! This is something else for me to learn from.
Jan 21 2016
prev sibling parent reply Gerald Jansen <gjansen ownmail.net> writes:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 While this is no fancy range-based code, and one might say it's 
 more hackish and C-like than idiomatic D, the problem is that 
 current D compilers can't quite optimize range-based code to 
 this extent yet. Perhaps in the future optimizers will improve 
 so that more idiomiatic, range-based code will have comparable 
 performance with fastcsv. (At least in theory this should be 
 possible.)
As a D novice still struggling with the concept that composable range-based functions can be more efficient than good-old looping (ya, I know, cache friendliness and GC avoidance), I find it extremely interesting that someone as expert as yourself would reach for a C-like approach for serious data crunching. Given that data crunching is the kind of thing I need to do a lot, I'm wondering how general your statement above might be at this time w.r.t. this and possibly other domains.
Jan 26 2016
parent reply Chris Wright <dhasenan gmail.com> writes:
On Tue, 26 Jan 2016 18:16:28 +0000, Gerald Jansen wrote:

 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 While this is no fancy range-based code, and one might say it's more
 hackish and C-like than idiomatic D, the problem is that current D
 compilers can't quite optimize range-based code to this extent yet.
 Perhaps in the future optimizers will improve so that more idiomiatic,
 range-based code will have comparable performance with fastcsv. (At
 least in theory this should be possible.)
As a D novice still struggling with the concept that composable range-based functions can be more efficient than good-old looping (ya, I know, cache friendliness and GC avoidance), I find it extremely interesting that someone as expert as yourself would reach for a C-like approach for serious data crunching. Given that data crunching is the kind of thing I need to do a lot, I'm wondering how general your statement above might be at this time w.r.t. this and possibly other domains.
You want to reduce allocations. Ranges often let you do that. However, it's sometimes unsafe to reuse range values that aren't immutable. That means, if you want to keep the values around, you need to copy them -- which introduces an allocation. You can get fewer large allocations by reading the whole file at once manually and using slices into that large allocation.
Jan 26 2016
next sibling parent Gerald Jansen <gjansen ownmail.net> writes:
On Tuesday, 26 January 2016 at 20:54:34 UTC, Chris Wright wrote:
 On Tue, 26 Jan 2016 18:16:28 +0000, Gerald Jansen wrote:
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 While this is no fancy range-based code, and one might say 
 it's more hackish and C-like than idiomatic D, the problem is 
 that current D compilers can't quite optimize range-based 
 code to this extent yet. Perhaps in the future optimizers 
 will improve so that more idiomiatic, range-based code will 
 have comparable performance with fastcsv.
... data crunching ... I'm wondering how general your statement above might be at this time w.r.t. this and possibly other domains.
You can get fewer large allocations by reading the whole file at once manually and using slices into that large allocation.
Sure, that part is clear. Presumably the quoted comment referred to more than just that technique.
Jan 26 2016
prev sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Tue, Jan 26, 2016 at 08:54:34PM +0000, Chris Wright via Digitalmars-d-learn
wrote:
 On Tue, 26 Jan 2016 18:16:28 +0000, Gerald Jansen wrote:
 
 On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
 While this is no fancy range-based code, and one might say it's
 more hackish and C-like than idiomatic D, the problem is that
 current D compilers can't quite optimize range-based code to this
 extent yet.  Perhaps in the future optimizers will improve so that
 more idiomiatic, range-based code will have comparable performance
 with fastcsv. (At least in theory this should be possible.)
As a D novice still struggling with the concept that composable range-based functions can be more efficient than good-old looping (ya, I know, cache friendliness and GC avoidance), I find it extremely interesting that someone as expert as yourself would reach for a C-like approach for serious data crunching. Given that data crunching is the kind of thing I need to do a lot, I'm wondering how general your statement above might be at this time w.r.t. this and possibly other domains.
You want to reduce allocations. Ranges often let you do that. However, it's sometimes unsafe to reuse range values that aren't immutable. That means, if you want to keep the values around, you need to copy them -- which introduces an allocation. You can get fewer large allocations by reading the whole file at once manually and using slices into that large allocation.
Yeah, in the course of this exercise, I found that the one thing that has had the biggest impact on performance is the amount of allocations involved. Basically, I noted that the less allocations are made, the more efficient the code. I'm not sure exactly why this is so, but it's probably something to do with the fact that tracing GCs work better with fewer allocations of larger objects, than many allocations of small objects. I have also noted in the past that D's current GC runs collections a little too often; in past projects I've obtained significant speedup (in one case, up to 40% reduction of total runtime) by suppressing automatic collections and scheduling them manually at a lower frequency. In short, I've found that reducing GC load plays a much bigger role in performance than the range vs. loops issue. The reason I chose to write manual loops at first is to eliminate all possibility of unexpected overhead that might hide behind range primitives, as well as compiler limitations, as current optimizers aren't exactly tuned for range-based idioms, and may fail to recognize certain range-based idioms that would lead to much more efficient code. However, in my second iteration when I made the fastcsv parser return an input range instead of an array, I found only negligible performance differences. This suggests that perhaps range-based code may not perform that badly after all. I have yet to test this hypothesis, as the inner loop that parses fields in a single row is still a manual loop; but my suspicion is that it wouldn't do too badly in range-based form either. What might make a big difference, though, is the part where slicing is used, since that is essential for reducing the number of allocations. The current iteration of struct-based parsing code, for instance, went through an initial version that was excruciatingly slow for structs with string fields. Why? Because the function takes const(char)[] as input, and you can't legally get strings out of that unless you make a copy of that data (since const means you cannot modify it, but somebody else still might). So std.conv.to would allocate a new string and copy the contents over, every time a string field was parsed, resulting in a large number of small allocations. To solve this, I decided to use a string buffer: instead of one allocation per string, pre-allocate a large-ish char[] buffer, and every time a string field was parsed, append the data into the buffer. If the buffer becomes full, allocate a new one. Take a slice of the buffer corresponding to that field and cast it to string (this is safe since the algorithm was constructed never to write over previous parts of the buffer). This seemingly trivial optimization won me a performance improvement of an order of magnitude(!). This is particularly enlightening, since it suggests that even the overhead of copying all the string fields out of the original data into a new buffer does not add up to that much. The new struct-based parser also returns an input range rather than an array; I found that constructing the array directly vs. copying from an input range didn't really make that big of a difference either. What did make a huge difference is reducing the number of allocations. So the moral of the story is: avoid large numbers of small allocations. If you have to do it, consider consolidating your allocations into a series of allocations of large(ish) buffers instead, and taking slices of the buffers. (And on a tangential note, this backs up Walter's claim that string manipulation in C/C++ ultimately will lose, because of strcpy() and strlen(). Think of how many times in C/C++ code you have to copy string data just because you can't guarantee the incoming string will still be around after you return, and how many times you have to iterate over strings just because arrays are pointers and thus have no length. You couldn't write the equivalent of fastcsv in C/C++, because you'll leak memory and/or get dangling pointers, since you don't know what will happen to the incoming data after you return, so you can't just take slices of it. You'd be forced to malloc() all your strings, and then somehow ensure the caller will clean up properly. Ultimately you'd need a convoluted, unnatural API just to make sure the memory housekeeping is taken care of. Whereas in D, even though the GC is so atrociously slow, it *does* let you freely slice things to your heart's content with zero API complication, no memory leaks, and when done right, can even rival C/C++ performance, and that at a fraction of the mental load required to write leak-free, pointer-bug-free C/C++ code.) T -- Tell me and I forget. Teach me and I remember. Involve me and I understand. -- Benjamin Franklin
Jan 26 2016
next sibling parent Gerald Jansen <gjansen ownmail.net> writes:
On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:
 ...
 So the moral of the story is: avoid large numbers of small 
 allocations. If you have to do it, consider consolidating your 
 allocations into a series of allocations of large(ish) buffers 
 instead, and taking slices of the buffers.
Many thanks for the detailed explanation.
Jan 27 2016
prev sibling next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:
 Yeah, in the course of this exercise, I found that the one 
 thing that has had the biggest impact on performance is the 
 amount of allocations involved.  [...snip]
Really interesting discussion.
Jan 27 2016
prev sibling parent Laeeth Isharc <laeeth-nospam nospamlaeeth.com> writes:
On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:
 So the moral of the story is: avoid large numbers of small 
 allocations. If you have to do it, consider consolidating your 
 allocations into a series of allocations of large(ish) buffers 
 instead, and taking slices of the buffers.
Thanks for sharing this, HS Teoh. I tried replacing allocations with using a Region from std.experimental.allocator (with FreeList and Quantizer on top), and then just deallocating everything in one go once I am done with the data. Seems to be a little faster, but I haven't had time to measure it. Just came across this C++ project, which seems to have astonishing performance. 7 minutes for reading a terabyte, and 2.5 to 4.5 GB/sec for reading file cold. That's pretty impressive. (Obviously they read in parallel, but I haven't yet read source to see what the other tricks might be). It would be nice to be able match that in D, though practically speaking it's probably easiest just to wrap it: http://www.wise.io/tech/paratext https://github.com/wiseio/paratext
Oct 29 2016
prev sibling next sibling parent reply Gerald Jansen <gjansen ownmail.net> writes:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
 I have been reading large text files with D's csv file reader 
 and have found it slow compared to R's read.table function
This great blog post has an optimized FastReader for CSV files: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
Jan 21 2016
parent data pulverizer <data.pulverizer gmail.com> writes:
On Thursday, 21 January 2016 at 20:46:15 UTC, Gerald Jansen wrote:
 On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
 wrote:
 I have been reading large text files with D's csv file reader 
 and have found it slow compared to R's read.table function
This great blog post has an optimized FastReader for CSV files: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
Thanks a lot Gerald, the blog and the discussions were very useful and revealing - for me it shows that you can use the D language to write fast code and then if you need it, to wring more performance and you can go as low level as you want all without leaving the D language or its tooling ecosystem.
Jan 21 2016
prev sibling parent reply Jon D <jond noreply.com> writes:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
 I have been reading large text files with D's csv file reader 
 and have found it slow compared to R's read.table function 
 which is not known to be particularly fast.
FWIW - I've been implementing a few programs manipulating delimited files, e.g. tab-delimited. Simpler than CSV files because there is no escaping inside the data. I've been trying to do this in relatively straightforward ways, e.g. using byLine rather than byChunk. (Goal is to explore the power of D standard libraries). I've gotten significant speed-ups in a couple different ways: * DMD libraries 2.068+ - byLine is dramatically faster * LDC 0.17 (alpha) - Based on DMD 2.068, and faster than the DMD compiler * Avoid utf-8 to dchar conversion - This conversion often occurs silently when working with ranges, but is generally not needed when manipulating data. * Avoid unnecessary string copies. e.g. Don't gratuitously convert char[] to string. At this point performance of the utilities I've been writing is quite good. They don't have direct equivalents with other tools (such as gnu core utils), so a head-to-head is not appropriate, but generally it seems the tools are quite competitive without needing to do my own buffer or memory management. And, they are dramatically faster than the same tools written in perl (which I was happy with). --Jon
Jan 21 2016
parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Jan 21, 2016 at 10:09:24PM +0000, Jon D via Digitalmars-d-learn wrote:
[...]
 FWIW - I've been implementing a few programs manipulating delimited
 files, e.g. tab-delimited. Simpler than CSV files because there is no
 escaping inside the data. I've been trying to do this in relatively
 straightforward ways, e.g. using byLine rather than byChunk. (Goal is
 to explore the power of D standard libraries).
 
 I've gotten significant speed-ups in a couple different ways:
 * DMD libraries 2.068+  -  byLine is dramatically faster
 * LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the DMD compiler
While byLine has improved a lot, it's still not the fastest thing in the world, because it still performs (at least) one OS roundtrip per line, not to mention it will auto-reencode to UTF-8. If your data is already in a known encoding, reading in the entire file and casting to (|w|d)string then splitting it by line will be a lot faster, since you can eliminate a lot of I/O roundtrips that way. In any case, it's well-known that gdc/ldc generally produce code that's about 20%-30% faster than dmd-compiled code, sometimes a lot more. While DMD has gotten some improvements in this area recently, it still has a long way to go before it can catch up. For performance-sensitive code I always reach for gdc instead of dmd.
 * Avoid utf-8 to dchar conversion - This conversion often occurs
 silently when working with ranges, but is generally not needed when
 manipulating data.
[...] Yet another nail in the coffin of auto-decoding. I wonder how many more nails we will need before Andrei is convinced... T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!
Jan 21 2016
parent Jon D <jond noreply.com> writes:
On Thursday, 21 January 2016 at 22:20:28 UTC, H. S. Teoh wrote:
 On Thu, Jan 21, 2016 at 10:09:24PM +0000, Jon D via 
 Digitalmars-d-learn wrote: [...]
 FWIW - I've been implementing a few programs manipulating 
 delimited files, e.g. tab-delimited. Simpler than CSV files 
 because there is no escaping inside the data. I've been trying 
 to do this in relatively straightforward ways, e.g. using 
 byLine rather than byChunk. (Goal is to explore the power of D 
 standard libraries).
 
 I've gotten significant speed-ups in a couple different ways:
 * DMD libraries 2.068+  -  byLine is dramatically faster
 * LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the 
 DMD compiler
While byLine has improved a lot, it's still not the fastest thing in the world, because it still performs (at least) one OS roundtrip per line, not to mention it will auto-reencode to UTF-8. If your data is already in a known encoding, reading in the entire file and casting to (|w|d)string then splitting it by line will be a lot faster, since you can eliminate a lot of I/O roundtrips that way.
No disagreement, but I had other goals. At a high level, I'm trying to learn and evaluate D, which partly involves understanding the strengths and weaknesses of the standard library. From this perspective, byLine was a logical starting point. More specifically, the tools I'm writing are often used in unix pipelines, so input can be a mixture of standard input and files. And, the files can be arbitrarily large. In these cases, reading the entire file is not always appropriate. Buffering usually is, and my code knows when it is dealing with files vs standard input and could handle these differently. However, standard library code could handle these distinctions as well, which was part of the reason for trying the straightforward approach. Aside - Despite the 'learning D' motivation, the tools are real tools, and writing them in D has been a clear win, especially with the byLine performance improvements in 2.068.
Jan 21 2016