www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - getting started with std.csv

reply "gjansen" <gjansen ownmail.net> writes:
Hi. I'm a D newbie(!) coming from a Fortran/C/Python background. 
I'm
struggling with the many new concepts needed in order to make any 
sense out
of the documentation or traceback messages 
(ranges/templates/...). For
example, the std.csv documentation is great but all the examples 
read from a
string rather than a file. I feel stupid but I'm having trouble 
with the
simple step of modifying the examples to read from a file. I can 
read the
whole file into a string in memory and then read the records from 
the string
just fine with csvReader (example A below) or read a line at a 
time from
the file and call csvReader using a single line (example B 
below), but
neither solution is satisfactory. In practice I need to read 
files with
up to 80 million records so I'd like to understand how to do this
properly/efficiently.

tia, Gerald

Example A
=========
import std.stdio, std.file, std.csv;

void main()
{
     std.file.write("test.csv", "0,1,abc\n2,3,def");
     scope(exit) std.file.remove("test.csv");

     auto lines = readText!(string)("test.csv");

     struct Rec { int a,b; char[] c; }
     foreach (Rec r; csvReader!Rec(lines)) {
         writeln("struct -> ", r);
     }
}


Example B
=========
import std.stdio, std.file, std.csv;

void main()
{
     std.file.write("test.csv", "0,1,abc\n2,3,def");
     scope(exit) std.file.remove("test.csv");

     struct Rec { int a,b; char[] c; }
     Rec r;
     foreach (line; File("test.csv", "r").byLine) {
         r = csvReader!Rec(line).front;
         writeln("struct -> ", r);
     }
}

Output
======
struct -> Rec(0, 1, "abc")
struct -> Rec(2, 3, "def")
Apr 06 2015
parent reply "yazd" <yazan.dabain gmail.com> writes:
I got this to work with:

```
import std.stdio, std.file, std.csv, std.range;

void main()
{
	std.file.write("test.csv", "0,1,abc\n2,3,def");
	scope(exit) std.file.remove("test.csv");

	static struct Rec { int a, b; char[] c; }

	auto file = File("test.csv", "r");
	foreach (s; csvReader!Rec(file.byLine().joiner("\n")))
	{
		writeln("struct -> ", s);
	}
}
```

I am not sure about using `file.byLine()` here, because `byLine` 
reuses its buffer, but this is working correctly (for some 
reason, anyone can comment?) as far as I tested.
Apr 06 2015
parent reply "yazd" <yazan.dabain gmail.com> writes:
On Tuesday, 7 April 2015 at 05:49:48 UTC, yazd wrote:
 I got this to work with:

 ```
 import std.stdio, std.file, std.csv, std.range;

 void main()
 {
 	std.file.write("test.csv", "0,1,abc\n2,3,def");
 	scope(exit) std.file.remove("test.csv");

 	static struct Rec { int a, b; char[] c; }

 	auto file = File("test.csv", "r");
 	foreach (s; csvReader!Rec(file.byLine().joiner("\n")))
 	{
 		writeln("struct -> ", s);
 	}
 }
 ```

 I am not sure about using `file.byLine()` here, because 
 `byLine` reuses its buffer, but this is working correctly (for 
 some reason, anyone can comment?) as far as I tested.
Btw, joiner is a lazy algorithm. In other words, it doesn't join the whole file when it is called but only when needed. This reduces the memory requirements as you won't need the whole file in memory at once.
Apr 06 2015
parent reply "yazd" <yazan.dabain gmail.com> writes:
On Tuesday, 7 April 2015 at 05:51:33 UTC, yazd wrote:
 On Tuesday, 7 April 2015 at 05:49:48 UTC, yazd wrote:
 I got this to work with:

 ```
 import std.stdio, std.file, std.csv, std.range;

 void main()
 {
 	std.file.write("test.csv", "0,1,abc\n2,3,def");
 	scope(exit) std.file.remove("test.csv");

 	static struct Rec { int a, b; char[] c; }

 	auto file = File("test.csv", "r");
 	foreach (s; csvReader!Rec(file.byLine().joiner("\n")))
 	{
 		writeln("struct -> ", s);
 	}
 }
 ```

 I am not sure about using `file.byLine()` here, because 
 `byLine` reuses its buffer, but this is working correctly (for 
 some reason, anyone can comment?) as far as I tested.
Btw, joiner is a lazy algorithm. In other words, it doesn't join the whole file when it is called but only when needed. This reduces the memory requirements as you won't need the whole file in memory at once.
Replace `std.range` with `std.algorithm`.
Apr 07 2015
parent reply "gjansen" <gjansen ownmail.net> writes:
Many thanks for the feedback yazd! I've tested the approach with 
a large csv file and it works fine. Unfortunately csvReader seems 
very convenient but it is no speed daemon. To my dismay it was 
much slower (about 4x) than a simple approach I am using in 
Python, which is essentially equivalent to 
chomp(line).split(','). I guess I'll have to keep studying and 
learning. Thx again.
Apr 07 2015
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 7 April 2015 at 09:44:11 UTC, gjansen wrote:
 Many thanks for the feedback yazd! I've tested the approach 
 with a large csv file and it works fine. Unfortunately 
 csvReader seems very convenient but it is no speed daemon. To 
 my dismay it was much slower (about 4x) than a simple approach 
 I am using in Python, which is essentially equivalent to 
 chomp(line).split(','). I guess I'll have to keep studying and 
 learning. Thx again.
What compiler are you using? What compilation flags?
Apr 07 2015
parent reply "gjansen" <gjansen ownmail.net> writes:
dmd -O (2.066.1) and gdc -O3 (4.9.2)

But... as I tried to convey, I was comparing apples to oranges. I 
have now rewritten the D test simply using split(',') instead of 
csvReader, to be more similar to the python test, and it runs 
about 2x faster in D with dmd and about 4x faster with gdc 
compared to Python 3.4.3. :-)

On Tuesday, 7 April 2015 at 10:47:14 UTC, John Colvin wrote:
 On Tuesday, 7 April 2015 at 09:44:11 UTC, gjansen wrote:
 Many thanks for the feedback yazd! I've tested the approach 
 with a large csv file and it works fine. Unfortunately 
 csvReader seems very convenient but it is no speed daemon. To 
 my dismay it was much slower (about 4x) than a simple approach 
 I am using in Python, which is essentially equivalent to 
 chomp(line).split(','). I guess I'll have to keep studying and 
 learning. Thx again.
What compiler are you using? What compilation flags?
Apr 07 2015
parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 7 April 2015 at 11:36:54 UTC, gjansen wrote:
 dmd -O (2.066.1) and gdc -O3 (4.9.2)

 But... as I tried to convey, I was comparing apples to oranges. 
 I have now rewritten the D test simply using split(',') instead 
 of csvReader, to be more similar to the python test, and it 
 runs about 2x faster in D with dmd and about 4x faster with gdc 
 compared to Python 3.4.3. :-)

 On Tuesday, 7 April 2015 at 10:47:14 UTC, John Colvin wrote:
 On Tuesday, 7 April 2015 at 09:44:11 UTC, gjansen wrote:
 Many thanks for the feedback yazd! I've tested the approach 
 with a large csv file and it works fine. Unfortunately 
 csvReader seems very convenient but it is no speed daemon. To 
 my dismay it was much slower (about 4x) than a simple 
 approach I am using in Python, which is essentially 
 equivalent to chomp(line).split(','). I guess I'll have to 
 keep studying and learning. Thx again.
What compiler are you using? What compilation flags?
also consider: -inline and -release for dmd and -frelease for gdc With gdc, if you are building for a specific cpu family (e.g. broadwell) -march= can provide improvements. -march=native chooses the same as the host machine.
Apr 07 2015