www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to read fastly files ( I/O operation)

reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
Dear,

I am looking to parse efficiently huge file but i think D lacking 
for this purpose.
To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in 
c++ ) need 2 min.

My code is maybe not easy as is not easy to parse a fastq file 
and is more harder when using memory mapped file.

I do not see where i can get some perf as i do not do many copy 
and i use mmfile.
fastxtoolkit do not use mmfile and store his result into a  
struct array for each sequences but is is still faster!!!

thanks to any help i hope we can create a faster parser otherwise 
that is too slow to use D instead C++
Feb 04 2013
next sibling parent "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
code: http://dpaste.dzfl.pl/79ab0e17
fastxtoolkit: 
http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2
| - fastx_quality_stats.c ->  read_file()
| - libfastx/fastx.c      -> fastx_read_next_record()
Feb 04 2013
prev sibling next sibling parent reply FG <home fgda.pl> writes:
On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D lacking for this
purpose.
 To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2
min.

 My code is maybe not easy as is not easy to parse a fastq file and is more
 harder when using memory mapped file.
Why are you using mmap? Don't you just go through the file sequentially? In that case it should be faster to read in chunks: foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
Feb 04 2013
parent reply Dejan Lekic <dejan.lekic gmail.com> writes:
FG wrote:

 On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D lacking for this
 purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++
 ) need 2 min.

 My code is maybe not easy as is not easy to parse a fastq file and is more
 harder when using memory mapped file.
Why are you using mmap? Don't you just go through the file sequentially? In that case it should be faster to read in chunks: foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
I would go even further, and organise the file so N Data objects fit one page, and read the file page by page. The page-size can easily be obtained from the system. IMHO that would beat this fastxtoolkit. :) -- Dejan Lekic dejan.lekic (a) gmail.com http://dejan.lekic.org
Feb 04 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 4 February 2013 at 19:30:59 UTC, Dejan Lekic wrote:
 FG wrote:

 On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D 
 lacking for this
 purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit 
 (written in c++
 ) need 2 min.

 My code is maybe not easy as is not easy to parse a fastq 
 file and is more
 harder when using memory mapped file.
Why are you using mmap? Don't you just go through the file sequentially? In that case it should be faster to read in chunks: foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
I would go even further, and organise the file so N Data objects fit one page, and read the file page by page. The page-size can easily be obtained from the system. IMHO that would beat this fastxtoolkit. :)
AFAIK, he is reading text data that needs to be parsed line by line, so byChunk may not be the best approach. Or at least, not the easiest approach. I'm just wondering if maybe the reason the D code is slow is not just because of: - unicode. - front + popFront. ranges in D are "notorious" for being slow to iterate on text, due to the "double decode". If you are *certain* that the file contains nothing but ASCII (which should be the case for fastq, right?), you can get more bang for your buck if you attempt to iterate over it as an array of bytes, and convert the bytes to char on the fly, bypassing any and all unicode processing.
Feb 04 2013
next sibling parent Brad Roberts <braddr slice-2.puremagic.com> writes:
On Mon, 4 Feb 2013, monarch_dodra wrote:

 AFAIK, he is reading text data that needs to be parsed line by line, so
 byChunk may not be the best approach. Or at least, not the easiest approach.
 
 I'm just wondering if maybe the reason the D code is slow is not just because
 of:
 - unicode.
 - front + popFront.
First rule of performance analysis.. don't guess, measure.
Feb 04 2013
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-04 20:39, monarch_dodra wrote:

 AFAIK, he is reading text data that needs to be parsed line by line, so
 byChunk may not be the best approach. Or at least, not the easiest
 approach.
He can still read a chunk from the file, or the whole file and then read that chunk line by line.
 I'm just wondering if maybe the reason the D code is slow is not just
 because of:
 - unicode.
 - front + popFront.

 ranges in D are "notorious" for being slow to iterate on text, due to
 the "double decode".

 If you are *certain* that the file contains nothing but ASCII (which
 should be the case for fastq, right?), you can get more bang for your
 buck if you attempt to iterate over it as an array of bytes, and convert
 the bytes to char on the fly, bypassing any and all unicode processing.
Depending on what you're doing you can blast through the bytes even if it's Unicode. It will of course not validate the Unicode. -- /Jacob Carlborg
Feb 04 2013
parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
instead to call mmFile opIndex to read ubyte by ubyte i tried to 
put into a buffer array of length PAGESIZE.

code here: http://dpaste.dzfl.pl/25ee34fc

and is not faster for 12Go to parse i need 11 minutes. I do not 
see how i could read faster the file!

To remember fastxtoolkit need 2 min!
Feb 06 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 6 February 2013 at 10:43:02 UTC, bioinfornatics 
wrote:
 instead to call mmFile opIndex to read ubyte by ubyte i tried 
 to put into a buffer array of length PAGESIZE.

 code here: http://dpaste.dzfl.pl/25ee34fc

 and is not faster for 12Go to parse i need 11 minutes. I do not 
 see how i could read faster the file!

 To remember fastxtoolkit need 2 min!
This might be stupid, but I see a "writeln" in your inner loop. You aren't slowed down just by your console by any chance? If I were you, I'd start benching to try and see who is slowing you down. I'd reorganize the code to parse a file that is, say 512Mb. The rationale being you can place it entirely at once. Then, I'd shift the logic from "fully proccess each charater before moving to the next character" to "make a full processing pass on the entire data structure, before moving to the next pass". The steps I see that need to be measured are: * Raw read of file * Iterating on your file to extract it as a raw array of "Data" objects * Processing the Data objects * Outputting the data Also, (of course), you need to make sure you are compiling in release (might sound obvious, but you never know). Are you using dmd? I heard the "other" compilers are faster. I'm going to try and see with some example files if I can't get something running faster.
Feb 06 2013
next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 6 February 2013 at 11:15:22 UTC, monarch_dodra 
wrote:
 I'm going to try and see with some example files if I can't get 
 something running faster.
Benchmarking and tweaking, I was able to find 3 things that speeds up your program: 1) Make the computeLocal a compile time constant. This will give you a tinsy bit of performance. Depends on if you plan to make it a run-time argument switch I guess. 2) Makes things about 10%-20% faster: Your "nucleic" and "amino" hash tables map a character to an index. However, given the range of the characters ('A' to 'Z'), you are better off doing a flat array, where each index represents a character, eg: A is index 0, B is index 1. This way, lookup is a simple array indexing, as opposed to a hash table indexing. You may even get a bigger bang for your buck by simply giving your "_stats" structure a simple "A is index 0, B is index 1", and only "re-order" the data at the end, when you want to read it. (I haven't done this though). 3) Makes things about 100% faster (ran in half the time on my machine): I don't know how mmFile works, but a simple File + "rawRead" seems to get the job done fast. Also, instead of keeping track of an (several) indexes, I merely keep a single slice. The only thing I care about, is if my slice is empty, in which case I re-fill it. The modified code is here. I'm apparently getting the same output you are, but that doesn't mean there might not be bugs in it. For example, I noticed that you don't strip leading whites, if any, before the first read. http://dpaste.dzfl.pl/9b9353b8 ---- I'd be tempted to re-write the parser using a "byLine" approach, since my quick reading about fastq seems to imply it is a line based format. Or just plain try to write a parser from scratch, putting my own logic and thought into it (all I did was modify your code, without caring about the actual algorithm)
Feb 06 2013
prev sibling parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
i use both gdc / ldc with "-w -O -release" flags

writeln inside loop is never evaluated as computeLocal boolean is 
always false


Thanks in any case i continue to read all your answer :-)
Feb 06 2013
parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics 
wrote:
 i use both gdc / ldc with "-w -O -release" flags

 writeln inside loop is never evaluated as computeLocal boolean 
 is always false


 Thanks in any case i continue to read all your answer :-)
just to add more information about fastq http://www.biomedsearch.com/nih/Sanger-FASTQ-file-format-sequences/20015970.html And here a set of fastq where parser should success or fail http://www.biomedsearch.com/attachments/00/20/01/59/20015970/gkp1137_nar-02248-d-2009-File005.gz The problem is a sequence line could be splitted in several line same for quality line. And if i think it should allow to have in this lines white space the is used to identify a identifier line the + is used to identify a description line but this char could appear as a quality value (ubyte) I agree the spec format is really bad but it is heavily used in biology so i would like a fast parser to develop some D application instead to use C++. I will try later all previous recommendation thank to all. It seem in any case is not easy to parse fastly a file in D Note: is possible to lock a file? to able to use pure method ?
Feb 06 2013
next sibling parent "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
this/these

sorry
Feb 06 2013
prev sibling next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 6 February 2013 at 15:40:39 UTC, bioinfornatics 
wrote:
 It seem in any case is not easy to parse fastly a file in D
I don't think that's true. D provides the same "FILE" primitive you'd get in C, so there is no reason for it to be slower than C. It is the "range" approach that, as convenient as it is, is not well adapted for certain things. As I had said, I tried to write my own program. In it, I devised a range that, instead of exposing things to parse character by character, parses an entire "object" (a ... "genome" ... maybe ? I called them "Q" in my program) at once into an object. I decided to use the very simple "byLine" primitive. From there, you can query the object for their name/sequence/quality. The irony is that by "parsing twice" (once to do the io read, once to do the actual processing of the text), and taking into account I'm allocating each object individually, I'm running twice as fast as my original already improved implementation. Not only is it faster, it is also more convenient, since you can extract an entire Q object at once, and then operate on that as you would so please: Separation of algorithm and parsing. It correctly takes into account that a sequence can be multiple lines. It does not strip whitespace because according to http://maq.sourceforge.net/fastq.shtml whitespace is not a legal character. Now: Keep in mind that this approach allocates (3) new strings for each Q. You could *try* an approach with a pre-allocated re-useable buffer. This would mean you can only operate on 1 Q at once, but you'd probably iterate on them faster. In any case, you can try it out: http://dpaste.dzfl.pl/8bdd0c84
Feb 06 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 6 February 2013 at 16:06:20 UTC, monarch_dodra 
wrote:
 It correctly takes into account that a sequence can be multiple 
 lines. It does not strip whitespace because according to 
 http://maq.sourceforge.net/fastq.shtml whitespace is not a 
 legal character.
Hum, just read your example files. I guess you can have white. In any case, that should pose not pose any real problem. would come in handy here.
Feb 06 2013
prev sibling parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
06.02.2013 19:40, bioinfornatics пишет:
 On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote:
 I agree the spec format is really bad but it is heavily used in biology
 so i would like a fast parser to develop some D application instead to
 use C++.
Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding! The situation can be improved only if: 1. We will find and kill every text format creator; 2. We will create a really good binary format for each such task and support it in every application we create. So after some time text formats will just die because of evolution as everything will support better formats. (the second proposal is a real recommendation) -- Денис В. Шеломовский Denis V. Shelomovskij
Feb 07 2013
parent "Jay Norwood" <jayn prismnet.com> writes:
On Friday, 8 February 2013 at 06:22:18 UTC, Denis Shelomovskij 
wrote:
 06.02.2013 19:40, bioinfornatics пишет:
 On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics 
 wrote:
 I agree the spec format is really bad but it is heavily used 
 in biology
 so i would like a fast parser to develop some D application 
 instead to
 use C++.
Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding! The situation can be improved only if: 1. We will find and kill every text format creator; 2. We will create a really good binary format for each such task and support it in every application we create. So after some time text formats will just die because of evolution as everything will support better formats. (the second proposal is a real recommendation)
There is a binary resource format for emf models, which normally use xml files, and some timing improvements stated at this link. It might be worth looking at this if you are thinking about writing your own binary format. http://www.slideshare.net/kenn.hussey/performance-and-extensibility-with-emf There is also a fast binary compression library named blosc that is used in some python utilities, measured and presented here, showing that it is faster than doing a memcpy if you have multiple cores. http://blosc.pytables.org/trac On the sequential accesses ... I found that windows writes blocks of data all over the place, but the best way to get it to write something in more contiguous locations is to modify the file output routines to use specify write through. The sequential accesses didn't improve read times on ssd. Most of the decent ssds can read big files at 300MB/sec or more now, and you can raid 0 a few of them and read 800MB/sec.
Dec 18 2013
prev sibling parent reply FG <home fgda.pl> writes:
On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D lacking for this
purpose.
 To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2
min.
Haven't compared to fastxtoolkit, but I have some code for you. I have processed the file SRR077487_1.filt.fastq from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ and expect this syntax (no multiline sequences or whitespace). File takes up almost 6 GB processing took 1m45s - twice as fast as the fastest D solution so far -- all compiled with gdc -O3. I bet your computer has better specs than mine. Program uses a buffer that should be twice the size of the largest sequence record (counting id, comment and quality data). A chunk of file is read, then records are scanned on the buffer until record start pointer passes the middle of the buffer -- then memcpy is used to move all the rest to the begining of the buffer and the remaining space at the end is filled with another chunk read from the file. Data contains both sequence letter and associated quality information. Sequence ID and comment are slices of the buffer, so they have valid info until you move to the next sequence (and the number increments). This is the code: http://dpaste.1azy.net/8424d4ac Tell me what timings you can get now.
Feb 06 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
 On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D 
 lacking for this purpose.
 To parse 12 Go i need 11 minutes wheras fastxtoolkit (written 
 in c++ ) need 2 min.
Haven't compared to fastxtoolkit, but I have some code for you. I have processed the file SRR077487_1.filt.fastq from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ and expect this syntax (no multiline sequences or whitespace). File takes up almost 6 GB processing took 1m45s - twice as fast as the fastest D solution so far
Do you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).
 Data contains both sequence letter and associated quality 
 information.
 Sequence ID and comment are slices of the buffer, so they have 
 valid info
 until you move to the next sequence (and the number increments).
Hum. Mine allocates new slices, so they are never invalidated :) Mine also takes into account newlines and and lowercase sequences. Still, it seems you and I both took different approaches. I had mentioned using a re-useable buffer. I'm going to try to consume some of your code to see if I can't improve my implementation. bioinfornatics I'm getting real interested on the subject. I'm going to try to write an actual library/framework for working with fastq files in a D environment. This means I'll try to write robust and useable code, with both stability and performance in mind, as opposed to the "proofs of concepts in so far". For now, I'd like to keep it simple: Would something that only knows how to parse/write Sanger FASTQ files be of help to you? If I write something, can I have you review it?
Feb 06 2013
next sibling parent reply "bioinfornatics" <bioinfornatics gmail.com> writes:
Thanks monarch and FG,
i will read your code to see where i failing :-)
And of course if you are interested with bio format i will really 
happy to works / review together

In any case  big thanks that is a very interesting subject
Feb 06 2013
parent reply Lee Braiden <leebraid gmail.com> writes:
On 06/02/13 22:21, bioinfornatics wrote:
 Thanks monarch and FG,
 i will read your code to see where i failing :-)
I wasn't going to mention this as I thought the CPU usage might be trivial, but if both CPU and IO are factors, then it would probably be beneficial to have a separate IO thread/task. I guess you'd need a big task: the task would need to load and return n chunks or n lines, rather than just one line at at time, for example, and the processing/parsing thread (main thread or otherwise) could then churn through that while more IO was done. It would also depend on the size of the file: no point firing up a thread just to read a tiny file that the filesystem can return in a millisecond. If you're talking about 1+ minutes of loading though, a thread should definitely help. Also, if you don't strictly need to parse the file in order, then you could divide and conquer it by breaking it into more sections/tasks. For example, if you're parsing records, you cold split the file in half, find the remaining parts of the record in the second half, move it to the first, and then process the two halves in two threads. If you've a nice function to do that split cleanly, and n cpus, then just call it some more. -- Lee
Feb 06 2013
parent FG <home fgda.pl> writes:
On 2013-02-07 00:41, Lee Braiden wrote:
 I wasn't going to mention this as I thought the CPU usage might be trivial, but
 if both CPU and IO are factors, then it would probably be beneficial to have a
 separate IO thread/task.
This wasn't an issue in my version of the program. It took 1m55s to process the file, but then again it takes 1m44s just to read it (as shown previously).
 Also, if you don't strictly need to parse the file in order, then you could
 divide and conquer it by breaking it into more sections/tasks. For example, if
 you're parsing records, you cold split the file in half, find the remaining
 parts of the record in the second half, move it to the first, and then process
 the two halves in two threads.  If you've a nice function to do that split
 cleanly, and n cpus, then just call it some more.
Now, this could make a big difference! If only parsing out of order is acceptable in this case.
Feb 06 2013
prev sibling next sibling parent reply FG <home fgda.pl> writes:
On 2013-02-06 21:43, monarch_dodra wrote:
 On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
 I have processed the file SRR077487_1.filt.fastq from
 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
 and expect this syntax (no multiline sequences or whitespace).
 File takes up almost 6 GB processing took 1m45s - twice as fast as the
 fastest D solution so far
Do you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).
Yes. Maybe CPU is the bottleneck on my end. With DMD32 2.060 on win7-64 compiled with same flags I got: MD: 4m30 / FG: 1m55s - both using 100% of one core. Quite similar results with GDC64. You have timed the same file SRR077487_1.filt.fastq at 67s?
 I'm getting real interested on the subject. I'm going to try to write an actual
 library/framework for working with fastq files in a D environment.
Those fastq are contagious. ;)
 This means I'll try to write robust and useable code, with both stability and
 performance in mind, as opposed to the "proofs of concepts in so far".
Yeah, but the big deal was that D is 5.5x slower than C++. You have mentioned something about using byLine. Well, I would have gladly used it instead of looking for line ends myself and pushing stuff with memcpy. But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx is fast in reading file by line, using file.readln(buf) is unpredictable. :) I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC can be several times slower. For example just reading in a loop: import std.stdio; enum uint bufferSize = 4096 - 16; void main(string[] args) { char[] tmp, buf = new char[bufferSize]; size_t cnt; auto f = File(args[1], "r"); switch(args[2]) { case "raw": do tmp = f.rawRead(buf); while (tmp.length); break; case "readln": do cnt = f.readln(buf); while (cnt); break; default: writeln("Use parameters: <filename> raw|readln"); } } Tested on a much smaller SRR077487.filt.fastq: DMD32 -release -O -inline: raw 94ms / readln 450ms GDC64 -O3: raw 94ms / readln 6.76s Tested on SRR077487_1.filt.fastq: DMD32 -release -O -inline: raw 1m44s / readln 1m55s GDC64 -O3: raw 1m48s / readln 14m16s Why such a big difference between the DMD and GDC (on Windows)? (or have I missed some switch in GDC?)
Feb 06 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 6 February 2013 at 22:55:14 UTC, FG wrote:
 On 2013-02-06 21:43, monarch_dodra wrote:
 On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
 I have processed the file SRR077487_1.filt.fastq from
 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
 and expect this syntax (no multiline sequences or whitespace).
 File takes up almost 6 GB processing took 1m45s - twice as 
 fast as the
 fastest D solution so far
Do you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).
Yes. Maybe CPU is the bottleneck on my end. With DMD32 2.060 on win7-64 compiled with same flags I got: MD: 4m30 / FG: 1m55s - both using 100% of one core. Quite similar results with GDC64. You have timed the same file SRR077487_1.filt.fastq at 67s?
Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO bound than you are? My attempt was mostly to try and see how fast we could go, while doing it only with high level stuff (eg, no fSomething calls). Probably, going lower level, and parsing the text manually, waiting for magic characters could yield better result (like what you did). I'm going to also try playing around with threads: Just last week I wrote a program that did exactly this (asynchronous file reads). That said, I'll be making this priority n°2. I'd like to make the parser work perfectly first, and in a way that is easily upgradeable/useable. Mr. bio made it perfectly clear that he needed support for whites and line feeds ;)
Feb 06 2013
parent reply FG <home fgda.pl> writes:
On 2013-02-07 08:26, monarch_dodra wrote:
 You have timed the same file SRR077487_1.filt.fastq at 67s?
Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO bound than you are?
Ah, now that you mention SSD, I moved the file onto one and it's even more clear that I am CPU-bound here on the Intel E6600 system. Compare: 7200rpm: MS 4m30s / FG 1m55s SSD: MS 4m14s / FG 1m44s Almost the same, but running the utility "wc -l" on the file renders: 7200rpm: 1m45s SSD: 0m33s In my case threads would be beneficial but only when using the SSD. Reading the file by chunk in D takes 33s on SSD and 1m44s on HDD. Slicing the file in half and reading from both threads would also be fine only on the SSD, because on a HDD I'd lose sequential disk reads jumping between threads (expecting lower performance). Therefore - threads: yes, but gotta use an SSD. :) Also, threads: yes, if there's gonna be more processing than just counting letters.
Feb 07 2013
parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
Little feed back
i named f the f's script and monarch the monarch's script

  gdmd -O -w -release f.d
~ $ time ./f bigFastq.fastq
['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 
'G':1353023772]

real	2m14.966s
user	0m47.168s
sys	0m15.379s
~ $ gdmd -O -w -release monarch.d
monarch.d:117: no identifier for declarator Lines
monarch.d:117: alias cannot have initializer
monarch.d:130: identifier or integer expected, not assert


i haven't take the time to look more

but in any case it seem memory mapped file is really slowly 
whereas it is said that is the faster way to read file. Create an 
index where reading the file need 12 min that is useless as for 
read and compute you need 2 min
Feb 07 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics 
wrote:
 Little feed back
 i named f the f's script and monarch the monarch's script

  gdmd -O -w -release f.d
 ~ $ time ./f bigFastq.fastq
 ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 
 'G':1353023772]

 real	2m14.966s
 user	0m47.168s
 sys	0m15.379s
 ~ $ gdmd -O -w -release monarch.d
 monarch.d:117: no identifier for declarator Lines
 monarch.d:117: alias cannot have initializer
 monarch.d:130: identifier or integer expected, not assert


 i haven't take the time to look more

 but in any case it seem memory mapped file is really slowly 
 whereas it is said that is the faster way to read file. Create 
 an index where reading the file need 12 min that is useless as 
 for read and compute you need 2 min
You must be using dmd 2.060. I'm using some 2.061 features: Namelly "new style alias". Just change line 117: alias Lines = typeof(File.init.byLine()); to alias typeof(File.init.byLine()) Lines; As for 130, it's a "version(assert)" eg, code that does not get executed in release. Just remove the "version(assert)", if it gets executed, it is not a big deal. In any case, I think the code is mostly "proof", I wouldn't use it as is. ------------ BTW, I've started working on my library. How would users expect the "quality" format served? As an array of characters, or as an array of integrals (ubytes)?
Feb 07 2013
next sibling parent "bioinfornatics" <bioinfornatics gmail.com> writes:
On Thursday, 7 February 2013 at 14:42:57 UTC, monarch_dodra wrote:
 On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics 
 wrote:
 Little feed back
 i named f the f's script and monarch the monarch's script

 gdmd -O -w -release f.d
 ~ $ time ./f bigFastq.fastq
 ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 
 'G':1353023772]

 real	2m14.966s
 user	0m47.168s
 sys	0m15.379s
 ~ $ gdmd -O -w -release monarch.d
 monarch.d:117: no identifier for declarator Lines
 monarch.d:117: alias cannot have initializer
 monarch.d:130: identifier or integer expected, not assert


 i haven't take the time to look more

 but in any case it seem memory mapped file is really slowly 
 whereas it is said that is the faster way to read file. Create 
 an index where reading the file need 12 min that is useless as 
 for read and compute you need 2 min
You must be using dmd 2.060. I'm using some 2.061 features: Namelly "new style alias". Just change line 117: alias Lines = typeof(File.init.byLine()); to alias typeof(File.init.byLine()) Lines; As for 130, it's a "version(assert)" eg, code that does not get executed in release. Just remove the "version(assert)", if it gets executed, it is not a big deal. In any case, I think the code is mostly "proof", I wouldn't use it as is. ------------ BTW, I've started working on my library. How would users expect the "quality" format served? As an array of characters, or as an array of integrals (ubytes)?
ubyte as is a number is maybe easier to understand an cuttoff some value
Feb 07 2013
prev sibling parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
And use size_t instead to int for getChar/getInt method as type 
returned

gdmd -w -O -release monarch.d
~ $ time ./monarch 
/env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq
globalStats:
A: 1007129068. C: 1350576504. G: 1353023772. M:   0. D:   0. S:   
0. H:   0. N: 39413. V:   0. U:   0. W:   0. R:   0. B:   0. Y:   
0. K:   0. T: 999786820.
time: 176585

real	2m56.635s
user	2m31.376s
sys	0m23.077s


this program is little less fast than f's program

about parser I would like create a set a biology parser and put 
into a lib with a set of common compute as letter counter.
By example you could run a letter counter compute throw a fata or 
fastq file.
rename identifier thwow a fata or fastq file.
Feb 08 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 8 February 2013 at 09:08:48 UTC, bioinfornatics wrote:
 And use size_t instead to int for getChar/getInt method as type 
 returned

 gdmd -w -O -release monarch.d
 ~ $ time ./monarch 
 /env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq
 globalStats:
 A: 1007129068. C: 1350576504. G: 1353023772. M:   0. D:   0. S:
   0. H:   0. N: 39413. V:   0. U:   0. W:   0. R:   0. B:   0. 
 Y:   0. K:   0. T: 999786820.
 time: 176585

 real	2m56.635s
 user	2m31.376s
 sys	0m23.077s


 this program is little less fast than f's program
I've re-tried running both mine and FG's on a HDD based machine, with dmd, -O -release. Also optional "inline" I also wrote a new parser, which does as FG suggested, and just parses straight up (byLine is indeed more expensive). This one handles whites and line breaks correctly. It also accepts lines of any size (the internal buffer is auto-grow). My results are different from yours though: w/o inline w inline FG 105s 77s MD 72s 64s newMD 61s 59s I have no idea why you guys are getting better results with FG, and I'm getting better results with mine. Is this a win/linux or dmd/gdc issue. My new parser is based on raw reads, so that should be much faster on your machines.
 about parser I would like create a set a biology parser and put 
 into a lib with a set of common compute as letter counter.
 By example you could run a letter counter compute throw a fata 
 or fastq file.
 rename identifier thwow a fata or fastq file.
I don't really understand what all that means. In any case, I've been able to implement some cool features so far. My parser is a "true" range you can pass around, and you won't have any problems with it. It returns "shallow" objects that reference a mutable string, however, the user can call "dup" or "idup" to have a new object. Said objects can be printed directly, so there is no need for a specialized "writer". As a matter of fact, this little program will allow you to "clean" a file (strip spaces), and potentially, line-wrap at 80 chars: //---- import std.stdio; import fastq.parser; import fastq.q; void main(string[] args) { Parser parser = new Parser(args[1]); File output = File(args[2], "wb"); foreach(entry; parser) writefln("%80s", entry); } //---- I'll submit it for your review, once it is perfectly implemented.
Feb 08 2013
parent "bioinfornatics" <bioinfornatics gmail.com> writes:
some idea such as letter counting:
rename identifier
trimming sequence from quality value to cutoff
convert to a binary format
more idea later
Feb 09 2013
prev sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 02/06/2013 12:43 PM, monarch_dodra wrote:

 with dmd, with -release -O -inline
Going off topic a little, in a recent experiment, I have noticed that adding -inline made a range solution twice slower. -O -release still helped but -inline was the culprit. Ali
Feb 06 2013
prev sibling parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
instead to use memcpy I try with slicing ~ lines 136 :
_hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
moveSize + _bufPosition];

I get same perf
Feb 12 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
wrote:
 instead to use memcpy I try with slicing ~ lines 136 :
 _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
 moveSize + _bufPosition];

 I get same perf
I think I figured out why I'm getting different results than you guys are, on my windows machine. AFAIK, file reads in windows are done natively asynchronously. I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers. I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%. I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it... This takes 60 seconds. //---- auto input = File(args[1], "rb"); ubyte[] buffer = new ubyte[](BufferSize); do{ buffer = input.rawRead(buffer); }while(buffer.length); //---- This takes 60 seconds too. //---- Parser parser = new Parser(args[1]); foreach(q; parser) foreach(char c; q.sequence) globalNucleic.collect(c); } //---- So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing. I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now. I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that: a) You can inspect the erroneous data b) You can skip the erroneous data, and parse the rest of the file. Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux. When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
Feb 12 2013
parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:
 On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
 wrote:
 instead to use memcpy I try with slicing ~ lines 136 :
 _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
 moveSize + _bufPosition];

 I get same perf
I think I figured out why I'm getting different results than you guys are, on my windows machine. AFAIK, file reads in windows are done natively asynchronously. I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers. I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%. I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it... This takes 60 seconds. //---- auto input = File(args[1], "rb"); ubyte[] buffer = new ubyte[](BufferSize); do{ buffer = input.rawRead(buffer); }while(buffer.length); //---- This takes 60 seconds too. //---- Parser parser = new Parser(args[1]); foreach(q; parser) foreach(char c; q.sequence) globalNucleic.collect(c); } //---- So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing. I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now. I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that: a) You can inspect the erroneous data b) You can skip the erroneous data, and parse the rest of the file. Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux. When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
about threaded version is possible to use get file size function to split it in several thread. Use fseek read end of section return it to detect end of split to used
Feb 12 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 12 February 2013 at 16:28:09 UTC, bioinfornatics 
wrote:
 On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra 
 wrote:
 On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
 wrote:
 instead to use memcpy I try with slicing ~ lines 136 :
 _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
 moveSize + _bufPosition];

 I get same perf
I think I figured out why I'm getting different results than you guys are, on my windows machine. AFAIK, file reads in windows are done natively asynchronously. I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers. I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%. I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it... This takes 60 seconds. //---- auto input = File(args[1], "rb"); ubyte[] buffer = new ubyte[](BufferSize); do{ buffer = input.rawRead(buffer); }while(buffer.length); //---- This takes 60 seconds too. //---- Parser parser = new Parser(args[1]); foreach(q; parser) foreach(char c; q.sequence) globalNucleic.collect(c); } //---- So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing. I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now. I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that: a) You can inspect the erroneous data b) You can skip the erroneous data, and parse the rest of the file. Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux. When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
about threaded version is possible to use get file size function to split it in several thread. Use fseek read end of section return it to detect end of split to used
You'd want to have 2 threads reading the same file at once? I don't think there is much more to be gained anyways, since the IO is the bottleneck anyways. A better approach would be to have 1 file reader that passes data to two simultaneous parsers. This, however, would make things scary complicated, and I'd doubt we'd even get much better results: I was not able to measure the actual amount of time spent working when compared to the time spent reading the file.
Feb 12 2013
parent reply FG <home fgda.pl> writes:
On 2013-02-12 17:45, monarch_dodra wrote:
 A better approach would be to have 1 file reader that passes data to two
 simultaneous parsers. This, however, would make things scary complicated, and
 I'd doubt we'd even get much better results: I was not able to measure the
 actual amount of time spent working when compared to the time spent reading the
 file.
Best to keep things simple when the potential benefits aren't certain. :)
Feb 12 2013
parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
Some time fastq are comressed to gz bz2 or xz as that is often a
huge file.
Maybe we need keep in mind this early in developement and use
std.zlib
Feb 12 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
wrote:
 Some time fastq are comressed to gz bz2 or xz as that is often a
 huge file.
 Maybe we need keep in mind this early in developement and use
 std.zlib
While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data. Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data. The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient. However, now that you mention it, I'll make sure it is correctly supported. I'll *try* to show you what I have so far tomorow (in about 18h).
Feb 12 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:
 On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
 wrote:
 Some time fastq are comressed to gz bz2 or xz as that is often 
 a
 huge file.
 Maybe we need keep in mind this early in developement and use
 std.zlib
While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data. Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data. The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient. However, now that you mention it, I'll make sure it is correctly supported. I'll *try* to show you what I have so far tomorow (in about 18h).
Yeah... I played around too much, and the file is dirtier than ever. The good news is that I was able to test out what I was telling you about: accepting any range is ok: I used your ZFile range to plug it into my parser: I can now parse zipped files directly. The good news is that now, I'm not bottle necked by IO anymore! The bad news is that I'm now bottle necked by CPU decompressing. But since I'm using dmd, you may get better results with LDC or GDC. In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result. Long story short: 99% of the time is spent acquiring data. The last 1% is just copying it into local buffers. The last good news though is that CPU bottleneck is always better than IO bottleneck. If you have multiple cores, you should be able to run multiple *instances* (not threads), and be able to process several files at once, multiplying your throughput.
Feb 13 2013
next sibling parent reply FG <home fgda.pl> writes:
On 2013-02-13 18:39, monarch_dodra wrote:
 In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds
 (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip,
 1 thread to parse), but again, the actual *parse* phase is so ridiculously
fast,
 that it changes *nothing* to the final result.
Great. Performance aside, we didn't talk much about how this data can be useful - should it only be read sequentially forward or both ways, would there be a need to place some markers or slice the sequence, etc. Our small test case was only about counting nucleotides, so reading order and possibility of further processing was irrelevant. Mr.Bio, what usage cases you'll be interested in, other than those counters?
Feb 13 2013
parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
 Mr.Bio, what usage cases you'll be interested in, other than 
 those counters?
some idea such as letter counting: rename identifier trimming sequence from quality value to cutoff convert to a binary format convert to fasta + sff merge close sequence to one concenus create a brujin graph more idea later
Feb 14 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 14 February 2013 at 18:31:35 UTC, bioinfornatics 
wrote:
 Mr.Bio, what usage cases you'll be interested in, other than 
 those counters?
some idea such as letter counting: rename identifier trimming sequence from quality value to cutoff convert to a binary format convert to fasta + sff merge close sequence to one concenus create a brujin graph more idea later
OK. I posted the parser here: http://dpaste.dzfl.pl/37b893ed This runs on the 2.061. I'll have to make a few changes if you need it to run 2.060, to get around some 2.060 specific bugs. This contains strictly only the parser. If you want, I'll post the async file reading stuff I wrote to interface with it. The example sections should give you a quick idea of how to use it. Tell me what you think about it.
Feb 19 2013
parent reply "bioinfornatics" <bioinfornatics fedoraproject.org> writes:
arf I am always in dmdfe 2.060
Feb 22 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 22 February 2013 at 08:53:35 UTC, bioinfornatics wrote:
 arf I am always in dmdfe 2.060
AFAIK, the problems are mostly the "nothrows", and maybe 1 or 2 "new style" alias declarations. That said, what's stopping you from upgrading? We are at 2.062 right now. Does upgrading break anything for you?
Feb 22 2013
prev sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Wednesday, 13 February 2013 at 17:39:11 UTC, monarch_dodra 
wrote:
 On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra 
 wrote:
 On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
 wrote:
 Some time fastq are comressed to gz bz2 or xz as that is 
 often a
 huge file.
 Maybe we need keep in mind this early in developement and use
 std.zlib
While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data. Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data. The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient. However, now that you mention it, I'll make sure it is correctly supported. I'll *try* to show you what I have so far tomorow (in about 18h).
Yeah... I played around too much, and the file is dirtier than ever. The good news is that I was able to test out what I was telling you about: accepting any range is ok: I used your ZFile range to plug it into my parser: I can now parse zipped files directly. The good news is that now, I'm not bottle necked by IO anymore! The bad news is that I'm now bottle necked by CPU decompressing. But since I'm using dmd, you may get better results with LDC or GDC. In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result. Long story short: 99% of the time is spent acquiring data. The last 1% is just copying it into local buffers. The last good news though is that CPU bottleneck is always better than IO bottleneck. If you have multiple cores, you should be able to run multiple *instances* (not threads), and be able to process several files at once, multiplying your throughput.
I modified the library unzip to make a parallel unzip a while back (at the link below). The execution time scaled very well for the number of cpus for the test case I was using, which was a 2GB unzip'd distribution containing many small files and subdirectories. The parallel operations were by file. I think the execution time gains on ssd drives were from having multiple cores scheduling the writes to separate files in parallel. https://github.com/jnorwood/file_parallel/blob/master/unzip_parallel.d
Dec 18 2013
prev sibling parent FG <home fgda.pl> writes:
On 2013-02-12 17:28, bioinfornatics wrote:
 about threaded version is possible to use get file size function to split it in
 several thread.
 Use fseek read end of section return it to detect end of split to used
Yes, but like already mentioned before, it only works well for SSD. For normal hard drives you'd want the data stored and accessed in sequence without jumping between cylinders whenever you switch threads. Do you store your data on an SSD?
Feb 12 2013