digitalmars.D.learn - How to read fastly files ( I/O operation)
- bioinfornatics (13/13) Feb 04 2013 Dear,
- bioinfornatics (5/5) Feb 04 2013 code: http://dpaste.dzfl.pl/79ab0e17
- FG (4/8) Feb 04 2013 Why are you using mmap? Don't you just go through the file sequentially?
- Dejan Lekic (8/20) Feb 04 2013 I would go even further, and organise the file so N Data objects fit one...
- monarch_dodra (15/37) Feb 04 2013 AFAIK, he is reading text data that needs to be parsed line by
- Brad Roberts (2/9) Feb 04 2013 First rule of performance analysis.. don't guess, measure.
- Jacob Carlborg (7/20) Feb 04 2013 He can still read a chunk from the file, or the whole file and then read...
- bioinfornatics (6/6) Feb 06 2013 instead to call mmFile opIndex to read ubyte by ubyte i tried to
- monarch_dodra (22/28) Feb 06 2013 This might be stupid, but I see a "writeln" in your inner loop.
- monarch_dodra (35/37) Feb 06 2013 Benchmarking and tweaking, I was able to find 3 things that
- bioinfornatics (4/4) Feb 06 2013 i use both gdc / ldc with "-w -O -release" flags
- bioinfornatics (18/22) Feb 06 2013 just to add more information about fastq
- bioinfornatics (2/2) Feb 06 2013 this/these
- monarch_dodra (30/31) Feb 06 2013 I don't think that's true. D provides the same "FILE" primitive
- monarch_dodra (6/10) Feb 06 2013 Hum, just read your example files. I guess you can have white. In
- Denis Shelomovskij (12/16) Feb 07 2013 Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding...
- Jay Norwood (19/36) Dec 18 2013 There is a binary resource format for emf models, which normally
- FG (19/21) Feb 06 2013 Haven't compared to fastxtoolkit, but I have some code for you.
- monarch_dodra (19/36) Feb 06 2013 Do you mean my solution above? I tried your solution with dmd,
- bioinfornatics (5/5) Feb 06 2013 Thanks monarch and FG,
- Lee Braiden (21/23) Feb 06 2013 I wasn't going to mention this as I thought the CPU usage might be
- FG (5/14) Feb 06 2013 This wasn't an issue in my version of the program. It took 1m55s to proc...
- FG (38/50) Feb 06 2013 Yes. Maybe CPU is the bottleneck on my end.
- monarch_dodra (14/32) Feb 06 2013 Yes, that file exactly. That said, I'm working on an SSD, so
- FG (16/19) Feb 07 2013 Ah, now that you mention SSD, I moved the file onto one and it's even mo...
- bioinfornatics (18/18) Feb 07 2013 Little feed back
- monarch_dodra (17/35) Feb 07 2013 You must be using dmd 2.060. I'm using some 2.061 features:
- bioinfornatics (3/43) Feb 07 2013 ubyte as is a number is maybe easier to understand an cuttoff
- bioinfornatics (19/19) Feb 08 2013 And use size_t instead to int for getChar/getInt method as type
- monarch_dodra (39/58) Feb 08 2013 I've re-tried running both mine and FG's on a HDD based machine,
- bioinfornatics (5/5) Feb 09 2013 some idea such as letter counting:
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (5/6) Feb 06 2013 Going off topic a little, in a recent experiment, I have noticed that
- bioinfornatics (4/4) Feb 12 2013 instead to use memcpy I try with slicing ~ lines 136 :
- monarch_dodra (45/49) Feb 12 2013 I think I figured out why I'm getting different results than you
- bioinfornatics (5/55) Feb 12 2013 about threaded version is possible to use get file size function
- monarch_dodra (10/78) Feb 12 2013 You'd want to have 2 threads reading the same file at once? I
- FG (2/7) Feb 12 2013 Best to keep things simple when the potential benefits aren't certain. :...
- bioinfornatics (4/4) Feb 12 2013 Some time fastq are comressed to gz bz2 or xz as that is often a
- monarch_dodra (13/17) Feb 12 2013 While working on making the parser multi-threaded compatible, I
- monarch_dodra (22/42) Feb 13 2013 Yeah... I played around too much, and the file is dirtier than
- FG (7/11) Feb 13 2013 Great. Performance aside, we didn't talk much about how this data can be...
- bioinfornatics (8/10) Feb 14 2013 some idea such as letter counting:
- monarch_dodra (11/21) Feb 19 2013 OK. I posted the parser here:
- bioinfornatics (1/1) Feb 22 2013 arf I am always in dmdfe 2.060
- monarch_dodra (5/6) Feb 22 2013 AFAIK, the problems are mostly the "nothrows", and maybe 1 or 2
- Jay Norwood (10/60) Dec 18 2013 I modified the library unzip to make a parallel unzip a while
- FG (5/8) Feb 12 2013 Yes, but like already mentioned before, it only works well for SSD.
Dear, I am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min. My code is maybe not easy as is not easy to parse a fastq file and is more harder when using memory mapped file. I do not see where i can get some perf as i do not do many copy and i use mmfile. fastxtoolkit do not use mmfile and store his result into a struct array for each sequences but is is still faster!!! thanks to any help i hope we can create a faster parser otherwise that is too slow to use D instead C++
Feb 04 2013
code: http://dpaste.dzfl.pl/79ab0e17 fastxtoolkit: http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2 | - fastx_quality_stats.c -> read_file() | - libfastx/fastx.c -> fastx_read_next_record()
Feb 04 2013
On 2013-02-04 15:04, bioinfornatics wrote:I am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min. My code is maybe not easy as is not easy to parse a fastq file and is more harder when using memory mapped file.Why are you using mmap? Don't you just go through the file sequentially? In that case it should be faster to read in chunks: foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
Feb 04 2013
FG wrote:On 2013-02-04 15:04, bioinfornatics wrote:I would go even further, and organise the file so N Data objects fit one page, and read the file page by page. The page-size can easily be obtained from the system. IMHO that would beat this fastxtoolkit. :) -- Dejan Lekic dejan.lekic (a) gmail.com http://dejan.lekic.orgI am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min. My code is maybe not easy as is not easy to parse a fastq file and is more harder when using memory mapped file.Why are you using mmap? Don't you just go through the file sequentially? In that case it should be faster to read in chunks: foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
Feb 04 2013
On Monday, 4 February 2013 at 19:30:59 UTC, Dejan Lekic wrote:FG wrote:AFAIK, he is reading text data that needs to be parsed line by line, so byChunk may not be the best approach. Or at least, not the easiest approach. I'm just wondering if maybe the reason the D code is slow is not just because of: - unicode. - front + popFront. ranges in D are "notorious" for being slow to iterate on text, due to the "double decode". If you are *certain* that the file contains nothing but ASCII (which should be the case for fastq, right?), you can get more bang for your buck if you attempt to iterate over it as an array of bytes, and convert the bytes to char on the fly, bypassing any and all unicode processing.On 2013-02-04 15:04, bioinfornatics wrote:I would go even further, and organise the file so N Data objects fit one page, and read the file page by page. The page-size can easily be obtained from the system. IMHO that would beat this fastxtoolkit. :)I am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min. My code is maybe not easy as is not easy to parse a fastq file and is more harder when using memory mapped file.Why are you using mmap? Don't you just go through the file sequentially? In that case it should be faster to read in chunks: foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
Feb 04 2013
On Mon, 4 Feb 2013, monarch_dodra wrote:AFAIK, he is reading text data that needs to be parsed line by line, so byChunk may not be the best approach. Or at least, not the easiest approach. I'm just wondering if maybe the reason the D code is slow is not just because of: - unicode. - front + popFront.First rule of performance analysis.. don't guess, measure.
Feb 04 2013
On 2013-02-04 20:39, monarch_dodra wrote:AFAIK, he is reading text data that needs to be parsed line by line, so byChunk may not be the best approach. Or at least, not the easiest approach.He can still read a chunk from the file, or the whole file and then read that chunk line by line.I'm just wondering if maybe the reason the D code is slow is not just because of: - unicode. - front + popFront. ranges in D are "notorious" for being slow to iterate on text, due to the "double decode". If you are *certain* that the file contains nothing but ASCII (which should be the case for fastq, right?), you can get more bang for your buck if you attempt to iterate over it as an array of bytes, and convert the bytes to char on the fly, bypassing any and all unicode processing.Depending on what you're doing you can blast through the bytes even if it's Unicode. It will of course not validate the Unicode. -- /Jacob Carlborg
Feb 04 2013
instead to call mmFile opIndex to read ubyte by ubyte i tried to put into a buffer array of length PAGESIZE. code here: http://dpaste.dzfl.pl/25ee34fc and is not faster for 12Go to parse i need 11 minutes. I do not see how i could read faster the file! To remember fastxtoolkit need 2 min!
Feb 06 2013
On Wednesday, 6 February 2013 at 10:43:02 UTC, bioinfornatics wrote:instead to call mmFile opIndex to read ubyte by ubyte i tried to put into a buffer array of length PAGESIZE. code here: http://dpaste.dzfl.pl/25ee34fc and is not faster for 12Go to parse i need 11 minutes. I do not see how i could read faster the file! To remember fastxtoolkit need 2 min!This might be stupid, but I see a "writeln" in your inner loop. You aren't slowed down just by your console by any chance? If I were you, I'd start benching to try and see who is slowing you down. I'd reorganize the code to parse a file that is, say 512Mb. The rationale being you can place it entirely at once. Then, I'd shift the logic from "fully proccess each charater before moving to the next character" to "make a full processing pass on the entire data structure, before moving to the next pass". The steps I see that need to be measured are: * Raw read of file * Iterating on your file to extract it as a raw array of "Data" objects * Processing the Data objects * Outputting the data Also, (of course), you need to make sure you are compiling in release (might sound obvious, but you never know). Are you using dmd? I heard the "other" compilers are faster. I'm going to try and see with some example files if I can't get something running faster.
Feb 06 2013
On Wednesday, 6 February 2013 at 11:15:22 UTC, monarch_dodra wrote:I'm going to try and see with some example files if I can't get something running faster.Benchmarking and tweaking, I was able to find 3 things that speeds up your program: 1) Make the computeLocal a compile time constant. This will give you a tinsy bit of performance. Depends on if you plan to make it a run-time argument switch I guess. 2) Makes things about 10%-20% faster: Your "nucleic" and "amino" hash tables map a character to an index. However, given the range of the characters ('A' to 'Z'), you are better off doing a flat array, where each index represents a character, eg: A is index 0, B is index 1. This way, lookup is a simple array indexing, as opposed to a hash table indexing. You may even get a bigger bang for your buck by simply giving your "_stats" structure a simple "A is index 0, B is index 1", and only "re-order" the data at the end, when you want to read it. (I haven't done this though). 3) Makes things about 100% faster (ran in half the time on my machine): I don't know how mmFile works, but a simple File + "rawRead" seems to get the job done fast. Also, instead of keeping track of an (several) indexes, I merely keep a single slice. The only thing I care about, is if my slice is empty, in which case I re-fill it. The modified code is here. I'm apparently getting the same output you are, but that doesn't mean there might not be bugs in it. For example, I noticed that you don't strip leading whites, if any, before the first read. http://dpaste.dzfl.pl/9b9353b8 ---- I'd be tempted to re-write the parser using a "byLine" approach, since my quick reading about fastq seems to imply it is a line based format. Or just plain try to write a parser from scratch, putting my own logic and thought into it (all I did was modify your code, without caring about the actual algorithm)
Feb 06 2013
i use both gdc / ldc with "-w -O -release" flags writeln inside loop is never evaluated as computeLocal boolean is always false Thanks in any case i continue to read all your answer :-)
Feb 06 2013
On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote:i use both gdc / ldc with "-w -O -release" flags writeln inside loop is never evaluated as computeLocal boolean is always false Thanks in any case i continue to read all your answer :-)just to add more information about fastq http://www.biomedsearch.com/nih/Sanger-FASTQ-file-format-sequences/20015970.html And here a set of fastq where parser should success or fail http://www.biomedsearch.com/attachments/00/20/01/59/20015970/gkp1137_nar-02248-d-2009-File005.gz The problem is a sequence line could be splitted in several line same for quality line. And if i think it should allow to have in this lines white space the is used to identify a identifier line the + is used to identify a description line but this char could appear as a quality value (ubyte) I agree the spec format is really bad but it is heavily used in biology so i would like a fast parser to develop some D application instead to use C++. I will try later all previous recommendation thank to all. It seem in any case is not easy to parse fastly a file in D Note: is possible to lock a file? to able to use pure method ?
Feb 06 2013
On Wednesday, 6 February 2013 at 15:40:39 UTC, bioinfornatics wrote:It seem in any case is not easy to parse fastly a file in DI don't think that's true. D provides the same "FILE" primitive you'd get in C, so there is no reason for it to be slower than C. It is the "range" approach that, as convenient as it is, is not well adapted for certain things. As I had said, I tried to write my own program. In it, I devised a range that, instead of exposing things to parse character by character, parses an entire "object" (a ... "genome" ... maybe ? I called them "Q" in my program) at once into an object. I decided to use the very simple "byLine" primitive. From there, you can query the object for their name/sequence/quality. The irony is that by "parsing twice" (once to do the io read, once to do the actual processing of the text), and taking into account I'm allocating each object individually, I'm running twice as fast as my original already improved implementation. Not only is it faster, it is also more convenient, since you can extract an entire Q object at once, and then operate on that as you would so please: Separation of algorithm and parsing. It correctly takes into account that a sequence can be multiple lines. It does not strip whitespace because according to http://maq.sourceforge.net/fastq.shtml whitespace is not a legal character. Now: Keep in mind that this approach allocates (3) new strings for each Q. You could *try* an approach with a pre-allocated re-useable buffer. This would mean you can only operate on 1 Q at once, but you'd probably iterate on them faster. In any case, you can try it out: http://dpaste.dzfl.pl/8bdd0c84
Feb 06 2013
On Wednesday, 6 February 2013 at 16:06:20 UTC, monarch_dodra wrote:It correctly takes into account that a sequence can be multiple lines. It does not strip whitespace because according to http://maq.sourceforge.net/fastq.shtml whitespace is not a legal character.Hum, just read your example files. I guess you can have white. In any case, that should pose not pose any real problem. would come in handy here.
Feb 06 2013
06.02.2013 19:40, bioinfornatics пишет:On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote: I agree the spec format is really bad but it is heavily used in biology so i would like a fast parser to develop some D application instead to use C++.Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding! The situation can be improved only if: 1. We will find and kill every text format creator; 2. We will create a really good binary format for each such task and support it in every application we create. So after some time text formats will just die because of evolution as everything will support better formats. (the second proposal is a real recommendation) -- Денис В. Шеломовский Denis V. Shelomovskij
Feb 07 2013
On Friday, 8 February 2013 at 06:22:18 UTC, Denis Shelomovskij wrote:06.02.2013 19:40, bioinfornatics пишет:There is a binary resource format for emf models, which normally use xml files, and some timing improvements stated at this link. It might be worth looking at this if you are thinking about writing your own binary format. http://www.slideshare.net/kenn.hussey/performance-and-extensibility-with-emf There is also a fast binary compression library named blosc that is used in some python utilities, measured and presented here, showing that it is faster than doing a memcpy if you have multiple cores. http://blosc.pytables.org/trac On the sequential accesses ... I found that windows writes blocks of data all over the place, but the best way to get it to write something in more contiguous locations is to modify the file output routines to use specify write through. The sequential accesses didn't improve read times on ssd. Most of the decent ssds can read big files at 300MB/sec or more now, and you can raid 0 a few of them and read 800MB/sec.On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote: I agree the spec format is really bad but it is heavily used in biology so i would like a fast parser to develop some D application instead to use C++.Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding! The situation can be improved only if: 1. We will find and kill every text format creator; 2. We will create a really good binary format for each such task and support it in every application we create. So after some time text formats will just die because of evolution as everything will support better formats. (the second proposal is a real recommendation)
Dec 18 2013
On 2013-02-04 15:04, bioinfornatics wrote:I am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.Haven't compared to fastxtoolkit, but I have some code for you. I have processed the file SRR077487_1.filt.fastq from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ and expect this syntax (no multiline sequences or whitespace). File takes up almost 6 GB processing took 1m45s - twice as fast as the fastest D solution so far -- all compiled with gdc -O3. I bet your computer has better specs than mine. Program uses a buffer that should be twice the size of the largest sequence record (counting id, comment and quality data). A chunk of file is read, then records are scanned on the buffer until record start pointer passes the middle of the buffer -- then memcpy is used to move all the rest to the begining of the buffer and the remaining space at the end is filled with another chunk read from the file. Data contains both sequence letter and associated quality information. Sequence ID and comment are slices of the buffer, so they have valid info until you move to the next sequence (and the number increments). This is the code: http://dpaste.1azy.net/8424d4ac Tell me what timings you can get now.
Feb 06 2013
On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:On 2013-02-04 15:04, bioinfornatics wrote:Do you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).I am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.Haven't compared to fastxtoolkit, but I have some code for you. I have processed the file SRR077487_1.filt.fastq from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ and expect this syntax (no multiline sequences or whitespace). File takes up almost 6 GB processing took 1m45s - twice as fast as the fastest D solution so farData contains both sequence letter and associated quality information. Sequence ID and comment are slices of the buffer, so they have valid info until you move to the next sequence (and the number increments).Hum. Mine allocates new slices, so they are never invalidated :) Mine also takes into account newlines and and lowercase sequences. Still, it seems you and I both took different approaches. I had mentioned using a re-useable buffer. I'm going to try to consume some of your code to see if I can't improve my implementation. bioinfornatics I'm getting real interested on the subject. I'm going to try to write an actual library/framework for working with fastq files in a D environment. This means I'll try to write robust and useable code, with both stability and performance in mind, as opposed to the "proofs of concepts in so far". For now, I'd like to keep it simple: Would something that only knows how to parse/write Sanger FASTQ files be of help to you? If I write something, can I have you review it?
Feb 06 2013
Thanks monarch and FG, i will read your code to see where i failing :-) And of course if you are interested with bio format i will really happy to works / review together In any case big thanks that is a very interesting subject
Feb 06 2013
On 06/02/13 22:21, bioinfornatics wrote:Thanks monarch and FG, i will read your code to see where i failing :-)I wasn't going to mention this as I thought the CPU usage might be trivial, but if both CPU and IO are factors, then it would probably be beneficial to have a separate IO thread/task. I guess you'd need a big task: the task would need to load and return n chunks or n lines, rather than just one line at at time, for example, and the processing/parsing thread (main thread or otherwise) could then churn through that while more IO was done. It would also depend on the size of the file: no point firing up a thread just to read a tiny file that the filesystem can return in a millisecond. If you're talking about 1+ minutes of loading though, a thread should definitely help. Also, if you don't strictly need to parse the file in order, then you could divide and conquer it by breaking it into more sections/tasks. For example, if you're parsing records, you cold split the file in half, find the remaining parts of the record in the second half, move it to the first, and then process the two halves in two threads. If you've a nice function to do that split cleanly, and n cpus, then just call it some more. -- Lee
Feb 06 2013
On 2013-02-07 00:41, Lee Braiden wrote:I wasn't going to mention this as I thought the CPU usage might be trivial, but if both CPU and IO are factors, then it would probably be beneficial to have a separate IO thread/task.This wasn't an issue in my version of the program. It took 1m55s to process the file, but then again it takes 1m44s just to read it (as shown previously).Also, if you don't strictly need to parse the file in order, then you could divide and conquer it by breaking it into more sections/tasks. For example, if you're parsing records, you cold split the file in half, find the remaining parts of the record in the second half, move it to the first, and then process the two halves in two threads. If you've a nice function to do that split cleanly, and n cpus, then just call it some more.Now, this could make a big difference! If only parsing out of order is acceptable in this case.
Feb 06 2013
On 2013-02-06 21:43, monarch_dodra wrote:On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:Yes. Maybe CPU is the bottleneck on my end. With DMD32 2.060 on win7-64 compiled with same flags I got: MD: 4m30 / FG: 1m55s - both using 100% of one core. Quite similar results with GDC64. You have timed the same file SRR077487_1.filt.fastq at 67s?I have processed the file SRR077487_1.filt.fastq from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ and expect this syntax (no multiline sequences or whitespace). File takes up almost 6 GB processing took 1m45s - twice as fast as the fastest D solution so farDo you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).I'm getting real interested on the subject. I'm going to try to write an actual library/framework for working with fastq files in a D environment.Those fastq are contagious. ;)This means I'll try to write robust and useable code, with both stability and performance in mind, as opposed to the "proofs of concepts in so far".Yeah, but the big deal was that D is 5.5x slower than C++. You have mentioned something about using byLine. Well, I would have gladly used it instead of looking for line ends myself and pushing stuff with memcpy. But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx is fast in reading file by line, using file.readln(buf) is unpredictable. :) I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC can be several times slower. For example just reading in a loop: import std.stdio; enum uint bufferSize = 4096 - 16; void main(string[] args) { char[] tmp, buf = new char[bufferSize]; size_t cnt; auto f = File(args[1], "r"); switch(args[2]) { case "raw": do tmp = f.rawRead(buf); while (tmp.length); break; case "readln": do cnt = f.readln(buf); while (cnt); break; default: writeln("Use parameters: <filename> raw|readln"); } } Tested on a much smaller SRR077487.filt.fastq: DMD32 -release -O -inline: raw 94ms / readln 450ms GDC64 -O3: raw 94ms / readln 6.76s Tested on SRR077487_1.filt.fastq: DMD32 -release -O -inline: raw 1m44s / readln 1m55s GDC64 -O3: raw 1m48s / readln 14m16s Why such a big difference between the DMD and GDC (on Windows)? (or have I missed some switch in GDC?)
Feb 06 2013
On Wednesday, 6 February 2013 at 22:55:14 UTC, FG wrote:On 2013-02-06 21:43, monarch_dodra wrote:Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO bound than you are? My attempt was mostly to try and see how fast we could go, while doing it only with high level stuff (eg, no fSomething calls). Probably, going lower level, and parsing the text manually, waiting for magic characters could yield better result (like what you did). I'm going to also try playing around with threads: Just last week I wrote a program that did exactly this (asynchronous file reads). That said, I'll be making this priority n°2. I'd like to make the parser work perfectly first, and in a way that is easily upgradeable/useable. Mr. bio made it perfectly clear that he needed support for whites and line feeds ;)On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:Yes. Maybe CPU is the bottleneck on my end. With DMD32 2.060 on win7-64 compiled with same flags I got: MD: 4m30 / FG: 1m55s - both using 100% of one core. Quite similar results with GDC64. You have timed the same file SRR077487_1.filt.fastq at 67s?I have processed the file SRR077487_1.filt.fastq from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ and expect this syntax (no multiline sequences or whitespace). File takes up almost 6 GB processing took 1m45s - twice as fast as the fastest D solution so farDo you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).
Feb 06 2013
On 2013-02-07 08:26, monarch_dodra wrote:Ah, now that you mention SSD, I moved the file onto one and it's even more clear that I am CPU-bound here on the Intel E6600 system. Compare: 7200rpm: MS 4m30s / FG 1m55s SSD: MS 4m14s / FG 1m44s Almost the same, but running the utility "wc -l" on the file renders: 7200rpm: 1m45s SSD: 0m33s In my case threads would be beneficial but only when using the SSD. Reading the file by chunk in D takes 33s on SSD and 1m44s on HDD. Slicing the file in half and reading from both threads would also be fine only on the SSD, because on a HDD I'd lose sequential disk reads jumping between threads (expecting lower performance). Therefore - threads: yes, but gotta use an SSD. :) Also, threads: yes, if there's gonna be more processing than just counting letters.You have timed the same file SRR077487_1.filt.fastq at 67s?Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO bound than you are?
Feb 07 2013
Little feed back i named f the f's script and monarch the monarch's script gdmd -O -w -release f.d ~ $ time ./f bigFastq.fastq ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772] real 2m14.966s user 0m47.168s sys 0m15.379s ~ $ gdmd -O -w -release monarch.d monarch.d:117: no identifier for declarator Lines monarch.d:117: alias cannot have initializer monarch.d:130: identifier or integer expected, not assert i haven't take the time to look more but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 min
Feb 07 2013
On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics wrote:Little feed back i named f the f's script and monarch the monarch's script gdmd -O -w -release f.d ~ $ time ./f bigFastq.fastq ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772] real 2m14.966s user 0m47.168s sys 0m15.379s ~ $ gdmd -O -w -release monarch.d monarch.d:117: no identifier for declarator Lines monarch.d:117: alias cannot have initializer monarch.d:130: identifier or integer expected, not assert i haven't take the time to look more but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 minYou must be using dmd 2.060. I'm using some 2.061 features: Namelly "new style alias". Just change line 117: alias Lines = typeof(File.init.byLine()); to alias typeof(File.init.byLine()) Lines; As for 130, it's a "version(assert)" eg, code that does not get executed in release. Just remove the "version(assert)", if it gets executed, it is not a big deal. In any case, I think the code is mostly "proof", I wouldn't use it as is. ------------ BTW, I've started working on my library. How would users expect the "quality" format served? As an array of characters, or as an array of integrals (ubytes)?
Feb 07 2013
On Thursday, 7 February 2013 at 14:42:57 UTC, monarch_dodra wrote:On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics wrote:ubyte as is a number is maybe easier to understand an cuttoff some valueLittle feed back i named f the f's script and monarch the monarch's script gdmd -O -w -release f.d ~ $ time ./f bigFastq.fastq ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772] real 2m14.966s user 0m47.168s sys 0m15.379s ~ $ gdmd -O -w -release monarch.d monarch.d:117: no identifier for declarator Lines monarch.d:117: alias cannot have initializer monarch.d:130: identifier or integer expected, not assert i haven't take the time to look more but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 minYou must be using dmd 2.060. I'm using some 2.061 features: Namelly "new style alias". Just change line 117: alias Lines = typeof(File.init.byLine()); to alias typeof(File.init.byLine()) Lines; As for 130, it's a "version(assert)" eg, code that does not get executed in release. Just remove the "version(assert)", if it gets executed, it is not a big deal. In any case, I think the code is mostly "proof", I wouldn't use it as is. ------------ BTW, I've started working on my library. How would users expect the "quality" format served? As an array of characters, or as an array of integrals (ubytes)?
Feb 07 2013
And use size_t instead to int for getChar/getInt method as type returned gdmd -w -O -release monarch.d ~ $ time ./monarch /env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq globalStats: A: 1007129068. C: 1350576504. G: 1353023772. M: 0. D: 0. S: 0. H: 0. N: 39413. V: 0. U: 0. W: 0. R: 0. B: 0. Y: 0. K: 0. T: 999786820. time: 176585 real 2m56.635s user 2m31.376s sys 0m23.077s this program is little less fast than f's program about parser I would like create a set a biology parser and put into a lib with a set of common compute as letter counter. By example you could run a letter counter compute throw a fata or fastq file. rename identifier thwow a fata or fastq file.
Feb 08 2013
On Friday, 8 February 2013 at 09:08:48 UTC, bioinfornatics wrote:And use size_t instead to int for getChar/getInt method as type returned gdmd -w -O -release monarch.d ~ $ time ./monarch /env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq globalStats: A: 1007129068. C: 1350576504. G: 1353023772. M: 0. D: 0. S: 0. H: 0. N: 39413. V: 0. U: 0. W: 0. R: 0. B: 0. Y: 0. K: 0. T: 999786820. time: 176585 real 2m56.635s user 2m31.376s sys 0m23.077s this program is little less fast than f's programI've re-tried running both mine and FG's on a HDD based machine, with dmd, -O -release. Also optional "inline" I also wrote a new parser, which does as FG suggested, and just parses straight up (byLine is indeed more expensive). This one handles whites and line breaks correctly. It also accepts lines of any size (the internal buffer is auto-grow). My results are different from yours though: w/o inline w inline FG 105s 77s MD 72s 64s newMD 61s 59s I have no idea why you guys are getting better results with FG, and I'm getting better results with mine. Is this a win/linux or dmd/gdc issue. My new parser is based on raw reads, so that should be much faster on your machines.about parser I would like create a set a biology parser and put into a lib with a set of common compute as letter counter. By example you could run a letter counter compute throw a fata or fastq file. rename identifier thwow a fata or fastq file.I don't really understand what all that means. In any case, I've been able to implement some cool features so far. My parser is a "true" range you can pass around, and you won't have any problems with it. It returns "shallow" objects that reference a mutable string, however, the user can call "dup" or "idup" to have a new object. Said objects can be printed directly, so there is no need for a specialized "writer". As a matter of fact, this little program will allow you to "clean" a file (strip spaces), and potentially, line-wrap at 80 chars: //---- import std.stdio; import fastq.parser; import fastq.q; void main(string[] args) { Parser parser = new Parser(args[1]); File output = File(args[2], "wb"); foreach(entry; parser) writefln("%80s", entry); } //---- I'll submit it for your review, once it is perfectly implemented.
Feb 08 2013
some idea such as letter counting: rename identifier trimming sequence from quality value to cutoff convert to a binary format more idea later
Feb 09 2013
On 02/06/2013 12:43 PM, monarch_dodra wrote:with dmd, with -release -O -inlineGoing off topic a little, in a recent experiment, I have noticed that adding -inline made a range solution twice slower. -O -release still helped but -inline was the culprit. Ali
Feb 06 2013
instead to use memcpy I try with slicing ~ lines 136 : _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition .. moveSize + _bufPosition]; I get same perf
Feb 12 2013
On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:instead to use memcpy I try with slicing ~ lines 136 : _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition .. moveSize + _bufPosition]; I get same perfI think I figured out why I'm getting different results than you guys are, on my windows machine. AFAIK, file reads in windows are done natively asynchronously. I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers. I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%. I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it... This takes 60 seconds. //---- auto input = File(args[1], "rb"); ubyte[] buffer = new ubyte[](BufferSize); do{ buffer = input.rawRead(buffer); }while(buffer.length); //---- This takes 60 seconds too. //---- Parser parser = new Parser(args[1]); foreach(q; parser) foreach(char c; q.sequence) globalNucleic.collect(c); } //---- So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing. I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now. I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that: a) You can inspect the erroneous data b) You can skip the erroneous data, and parse the rest of the file. Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux. When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
Feb 12 2013
On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:about threaded version is possible to use get file size function to split it in several thread. Use fseek read end of section return it to detect end of split to usedinstead to use memcpy I try with slicing ~ lines 136 : _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition .. moveSize + _bufPosition]; I get same perfI think I figured out why I'm getting different results than you guys are, on my windows machine. AFAIK, file reads in windows are done natively asynchronously. I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers. I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%. I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it... This takes 60 seconds. //---- auto input = File(args[1], "rb"); ubyte[] buffer = new ubyte[](BufferSize); do{ buffer = input.rawRead(buffer); }while(buffer.length); //---- This takes 60 seconds too. //---- Parser parser = new Parser(args[1]); foreach(q; parser) foreach(char c; q.sequence) globalNucleic.collect(c); } //---- So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing. I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now. I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that: a) You can inspect the erroneous data b) You can skip the erroneous data, and parse the rest of the file. Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux. When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
Feb 12 2013
On Tuesday, 12 February 2013 at 16:28:09 UTC, bioinfornatics wrote:On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:You'd want to have 2 threads reading the same file at once? I don't think there is much more to be gained anyways, since the IO is the bottleneck anyways. A better approach would be to have 1 file reader that passes data to two simultaneous parsers. This, however, would make things scary complicated, and I'd doubt we'd even get much better results: I was not able to measure the actual amount of time spent working when compared to the time spent reading the file.On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:about threaded version is possible to use get file size function to split it in several thread. Use fseek read end of section return it to detect end of split to usedinstead to use memcpy I try with slicing ~ lines 136 : _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition .. moveSize + _bufPosition]; I get same perfI think I figured out why I'm getting different results than you guys are, on my windows machine. AFAIK, file reads in windows are done natively asynchronously. I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers. I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%. I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it... This takes 60 seconds. //---- auto input = File(args[1], "rb"); ubyte[] buffer = new ubyte[](BufferSize); do{ buffer = input.rawRead(buffer); }while(buffer.length); //---- This takes 60 seconds too. //---- Parser parser = new Parser(args[1]); foreach(q; parser) foreach(char c; q.sequence) globalNucleic.collect(c); } //---- So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing. I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now. I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that: a) You can inspect the erroneous data b) You can skip the erroneous data, and parse the rest of the file. Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux. When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
Feb 12 2013
On 2013-02-12 17:45, monarch_dodra wrote:A better approach would be to have 1 file reader that passes data to two simultaneous parsers. This, however, would make things scary complicated, and I'd doubt we'd even get much better results: I was not able to measure the actual amount of time spent working when compared to the time spent reading the file.Best to keep things simple when the potential benefits aren't certain. :)
Feb 12 2013
Some time fastq are comressed to gz bz2 or xz as that is often a huge file. Maybe we need keep in mind this early in developement and use std.zlib
Feb 12 2013
On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:Some time fastq are comressed to gz bz2 or xz as that is often a huge file. Maybe we need keep in mind this early in developement and use std.zlibWhile working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data. Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data. The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient. However, now that you mention it, I'll make sure it is correctly supported. I'll *try* to show you what I have so far tomorow (in about 18h).
Feb 12 2013
On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:Yeah... I played around too much, and the file is dirtier than ever. The good news is that I was able to test out what I was telling you about: accepting any range is ok: I used your ZFile range to plug it into my parser: I can now parse zipped files directly. The good news is that now, I'm not bottle necked by IO anymore! The bad news is that I'm now bottle necked by CPU decompressing. But since I'm using dmd, you may get better results with LDC or GDC. In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result. Long story short: 99% of the time is spent acquiring data. The last 1% is just copying it into local buffers. The last good news though is that CPU bottleneck is always better than IO bottleneck. If you have multiple cores, you should be able to run multiple *instances* (not threads), and be able to process several files at once, multiplying your throughput.Some time fastq are comressed to gz bz2 or xz as that is often a huge file. Maybe we need keep in mind this early in developement and use std.zlibWhile working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data. Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data. The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient. However, now that you mention it, I'll make sure it is correctly supported. I'll *try* to show you what I have so far tomorow (in about 18h).
Feb 13 2013
On 2013-02-13 18:39, monarch_dodra wrote:In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result.Great. Performance aside, we didn't talk much about how this data can be useful - should it only be read sequentially forward or both ways, would there be a need to place some markers or slice the sequence, etc. Our small test case was only about counting nucleotides, so reading order and possibility of further processing was irrelevant. Mr.Bio, what usage cases you'll be interested in, other than those counters?
Feb 13 2013
Mr.Bio, what usage cases you'll be interested in, other than those counters?some idea such as letter counting: rename identifier trimming sequence from quality value to cutoff convert to a binary format convert to fasta + sff merge close sequence to one concenus create a brujin graph more idea later
Feb 14 2013
On Thursday, 14 February 2013 at 18:31:35 UTC, bioinfornatics wrote:OK. I posted the parser here: http://dpaste.dzfl.pl/37b893ed This runs on the 2.061. I'll have to make a few changes if you need it to run 2.060, to get around some 2.060 specific bugs. This contains strictly only the parser. If you want, I'll post the async file reading stuff I wrote to interface with it. The example sections should give you a quick idea of how to use it. Tell me what you think about it.Mr.Bio, what usage cases you'll be interested in, other than those counters?some idea such as letter counting: rename identifier trimming sequence from quality value to cutoff convert to a binary format convert to fasta + sff merge close sequence to one concenus create a brujin graph more idea later
Feb 19 2013
arf I am always in dmdfe 2.060
Feb 22 2013
On Friday, 22 February 2013 at 08:53:35 UTC, bioinfornatics wrote:arf I am always in dmdfe 2.060AFAIK, the problems are mostly the "nothrows", and maybe 1 or 2 "new style" alias declarations. That said, what's stopping you from upgrading? We are at 2.062 right now. Does upgrading break anything for you?
Feb 22 2013
On Wednesday, 13 February 2013 at 17:39:11 UTC, monarch_dodra wrote:On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:I modified the library unzip to make a parallel unzip a while back (at the link below). The execution time scaled very well for the number of cpus for the test case I was using, which was a 2GB unzip'd distribution containing many small files and subdirectories. The parallel operations were by file. I think the execution time gains on ssd drives were from having multiple cores scheduling the writes to separate files in parallel. https://github.com/jnorwood/file_parallel/blob/master/unzip_parallel.dOn Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:Yeah... I played around too much, and the file is dirtier than ever. The good news is that I was able to test out what I was telling you about: accepting any range is ok: I used your ZFile range to plug it into my parser: I can now parse zipped files directly. The good news is that now, I'm not bottle necked by IO anymore! The bad news is that I'm now bottle necked by CPU decompressing. But since I'm using dmd, you may get better results with LDC or GDC. In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result. Long story short: 99% of the time is spent acquiring data. The last 1% is just copying it into local buffers. The last good news though is that CPU bottleneck is always better than IO bottleneck. If you have multiple cores, you should be able to run multiple *instances* (not threads), and be able to process several files at once, multiplying your throughput.Some time fastq are comressed to gz bz2 or xz as that is often a huge file. Maybe we need keep in mind this early in developement and use std.zlibWhile working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data. Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data. The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient. However, now that you mention it, I'll make sure it is correctly supported. I'll *try* to show you what I have so far tomorow (in about 18h).
Dec 18 2013
On 2013-02-12 17:28, bioinfornatics wrote:about threaded version is possible to use get file size function to split it in several thread. Use fseek read end of section return it to detect end of split to usedYes, but like already mentioned before, it only works well for SSD. For normal hard drives you'd want the data stored and accessed in sequence without jumping between cylinders whenever you switch threads. Do you store your data on an SSD?
Feb 12 2013