digitalmars.D.learn - How to read fastly files ( I/O operation)

bioinfornatics (13/13) Feb 04 2013 Dear,

bioinfornatics (5/5) Feb 04 2013 code: http://dpaste.dzfl.pl/79ab0e17
FG (4/8) Feb 04 2013 Why are you using mmap? Don't you just go through the file sequentially?

Dejan Lekic (8/20) Feb 04 2013 I would go even further, and organise the file so N Data objects fit one...

monarch_dodra (15/37) Feb 04 2013 AFAIK, he is reading text data that needs to be parsed line by

Brad Roberts (2/9) Feb 04 2013 First rule of performance analysis.. don't guess, measure.
Jacob Carlborg (7/20) Feb 04 2013 He can still read a chunk from the file, or the whole file and then read...

bioinfornatics (6/6) Feb 06 2013 instead to call mmFile opIndex to read ubyte by ubyte i tried to

monarch_dodra (22/28) Feb 06 2013 This might be stupid, but I see a "writeln" in your inner loop.

monarch_dodra (35/37) Feb 06 2013 Benchmarking and tweaking, I was able to find 3 things that
bioinfornatics (4/4) Feb 06 2013 i use both gdc / ldc with "-w -O -release" flags

bioinfornatics (18/22) Feb 06 2013 just to add more information about fastq

bioinfornatics (2/2) Feb 06 2013 this/these
monarch_dodra (30/31) Feb 06 2013 I don't think that's true. D provides the same "FILE" primitive

monarch_dodra (6/10) Feb 06 2013 Hum, just read your example files. I guess you can have white. In

Denis Shelomovskij (12/16) Feb 07 2013 Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding...

Jay Norwood (19/36) Dec 18 2013 There is a binary resource format for emf models, which normally

FG (19/21) Feb 06 2013 Haven't compared to fastxtoolkit, but I have some code for you.

monarch_dodra (19/36) Feb 06 2013 Do you mean my solution above? I tried your solution with dmd,

bioinfornatics (5/5) Feb 06 2013 Thanks monarch and FG,

Lee Braiden (21/23) Feb 06 2013 I wasn't going to mention this as I thought the CPU usage might be

FG (5/14) Feb 06 2013 This wasn't an issue in my version of the program. It took 1m55s to proc...

FG (38/50) Feb 06 2013 Yes. Maybe CPU is the bottleneck on my end.

monarch_dodra (14/32) Feb 06 2013 Yes, that file exactly. That said, I'm working on an SSD, so

FG (16/19) Feb 07 2013 Ah, now that you mention SSD, I moved the file onto one and it's even mo...

bioinfornatics (18/18) Feb 07 2013 Little feed back

monarch_dodra (17/35) Feb 07 2013 You must be using dmd 2.060. I'm using some 2.061 features:

bioinfornatics (3/43) Feb 07 2013 ubyte as is a number is maybe easier to understand an cuttoff
bioinfornatics (19/19) Feb 08 2013 And use size_t instead to int for getChar/getInt method as type

monarch_dodra (39/58) Feb 08 2013 I've re-tried running both mine and FG's on a HDD based machine,

bioinfornatics (5/5) Feb 09 2013 some idea such as letter counting:

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (5/6) Feb 06 2013 Going off topic a little, in a recent experiment, I have noticed that

bioinfornatics (4/4) Feb 12 2013 instead to use memcpy I try with slicing ~ lines 136 :

monarch_dodra (45/49) Feb 12 2013 I think I figured out why I'm getting different results than you

bioinfornatics (5/55) Feb 12 2013 about threaded version is possible to use get file size function

monarch_dodra (10/78) Feb 12 2013 You'd want to have 2 threads reading the same file at once? I

FG (2/7) Feb 12 2013 Best to keep things simple when the potential benefits aren't certain. :...

bioinfornatics (4/4) Feb 12 2013 Some time fastq are comressed to gz bz2 or xz as that is often a

monarch_dodra (13/17) Feb 12 2013 While working on making the parser multi-threaded compatible, I

monarch_dodra (22/42) Feb 13 2013 Yeah... I played around too much, and the file is dirtier than

FG (7/11) Feb 13 2013 Great. Performance aside, we didn't talk much about how this data can be...

bioinfornatics (8/10) Feb 14 2013 some idea such as letter counting:

monarch_dodra (11/21) Feb 19 2013 OK. I posted the parser here:

bioinfornatics (1/1) Feb 22 2013 arf I am always in dmdfe 2.060

monarch_dodra (5/6) Feb 22 2013 AFAIK, the problems are mostly the "nothrows", and maybe 1 or 2

Jay Norwood (10/60) Dec 18 2013 I modified the library unzip to make a parallel unzip a while

FG (5/8) Feb 12 2013 Yes, but like already mentioned before, it only works well for SSD.

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

Dear,

I am looking to parse efficiently huge file but i think D lacking 
for this purpose.
To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in 
c++ ) need 2 min.

My code is maybe not easy as is not easy to parse a fastq file 
and is more harder when using memory mapped file.

I do not see where i can get some perf as i do not do many copy 
and i use mmfile.
fastxtoolkit do not use mmfile and store his result into a  
struct array for each sequences but is is still faster!!!

thanks to any help i hope we can create a faster parser otherwise 
that is too slow to use D instead C++

Feb 04 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

code: http://dpaste.dzfl.pl/79ab0e17
fastxtoolkit: 
http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2
| - fastx_quality_stats.c ->  read_file()
| - libfastx/fastx.c      -> fastx_read_next_record()

Feb 04 2013

FG <home fgda.pl> writes:

On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D lacking for this
purpose.
 To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2
min.

 My code is maybe not easy as is not easy to parse a fastq file and is more
 harder when using memory mapped file.

Why are you using mmap? Don't you just go through the file sequentially?
In that case it should be faster to read in chunks:

     foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }

Feb 04 2013

Dejan Lekic <dejan.lekic gmail.com> writes:

FG wrote:

 On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D lacking for this
 purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++
 ) need 2 min.

 My code is maybe not easy as is not easy to parse a fastq file and is more
 harder when using memory mapped file.

 
 Why are you using mmap? Don't you just go through the file sequentially?
 In that case it should be faster to read in chunks:
 
      foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }

I would go even further, and organise the file so N Data objects fit one page, 
and read the file page by page. The page-size can easily be obtained from the 
system. IMHO that would beat this fastxtoolkit. :)

-- 
Dejan Lekic
dejan.lekic (a) gmail.com
http://dejan.lekic.org

Feb 04 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 4 February 2013 at 19:30:59 UTC, Dejan Lekic wrote:
 FG wrote:

 On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D 
 lacking for this
 purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit 
 (written in c++
 ) need 2 min.

 My code is maybe not easy as is not easy to parse a fastq 
 file and is more
 harder when using memory mapped file.

 
 Why are you using mmap? Don't you just go through the file 
 sequentially?
 In that case it should be faster to read in chunks:
 
      foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }

 I would go even further, and organise the file so N Data 
 objects fit one page,
 and read the file page by page. The page-size can easily be 
 obtained from the
 system. IMHO that would beat this fastxtoolkit. :)

AFAIK, he is reading text data that needs to be parsed line by 
line, so byChunk may not be the best approach. Or at least, not 
the easiest approach.

I'm just wondering if maybe the reason the D code is slow is not 
just because of:
- unicode.
- front + popFront.

ranges in D are "notorious" for being slow to iterate on text, 
due to the "double decode".

If you are *certain* that the file contains nothing but ASCII 
(which should be the case for fastq, right?), you can get more 
bang for your buck if you attempt to iterate over it as an array 
of bytes, and convert the bytes to char on the fly, bypassing any 
and all unicode processing.

Feb 04 2013

Brad Roberts <braddr slice-2.puremagic.com> writes:

On Mon, 4 Feb 2013, monarch_dodra wrote:

 AFAIK, he is reading text data that needs to be parsed line by line, so
 byChunk may not be the best approach. Or at least, not the easiest approach.
 
 I'm just wondering if maybe the reason the D code is slow is not just because
 of:
 - unicode.
 - front + popFront.

First rule of performance analysis.. don't guess, measure.

Feb 04 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-04 20:39, monarch_dodra wrote:

 AFAIK, he is reading text data that needs to be parsed line by line, so
 byChunk may not be the best approach. Or at least, not the easiest
 approach.

He can still read a chunk from the file, or the whole file and then read 
that chunk line by line.

 I'm just wondering if maybe the reason the D code is slow is not just
 because of:
 - unicode.
 - front + popFront.

 ranges in D are "notorious" for being slow to iterate on text, due to
 the "double decode".

 If you are *certain* that the file contains nothing but ASCII (which
 should be the case for fastq, right?), you can get more bang for your
 buck if you attempt to iterate over it as an array of bytes, and convert
 the bytes to char on the fly, bypassing any and all unicode processing.

Depending on what you're doing you can blast through the bytes even if 
it's Unicode. It will of course not validate the Unicode.

-- 
/Jacob Carlborg

Feb 04 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

instead to call mmFile opIndex to read ubyte by ubyte i tried to 
put into a buffer array of length PAGESIZE.

code here: http://dpaste.dzfl.pl/25ee34fc

and is not faster for 12Go to parse i need 11 minutes. I do not 
see how i could read faster the file!

To remember fastxtoolkit need 2 min!

Feb 06 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 6 February 2013 at 10:43:02 UTC, bioinfornatics 
wrote:
 instead to call mmFile opIndex to read ubyte by ubyte i tried 
 to put into a buffer array of length PAGESIZE.

 code here: http://dpaste.dzfl.pl/25ee34fc

 and is not faster for 12Go to parse i need 11 minutes. I do not 
 see how i could read faster the file!

 To remember fastxtoolkit need 2 min!

This might be stupid, but I see a "writeln" in your inner loop. 
You aren't slowed down just by your console by any chance?

If I were you, I'd start benching to try and see who is slowing 
you down.

I'd reorganize the code to parse a file that is, say 512Mb. The 
rationale being you can place it entirely at once. Then, I'd 
shift the logic from "fully proccess each charater before moving 
to the next character" to "make a full processing pass on the 
entire data structure, before moving to the next pass".

The steps I see that need to be measured are:

* Raw read of file
* Iterating on your file to extract it as a raw array of "Data" 
objects
* Processing the Data objects
* Outputting the data

Also,  (of course), you need to make sure you are compiling in 
release (might sound obvious, but you never know). Are you using 
dmd? I heard the "other" compilers are faster.

I'm going to try and see with some example files if I can't get 
something running faster.

Feb 06 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 6 February 2013 at 11:15:22 UTC, monarch_dodra 
wrote:
 I'm going to try and see with some example files if I can't get 
 something running faster.

Benchmarking and tweaking, I was able to find 3 things that 
speeds up your program:

1) Make the computeLocal a compile time constant. This will give 
you a tinsy bit of performance. Depends on if you plan to make it 
a run-time argument switch I guess.

2) Makes things about 10%-20% faster:
Your "nucleic" and "amino" hash tables map a character to an 
index. However, given the range of the characters ('A' to 'Z'), 
you are better off doing a flat array, where each index 
represents a character, eg: A is index 0, B is index 1. This way, 
lookup is a simple array indexing, as opposed to a hash table 
indexing.

You may even get a bigger bang for your buck by simply giving 
your "_stats" structure a simple "A is index 0, B is index 1", 
and only "re-order" the data at the end, when you want to read 
it. (I haven't done this though).

3) Makes things about 100% faster (ran in half the time on my 
machine): I don't know how mmFile works, but a simple File + 
"rawRead" seems to get the job done fast. Also, instead of 
keeping track of an (several) indexes, I merely keep a single 
slice. The only thing I care about, is if my slice is empty, in 
which case I re-fill it.

The modified code is here. I'm apparently getting the same output 
you are, but that doesn't mean there might not be bugs in it. For 
example, I noticed that you don't strip leading whites, if any, 
before the first read.
http://dpaste.dzfl.pl/9b9353b8

----
I'd be tempted to re-write the parser using a "byLine" approach, 
since my quick reading about fastq seems to imply it is a line 
based format. Or just plain try to write a parser from scratch, 
putting my own logic and thought into it (all I did was modify 
your code, without caring about the actual algorithm)

Feb 06 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

i use both gdc / ldc with "-w -O -release" flags

writeln inside loop is never evaluated as computeLocal boolean is 
always false


Thanks in any case i continue to read all your answer :-)

Feb 06 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics 
wrote:
 i use both gdc / ldc with "-w -O -release" flags

 writeln inside loop is never evaluated as computeLocal boolean 
 is always false


 Thanks in any case i continue to read all your answer :-)

just to add more information about fastq
http://www.biomedsearch.com/nih/Sanger-FASTQ-file-format-sequences/20015970.html

And here a set of fastq where parser should success or fail 
http://www.biomedsearch.com/attachments/00/20/01/59/20015970/gkp1137_nar-02248-d-2009-File005.gz

The problem is a sequence line could be splitted in several line 
same for quality line. And if i think it should allow to have in 
this lines white space

the   is used to identify a identifier line
the + is used to identify a description line
but this char could appear as a quality value (ubyte)

I agree the spec format is really bad but it is heavily used in 
biology so i would like a fast parser to develop some D 
application instead to use C++.

I will try later all previous recommendation thank to all.

It seem in any case is not easy to parse fastly a file in D

Note: is possible to lock a file? to able to use pure method ?

Feb 06 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

this/these

sorry

Feb 06 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 6 February 2013 at 15:40:39 UTC, bioinfornatics 
wrote:
 It seem in any case is not easy to parse fastly a file in D

I don't think that's true. D provides the same "FILE" primitive 
you'd get in C, so there is no reason for it to be slower than C.

It is the "range" approach that, as convenient as it is, is not 
well adapted for certain things.

As I had said, I tried to write my own program. In it, I devised 
a range that, instead of exposing things to parse character by 
character, parses an entire "object" (a ... "genome" ... maybe ? 
I called them "Q" in my program) at once into an object. I 
decided to use the very simple "byLine" primitive.

 From there, you can query the object for their 
name/sequence/quality. The irony is that by "parsing twice" (once 
to do the io read, once to do the actual processing of the text), 
and taking into account I'm allocating each object individually, 
I'm running twice as fast as my original already improved 
implementation. Not only is it faster, it is also more 
convenient, since you can extract an entire Q object at once, and 
then operate on that as you would so please: Separation of 
algorithm and parsing.

It correctly takes into account that a sequence can be multiple 
lines. It does not strip whitespace because according to 
http://maq.sourceforge.net/fastq.shtml whitespace is not a legal 
character.

Now: Keep in mind that this approach allocates (3) new strings 
for each Q. You could *try* an approach with a pre-allocated 
re-useable buffer. This would mean you can only operate on 1 Q at 
once, but you'd probably iterate on them faster.

In any case, you can try it out:
http://dpaste.dzfl.pl/8bdd0c84

Feb 06 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 6 February 2013 at 16:06:20 UTC, monarch_dodra 
wrote:
 It correctly takes into account that a sequence can be multiple 
 lines. It does not strip whitespace because according to 
 http://maq.sourceforge.net/fastq.shtml whitespace is not a 
 legal character.

Hum, just read your example files. I guess you can have white. In 
any case, that should pose not pose any real problem. 


would come in handy here.

Feb 06 2013

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

06.02.2013 19:40, bioinfornatics пишет:
 On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote:
 I agree the spec format is really bad but it is heavily used in biology
 so i would like a fast parser to develop some D application instead to
 use C++.

Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding!

The situation can be improved only if:
1. We will find and kill every text format creator;
2. We will create a really good binary format for each such task and 
support it in every application we create. So after some time text 
formats will just die because of evolution as everything will support 
better formats.

(the second proposal is a real recommendation)

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Feb 07 2013

"Jay Norwood" <jayn prismnet.com> writes:

On Friday, 8 February 2013 at 06:22:18 UTC, Denis Shelomovskij 
wrote:
 06.02.2013 19:40, bioinfornatics пишет:
 On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics 
 wrote:
 I agree the spec format is really bad but it is heavily used 
 in biology
 so i would like a fast parser to develop some D application 
 instead to
 use C++.

 Yes, lets also create 1 GiB XML files and ask for fast 
 encoding/decoding!

 The situation can be improved only if:
 1. We will find and kill every text format creator;
 2. We will create a really good binary format for each such 
 task and support it in every application we create. So after 
 some time text formats will just die because of evolution as 
 everything will support better formats.

 (the second proposal is a real recommendation)

There is a binary resource format for emf models, which normally 
use xml files, and some timing improvements stated at this link.  
It might be worth looking at this if you are thinking about 
writing your own binary format.
http://www.slideshare.net/kenn.hussey/performance-and-extensibility-with-emf

There is also a fast binary compression library named blosc that 
is used in some python utilities, measured and presented here, 
showing that it is faster than doing a memcpy if you have 
multiple cores.
http://blosc.pytables.org/trac

On the sequential accesses ... I found that windows writes blocks 
of data all over the place, but the best way to get it to write 
something in more contiguous locations is to modify the file 
output routines to use specify write through.  The sequential 
accesses didn't improve read times on ssd.

Most of the decent ssds can read big files at 300MB/sec or more 
now, and you can raid 0 a few of them and read 800MB/sec.

Dec 18 2013

FG <home fgda.pl> writes:

On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D lacking for this
purpose.
 To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2
min.

Haven't compared to fastxtoolkit, but I have some code for you.
I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast as the
fastest D solution so far -- all compiled with gdc -O3.
I bet your computer has better specs than mine.

Program uses a buffer that should be twice the size of the largest sequence
record (counting id, comment and quality data). A chunk of file is read,
then records are scanned on the buffer until record start pointer passes
the middle of the buffer -- then memcpy is used to move all the rest to
the begining of the buffer and the remaining space at the end is filled with
another chunk read from the file.

Data contains both sequence letter and associated quality information.
Sequence ID and comment are slices of the buffer, so they have valid info
until you move to the next sequence (and the number increments).

This is the code: http://dpaste.1azy.net/8424d4ac
Tell me what timings you can get now.

Feb 06 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
 On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D 
 lacking for this purpose.
 To parse 12 Go i need 11 minutes wheras fastxtoolkit (written 
 in c++ ) need 2 min.

 Haven't compared to fastxtoolkit, but I have some code for you.
 I have processed the file SRR077487_1.filt.fastq from
 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
 and expect this syntax (no multiline sequences or whitespace).
 File takes up almost 6 GB processing took 1m45s - twice as fast 
 as the
 fastest D solution so far

Do you mean my solution above? I tried your solution with dmd, 
with -release -O -inline, and both gave about the same result 
(69s yours, 67s mine).

 Data contains both sequence letter and associated quality 
 information.
 Sequence ID and comment are slices of the buffer, so they have 
 valid info
 until you move to the next sequence (and the number increments).

Hum. Mine allocates new slices, so they are never invalidated :)
Mine also takes into account newlines and and lowercase sequences.

Still, it seems you and I both took different approaches. I had 
mentioned using a re-useable buffer. I'm going to try to consume 
some of your code to see if I can't improve my implementation.

 bioinfornatics

I'm getting real interested on the subject. I'm going to try to 
write an actual library/framework for working with fastq files in 
a D environment.

This means I'll try to write robust and useable code, with both 
stability and performance in mind, as opposed to the "proofs of 
concepts in so far".

For now, I'd like to keep it simple: Would something that only 
knows how to parse/write Sanger FASTQ files be of help to you?

If I write something, can I have you review it?

Feb 06 2013

"bioinfornatics" <bioinfornatics gmail.com> writes:

Thanks monarch and FG,
i will read your code to see where i failing :-)
And of course if you are interested with bio format i will really 
happy to works / review together

In any case  big thanks that is a very interesting subject

Feb 06 2013

Lee Braiden <leebraid gmail.com> writes:

On 06/02/13 22:21, bioinfornatics wrote:
 Thanks monarch and FG,
 i will read your code to see where i failing :-)

I wasn't going to mention this as I thought the CPU usage might be 
trivial, but if both CPU and IO are factors, then it would probably be 
beneficial to have a separate IO thread/task.

I guess you'd need a big task: the task would need to load and return n 
chunks or n lines, rather than just one line at at time, for example, 
and the processing/parsing thread (main thread or otherwise) could then 
churn through that while more IO was done.

It would also depend on the size of the file: no point firing up a 
thread just to read a tiny file that the filesystem can return in a 
millisecond.  If you're talking about 1+ minutes of loading though, a 
thread should definitely help.

Also, if you don't strictly need to parse the file in order, then you 
could divide and conquer it by breaking it into more sections/tasks. For 
example, if you're parsing records, you cold split the file in half, 
find the remaining parts of the record in the second half, move it to 
the first, and then process the two halves in two threads.  If you've a 
nice function to do that split cleanly, and n cpus, then just call it 
some more.



-- 
Lee

Feb 06 2013

FG <home fgda.pl> writes:

On 2013-02-07 00:41, Lee Braiden wrote:
 I wasn't going to mention this as I thought the CPU usage might be trivial, but
 if both CPU and IO are factors, then it would probably be beneficial to have a
 separate IO thread/task.

This wasn't an issue in my version of the program. It took 1m55s to process the
file, but then again it takes 1m44s just to read it (as shown previously).

 Also, if you don't strictly need to parse the file in order, then you could
 divide and conquer it by breaking it into more sections/tasks. For example, if
 you're parsing records, you cold split the file in half, find the remaining
 parts of the record in the second half, move it to the first, and then process
 the two halves in two threads.  If you've a nice function to do that split
 cleanly, and n cpus, then just call it some more.

Now, this could make a big difference!
If only parsing out of order is acceptable in this case.

Feb 06 2013

FG <home fgda.pl> writes:

On 2013-02-06 21:43, monarch_dodra wrote:
 On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
 I have processed the file SRR077487_1.filt.fastq from
 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
 and expect this syntax (no multiline sequences or whitespace).
 File takes up almost 6 GB processing took 1m45s - twice as fast as the
 fastest D solution so far

 Do you mean my solution above? I tried your solution with dmd, with -release -O
 -inline, and both gave about the same result (69s yours, 67s mine).

Yes. Maybe CPU is the bottleneck on my end.
With DMD32 2.060 on win7-64 compiled with same flags I got:
MD: 4m30 / FG: 1m55s - both using 100% of one core.
Quite similar results with GDC64.

You have timed the same file SRR077487_1.filt.fastq at 67s?


 I'm getting real interested on the subject. I'm going to try to write an actual
 library/framework for working with fastq files in a D environment.

Those fastq are contagious. ;)

 This means I'll try to write robust and useable code, with both stability and
 performance in mind, as opposed to the "proofs of concepts in so far".

Yeah, but the big deal was that D is 5.5x slower than C++.

You have mentioned something about using byLine. Well, I would have gladly used
it instead of looking for line ends myself and pushing stuff with memcpy.
But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx
is fast in reading file by line, using file.readln(buf) is unpredictable. :)
I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC
can be several times slower. For example just reading in a loop:

     import std.stdio;
     enum uint bufferSize = 4096 - 16;
     void main(string[] args) {
         char[] tmp, buf = new char[bufferSize];
         size_t cnt;
         auto f = File(args[1], "r");
         switch(args[2]) {
             case "raw":
                 do tmp = f.rawRead(buf); while (tmp.length);
                 break;

             case "readln":
                 do cnt = f.readln(buf); while (cnt);
                 break;

             default: writeln("Use parameters: <filename> raw|readln");
         }
     }

Tested on a much smaller SRR077487.filt.fastq:
DMD32 -release -O -inline: raw 94ms / readln 450ms
GDC64 -O3:                 raw 94ms / readln 6.76s

Tested on SRR077487_1.filt.fastq:
DMD32 -release -O -inline: raw 1m44s / readln  1m55s
GDC64 -O3:                 raw 1m48s / readln 14m16s

Why such a big difference between the DMD and GDC (on Windows)?
(or have I missed some switch in GDC?)

Feb 06 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 6 February 2013 at 22:55:14 UTC, FG wrote:
 On 2013-02-06 21:43, monarch_dodra wrote:
 On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
 I have processed the file SRR077487_1.filt.fastq from
 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
 and expect this syntax (no multiline sequences or whitespace).
 File takes up almost 6 GB processing took 1m45s - twice as 
 fast as the
 fastest D solution so far

 Do you mean my solution above? I tried your solution with dmd, 
 with -release -O
 -inline, and both gave about the same result (69s yours, 67s 
 mine).

 Yes. Maybe CPU is the bottleneck on my end.
 With DMD32 2.060 on win7-64 compiled with same flags I got:
 MD: 4m30 / FG: 1m55s - both using 100% of one core.
 Quite similar results with GDC64.

 You have timed the same file SRR077487_1.filt.fastq at 67s?

Yes, that file exactly. That said, I'm working on an SSD, so 
maybe I'm less IO bound than you are?

My attempt was mostly to try and see how fast we could go, while 
doing it only with high level stuff (eg, no fSomething calls).

Probably, going lower level, and parsing the text manually, 
waiting for magic characters could yield better result (like what 
you did).

I'm going to also try playing around with threads: Just last week 
I wrote a program that did exactly this (asynchronous file reads).

That said, I'll be making this priority n°2. I'd like to make the 
parser work perfectly first, and in a way that is easily 
upgradeable/useable. Mr. bio made it perfectly clear that he 
needed support for whites and line feeds ;)

Feb 06 2013

FG <home fgda.pl> writes:

On 2013-02-07 08:26, monarch_dodra wrote:
 You have timed the same file SRR077487_1.filt.fastq at 67s?

 Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO
 bound than you are?

Ah, now that you mention SSD, I moved the file onto one and it's even more
clear that I am CPU-bound here on the Intel E6600 system. Compare:

     7200rpm: MS 4m30s / FG 1m55s
     SSD:     MS 4m14s / FG 1m44s

Almost the same, but running the utility "wc -l" on the file renders:

     7200rpm: 1m45s
     SSD:     0m33s

In my case threads would be beneficial but only when using the SSD.
Reading the file by chunk in D takes 33s on SSD and 1m44s on HDD.
Slicing the file in half and reading from both threads would also
be fine only on the SSD, because on a HDD I'd lose sequential disk
reads jumping between threads (expecting lower performance).

Therefore - threads: yes, but gotta use an SSD. :)
Also, threads: yes, if there's gonna be more processing than just
counting letters.

Feb 07 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

Little feed back
i named f the f's script and monarch the monarch's script

  gdmd -O -w -release f.d
~ $ time ./f bigFastq.fastq
['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 
'G':1353023772]

real	2m14.966s
user	0m47.168s
sys	0m15.379s
~ $ gdmd -O -w -release monarch.d
monarch.d:117: no identifier for declarator Lines
monarch.d:117: alias cannot have initializer
monarch.d:130: identifier or integer expected, not assert


i haven't take the time to look more

but in any case it seem memory mapped file is really slowly 
whereas it is said that is the faster way to read file. Create an 
index where reading the file need 12 min that is useless as for 
read and compute you need 2 min

Feb 07 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics 
wrote:
 Little feed back
 i named f the f's script and monarch the monarch's script

  gdmd -O -w -release f.d
 ~ $ time ./f bigFastq.fastq
 ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 
 'G':1353023772]

 real	2m14.966s
 user	0m47.168s
 sys	0m15.379s
 ~ $ gdmd -O -w -release monarch.d
 monarch.d:117: no identifier for declarator Lines
 monarch.d:117: alias cannot have initializer
 monarch.d:130: identifier or integer expected, not assert


 i haven't take the time to look more

 but in any case it seem memory mapped file is really slowly 
 whereas it is said that is the faster way to read file. Create 
 an index where reading the file need 12 min that is useless as 
 for read and compute you need 2 min

You must be using dmd 2.060. I'm using some 2.061 features: 
Namelly "new style alias".

Just change line 117:
alias Lines = typeof(File.init.byLine());
to
alias typeof(File.init.byLine()) Lines;

As for 130, it's a "version(assert)" eg, code that does not get 
executed in release. Just remove the "version(assert)", if it 
gets executed, it is not a big deal.

In any case, I think the code is mostly "proof", I wouldn't use 
it as is.

------------

BTW, I've started working on my library. How would users expect 
the "quality" format served? As an array of characters, or as an 
array of integrals (ubytes)?

Feb 07 2013

"bioinfornatics" <bioinfornatics gmail.com> writes:

On Thursday, 7 February 2013 at 14:42:57 UTC, monarch_dodra wrote:
 On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics 
 wrote:
 Little feed back
 i named f the f's script and monarch the monarch's script

 gdmd -O -w -release f.d
 ~ $ time ./f bigFastq.fastq
 ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 
 'G':1353023772]

 real	2m14.966s
 user	0m47.168s
 sys	0m15.379s
 ~ $ gdmd -O -w -release monarch.d
 monarch.d:117: no identifier for declarator Lines
 monarch.d:117: alias cannot have initializer
 monarch.d:130: identifier or integer expected, not assert


 i haven't take the time to look more

 but in any case it seem memory mapped file is really slowly 
 whereas it is said that is the faster way to read file. Create 
 an index where reading the file need 12 min that is useless as 
 for read and compute you need 2 min

 You must be using dmd 2.060. I'm using some 2.061 features: 
 Namelly "new style alias".

 Just change line 117:
 alias Lines = typeof(File.init.byLine());
 to
 alias typeof(File.init.byLine()) Lines;

 As for 130, it's a "version(assert)" eg, code that does not get 
 executed in release. Just remove the "version(assert)", if it 
 gets executed, it is not a big deal.

 In any case, I think the code is mostly "proof", I wouldn't use 
 it as is.

 ------------

 BTW, I've started working on my library. How would users expect 
 the "quality" format served? As an array of characters, or as 
 an array of integrals (ubytes)?

ubyte as is a number is maybe easier to understand an cuttoff 
some value

Feb 07 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

And use size_t instead to int for getChar/getInt method as type 
returned

gdmd -w -O -release monarch.d
~ $ time ./monarch 
/env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq
globalStats:
A: 1007129068. C: 1350576504. G: 1353023772. M:   0. D:   0. S:   
0. H:   0. N: 39413. V:   0. U:   0. W:   0. R:   0. B:   0. Y:   
0. K:   0. T: 999786820.
time: 176585

real	2m56.635s
user	2m31.376s
sys	0m23.077s


this program is little less fast than f's program

about parser I would like create a set a biology parser and put 
into a lib with a set of common compute as letter counter.
By example you could run a letter counter compute throw a fata or 
fastq file.
rename identifier thwow a fata or fastq file.

Feb 08 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 8 February 2013 at 09:08:48 UTC, bioinfornatics wrote:
 And use size_t instead to int for getChar/getInt method as type 
 returned

 gdmd -w -O -release monarch.d
 ~ $ time ./monarch 
 /env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq
 globalStats:
 A: 1007129068. C: 1350576504. G: 1353023772. M:   0. D:   0. S:
   0. H:   0. N: 39413. V:   0. U:   0. W:   0. R:   0. B:   0. 
 Y:   0. K:   0. T: 999786820.
 time: 176585

 real	2m56.635s
 user	2m31.376s
 sys	0m23.077s


 this program is little less fast than f's program

I've re-tried running both mine and FG's on a HDD based machine, 
with dmd, -O -release. Also optional "inline"

I also wrote a new parser, which does as FG suggested, and just 
parses straight up (byLine is indeed more expensive). This one 
handles whites and line breaks correctly. It also accepts lines 
of any size (the internal buffer is auto-grow).

My results are different from yours though:

         w/o inline  w inline
FG      105s        77s
MD       72s        64s
newMD    61s        59s

I have no idea why you guys are getting better results with FG, 
and I'm getting better results with mine. Is this a win/linux or 
dmd/gdc issue. My new parser is based on raw reads, so that 
should be much faster on your machines.

 about parser I would like create a set a biology parser and put 
 into a lib with a set of common compute as letter counter.
 By example you could run a letter counter compute throw a fata 
 or fastq file.
 rename identifier thwow a fata or fastq file.

I don't really understand what all that means.

In any case, I've been able to implement some cool features so 
far. My parser is a "true" range you can pass around, and you 
won't have any problems with it.

It returns "shallow" objects that reference a mutable string, 
however, the user can call "dup" or "idup" to have a new object.

Said objects can be printed directly, so there is no need for a 
specialized "writer". As a matter of fact, this little program 
will allow you to "clean" a file (strip spaces), and potentially, 
line-wrap at 80 chars:

//----
import std.stdio;

import fastq.parser;
import fastq.q;

void main(string[] args)
{
     Parser parser = new Parser(args[1]);
     File   output = File(args[2], "wb");
     foreach(entry; parser)
         writefln("%80s", entry);
}
//----

I'll submit it for your review, once it is perfectly implemented.

Feb 08 2013

"bioinfornatics" <bioinfornatics gmail.com> writes:

some idea such as letter counting:
rename identifier
trimming sequence from quality value to cutoff
convert to a binary format
more idea later

Feb 09 2013

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 02/06/2013 12:43 PM, monarch_dodra wrote:

 with dmd, with -release -O -inline

Going off topic a little, in a recent experiment, I have noticed that 
adding -inline made a range solution twice slower. -O -release still 
helped but -inline was the culprit.

Ali

Feb 06 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

instead to use memcpy I try with slicing ~ lines 136 :
_hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
moveSize + _bufPosition];

I get same perf

Feb 12 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
wrote:
 instead to use memcpy I try with slicing ~ lines 136 :
 _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
 moveSize + _bufPosition];

 I get same perf

I think I figured out why I'm getting different results than you 
guys are, on my windows machine.

AFAIK, file reads in windows are done natively asynchronously.

I wrote a multi-threaded version of the parser, with a thread 
dedicated to reading the file, while the main thread parses the 
read buffers.

I'm getting EXACTLY 0% performance improvement. Not better, not 
worst, just 0%.

I'd have to try again on my SSD. Right now, I'm parsing the file 
6 Gig file in 60 seconds, which is the limit of my HDD. As a 
matter of fact, just *reading* the files takes the EXACT same 
amount of time as parsing it...

This takes 60 seconds.
//----
     auto input = File(args[1], "rb");
     ubyte[] buffer = new ubyte[](BufferSize);
     do{
         buffer = input.rawRead(buffer);
     }while(buffer.length);
//----

This takes 60 seconds too.
//----
     Parser parser = new Parser(args[1]);
     foreach(q; parser)
         foreach(char c; q.sequence)
             globalNucleic.collect(c);
}
//----

So at this point, I'd need to test on my Linux box, or publish 
the code so you can tell me how I'm doing.

I'm still tweaking the code to publish something readable, as 
there is a lot of sketchy code right now.

I'm also implementing a correct exception handling, so that if 
there is an erroneous entry, an exception is thrown. However, all 
the erroneous data is parsed out of the file, and placed inside 
the exception. This means that:
a) You can inspect the erroneous data
b) You can skip the erroneous data, and parse the rest of the 
file.

Once I deliver the code with the multi-threaded code activated, 
you should get some better performance on Linux.

When "1.0" is ready, I'll create a github project for it, so work 
can be done parallel on it.

Feb 12 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:
 On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
 wrote:
 instead to use memcpy I try with slicing ~ lines 136 :
 _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
 moveSize + _bufPosition];

 I get same perf

 I think I figured out why I'm getting different results than 
 you guys are, on my windows machine.

 AFAIK, file reads in windows are done natively asynchronously.

 I wrote a multi-threaded version of the parser, with a thread 
 dedicated to reading the file, while the main thread parses the 
 read buffers.

 I'm getting EXACTLY 0% performance improvement. Not better, not 
 worst, just 0%.

 I'd have to try again on my SSD. Right now, I'm parsing the 
 file 6 Gig file in 60 seconds, which is the limit of my HDD. As 
 a matter of fact, just *reading* the files takes the EXACT same 
 amount of time as parsing it...

 This takes 60 seconds.
 //----
     auto input = File(args[1], "rb");
     ubyte[] buffer = new ubyte[](BufferSize);
     do{
         buffer = input.rawRead(buffer);
     }while(buffer.length);
 //----

 This takes 60 seconds too.
 //----
     Parser parser = new Parser(args[1]);
     foreach(q; parser)
         foreach(char c; q.sequence)
             globalNucleic.collect(c);
 }
 //----

 So at this point, I'd need to test on my Linux box, or publish 
 the code so you can tell me how I'm doing.

 I'm still tweaking the code to publish something readable, as 
 there is a lot of sketchy code right now.

 I'm also implementing a correct exception handling, so that if 
 there is an erroneous entry, an exception is thrown. However, 
 all the erroneous data is parsed out of the file, and placed 
 inside the exception. This means that:
 a) You can inspect the erroneous data
 b) You can skip the erroneous data, and parse the rest of the 
 file.

 Once I deliver the code with the multi-threaded code activated, 
 you should get some better performance on Linux.

 When "1.0" is ready, I'll create a github project for it, so 
 work can be done parallel on it.

about threaded version is possible to use get file size function 
to split it in several thread.
Use fseek read end of section return it to detect end of split to 
used

Feb 12 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 12 February 2013 at 16:28:09 UTC, bioinfornatics 
wrote:
 On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra 
 wrote:
 On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
 wrote:
 instead to use memcpy I try with slicing ~ lines 136 :
 _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
 moveSize + _bufPosition];

 I get same perf

 I think I figured out why I'm getting different results than 
 you guys are, on my windows machine.

 AFAIK, file reads in windows are done natively asynchronously.

 I wrote a multi-threaded version of the parser, with a thread 
 dedicated to reading the file, while the main thread parses 
 the read buffers.

 I'm getting EXACTLY 0% performance improvement. Not better, 
 not worst, just 0%.

 I'd have to try again on my SSD. Right now, I'm parsing the 
 file 6 Gig file in 60 seconds, which is the limit of my HDD. 
 As a matter of fact, just *reading* the files takes the EXACT 
 same amount of time as parsing it...

 This takes 60 seconds.
 //----
    auto input = File(args[1], "rb");
    ubyte[] buffer = new ubyte[](BufferSize);
    do{
        buffer = input.rawRead(buffer);
    }while(buffer.length);
 //----

 This takes 60 seconds too.
 //----
    Parser parser = new Parser(args[1]);
    foreach(q; parser)
        foreach(char c; q.sequence)
            globalNucleic.collect(c);
 }
 //----

 So at this point, I'd need to test on my Linux box, or publish 
 the code so you can tell me how I'm doing.

 I'm still tweaking the code to publish something readable, as 
 there is a lot of sketchy code right now.

 I'm also implementing a correct exception handling, so that if 
 there is an erroneous entry, an exception is thrown. However, 
 all the erroneous data is parsed out of the file, and placed 
 inside the exception. This means that:
 a) You can inspect the erroneous data
 b) You can skip the erroneous data, and parse the rest of the 
 file.

 Once I deliver the code with the multi-threaded code 
 activated, you should get some better performance on Linux.

 When "1.0" is ready, I'll create a github project for it, so 
 work can be done parallel on it.

 about threaded version is possible to use get file size 
 function to split it in several thread.
 Use fseek read end of section return it to detect end of split 
 to used

You'd want to have 2 threads reading the same file at once? I 
don't think there is much more to be gained anyways, since the IO 
is the bottleneck anyways.

A better approach would be to have 1 file reader that passes data 
to two simultaneous parsers. This, however, would make things 
scary complicated, and I'd doubt we'd even get much better 
results: I was not able to measure the actual amount of time 
spent working when compared to the time spent reading the file.

Feb 12 2013

FG <home fgda.pl> writes:

On 2013-02-12 17:45, monarch_dodra wrote:
 A better approach would be to have 1 file reader that passes data to two
 simultaneous parsers. This, however, would make things scary complicated, and
 I'd doubt we'd even get much better results: I was not able to measure the
 actual amount of time spent working when compared to the time spent reading the
 file.

Best to keep things simple when the potential benefits aren't certain. :)

Feb 12 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

Some time fastq are comressed to gz bz2 or xz as that is often a
huge file.
Maybe we need keep in mind this early in developement and use
std.zlib

Feb 12 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
wrote:
 Some time fastq are comressed to gz bz2 or xz as that is often a
 huge file.
 Maybe we need keep in mind this early in developement and use
 std.zlib

While working on making the parser multi-threaded compatible, I 
was able to seperate the part that feeds data, and the part that 
parses data.

Long story short, the parser operates on an input range of 
ubyte[]: It is not responsible any more for acquisition of data.

The range can be a simple (wrapped) File, a byChunk, an 
asynchroneus file reader, or a zip decompresser, or just stdin I 
guess. Range can be transient.

However, now that you mention it, I'll make sure it is correctly 
supported.

I'll *try* to show you what I have so far tomorow (in about 18h).

Feb 12 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:
 On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
 wrote:
 Some time fastq are comressed to gz bz2 or xz as that is often 
 a
 huge file.
 Maybe we need keep in mind this early in developement and use
 std.zlib

 While working on making the parser multi-threaded compatible, I 
 was able to seperate the part that feeds data, and the part 
 that parses data.

 Long story short, the parser operates on an input range of 
 ubyte[]: It is not responsible any more for acquisition of data.

 The range can be a simple (wrapped) File, a byChunk, an 
 asynchroneus file reader, or a zip decompresser, or just stdin 
 I guess. Range can be transient.

 However, now that you mention it, I'll make sure it is 
 correctly supported.

 I'll *try* to show you what I have so far tomorow (in about 
 18h).

Yeah... I played around too much, and the file is dirtier than 
ever.

The good news is that I was able to test out what I was telling 
you about: accepting any range is ok:

I used your ZFile range to plug it into my parser: I can now 
parse zipped files directly.

The good news is that now, I'm not bottle necked by IO anymore! 
The bad news is that I'm now bottle necked by CPU decompressing. 
But since I'm using dmd, you may get better results with LDC or 
GDC.

In any case, I am now parsing the 6Gig packed into 1.5Gig in 
about 53 seconds (down from 61). I also tried doing a 
dual-threaded approach (1 thread to unzip, 1 thread to parse), 
but again, the actual *parse* phase is so ridiculously fast, that 
it changes *nothing* to the final result.

Long story short: 99% of the time is spent acquiring data. The 
last 1% is just copying it into local buffers.

The last good news though is that CPU bottleneck is always better 
than IO bottleneck. If you have multiple cores, you should be 
able to run multiple *instances* (not threads), and be able to 
process several files at once, multiplying your throughput.

Feb 13 2013

FG <home fgda.pl> writes:

On 2013-02-13 18:39, monarch_dodra wrote:
 In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds
 (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip,
 1 thread to parse), but again, the actual *parse* phase is so ridiculously
fast,
 that it changes *nothing* to the final result.

Great. Performance aside, we didn't talk much about how this data can be useful 
- should it only be read sequentially forward or both ways, would there be a 
need to place some markers or slice the sequence, etc. Our small test case was 
only about counting nucleotides, so reading order and possibility of further 
processing was irrelevant.

Mr.Bio, what usage cases you'll be interested in, other than those counters?

Feb 13 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

 Mr.Bio, what usage cases you'll be interested in, other than 
 those counters?

some idea such as letter counting:
rename identifier
trimming sequence from quality value to cutoff
convert to a binary format
convert to fasta + sff
merge close sequence to one concenus
create a brujin graph
more idea later

Feb 14 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 14 February 2013 at 18:31:35 UTC, bioinfornatics 
wrote:
 Mr.Bio, what usage cases you'll be interested in, other than 
 those counters?

 some idea such as letter counting:
 rename identifier
 trimming sequence from quality value to cutoff
 convert to a binary format
 convert to fasta + sff
 merge close sequence to one concenus
 create a brujin graph
 more idea later

OK. I posted the parser here:
http://dpaste.dzfl.pl/37b893ed

This runs on the 2.061. I'll have to make a few changes if you 
need it to run 2.060, to get around some 2.060 specific bugs.

This contains strictly only the parser. If you want, I'll post 
the async file reading stuff I wrote to interface with it.

The example sections should give you a quick idea of how to use 
it.

Tell me what you think about it.

Feb 19 2013

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

arf I am always in dmdfe 2.060

Feb 22 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 22 February 2013 at 08:53:35 UTC, bioinfornatics wrote:
 arf I am always in dmdfe 2.060

AFAIK, the problems are mostly the "nothrows", and maybe 1 or 2 
"new style" alias declarations.

That said, what's stopping you from upgrading? We are at 2.062 
right now. Does upgrading break anything for you?

Feb 22 2013

"Jay Norwood" <jayn prismnet.com> writes:

On Wednesday, 13 February 2013 at 17:39:11 UTC, monarch_dodra 
wrote:
 On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra 
 wrote:
 On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
 wrote:
 Some time fastq are comressed to gz bz2 or xz as that is 
 often a
 huge file.
 Maybe we need keep in mind this early in developement and use
 std.zlib

 While working on making the parser multi-threaded compatible, 
 I was able to seperate the part that feeds data, and the part 
 that parses data.

 Long story short, the parser operates on an input range of 
 ubyte[]: It is not responsible any more for acquisition of 
 data.

 The range can be a simple (wrapped) File, a byChunk, an 
 asynchroneus file reader, or a zip decompresser, or just stdin 
 I guess. Range can be transient.

 However, now that you mention it, I'll make sure it is 
 correctly supported.

 I'll *try* to show you what I have so far tomorow (in about 
 18h).

 Yeah... I played around too much, and the file is dirtier than 
 ever.

 The good news is that I was able to test out what I was telling 
 you about: accepting any range is ok:

 I used your ZFile range to plug it into my parser: I can now 
 parse zipped files directly.

 The good news is that now, I'm not bottle necked by IO anymore! 
 The bad news is that I'm now bottle necked by CPU 
 decompressing. But since I'm using dmd, you may get better 
 results with LDC or GDC.

 In any case, I am now parsing the 6Gig packed into 1.5Gig in 
 about 53 seconds (down from 61). I also tried doing a 
 dual-threaded approach (1 thread to unzip, 1 thread to parse), 
 but again, the actual *parse* phase is so ridiculously fast, 
 that it changes *nothing* to the final result.

 Long story short: 99% of the time is spent acquiring data. The 
 last 1% is just copying it into local buffers.

 The last good news though is that CPU bottleneck is always 
 better than IO bottleneck. If you have multiple cores, you 
 should be able to run multiple *instances* (not threads), and 
 be able to process several files at once, multiplying your 
 throughput.

I modified the library unzip to make a parallel unzip a while 
back (at the link below).  The execution time scaled very well 
for the number of cpus for the test case I was using, which was a 
2GB unzip'd distribution containing many small files and 
subdirectories.  The parallel operations were by file.   I think 
the execution time gains on ssd drives were from having multiple 
cores scheduling the writes to separate files in parallel.
https://github.com/jnorwood/file_parallel/blob/master/unzip_parallel.d

Dec 18 2013

FG <home fgda.pl> writes:

On 2013-02-12 17:28, bioinfornatics wrote:
 about threaded version is possible to use get file size function to split it in
 several thread.
 Use fseek read end of section return it to detect end of split to used

Yes, but like already mentioned before, it only works well for SSD.
For normal hard drives you'd want the data stored and accessed in sequence 
without jumping between cylinders whenever you switch threads.
Do you store your data on an SSD?

Feb 12 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - How to read fastly files ( I/O operation)