digitalmars.D.learn - parsing fastq files with D
- eastanon (50/50) Mar 23 2016 Fastq is a format for storing DNA sequences together with the
- rikki cattermole (136/136) Mar 23 2016 As a little fun thing to do I implemented it for you.
- eastanon (12/19) Mar 24 2016 Thank you very much. I think you have exposed me to a number of
- rikki cattermole (17/36) Mar 24 2016 Yup, any string usage gets allocated in memory. Since the way I designed...
- Marc =?UTF-8?B?U2Now7x0eg==?= (14/35) Mar 24 2016 Yes, it's read into your processes memory. You can use std.mmfile
- eastanon (3/17) Mar 24 2016 That is very clever. Thank you for the tip and I have implemented
- eastanon (4/27) Mar 29 2016 Do you mind to explain what is really going on at popFront() and
- rikki cattermole (10/36) Mar 29 2016 popFront is a special function used with input ranges.
Fastq is a format for storing DNA sequences together with the associated quality information often encoded in ascii characters. It is typically made of 4 lines for example 2 fastq entries would look like this. seq1 TTATTTTAAT + ?+BBB/DHH seq2 GACCCTTTGCA + ?+BHB/DIH I do not have a lot of D expirience and I am writing a simple parser to help work with these files. Ideally it should be fast with low memory footprint. I am working with very large files of this type and can be up to 1GB. module fastq; import std.stdio; import std.file; import std.exception; import std.algorithm; import std.string; struct Record{ string sequence; string quals; string name; } auto Records(string filename){ static auto toRecords(S)(S str){ auto res = findSplitBefore(str,"+\n"); auto seq = res[0]; auto qual = res[1]; return Record(seq,qual); } string text = cast(string)std.file.read(filename); enforce(text.length > 0 && text[0] == ' '); text = text[1 .. $]; auto entries = splitter(text,' '); return map!toRecords(entries); } The issue with this is that the "+" character can be part of the quality information and I am using it to split the quality information from the sequence information. and ends up splitting the quality information which is wrong. Ideally I do not want to use regex and I have heard of ragel for parsing but never used it. Such a solution would also be welcome, since I read it can be very fast. Which is the idiomatic way to capture, sequence name (starts with character and the first entry) the sequence, (line2) the quality scores( line 4)
Mar 23 2016
As a little fun thing to do I implemented it for you. It won't allocate. Making this perfect for you. With a bit of work you could make Result have buffers for result instead of using the input array allow for the source to be an input range itself. I made this up on dpaste and single quotes were not playing nicely there. So you'll see "\r"[0] as a workaround. struct FastQRecord { const(char)[] sequenceId; const(char)[] sequenceLetters; const(char)[] quality; static auto parse(const(char)[] from) { struct Result { private { const(char)[] source; FastQRecord value; bool isEmpty; } this(const(char)[] source) { this.source = source; popFront; } property { FastQRecord front() { return value; } bool empty() { return isEmpty; } } void popFront() { import std.string : indexOf; if (source is null) { isEmpty = true; return; } void tidyInput() { foreach(i, c; source) { switch(c) { case 0: .. case ' ': break; default: source = source[i .. $]; return; } } source = null; } tidyInput(); if (source is null) return; // sequenceId assert(source[0] == ' '); ptrdiff_t len = source.indexOf("\n"); assert(len > 0); value.sequenceId = source[1 .. len]; if (value.sequenceId[$-1] == "\r"[0]) value.sequenceId = value.sequenceId[0 .. $-1]; source = source[len + 1 .. $]; // sequenceLetters len = source.indexOf("\n"); assert(len > 0); value.sequenceLetters = source[0 .. len]; if (value.sequenceLetters[$-1] == "\r"[0]) value.sequenceLetters = value.sequenceLetters[0 .. $-1]; source = source[len + 1 .. $]; // +sequenceId len = source.indexOf("\n"); assert(len > 0); source = source[len + 1 .. $]; // quality len = source.indexOf("\n"); assert(len > 0); value.quality = source[0 .. len]; if (value.quality[$-1] == "\r"[0]) value.quality = value.quality[0 .. $-1]; if (source.length > len + 1) { source = source[len + 1 .. $]; tidyInput(); } else source = null; } } return Result(from); } } void main() {} unittest { string input = """ seq1 TTATTTTAAT + ?+BBB/DHH seq2 GACCCTTTGCA + ?+BHB/DIH SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 """[1 .. $]; foreach(record; FastQRecord.parse(input)) { import std.stdio; writeln(record); } } --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Mar 23 2016
On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:As a little fun thing to do I implemented it for you. It won't allocate. Making this perfect for you. With a bit of work you could make Result have buffers for result instead of using the input array allow for the source to be an input range itself. I made this up on dpaste and single quotes were not playing nicely there. So you'll see "\r"[0] as a workaround.Thank you very much. I think you have exposed me to a number of new concepts that I will go through and annotate the code with. I read all input from file as follows. string text = cast(string)std.file.read(inputfile); foreach(record;FastQRecord.parse(text)){ writeln(record); } </naivequestion>Does this mean that text is allocated to memory? and is there a better way to read and process the inputfile? </naivequestion>
Mar 24 2016
On 24/03/16 9:24 PM, eastanon wrote:On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:Yup, any string usage gets allocated in memory. Since the way I designed parse there, it won't allocate during returning of a record. Just don't go around modifying that memory or keeping copies (not critical but I wouldn't). There are better ways to read the input file. For example byChunk would read it in smaller groups of say 4096 chars. But you would need to overload that with support for e.g. lines parsing. The way I'd go is memory mapped files as an input range. After all, let the OS help you out with deciding when to put the file and parts of it into memory. After all, if you have 1gb files you want to parse. Could you just as easily have 2tb worth? Probably since it is DNA after all. You kinda don't want that all in memory at once if you know what I mean. --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirusAs a little fun thing to do I implemented it for you. It won't allocate. Making this perfect for you. With a bit of work you could make Result have buffers for result instead of using the input array allow for the source to be an input range itself. I made this up on dpaste and single quotes were not playing nicely there. So you'll see "\r"[0] as a workaround.Thank you very much. I think you have exposed me to a number of new concepts that I will go through and annotate the code with. I read all input from file as follows. string text = cast(string)std.file.read(inputfile); foreach(record;FastQRecord.parse(text)){ writeln(record); } </naivequestion>Does this mean that text is allocated to memory? and is there a better way to read and process the inputfile? </naivequestion>
Mar 24 2016
On Thursday, 24 March 2016 at 08:24:15 UTC, eastanon wrote:On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:Yes, it's read into your processes memory. You can use std.mmfile [1] to make things a bit more efficient. It will, too, read the data into memory, but it will do so in a way (memory mapping) that only loads what is actually accessed (everything in your case), and that allows the operating system to efficiently release and reload parts of it if memory runs low. Unfortunately there is no example in the documentation, but it works like this (untested): import std.mmfile; auto file = new MmFile(inputfile); string text = cast(string) file[]; ... [1] http://dlang.org/phobos/std_mmfile.htmlAs a little fun thing to do I implemented it for you. It won't allocate. Making this perfect for you. With a bit of work you could make Result have buffers for result instead of using the input array allow for the source to be an input range itself. I made this up on dpaste and single quotes were not playing nicely there. So you'll see "\r"[0] as a workaround.Thank you very much. I think you have exposed me to a number of new concepts that I will go through and annotate the code with. I read all input from file as follows. string text = cast(string)std.file.read(inputfile); foreach(record;FastQRecord.parse(text)){ writeln(record); } </naivequestion>Does this mean that text is allocated to memory? and is there a better way to read and process the inputfile? </naivequestion>
Mar 24 2016
On Thursday, 24 March 2016 at 13:38:32 UTC, Marc Schütz wrote:Yes, it's read into your processes memory. You can use std.mmfile [1] to make things a bit more efficient. It will, too, read the data into memory, but it will do so in a way (memory mapping) that only loads what is actually accessed (everything in your case), and that allows the operating system to efficiently release and reload parts of it if memory runs low. Unfortunately there is no example in the documentation, but it works like this (untested): import std.mmfile; auto file = new MmFile(inputfile); string text = cast(string) file[]; ... [1] http://dlang.org/phobos/std_mmfile.htmlThat is very clever. Thank you for the tip and I have implemented it and it works. I feel like it is a safer way of reading a file.
Mar 24 2016
On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:void popFront() { import std.string : indexOf; if (source is null) { isEmpty = true; return; } void tidyInput() { foreach(i, c; source) { switch(c) { case 0: .. case ' ': break; default: source = source[i .. $]; return; } } source = null; } tidyInput();Do you mind to explain what is really going on at popFront() and tidyInput() functions?
Mar 29 2016
On 30/03/2016 1:46 AM, eastanon wrote:On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:popFront is a special function used with input ranges. It has support in the language for e.g. foreach loops. You also need empty and front to go along with it however. With front returning a value and empty if there is any more items. The if statement in popFront is just to make empty set correctly (delayed). Now tidyInput basically just trims (on the left) the source and sets it to null if there is no more of it. That way things like " Myid" will still work fine. If you need an explanation about the rest of popFront, let me know!void popFront() { import std.string : indexOf; if (source is null) { isEmpty = true; return; } void tidyInput() { foreach(i, c; source) { switch(c) { case 0: .. case ' ': break; default: source = source[i .. $]; return; } } source = null; } tidyInput();Do you mind to explain what is really going on at popFront() and tidyInput() functions?
Mar 29 2016