digitalmars.D.learn - Reading a file of words line by line
- mark (45/45) Jan 14 2020 As part of learning D I want to read a file that contains one
- mark (3/3) Jan 14 2020 Should I have closed the file, i.e.,:
- mipri (18/36) Jan 14 2020 One thing I picked up during Advent of Code last year was
- mark (17/17) Jan 15 2020 Thanks for the ideas, I've now reduced the size of the getWords()
- dwdv (12/13) Jan 15 2020 How about this?
- mark (14/14) Jan 15 2020 I really do need a set for the next part of the program, but
- H. S. Teoh (9/10) Jan 15 2020 The .length of a `string` type is the number of bytes that it occupies,
- Jesse Phillips (19/33) Jan 15 2020 Your solution is fine, but also
As part of learning D I want to read a file that contains one word per line (plus optional junk after the word) and creates a set of all the unique words of a particular length (uppercased). D doesn't appear to have a set type so I'm faking using an associative array whose values are always 0. I can't help feeling that the foreach loop's block is rather more verbose than it could be? ---- import std.stdio; immutable WORDFILE = "/usr/share/hunspell/en_GB.dic"; immutable WORDSIZE = 4; // Should be even alias WordSet = int[string]; // key = word; value = 0 void main() { import core.time; auto start = MonoTime.currTime; auto words = getWords(WORDFILE, WORDSIZE); // TODO writeln(words.length, " words"); writeln(MonoTime.currTime - start); } WordSet getWords(string filename, int wordsize) { import std.conv; import std.regex; import std.uni; WordSet words; auto rx = ctRegex!(r"^[a-z]+", "i"); auto file = File(filename); foreach (line; file.byLine) { auto match = matchFirst(line, rx); if (!match.empty()) { auto word = match.hit().to!string; // I hope this assumes UTF-8? if (word.length == wordsize) { words[word.toUpper] = 0; } } } return words; } ---- PS I'm using ldc on Linux and think that rdmd is excellent. For lots of small Python programs I have I'm wondering how many would be faster using D and rdmd (which I think caches binaries). Also I've now got Mike Parker's "Learning D" on order.
Jan 14 2020
Should I have closed the file, i.e.,: auto file = File(filename); scope(exit) file.close(); // Add this?
Jan 14 2020
On Tuesday, 14 January 2020 at 16:39:16 UTC, mark wrote:I can't help feeling that the foreach loop's block is rather more verbose than it could be?WordSet words; auto rx = ctRegex!(r"^[a-z]+", "i"); auto file = File(filename); foreach (line; file.byLine) { auto match = matchFirst(line, rx); if (!match.empty()) { auto word = match.hit().to!string; // I hope this assumes UTF-8? if (word.length == wordsize) { words[word.toUpper] = 0; } } } return words; } ----One thing I picked up during Advent of Code last year was std.file.slurp, which was great for reading 90% of the input files from that contest. With that, I'd do this more like int[string] words; slurp!string("input.txt", "%s").each!(w => words[w] = 0); Where "%s" is what slurp() expects to find on each line, and 'string' is the type it returns from that. With just a list of words this isn't very interesting. Some of my uses from the contest are: auto input = slurp!(int, int, int)(args[1], "<x=%d, y=%d, z=%d>") .map!(p => Moon([p[0], p[1], p[2]])).array; Tuple!(string, string)[] input = slurp!(string, string)("input.txt", "%s)%s"); Of course if you want to validate the input as you're reading it, you still have to do extra work, but it could be in a .filter!
Jan 14 2020
Thanks for the ideas, I've now reduced the size of the getWords() function (even allowing for moving the imports to the top of the file) to this: WordSet getWords(string filename, int wordsize) { string bareWord(string line) { auto rx = ctRegex!(r"^([a-z]+)", "i"); auto match = matchFirst(line, rx); return match.empty ? "" : match.hit.to!string; } WordSet words; slurp!string(filename, "%s") .map!(line => bareWord(line)) .filter!(word => word.length == wordsize) .each!(word => words[word.toUpper] = 0); return words; } Is this as compact as it _reasonably_ can be?
Jan 15 2020
On 2020-01-15 16:34, mark via Digitalmars-d-learn wrote:Is this as compact as it _reasonably_ can be?How about this? auto uniqueWords(string filename, uint wordsize) { import std.algorithm, std.array, std.conv, std.functional, std.uni; return File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .array .sort .uniq; }
Jan 15 2020
I really do need a set for the next part of the program, but taking your code and ideas I have now reduced the function to this: WordSet getWords(string filename, int wordsize) { WordSet words; File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .each!(word => words[word.to!string.toUpper] = 0); return words; } This is also 4x faster than my version that used a regex -- thanks! Why did you use string.count rather than string.length?
Jan 15 2020
On Wed, Jan 15, 2020 at 07:50:31PM +0000, mark via Digitalmars-d-learn wrote: [...]Why did you use string.count rather than string.length?The .length of a `string` type is the number of bytes that it occupies, which is not necessarily the same thing as the number of characters in the string. E.g., if you receive a Unicode string, there may be multi-byte characters in it. T -- A computer doesn't mind if its programs are put to purposes that don't match their names. -- D. Knuth
Jan 15 2020
On Wednesday, 15 January 2020 at 19:50:31 UTC, mark wrote:I really do need a set for the next part of the program, but taking your code and ideas I have now reduced the function to this: WordSet getWords(string filename, int wordsize) { WordSet words; File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .each!(word => words[word.to!string.toUpper] = 0); return words; } This is also 4x faster than my version that used a regex -- thanks! Why did you use string.count rather than string.length?Your solution is fine, but also void main () { auto file = ["word one", "my word", "word"] ; writeln (uniqueWords(file, 4)); } auto uniqueWords(string[] file, uint wordsize) { import std.algorithm, std.array, std.conv, std.functional, std.uni; return file .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .array .sort .uniq .map!(x => tuple (x, 0)) .assocArray ; }
Jan 15 2020
On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote:[...] .map!(word => word.to!string.toUpper) .array .sort .uniq .map!(x => tuple (x, 0)) .assocArray ;.each!(word => words[word.to!string.toUpper] = 0); isn't far off, but could also be (sans imports): return File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .assocArray(0.repeat);
Jan 16 2020
On Thursday, 16 January 2020 at 10:10:02 UTC, dwdv wrote:On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote:[...][...]isn't far off, but could also be (sans imports): return File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .assocArray(0.repeat);That's what I'm now using -- thanks! (Now I can try the next bit.)
Jan 16 2020