digitalmars.D.learn - Search a file skiping whitespace
- Willy Martinez (14/14) Jul 16 2011 Hello. I'm new to D but bear with me, please.
- Johann MacDonagh (17/18) Jul 16 2011 import std.algorithm;
- Johann MacDonagh (16/34) Jul 16 2011 This is actually better:
- Dmitry Olshansky (20/34) Jul 16 2011 If you wish to avoid storing all of this in an array by using e.g.
- Willy Martinez (31/50) Jul 16 2011 I don't mind storing it in memory. Each .txt file is around 20MB so the ...
- Johann MacDonagh (7/8) Jul 16 2011 I'm confused. Boyer-Moore is when you want to find two or more elements
- Johann MacDonagh (3/11) Jul 16 2011 Ugh, I'm dumb. I misread the documentation.
- Dmitry Olshansky (12/62) Jul 16 2011 Not exactly calling array but I perfectly understand why you have
- Johann MacDonagh (25/38) Jul 16 2011 Nope, that doesn't work.
- Dmitry Olshansky (14/59) Jul 16 2011 Actually with brute force you do O(mn) comparisons in general while
Hello. I'm new to D but bear with me, please. I have several files that look like this: 71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216 I'm trying to read those files and search for sequences of digits inside, hopefully with the Boyer-Moore implementation in std.algorithm. Right now I have made a small script that iterates over the .txt files in the current directory and reads line by line and uses find on it. But I haven't been able to write a range that removes the whitespace and can be used with find. It should generate one long stream of digits like: 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216 Any help? Thanks
Jul 16 2011
On 7/16/2011 3:05 PM, Willy Martinez wrote:Any help?import std.algorithm; import std.ascii; import std.stdio; void main(string[] argv) { auto s = "71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216"; auto t = filter!("!std.ascii.isWhite(a)")(s); bool found = canFind(t, "408"); writeln(t); writeln(found); } Outputs: 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216 true
Jul 16 2011
On 7/16/2011 3:56 PM, Johann MacDonagh wrote:On 7/16/2011 3:05 PM, Willy Martinez wrote:This is actually better: import std.algorithm; import std.ascii; import std.stdio; void main(string[] argv) { auto s = "71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216"; auto t = filter!( (c) { return isWhite(c); } )(s); bool found = canFind(t, "408"); writeln(t); writeln(found); } Otherwise we're relying on std.algorithm importing std.ascii.Any help?import std.algorithm; import std.ascii; import std.stdio; void main(string[] argv) { auto s = "71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216"; auto t = filter!("!std.ascii.isWhite(a)")(s); bool found = canFind(t, "408"); writeln(t); writeln(found); } Outputs: 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216 true
Jul 16 2011
On 16.07.2011 23:05, Willy Martinez wrote:Hello. I'm new to D but bear with me, please. I have several files that look like this: 71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216 I'm trying to read those files and search for sequences of digits inside, hopefully with the Boyer-Moore implementation in std.algorithm. Right now I have made a small script that iterates over the .txt files in the current directory and reads line by line and uses find on it. But I haven't been able to write a range that removes the whitespace and can be used with find. It should generate one long stream of digits like: 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216If you wish to avoid storing all of this in an array by using e.g. filter _and_ use Boyer-Moore search on it then: No, you can't do that. The reason is that filter is ForwardRange with an important consequence that you can't look at arbitrary Nth element in O(1). And Boyer-Moore requires such and access to be anywhere efficient. Why doesn't filter not provide O(1) random access ? Because to get Nth element you'd need to check at least N (and potentially unlimited) number of elements before in case they get filtered out.Any help?If I'd had this sort of problem I'd use something along the lines: auto file = File("yourfile"); foreach( line; file.ByLine) { auto onlyDigitis = array(filter!((x){ return !isWhite(x); })(line)); // this copies all digits to a new array auto result = find(onlyDigits, ... ); //your query here ///.... }Thanks-- Dmitry Olshansky
Jul 16 2011
== Quote from Dmitry Olshansky (dmitry.olsh gmail.com)'s articleIf you wish to avoid storing all of this in an array by using e.g. filter _and_ use Boyer-Moore search on it then: No, you can't do that. The reason is that filter is ForwardRange with an important consequence that you can't look at arbitrary Nth element in O(1). And Boyer-Moore requires such and access to be anywhere efficient. Why doesn't filter not provide O(1) random access ? Because to get Nth element you'd need to check at least N (and potentially unlimited) number of elements before in case they get filtered out.I don't mind storing it in memory. Each .txt file is around 20MB so the filtered string should be even smaller. Still, calling array gives this error: ..\..\src\phobos\std\algorithm.d(3252): Error: function std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string haystack) is not callable using argument types (dchar[]) ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert expression (haystack) of type dchar[] to string ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert expression (needle.beFound((__error))) of type string to dchar[] search_seq.d(13): Error: template instance std.algorithm.find!(dchar[],result,string) error instantiating From this code: import std.algorithm; import std.array; import std.file; import std.stdio; void main(string[] args) { auto needle = boyerMooreFinder(args[1]); foreach (string name; dirEntries(".", SpanMode.shallow)) { if (name[$-3 .. $] == "txt") { writeln(name); string text = readText(name); auto haystack = array(filter!("a >= '0' && a <= '9'")(text)); auto result = find(haystack, needle); writeln(result); } } } I'm using DMD 2.054 on Windows if that helpsAny help?If I'd had this sort of problem I'd use something along the lines: auto file = File("yourfile"); foreach( line; file.ByLine) { auto onlyDigitis = array(filter!((x){ return !isWhite(x); })(line)); // this copies all digits to a new array auto result = find(onlyDigits, ... ); //your query here ///.... }Thanks
Jul 16 2011
On 7/16/2011 4:41 PM, Willy Martinez wrote:auto needle = boyerMooreFinder(args[1]);I'm confused. Boyer-Moore is when you want to find two or more elements efficiently in a range. When you create a BoyerMooreFinder you give it a range of elements it should search for. When you give it args[1] you are giving it a string which is immutable(char)[], or a range of characters. So if args[1] = "123" you're saying "search for 1, 2, or 3". Are you actually trying to search for multiple elements at the same time?
Jul 16 2011
On 7/16/2011 4:43 PM, Johann MacDonagh wrote:On 7/16/2011 4:41 PM, Willy Martinez wrote:Ugh, I'm dumb. I misread the documentation. Instead of defining "needle", just pass arg[1] into your call to find.auto needle = boyerMooreFinder(args[1]);I'm confused. Boyer-Moore is when you want to find two or more elements efficiently in a range. When you create a BoyerMooreFinder you give it a range of elements it should search for. When you give it args[1] you are giving it a string which is immutable(char)[], or a range of characters. So if args[1] = "123" you're saying "search for 1, 2, or 3". Are you actually trying to search for multiple elements at the same time?
Jul 16 2011
On 17.07.2011 0:41, Willy Martinez wrote:== Quote from Dmitry Olshansky (dmitry.olsh gmail.com)'s articleNot exactly calling array but I perfectly understand why you have confused it.If you wish to avoid storing all of this in an array by using e.g. filter _and_ use Boyer-Moore search on it then: No, you can't do that. The reason is that filter is ForwardRange with an important consequence that you can't look at arbitrary Nth element in O(1). And Boyer-Moore requires such and access to be anywhere efficient. Why doesn't filter not provide O(1) random access ? Because to get Nth element you'd need to check at least N (and potentially unlimited) number of elements before in case they get filtered out.I don't mind storing it in memory. Each .txt file is around 20MB so the filtered string should be even smaller. Still, calling array gives this error:Any help?If I'd had this sort of problem I'd use something along the lines: auto file = File("yourfile"); foreach( line; file.ByLine) { auto onlyDigitis = array(filter!((x){ return !isWhite(x); })(line)); // this copies all digits to a new array auto result = find(onlyDigits, ... ); //your query here ///.... }Thanks..\..\src\phobos\std\algorithm.d(3252): Error: function std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string haystack) is not callable using argument types (dchar[]) ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert expression (haystack) of type dchar[] to string ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert expression (needle.beFound((__error))) of type string to dchar[] search_seq.d(13): Error: template instance std.algorithm.find!(dchar[],result,string) error instantiatingLet's drill down to the problem through this barrage of crap: the problem statement is Error: cannot implicitly convert expression (haystack) of type dchar[] to string So (apparently) the problem is that after array(filter!(... you get array of dchars (unicode codepoints)as a result of filtering string (which is UTF-8 under the hood btw) while you are going to search an UTF-8 string. And UTF-8 string is (once again) is not random accessible in sense of codepoints (it's needs an UTF decode though it's clearly not needed in your case). The simplest workaround I can think of is convert needle to dstring: auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.convFrom this code: import std.algorithm; import std.array; import std.file; import std.stdio; void main(string[] args) { auto needle = boyerMooreFinder(args[1]); foreach (string name; dirEntries(".", SpanMode.shallow)) { if (name[$-3 .. $] == "txt") { writeln(name); string text = readText(name); auto haystack = array(filter!("a>= '0'&& a<= '9'")(text)); auto result = find(haystack, needle); writeln(result); } } } I'm using DMD 2.054 on Windows if that helps-- Dmitry Olshansky
Jul 16 2011
On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:Let's drill down to the problem through this barrage of crap: the problem statement is Error: cannot implicitly convert expression (haystack) of type dchar[] to string So (apparently) the problem is that after array(filter!(... you get array of dchars (unicode codepoints)as a result of filtering string (which is UTF-8 under the hood btw) while you are going to search an UTF-8 string. And UTF-8 string is (once again) is not random accessible in sense of codepoints (it's needs an UTF decode though it's clearly not needed in your case). The simplest workaround I can think of is convert needle to dstring: auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.convNope, that doesn't work. Here's what he should do: import std.algorithm; import std.array; import std.file; import std.stdio; void main(string[] args) { foreach (string name; dirEntries(".", SpanMode.shallow)) { if (name[$-3 .. $] == "txt") { writeln(name); string text = readText(name); auto haystack = filter!("a >= '0' && a <='9'")(text); auto result = find(haystack, args[1]); writeln(result); } } } If you call array() on the filter you're already doing O(n) operations, meaning you're not going to get any performance benefit by being able to do Boyer-Moore. Correct me if I'm wrong, but I believe find() will do Boyer-Moore if it knows its haystack is random access. This is the power of templated programming. find! can generate an efficient find algorithm if it knows the range is random access, or default to a less efficient one if not.
Jul 16 2011
On 17.07.2011 1:28, Johann MacDonagh wrote:On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:Actually with brute force you do O(mn) comparisons in general while there are Boyer-Moor implementations that will do O(n). Now with an alphabet of 0..9 I'd say you won't get a lot of speed up with Boyer-Moore, in any case I'd suggest OP to at least test both. In this case however my bet is on bruteforce (given an extra allocation on each string with B-M).Let's drill down to the problem through this barrage of crap: the problem statement is Error: cannot implicitly convert expression (haystack) of type dchar[] to string So (apparently) the problem is that after array(filter!(... you get array of dchars (unicode codepoints)as a result of filtering string (which is UTF-8 under the hood btw) while you are going to search an UTF-8 string. And UTF-8 string is (once again) is not random accessible in sense of codepoints (it's needs an UTF decode though it's clearly not needed in your case). The simplest workaround I can think of is convert needle to dstring: auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.convNope, that doesn't work. Here's what he should do: import std.algorithm; import std.array; import std.file; import std.stdio; void main(string[] args) { foreach (string name; dirEntries(".", SpanMode.shallow)) { if (name[$-3 .. $] == "txt") { writeln(name); string text = readText(name); auto haystack = filter!("a >= '0' && a <='9'")(text); auto result = find(haystack, args[1]); writeln(result); } } } If you call array() on the filter you're already doing O(n) operations, meaning you're not going to get any performance benefit by being able to do Boyer-Moore.Correct me if I'm wrong, but I believe find() will do Boyer-Moore if it knows its haystack is random access. This is the power of templated programming. find! can generate an efficient find algorithm if it knows the range is random access, or default to a less efficient one if not.No, the problem of BoyerMoor search is there is some preparatory overhead to construct lookup table, apparently that table is stored in that BoyerMoorFinder. Frankly, std.algorithm might add at least one another sting searching algorithm e.g. http://monge.univ-mlv.fr/~lecroq/string/ -- Dmitry Olshansky
Jul 16 2011