www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Search a file skiping whitespace

reply Willy Martinez <wmartinez thisisnotmyrealemail.com> writes:
Hello. I'm new to D but bear with me, please.

I have several files that look like this:

71104 08924 72394 13995 49707 98696
48245 08311 44066 67172 56025 07952
00384 37808 90166 13871 94258 37216

I'm trying to read those files and search for sequences of digits inside,
hopefully with the Boyer-Moore implementation in std.algorithm.

Right now I have made a small script that iterates over the .txt files in the
current directory and reads line by line and uses find on it.

But I haven't been able to write a range that removes the whitespace and can
be used with find. It should generate one long stream of digits like:

711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216

Any help?

Thanks
Jul 16 2011
next sibling parent reply Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:
On 7/16/2011 3:05 PM, Willy Martinez wrote:
 Any help?
import std.algorithm; import std.ascii; import std.stdio; void main(string[] argv) { auto s = "71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216"; auto t = filter!("!std.ascii.isWhite(a)")(s); bool found = canFind(t, "408"); writeln(t); writeln(found); } Outputs: 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216 true
Jul 16 2011
parent Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:
On 7/16/2011 3:56 PM, Johann MacDonagh wrote:
 On 7/16/2011 3:05 PM, Willy Martinez wrote:
 Any help?
import std.algorithm; import std.ascii; import std.stdio; void main(string[] argv) { auto s = "71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216"; auto t = filter!("!std.ascii.isWhite(a)")(s); bool found = canFind(t, "408"); writeln(t); writeln(found); } Outputs: 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216 true
This is actually better: import std.algorithm; import std.ascii; import std.stdio; void main(string[] argv) { auto s = "71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216"; auto t = filter!( (c) { return isWhite(c); } )(s); bool found = canFind(t, "408"); writeln(t); writeln(found); } Otherwise we're relying on std.algorithm importing std.ascii.
Jul 16 2011
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 16.07.2011 23:05, Willy Martinez wrote:
 Hello. I'm new to D but bear with me, please.

 I have several files that look like this:

 71104 08924 72394 13995 49707 98696
 48245 08311 44066 67172 56025 07952
 00384 37808 90166 13871 94258 37216

 I'm trying to read those files and search for sequences of digits inside,
 hopefully with the Boyer-Moore implementation in std.algorithm.

 Right now I have made a small script that iterates over the .txt files in the
 current directory and reads line by line and uses find on it.

 But I haven't been able to write a range that removes the whitespace and can
 be used with find. It should generate one long stream of digits like:

 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216
If you wish to avoid storing all of this in an array by using e.g. filter _and_ use Boyer-Moore search on it then: No, you can't do that. The reason is that filter is ForwardRange with an important consequence that you can't look at arbitrary Nth element in O(1). And Boyer-Moore requires such and access to be anywhere efficient. Why doesn't filter not provide O(1) random access ? Because to get Nth element you'd need to check at least N (and potentially unlimited) number of elements before in case they get filtered out.
 Any help?
If I'd had this sort of problem I'd use something along the lines: auto file = File("yourfile"); foreach( line; file.ByLine) { auto onlyDigitis = array(filter!((x){ return !isWhite(x); })(line)); // this copies all digits to a new array auto result = find(onlyDigits, ... ); //your query here ///.... }
 Thanks
-- Dmitry Olshansky
Jul 16 2011
parent reply Willy Martinez <wmartinez thisisnotmyrealemail.com> writes:
== Quote from Dmitry Olshansky (dmitry.olsh gmail.com)'s article
 If you wish to avoid storing all of this in an array by using e.g.
 filter _and_  use Boyer-Moore search on it then: No, you can't do that.
 The reason is that filter is ForwardRange with an important consequence
 that you can't look at arbitrary Nth element in O(1). And Boyer-Moore
 requires such and access to be anywhere efficient.
 Why doesn't filter not provide O(1) random access ? Because to get Nth
 element you'd need to check at least N (and potentially unlimited)
 number of elements before in case they get filtered out.
 Any help?
If I'd had this sort of problem I'd use something along the lines: auto file = File("yourfile"); foreach( line; file.ByLine) { auto onlyDigitis = array(filter!((x){ return !isWhite(x); })(line)); // this copies all digits to a new array auto result = find(onlyDigits, ... ); //your query here ///.... }
 Thanks
I don't mind storing it in memory. Each .txt file is around 20MB so the filtered string should be even smaller. Still, calling array gives this error: ..\..\src\phobos\std\algorithm.d(3252): Error: function std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string haystack) is not callable using argument types (dchar[]) ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert expression (haystack) of type dchar[] to string ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert expression (needle.beFound((__error))) of type string to dchar[] search_seq.d(13): Error: template instance std.algorithm.find!(dchar[],result,string) error instantiating From this code: import std.algorithm; import std.array; import std.file; import std.stdio; void main(string[] args) { auto needle = boyerMooreFinder(args[1]); foreach (string name; dirEntries(".", SpanMode.shallow)) { if (name[$-3 .. $] == "txt") { writeln(name); string text = readText(name); auto haystack = array(filter!("a >= '0' && a <= '9'")(text)); auto result = find(haystack, needle); writeln(result); } } } I'm using DMD 2.054 on Windows if that helps
Jul 16 2011
next sibling parent reply Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:
On 7/16/2011 4:41 PM, Willy Martinez wrote:
 	auto needle = boyerMooreFinder(args[1]);
I'm confused. Boyer-Moore is when you want to find two or more elements efficiently in a range. When you create a BoyerMooreFinder you give it a range of elements it should search for. When you give it args[1] you are giving it a string which is immutable(char)[], or a range of characters. So if args[1] = "123" you're saying "search for 1, 2, or 3". Are you actually trying to search for multiple elements at the same time?
Jul 16 2011
parent Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:
On 7/16/2011 4:43 PM, Johann MacDonagh wrote:
 On 7/16/2011 4:41 PM, Willy Martinez wrote:
 auto needle = boyerMooreFinder(args[1]);
I'm confused. Boyer-Moore is when you want to find two or more elements efficiently in a range. When you create a BoyerMooreFinder you give it a range of elements it should search for. When you give it args[1] you are giving it a string which is immutable(char)[], or a range of characters. So if args[1] = "123" you're saying "search for 1, 2, or 3". Are you actually trying to search for multiple elements at the same time?
Ugh, I'm dumb. I misread the documentation. Instead of defining "needle", just pass arg[1] into your call to find.
Jul 16 2011
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 17.07.2011 0:41, Willy Martinez wrote:
 == Quote from Dmitry Olshansky (dmitry.olsh gmail.com)'s article
 If you wish to avoid storing all of this in an array by using e.g.
 filter _and_  use Boyer-Moore search on it then: No, you can't do that.
 The reason is that filter is ForwardRange with an important consequence
 that you can't look at arbitrary Nth element in O(1). And Boyer-Moore
 requires such and access to be anywhere efficient.
 Why doesn't filter not provide O(1) random access ? Because to get Nth
 element you'd need to check at least N (and potentially unlimited)
 number of elements before in case they get filtered out.
 Any help?
If I'd had this sort of problem I'd use something along the lines: auto file = File("yourfile"); foreach( line; file.ByLine) { auto onlyDigitis = array(filter!((x){ return !isWhite(x); })(line)); // this copies all digits to a new array auto result = find(onlyDigits, ... ); //your query here ///.... }
 Thanks
I don't mind storing it in memory. Each .txt file is around 20MB so the filtered string should be even smaller. Still, calling array gives this error:
Not exactly calling array but I perfectly understand why you have confused it.
 ..\..\src\phobos\std\algorithm.d(3252): Error: function
 std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string
 haystack) is not callable using argument types (dchar[])
 ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
 expression (haystack) of type dchar[] to string
 ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
 expression (needle.beFound((__error))) of type string to dchar[]
search_seq.d(13):
 Error: template instance std.algorithm.find!(dchar[],result,string) error
 instantiating
Let's drill down to the problem through this barrage of crap: the problem statement is Error: cannot implicitly convert expression (haystack) of type dchar[] to string So (apparently) the problem is that after array(filter!(... you get array of dchars (unicode codepoints)as a result of filtering string (which is UTF-8 under the hood btw) while you are going to search an UTF-8 string. And UTF-8 string is (once again) is not random accessible in sense of codepoints (it's needs an UTF decode though it's clearly not needed in your case). The simplest workaround I can think of is convert needle to dstring: auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv
  From this code:

 import std.algorithm;
 import std.array;
 import std.file;
 import std.stdio;

 void main(string[] args) {
 	auto needle = boyerMooreFinder(args[1]);
 	foreach (string name; dirEntries(".", SpanMode.shallow)) {
 		if (name[$-3 .. $] == "txt") {
 			writeln(name);
 			string text = readText(name);
 			auto haystack = array(filter!("a>= '0'&&  a<= '9'")(text));
 			auto result = find(haystack, needle);
 			writeln(result);
 		}
 	}
 }


 I'm using DMD 2.054 on Windows if that helps
-- Dmitry Olshansky
Jul 16 2011
parent reply Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:
On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:
 Let's drill down to the problem through this barrage of crap:

 the problem statement is

 Error: cannot implicitly convert expression (haystack) of type dchar[]
 to string

 So (apparently) the problem is that after array(filter!(... you get
 array of dchars (unicode codepoints)as a result of filtering string
 (which is UTF-8 under the hood btw) while you are going to search an
 UTF-8 string.
 And UTF-8 string is (once again) is not random accessible in sense of
 codepoints (it's needs an UTF decode though it's clearly not needed in
 your case).
 The simplest workaround I can think of is convert needle to dstring:
 auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv
Nope, that doesn't work. Here's what he should do: import std.algorithm; import std.array; import std.file; import std.stdio; void main(string[] args) { foreach (string name; dirEntries(".", SpanMode.shallow)) { if (name[$-3 .. $] == "txt") { writeln(name); string text = readText(name); auto haystack = filter!("a >= '0' && a <='9'")(text); auto result = find(haystack, args[1]); writeln(result); } } } If you call array() on the filter you're already doing O(n) operations, meaning you're not going to get any performance benefit by being able to do Boyer-Moore. Correct me if I'm wrong, but I believe find() will do Boyer-Moore if it knows its haystack is random access. This is the power of templated programming. find! can generate an efficient find algorithm if it knows the range is random access, or default to a less efficient one if not.
Jul 16 2011
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 17.07.2011 1:28, Johann MacDonagh wrote:
 On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:
 Let's drill down to the problem through this barrage of crap:

 the problem statement is

 Error: cannot implicitly convert expression (haystack) of type dchar[]
 to string

 So (apparently) the problem is that after array(filter!(... you get
 array of dchars (unicode codepoints)as a result of filtering string
 (which is UTF-8 under the hood btw) while you are going to search an
 UTF-8 string.
 And UTF-8 string is (once again) is not random accessible in sense of
 codepoints (it's needs an UTF decode though it's clearly not needed in
 your case).
 The simplest workaround I can think of is convert needle to dstring:
 auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv
Nope, that doesn't work. Here's what he should do: import std.algorithm; import std.array; import std.file; import std.stdio; void main(string[] args) { foreach (string name; dirEntries(".", SpanMode.shallow)) { if (name[$-3 .. $] == "txt") { writeln(name); string text = readText(name); auto haystack = filter!("a >= '0' && a <='9'")(text); auto result = find(haystack, args[1]); writeln(result); } } } If you call array() on the filter you're already doing O(n) operations, meaning you're not going to get any performance benefit by being able to do Boyer-Moore.
Actually with brute force you do O(mn) comparisons in general while there are Boyer-Moor implementations that will do O(n). Now with an alphabet of 0..9 I'd say you won't get a lot of speed up with Boyer-Moore, in any case I'd suggest OP to at least test both. In this case however my bet is on bruteforce (given an extra allocation on each string with B-M).
 Correct me if I'm wrong, but I believe find() will do Boyer-Moore if 
 it knows its haystack is random access. This is the power of templated 
 programming. find! can generate an efficient find algorithm if it 
 knows the range is random access, or default to a less efficient one 
 if not.
No, the problem of BoyerMoor search is there is some preparatory overhead to construct lookup table, apparently that table is stored in that BoyerMoorFinder. Frankly, std.algorithm might add at least one another sting searching algorithm e.g. http://monge.univ-mlv.fr/~lecroq/string/ -- Dmitry Olshansky
Jul 16 2011