digitalmars.D.learn - Search a file skiping whitespace

Willy Martinez (14/14) Jul 16 2011 Hello. I'm new to D but bear with me, please.

Johann MacDonagh (17/18) Jul 16 2011 import std.algorithm;

Johann MacDonagh (16/34) Jul 16 2011 This is actually better:

Dmitry Olshansky (20/34) Jul 16 2011 If you wish to avoid storing all of this in an array by using e.g.

Willy Martinez (31/50) Jul 16 2011 I don't mind storing it in memory. Each .txt file is around 20MB so the ...

Johann MacDonagh (7/8) Jul 16 2011 I'm confused. Boyer-Moore is when you want to find two or more elements

Johann MacDonagh (3/11) Jul 16 2011 Ugh, I'm dumb. I misread the documentation.

Dmitry Olshansky (12/62) Jul 16 2011 Not exactly calling array but I perfectly understand why you have

Johann MacDonagh (25/38) Jul 16 2011 Nope, that doesn't work.

Dmitry Olshansky (14/59) Jul 16 2011 Actually with brute force you do O(mn) comparisons in general while

Willy Martinez <wmartinez thisisnotmyrealemail.com> writes:

Hello. I'm new to D but bear with me, please.

I have several files that look like this:

71104 08924 72394 13995 49707 98696
48245 08311 44066 67172 56025 07952
00384 37808 90166 13871 94258 37216

I'm trying to read those files and search for sequences of digits inside,
hopefully with the Boyer-Moore implementation in std.algorithm.

Right now I have made a small script that iterates over the .txt files in the
current directory and reads line by line and uses find on it.

But I haven't been able to write a range that removes the whitespace and can
be used with find. It should generate one long stream of digits like:

711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216

Any help?

Thanks

Jul 16 2011

Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:

On 7/16/2011 3:05 PM, Willy Martinez wrote:
 Any help?

import std.algorithm;
import std.ascii;
import std.stdio;

void main(string[] argv)
{
     auto s = "71104 08924 72394 13995 49707 98696
48245 08311 44066 67172 56025 07952
00384 37808 90166 13871 94258 37216";

     auto t = filter!("!std.ascii.isWhite(a)")(s);

     bool found = canFind(t, "408");

     writeln(t);
     writeln(found);
}

Outputs:

711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216
true

Jul 16 2011

Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:

On 7/16/2011 3:56 PM, Johann MacDonagh wrote:
 On 7/16/2011 3:05 PM, Willy Martinez wrote:
 Any help?

 import std.algorithm;
 import std.ascii;
 import std.stdio;

 void main(string[] argv)
 {
 auto s = "71104 08924 72394 13995 49707 98696
 48245 08311 44066 67172 56025 07952
 00384 37808 90166 13871 94258 37216";

 auto t = filter!("!std.ascii.isWhite(a)")(s);

 bool found = canFind(t, "408");

 writeln(t);
 writeln(found);
 }

 Outputs:

 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216

 true

This is actually better:

import std.algorithm;
import std.ascii;
import std.stdio;

void main(string[] argv)
{
     auto s = "71104 08924 72394 13995 49707 98696
48245 08311 44066 67172 56025 07952
00384 37808 90166 13871 94258 37216";

     auto t = filter!( (c) { return isWhite(c); } )(s);

     bool found = canFind(t, "408");

     writeln(t);
     writeln(found);
}

Otherwise we're relying on std.algorithm importing std.ascii.

Jul 16 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 16.07.2011 23:05, Willy Martinez wrote:
 Hello. I'm new to D but bear with me, please.

 I have several files that look like this:

 71104 08924 72394 13995 49707 98696
 48245 08311 44066 67172 56025 07952
 00384 37808 90166 13871 94258 37216

 I'm trying to read those files and search for sequences of digits inside,
 hopefully with the Boyer-Moore implementation in std.algorithm.

 Right now I have made a small script that iterates over the .txt files in the
 current directory and reads line by line and uses find on it.

 But I haven't been able to write a range that removes the whitespace and can
 be used with find. It should generate one long stream of digits like:

 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216

If you wish to avoid storing all of this in an array by using e.g. 
filter _and_  use Boyer-Moore search on it then: No, you can't do that. 
The reason is that filter is ForwardRange with an important consequence 
that you can't look at arbitrary Nth element in O(1). And Boyer-Moore 
requires such and access to be anywhere efficient.
Why doesn't filter not provide O(1) random access ? Because to get Nth 
element you'd need to check at least N (and potentially unlimited)  
number of elements before in case they get filtered out.


 Any help?

If I'd had this sort of problem I'd use something along the lines:

auto file = File("yourfile");
foreach( line; file.ByLine)
{
     auto onlyDigitis = array(filter!((x){   return !isWhite(x); 
})(line)); // this copies all digits to a new array
     auto result = find(onlyDigits, ... ); //your query here
     ///....
}

 Thanks


-- 
Dmitry Olshansky

Jul 16 2011

Willy Martinez <wmartinez thisisnotmyrealemail.com> writes:

== Quote from Dmitry Olshansky (dmitry.olsh gmail.com)'s article
 If you wish to avoid storing all of this in an array by using e.g.
 filter _and_  use Boyer-Moore search on it then: No, you can't do that.
 The reason is that filter is ForwardRange with an important consequence
 that you can't look at arbitrary Nth element in O(1). And Boyer-Moore
 requires such and access to be anywhere efficient.
 Why doesn't filter not provide O(1) random access ? Because to get Nth
 element you'd need to check at least N (and potentially unlimited)
 number of elements before in case they get filtered out.
 Any help?

 If I'd had this sort of problem I'd use something along the lines:
 auto file = File("yourfile");
 foreach( line; file.ByLine)
 {
      auto onlyDigitis = array(filter!((x){   return !isWhite(x);
 })(line)); // this copies all digits to a new array
      auto result = find(onlyDigits, ... ); //your query here
      ///....
 }
 Thanks


I don't mind storing it in memory. Each .txt file is around 20MB so the filtered
string should be even smaller.

Still, calling array gives this error:

..\..\src\phobos\std\algorithm.d(3252): Error: function
std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string
haystack) is not callable using argument types (dchar[])
..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
expression (haystack) of type dchar[] to string
..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
expression (needle.beFound((__error))) of type string to dchar[]
search_seq.d(13):
Error: template instance std.algorithm.find!(dchar[],result,string) error
instantiating



From this code:

import std.algorithm;
import std.array;
import std.file;
import std.stdio;

void main(string[] args) {
	auto needle = boyerMooreFinder(args[1]);
	foreach (string name; dirEntries(".", SpanMode.shallow)) {
		if (name[$-3 .. $] == "txt") {
			writeln(name);
			string text = readText(name);
			auto haystack = array(filter!("a >= '0' && a <= '9'")(text));
			auto result = find(haystack, needle);
			writeln(result);
		}
	}
}


I'm using DMD 2.054 on Windows if that helps

Jul 16 2011

Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:

On 7/16/2011 4:41 PM, Willy Martinez wrote:
 	auto needle = boyerMooreFinder(args[1]);

I'm confused. Boyer-Moore is when you want to find two or more elements 
efficiently in a range. When you create a BoyerMooreFinder you give it a 
range of elements it should search for. When you give it args[1] you are 
giving it a string which is immutable(char)[], or a range of characters. 
So if args[1] = "123" you're saying "search for 1, 2, or 3".

Are you actually trying to search for multiple elements at the same time?

Jul 16 2011

Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:

On 7/16/2011 4:43 PM, Johann MacDonagh wrote:
 On 7/16/2011 4:41 PM, Willy Martinez wrote:
 auto needle = boyerMooreFinder(args[1]);

 I'm confused. Boyer-Moore is when you want to find two or more elements
 efficiently in a range. When you create a BoyerMooreFinder you give it a
 range of elements it should search for. When you give it args[1] you are
 giving it a string which is immutable(char)[], or a range of characters.
 So if args[1] = "123" you're saying "search for 1, 2, or 3".

 Are you actually trying to search for multiple elements at the same time?

Ugh, I'm dumb. I misread the documentation.

Instead of defining "needle", just pass arg[1] into your call to find.

Jul 16 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 17.07.2011 0:41, Willy Martinez wrote:
 == Quote from Dmitry Olshansky (dmitry.olsh gmail.com)'s article
 If you wish to avoid storing all of this in an array by using e.g.
 filter _and_  use Boyer-Moore search on it then: No, you can't do that.
 The reason is that filter is ForwardRange with an important consequence
 that you can't look at arbitrary Nth element in O(1). And Boyer-Moore
 requires such and access to be anywhere efficient.
 Why doesn't filter not provide O(1) random access ? Because to get Nth
 element you'd need to check at least N (and potentially unlimited)
 number of elements before in case they get filtered out.
 Any help?

 If I'd had this sort of problem I'd use something along the lines:
 auto file = File("yourfile");
 foreach( line; file.ByLine)
 {
       auto onlyDigitis = array(filter!((x){   return !isWhite(x);
 })(line)); // this copies all digits to a new array
       auto result = find(onlyDigits, ... ); //your query here
       ///....
 }
 Thanks


 I don't mind storing it in memory. Each .txt file is around 20MB so the
filtered
 string should be even smaller.

 Still, calling array gives this error:

Not exactly calling array but I perfectly understand why you have 
confused it.

 ..\..\src\phobos\std\algorithm.d(3252): Error: function
 std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string
 haystack) is not callable using argument types (dchar[])
 ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
 expression (haystack) of type dchar[] to string
 ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
 expression (needle.beFound((__error))) of type string to dchar[]
search_seq.d(13):
 Error: template instance std.algorithm.find!(dchar[],result,string) error
 instantiating

Let's drill down to the problem through this barrage of crap:

the problem statement is

  Error: cannot implicitly convert expression (haystack) of type dchar[] to
string

So (apparently) the problem is that after array(filter!(... you get array of
dchars (unicode codepoints)as a result of filtering string (which is UTF-8
under the hood btw) while you are going to search an UTF-8 string.
And UTF-8 string is (once again) is not random  accessible in sense of
codepoints (it's needs an UTF decode though it's clearly not needed in your
case).
The simplest workaround I can think of is convert needle to dstring:
auto needle =  boyerMooreFinder(to!dstring(args[1])); //found in std.conv


  From this code:

 import std.algorithm;
 import std.array;
 import std.file;
 import std.stdio;

 void main(string[] args) {
 	auto needle = boyerMooreFinder(args[1]);
 	foreach (string name; dirEntries(".", SpanMode.shallow)) {
 		if (name[$-3 .. $] == "txt") {
 			writeln(name);
 			string text = readText(name);
 			auto haystack = array(filter!("a>= '0'&&  a<= '9'")(text));
 			auto result = find(haystack, needle);
 			writeln(result);
 		}
 	}
 }


 I'm using DMD 2.054 on Windows if that helps


-- 
Dmitry Olshansky

Jul 16 2011

Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:

On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:
 Let's drill down to the problem through this barrage of crap:

 the problem statement is

 Error: cannot implicitly convert expression (haystack) of type dchar[]
 to string

 So (apparently) the problem is that after array(filter!(... you get
 array of dchars (unicode codepoints)as a result of filtering string
 (which is UTF-8 under the hood btw) while you are going to search an
 UTF-8 string.
 And UTF-8 string is (once again) is not random accessible in sense of
 codepoints (it's needs an UTF decode though it's clearly not needed in
 your case).
 The simplest workaround I can think of is convert needle to dstring:
 auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv

Nope, that doesn't work.

Here's what he should do:

import std.algorithm;
import std.array;
import std.file;
import std.stdio;

void main(string[] args) {
	foreach (string name; dirEntries(".", SpanMode.shallow)) {
		if (name[$-3 .. $] == "txt") {
			writeln(name);
			string text = readText(name);
			auto haystack = filter!("a >= '0' && a <='9'")(text);
			auto result = find(haystack, args[1]);
			writeln(result);
		}
	}
}

If you call array() on the filter you're already doing O(n) operations, 
meaning you're not going to get any performance benefit by being able to 
do Boyer-Moore.

Correct me if I'm wrong, but I believe find() will do Boyer-Moore if it 
knows its haystack is random access. This is the power of templated 
programming. find! can generate an efficient find algorithm if it knows 
the range is random access, or default to a less efficient one if not.

Jul 16 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 17.07.2011 1:28, Johann MacDonagh wrote:
 On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:
 Let's drill down to the problem through this barrage of crap:

 the problem statement is

 Error: cannot implicitly convert expression (haystack) of type dchar[]
 to string

 So (apparently) the problem is that after array(filter!(... you get
 array of dchars (unicode codepoints)as a result of filtering string
 (which is UTF-8 under the hood btw) while you are going to search an
 UTF-8 string.
 And UTF-8 string is (once again) is not random accessible in sense of
 codepoints (it's needs an UTF decode though it's clearly not needed in
 your case).
 The simplest workaround I can think of is convert needle to dstring:
 auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv

 Nope, that doesn't work.

 Here's what he should do:

 import std.algorithm;
 import std.array;
 import std.file;
 import std.stdio;

 void main(string[] args) {
     foreach (string name; dirEntries(".", SpanMode.shallow)) {
         if (name[$-3 .. $] == "txt") {
             writeln(name);
             string text = readText(name);
             auto haystack = filter!("a >= '0' && a <='9'")(text);
             auto result = find(haystack, args[1]);
             writeln(result);
         }
     }
 }

 If you call array() on the filter you're already doing O(n) 
 operations, meaning you're not going to get any performance benefit by 
 being able to do Boyer-Moore.

Actually with brute force you do O(mn) comparisons in general while 
there are Boyer-Moor implementations that will do O(n). Now with an 
alphabet of 0..9 I'd say you won't get a lot of speed up with 
Boyer-Moore, in any case I'd suggest OP to at least test both. In this 
case  however my bet is on  bruteforce (given an extra allocation on 
each string with B-M).

 Correct me if I'm wrong, but I believe find() will do Boyer-Moore if 
 it knows its haystack is random access. This is the power of templated 
 programming. find! can generate an efficient find algorithm if it 
 knows the range is random access, or default to a less efficient one 
 if not.

No, the problem of BoyerMoor search is there is some preparatory 
overhead to construct lookup table, apparently that table is stored in 
that BoyerMoorFinder. Frankly, std.algorithm might add at least one 
another sting searching algorithm e.g.
http://monge.univ-mlv.fr/~lecroq/string/

-- 
Dmitry Olshansky

Jul 16 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Search a file skiping whitespace