www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Which option is faster...

reply "jicman" <cabrera__SP M__wrc.xerox.com> writes:
Greetings!

I have this code,

foreach (...)
{

   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
}

foreach (...)
{
   if (std.string.tolower(fext[0]) == "doc")
     continue;
   if (std.string.tolower(fext[0]) == "docx")
     continue;
   if (std.string.tolower(fext[0]) == "xls")
     continue;
   if (std.string.tolower(fext[0]) == "xlsx")
     continue;
   if (std.string.tolower(fext[0]) == "ppt")
     continue;
   if (std.string.tolower(fext[0]) == "pptx")
    continue;
   ...
   ...
}

thanks.

josé
Aug 05 2013
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 08/05/2013 03:59 PM, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

    if (std.string.tolower(fext[0]) == "doc" ||
      std.string.tolower(fext[0]) == "docx" ||
      std.string.tolower(fext[0]) == "xls" ||
      std.string.tolower(fext[0]) == "xlsx" ||
      std.string.tolower(fext[0]) == "ppt" ||
      std.string.tolower(fext[0]) == "pptx")
     continue;
 }

 foreach (...)
 {
    if (std.string.tolower(fext[0]) == "doc")
      continue;
    if (std.string.tolower(fext[0]) == "docx")
      continue;
    if (std.string.tolower(fext[0]) == "xls")
      continue;
    if (std.string.tolower(fext[0]) == "xlsx")
      continue;
    if (std.string.tolower(fext[0]) == "ppt")
      continue;
    if (std.string.tolower(fext[0]) == "pptx")
     continue;
    ...
    ...
 }

 thanks.

 josé
They are both equally slow.
Aug 05 2013
parent reply "jicman" <cabrera_ _wrc.xerox.com> writes:
On Monday, 5 August 2013 at 14:13:33 UTC, Timon Gehr wrote:
 On 08/05/2013 03:59 PM, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
 }

 foreach (...)
 {
   if (std.string.tolower(fext[0]) == "doc")
     continue;
   if (std.string.tolower(fext[0]) == "docx")
     continue;
   if (std.string.tolower(fext[0]) == "xls")
     continue;
   if (std.string.tolower(fext[0]) == "xlsx")
     continue;
   if (std.string.tolower(fext[0]) == "ppt")
     continue;
   if (std.string.tolower(fext[0]) == "pptx")
    continue;
   ...
   ...
 }

 thanks.

 josé
They are both equally slow.
How would you make it faster in D1?
Aug 05 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
jicman:

 How would you make it faster in D1?
Compute std.string.tolower(fext[0]) and put it in a temporary variable. And then compare that variable with all your string literals. In most cases that's fast enough. If it's not enough, you could create a little finite state machine that represents your directed acyclic word graph, and uses gotos to jump around states. The amount of strings is small, so perhaps there are not enough code cache misses to nullify this optimization. Bye, bearophile
Aug 05 2013
parent "jicman" <cabrera_ _wrc.xerox.com> writes:
On Monday, 5 August 2013 at 14:50:27 UTC, bearophile wrote:
 jicman:

 How would you make it faster in D1?
Compute std.string.tolower(fext[0]) and put it in a temporary variable. And then compare that variable with all your string literals. In most cases that's fast enough. If it's not enough, you could create a little finite state machine that represents your directed acyclic word graph, and uses gotos to jump around states. The amount of strings is small, so perhaps there are not enough code cache misses to nullify this optimization. Bye, bearophile
Thanks. This is great.
Aug 05 2013
prev sibling next sibling parent "jicman" <cabrera_ _wrc.xerox.com> writes:
On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
 Greetings!

 I have this code,
First option...
 foreach (...)
 {

   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
 }
Second option...
 foreach (...)
 {
   if (std.string.tolower(fext[0]) == "doc")
     continue;
   if (std.string.tolower(fext[0]) == "docx")
     continue;
   if (std.string.tolower(fext[0]) == "xls")
     continue;
   if (std.string.tolower(fext[0]) == "xlsx")
     continue;
   if (std.string.tolower(fext[0]) == "ppt")
     continue;
   if (std.string.tolower(fext[0]) == "pptx")
    continue;
   ...
   ...
 }

 thanks.

 josé
So, after I saw this post I asked myself, what? So, the question is: which of the two foreach loops options are faster: 1. The concatenated if || 2. The single if I am trying to see if it matters. I have a project with lots of files and if one is faster, then, it will matter to code it the faster way. Thanks.
Aug 05 2013
prev sibling next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
did you benchmarked your current szenario - how do you know that this is 
the slow part - or are you working on an only-extension-compare-tool?

btw: they are both equal and slow - and full of partly code-duplication
std.string.tolower(fext[0]) multiple times, i hope your list isn't going 
much longer

Am 05.08.2013 15:59, schrieb jicman:
 Greetings!

 I have this code,

 foreach (...)
 {

     if (std.string.tolower(fext[0]) == "doc" ||
       std.string.tolower(fext[0]) == "docx" ||
       std.string.tolower(fext[0]) == "xls" ||
       std.string.tolower(fext[0]) == "xlsx" ||
       std.string.tolower(fext[0]) == "ppt" ||
       std.string.tolower(fext[0]) == "pptx")
      continue;
 }

 foreach (...)
 {
     if (std.string.tolower(fext[0]) == "doc")
       continue;
     if (std.string.tolower(fext[0]) == "docx")
       continue;
     if (std.string.tolower(fext[0]) == "xls")
       continue;
     if (std.string.tolower(fext[0]) == "xlsx")
       continue;
     if (std.string.tolower(fext[0]) == "ppt")
       continue;
     if (std.string.tolower(fext[0]) == "pptx")
      continue;
     ...
     ...
 }

 thanks.

 josé
Aug 05 2013
parent reply "jicman" <cabrera_ _wrc.xerox.com> writes:
On Monday, 5 August 2013 at 14:27:43 UTC, dennis luehring wrote:
 did you benchmarked your current szenario - how do you know 
 that this is the slow part - or are you working on an 
 only-extension-compare-tool?

 btw: they are both equal and slow - and full of partly 
 code-duplication
 std.string.tolower(fext[0]) multiple times, i hope your list 
 isn't going much longer
Ok, how would you make it faster?
Aug 05 2013
parent reply dennis luehring <dl.soluz gmx.net> writes:
 Ok, how would you make it faster?
i don't see a better solution here - how to reduce ONE lowercase and SOME compares in any way? (i dont think a hash or something will help) but i know that anything like your continue-party is worth nothing (feels a little bit like script-kiddies "do it with assembler that would it make million times faster" blabla) question: is this the slow part in your project? do you know it for sure or just an emotion - HOW do you benchmark? Am 05.08.2013 16:31, schrieb jicman:
 On Monday, 5 August 2013 at 14:27:43 UTC, dennis luehring wrote:
 did you benchmarked your current szenario - how do you know
 that this is the slow part - or are you working on an
 only-extension-compare-tool?

 btw: they are both equal and slow - and full of partly
 code-duplication
 std.string.tolower(fext[0]) multiple times, i hope your list
 isn't going much longer
Ok, how would you make it faster?
Aug 05 2013
next sibling parent reply "jicman" <cabrera_ _wrc.xerox.com> writes:
On Monday, 5 August 2013 at 14:47:37 UTC, dennis luehring wrote:
 Ok, how would you make it faster?
i don't see a better solution here - how to reduce ONE lowercase and SOME compares in any way? (i dont think a hash or something will help) but i know that anything like your continue-party is worth nothing (feels a little bit like script-kiddies "do it with assembler that would it make million times faster" blabla) question: is this the slow part in your project? do you know it for sure or just an emotion - HOW do you benchmark? Am 05.08.2013 16:31, schrieb jicman:
 On Monday, 5 August 2013 at 14:27:43 UTC, dennis luehring 
 wrote:
 did you benchmarked your current szenario - how do you know
 that this is the slow part - or are you working on an
 only-extension-compare-tool?

 btw: they are both equal and slow - and full of partly
 code-duplication
 std.string.tolower(fext[0]) multiple times, i hope your list
 isn't going much longer
Ok, how would you make it faster?
It is a tool that was a script, but I have turned it into do, which now has taken two hours from the last jscript script. I have not benchmarked it, yet. I may. But I see that a great idea has been provided, which I will use. Thanks for the help.
Aug 05 2013
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 05.08.2013 17:18, schrieb jicman:
 It is a tool that was a script, but I have turned it into do,
 which now has taken two hours from the last jscript script.  I
 have not benchmarked it, yet.  I may.  But I see that a great
 idea has been provided, which I will use.  Thanks for the help.
 have not benchmarked it, yet.  I may.
so its totaly unclear if the presented code is your 2h monster, what was the runtime of your jscript?
 But I see that a great idea has been provided
using a local variable for not lowercasing on each if is not an great idea it is default programming style and i don't think you're be able to implement the tree statemachine when doing such simple performance killer like multiple lowercase calls, and try to help youselfe by introducing "continue"...
Aug 05 2013
parent reply "jicman" <cabrera_ _wrc.xerox.com> writes:
On Monday, 5 August 2013 at 15:27:09 UTC, dennis luehring wrote:
 Am 05.08.2013 17:18, schrieb jicman:
 It is a tool that was a script, but I have turned it into do,
 which now has taken two hours from the last jscript script.  I
 have not benchmarked it, yet.  I may.  But I see that a great
 idea has been provided, which I will use.  Thanks for the help.
 have not benchmarked it, yet.  I may.
so its totaly unclear if the presented code is your 2h monster, what was the runtime of your jscript?
The files are in a network drive, so, that has some slowness already involved because of that. The jscript use to take over 8 hours. The new D program has dropped that to under less than six. This is huge to us. But, I know that I can probably fine tune the program to make it a few minutes less. :-)
 But I see that a great idea has been provided
using a local variable for not lowercasing on each if is not an great idea it is default programming style
This will help. If there are 100K files, which I know that there are more than that, it will help a little bit.
 and i don't think you're be able to implement the tree 
 statemachine
 when doing such simple performance killer like multiple 
 lowercase calls, and try to help youselfe by introducing 
 "continue"...
Perhaps, but it's a good idea, nonetheless.
Aug 05 2013
next sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 08/05/2013 10:04 AM, jicman wrote:

 The files are in a network drive, so, that has some slowness already
 involved because of that.
Which may dominate the running time.
 The jscript use to take over 8 hours.  The new D program has
 dropped that to under less than six.
And that is proof that a sizable amount of time is spent during computations. :)
 But, I know that I can probably fine tune the program to make
 it a few minutes less. :-)
You really should profile where the time is spent. If file I/O is still a big issue, there are WAN optimization products that may solve that problem amazingly efficiently: http://en.wikipedia.org/wiki/WAN_optimization Ali
Aug 05 2013
prev sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 05.08.2013 19:04, schrieb jicman:
 so its totaly unclear if the presented code is your 2h monster,
 what was
 the runtime of your jscript?
The files are in a network drive, so, that has some slowness already involved because of that. The jscript use to take over 8 hours. The new D program has dropped that to under less than six. This is huge to us. But, I know that I can probably fine tune the program to make it a few minutes less. :-)
can you describe more detailed what are you doing - are you also reading these files? why not run your tool on the server and only collect the results over network (wouldn't that be much faster)? what is the impact of using the networkdrive? just copy your biggeste szenario on a local machine an run it against this to get a feeling how painfull the networkdrive communication is
 But I see that a great idea has been provided
using a local variable for not lowercasing on each if is not an great idea it is default programming style
This will help. If there are 100K files, which I know that there are more than that, it will help a little bit.
 and i don't think you're be able to implement the tree
 statemachine
 when doing such simple performance killer like multiple
 lowercase calls, and try to help youselfe by introducing
 "continue"...
Perhaps, but it's a good idea, nonetheless.
maybe your program can be better optimized globaly, but we need more information of what is the programing doing - can you give some psuedo code like: 1. read all filenames recursivley -> 100k filenames 2. reduce down to known extensions -> 10k filenames 3. ... 4. ... 5. ... i don't think that your lower-case is the major part of the 6h, there must be other very slow parts in your project - which you don't find without putting in some bechmarking (and i don't understand why you don't start with benchmarking)
Aug 05 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Aug 05, 2013 at 04:47:36PM +0200, dennis luehring wrote:
 Ok, how would you make it faster?
i don't see a better solution here - how to reduce ONE lowercase and SOME compares in any way? (i dont think a hash or something will help) but i know that anything like your continue-party is worth nothing (feels a little bit like script-kiddies "do it with assembler that would it make million times faster" blabla)
If you really want optimal performance, use std.regex: import std.regex; auto reExtMatch = ctRegex!(`doc|docx|exe|...`, "i"); foreach (...) { if (fext[0].match(reExtMatch)) continue; } The regex is set up to ignore case (the "i" flag), and is compiled at compile-time to generate optimal matching code for the given list of extensions. To get any faster than this, you'll have to hand-optimize or write in assembly. :) T -- Computerese Irregular Verb Conjugation: I have preferences. You have biases. He/She has prejudices. -- Gene Wirchenko
Aug 05 2013
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 8/5/13, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:
 If you really want optimal performance, use std.regex:
Yes and if you also want to bloat your executable to the point that other parts of the system start slowing down.
Aug 06 2013
prev sibling next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
 }

 foreach (...)
 {
   if (std.string.tolower(fext[0]) == "doc")
     continue;
   if (std.string.tolower(fext[0]) == "docx")
     continue;
   if (std.string.tolower(fext[0]) == "xls")
     continue;
   if (std.string.tolower(fext[0]) == "xlsx")
     continue;
   if (std.string.tolower(fext[0]) == "ppt")
     continue;
   if (std.string.tolower(fext[0]) == "pptx")
    continue;
   ...
   ...
 }

 thanks.

 josé
better: foreach (...) { auto tmp = std.string.tolower(fext[0]); if(tmp == "doc" || tmp == "docx" || tmp == "xls" || tmp == "xlsx" || tmp == "ppt" || tmp == "pptx") { continue; } } but still not super-fast as (unless the compiler is very clever) it still means multiple passes over tmp. Also, it converts the whole string to lower case even when it's not necessary. If you have large numbers of possible matches you will probably want to be clever with your data structures / algorithms. E.g. You could create a tree-like structure to quickly eliminate possibilities as you read successive letters. You read one character, follow the appropriate branch, check if there are any further branches, if not then no match and break. Else, read the next character and follow the appropriate branch and so on.... Infeasible for large (or even medium-sized) character-sets without hashing, but might be pretty fast for a-z and a large number of short strings.
Aug 05 2013
next sibling parent "jicman" <cabrera_ _wrc.xerox.com> writes:
On Monday, 5 August 2013 at 15:18:42 UTC, John Colvin wrote:
 On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

  if (std.string.tolower(fext[0]) == "doc" ||
    std.string.tolower(fext[0]) == "docx" ||
    std.string.tolower(fext[0]) == "xls" ||
    std.string.tolower(fext[0]) == "xlsx" ||
    std.string.tolower(fext[0]) == "ppt" ||
    std.string.tolower(fext[0]) == "pptx")
   continue;
 }

 foreach (...)
 {
  if (std.string.tolower(fext[0]) == "doc")
    continue;
  if (std.string.tolower(fext[0]) == "docx")
    continue;
  if (std.string.tolower(fext[0]) == "xls")
    continue;
  if (std.string.tolower(fext[0]) == "xlsx")
    continue;
  if (std.string.tolower(fext[0]) == "ppt")
    continue;
  if (std.string.tolower(fext[0]) == "pptx")
   continue;
  ...
  ...
 }

 thanks.

 josé
better: foreach (...) { auto tmp = std.string.tolower(fext[0]); if(tmp == "doc" || tmp == "docx" || tmp == "xls" || tmp == "xlsx" || tmp == "ppt" || tmp == "pptx") { continue; } } but still not super-fast as (unless the compiler is very clever) it still means multiple passes over tmp. Also, it converts the whole string to lower case even when it's not necessary. If you have large numbers of possible matches you will probably want to be clever with your data structures / algorithms. E.g. You could create a tree-like structure to quickly eliminate possibilities as you read successive letters. You read one character, follow the appropriate branch, check if there are any further branches, if not then no match and break. Else, read the next character and follow the appropriate branch and so on.... Infeasible for large (or even medium-sized) character-sets without hashing, but might be pretty fast for a-z and a large number of short strings.
As the Hispanic speaking community say, muchas gracias.
Aug 05 2013
prev sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 5 August 2013 at 15:18:42 UTC, John Colvin wrote:
 better:

 foreach (...)
 {
     auto tmp = std.string.tolower(fext[0]);
     if(tmp == "doc" || tmp == "docx"
        || tmp == "xls" || tmp == "xlsx"
        || tmp == "ppt" || tmp == "pptx")
     {
         continue;
     }
 }

 but still not super-fast as (unless the compiler is very 
 clever) it still means multiple passes over tmp. Also, it 
 converts the whole string to lower case even when it's not 
 necessary.

 If you have large numbers of possible matches you will probably 
 want to be clever with your data structures / algorithms. E.g.

 You could create a tree-like structure to quickly eliminate 
 possibilities as you read successive letters. You read one 
 character, follow the appropriate branch, check if there are 
 any further branches, if not then no match and break. Else, 
 read the next character and follow the appropriate branch and 
 so on.... Infeasible for large (or even medium-sized) 
 character-sets without hashing, but might be pretty fast for 
 a-z and a large number of short strings.
Arguably, you'd do even better with: foreach (...) { auto tmp = std.string.tolower(fext[0]); switch(tmp) { case "doc", "docx": case "xls", "xlsx": case "ppt", "pptx": continue; default: } } Since it gives the compiler more wiggle room to optimize things as it wishes. For example, it *could* (who knows :D !) implement the switch as a hash table, or a tree. BTW, a very convenient "tree-like" structure that could be used is a "heap": it is a basic binary tree, but stored inside an array. You could build it during compilation, and then simply search it. A possible optimization is to first switch on string length: foreach (...) { auto tmp = std.string.tolower(fext[0]); switch(tmp.length) { case 3: switch(tmp) { case "doc", "xls", "ppt": continue; default: } break; case 4: switch(tmp) { case "docx", "xlsx", "pptx": continue; default: } break; default: } } That said, I'm not even sure this would be faster, so a bench would be called for. Further more, I'd really be tempted to say that at this point, we are in the realm of premature optimization.
Aug 05 2013
parent "jicman" <cabrera_ _wrc.xerox.com> writes:
On Monday, 5 August 2013 at 17:04:36 UTC, monarch_dodra wrote:
 On Monday, 5 August 2013 at 15:18:42 UTC, John Colvin wrote:
 better:

 foreach (...)
 {
    auto tmp = std.string.tolower(fext[0]);
    if(tmp == "doc" || tmp == "docx"
       || tmp == "xls" || tmp == "xlsx"
       || tmp == "ppt" || tmp == "pptx")
    {
        continue;
    }
 }

 but still not super-fast as (unless the compiler is very 
 clever) it still means multiple passes over tmp. Also, it 
 converts the whole string to lower case even when it's not 
 necessary.

 If you have large numbers of possible matches you will 
 probably want to be clever with your data structures / 
 algorithms. E.g.

 You could create a tree-like structure to quickly eliminate 
 possibilities as you read successive letters. You read one 
 character, follow the appropriate branch, check if there are 
 any further branches, if not then no match and break. Else, 
 read the next character and follow the appropriate branch and 
 so on.... Infeasible for large (or even medium-sized) 
 character-sets without hashing, but might be pretty fast for 
 a-z and a large number of short strings.
Arguably, you'd do even better with: foreach (...) { auto tmp = std.string.tolower(fext[0]); switch(tmp) { case "doc", "docx": case "xls", "xlsx": case "ppt", "pptx": continue; default: } } Since it gives the compiler more wiggle room to optimize things as it wishes. For example, it *could* (who knows :D !) implement the switch as a hash table, or a tree. BTW, a very convenient "tree-like" structure that could be used is a "heap": it is a basic binary tree, but stored inside an array. You could build it during compilation, and then simply search it. A possible optimization is to first switch on string length: foreach (...) { auto tmp = std.string.tolower(fext[0]); switch(tmp.length) { case 3: switch(tmp) { case "doc", "xls", "ppt": continue; default: } break; case 4: switch(tmp) { case "docx", "xlsx", "pptx": continue; default: } break; default: } } That said, I'm not even sure this would be faster, so a bench would be called for. Further more, I'd really be tempted to say that at this point, we are in the realm of premature optimization.
:-) Now that's funny. He he he he... By the way, your coding makes sense. For 100K plus files, it may not mean much, but we are thinking that this amount will be only growing.
Aug 05 2013
prev sibling next sibling parent reply David <d dav1d.de> writes:
Am 05.08.2013 15:59, schrieb jicman:
 
 Greetings!
 
 I have this code,
 
 foreach (...)
 {
 
   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
 }
 
That's how I would do it. if(["doc", "docx", "xls", "xlsx", "ppt", "pptx"].canFind(fext[0].toLower) { continue; } With the array beeing a runtime/compiletime constant with a sensible name. WORD_EXTENSIONS or something.
Aug 05 2013
parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 5 August 2013 at 15:21:25 UTC, David wrote:
 Am 05.08.2013 15:59, schrieb jicman:
 
 Greetings!
 
 I have this code,
 
 foreach (...)
 {
 
   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
 }
 
That's how I would do it. if(["doc", "docx", "xls", "xlsx", "ppt", "pptx"].canFind(fext[0].toLower) { continue; } With the array beeing a runtime/compiletime constant with a sensible name. WORD_EXTENSIONS or something.
In case some of the strings are actually rather long, one could use toLower lazily (either std.ascii.toLower or std.uni.toLower, depending on which you need): if(WORD_EXT.canFind(fext[0].map!toLower()))
Aug 05 2013
prev sibling next sibling parent reply "JS" <js.mdnq gmail.com> writes:
On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
 }

 foreach (...)
 {
   if (std.string.tolower(fext[0]) == "doc")
     continue;
   if (std.string.tolower(fext[0]) == "docx")
     continue;
   if (std.string.tolower(fext[0]) == "xls")
     continue;
   if (std.string.tolower(fext[0]) == "xlsx")
     continue;
   if (std.string.tolower(fext[0]) == "ppt")
     continue;
   if (std.string.tolower(fext[0]) == "pptx")
    continue;
   ...
   ...
 }

 thanks.

 josé
Both are slow, those trying to optimize out tolower are missing the boat. A good compiler should optimize that out. The issue is that n compare are done in those cases with each successive compare getting slower and slower. if the match is doc then it happens in one compare in both cases. If it's pptx then it is 6. So if there are many pptx matches then the code will perform slower than for doc. This can be good or bad behavior depending on if you are guarantee that your order is representative of what actually happens(many doc's few pptx's). To get O(1) behavior, you must use some sort of jump list. A hash table can provide such behavior. If there are many comparisons needed then at some point this will be, on average, faster. If performance is critical and the string length is small enough then the fastest method is to use a direct lookup table. For n character of char. It is 256^n. For n = 4, this is 4GB of memory! (This is one reason to use a hash table). Since you are not using the full ascii you could compact this a bit, say 36^n, for n = 4 it is only 1.6MB of memory. The cost is packing the strings for lookup and computing the index. One way to speed up the loop is to also compare for what is not used. In this case you can compare just on the first char. if (fext[0][0]) == "d") { ... } if (fext[0][0]) == "x") { ... } if (fext[0][0]) == "p") { ... } This pares down the results. In this case it is only 3 comparisions + 2 comparisons = 5 for pptx rather than 6. You can try and develop your own hash to get optimal amortized performance.
Aug 05 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Aug 05, 2013 at 07:30:36PM +0200, JS wrote:
 On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
Greetings!

I have this code,

foreach (...)
{

  if (std.string.tolower(fext[0]) == "doc" ||
    std.string.tolower(fext[0]) == "docx" ||
    std.string.tolower(fext[0]) == "xls" ||
    std.string.tolower(fext[0]) == "xlsx" ||
    std.string.tolower(fext[0]) == "ppt" ||
    std.string.tolower(fext[0]) == "pptx")
   continue;
}

foreach (...)
{
  if (std.string.tolower(fext[0]) == "doc")
    continue;
  if (std.string.tolower(fext[0]) == "docx")
    continue;
  if (std.string.tolower(fext[0]) == "xls")
    continue;
  if (std.string.tolower(fext[0]) == "xlsx")
    continue;
  if (std.string.tolower(fext[0]) == "ppt")
    continue;
  if (std.string.tolower(fext[0]) == "pptx")
   continue;
  ...
  ...
}

thanks.

josé
Both are slow, those trying to optimize out tolower are missing the boat. A good compiler should optimize that out. The issue is that n compare are done in those cases with each successive compare getting slower and slower. if the match is doc then it happens in one compare in both cases. If it's pptx then it is 6. So if there are many pptx matches then the code will perform slower than for doc. This can be good or bad behavior depending on if you are guarantee that your order is representative of what actually happens(many doc's few pptx's). To get O(1) behavior, you must use some sort of jump list. A hash table can provide such behavior. If there are many comparisons needed then at some point this will be, on average, faster.
[...] There's no need to reinvent the wheel here. std.regex.ctRegex will create the equivalent of jump tables for you. Besides, I doubt the OP's real performance bottleneck is in this code; sounds like the program is I/O bound. Compared to I/O, even repeatedly calling std.string.tolower repeatedly is just roundoff error in terms of total execution time. Using a profiler will reveal the *real* bottleneck. Based on past experience, I'm expecting that it will reveal that the bottleneck isn't anywhere we're predicting it is. :) T -- VI = Visual Irritation
Aug 05 2013
prev sibling next sibling parent =?UTF-8?B?UmFwaGHDq2wgSmFrc2U=?= <raphael.jakse gmail.com> writes:
Le 05/08/2013 15:59, jicman a écrit :
 Greetings!

 I have this code,

 foreach (...)
 {

    if (std.string.tolower(fext[0]) == "doc" ||
      std.string.tolower(fext[0]) == "docx" ||
      std.string.tolower(fext[0]) == "xls" ||
      std.string.tolower(fext[0]) == "xlsx" ||
      std.string.tolower(fext[0]) == "ppt" ||
      std.string.tolower(fext[0]) == "pptx")
     continue;
 }

 foreach (...)
 {
    if (std.string.tolower(fext[0]) == "doc")
      continue;
    if (std.string.tolower(fext[0]) == "docx")
      continue;
    if (std.string.tolower(fext[0]) == "xls")
      continue;
    if (std.string.tolower(fext[0]) == "xlsx")
      continue;
    if (std.string.tolower(fext[0]) == "ppt")
      continue;
    if (std.string.tolower(fext[0]) == "pptx")
     continue;
    ...
    ...
 }

 thanks.

 josé
Like others said, writing if( ...) continue; if(...) continue; versus if(... || ...) continue; is not the problem in the code. Computing the lower case of the string for each comparison is a greater problem that can be spotted at the first glance. Storing the value in a variable and then using this variables in comparisons would be better. Both versions of your code should be equivalent, though this should be verified. I would suggest another way of doing this: switch(std.string.tolower(fext[0])) { case "doc": case "docx": case "xls": case "xlsx": case "ppt": case "pptx": continue; default: // do something else } This way of writing your code seems far more readable and elegant to me. Even more concise: switch(std.string.tolower(fext[0])) { case "doc","docx","xls","xlsx","ppt","pptx": continue; default: // do something else } Much like what morarch_dodra proposed, I guess. But notice that tolower is deprecated in current D2 version, toLower should be used instead. If you still use D1, maybe a better thing to do before considering such optimization would be opting for D2. I would also suggest avoiding usage of continue if you don't need it. Here, it is likely that you can write something more structured like: import std.string : toLower // ... foreach(...) { switch(fext[0].toLower) { case "doc","docx","xls","xlsx","ppt","pptx": // do what is to be done here with these cases, or nothing break; default: // do something for other cases } } If you need to use fext[0].toLower in any case, I would suggest you write this kind of code instead: import std.string : toLower // ... auto ext = fext[0].toLower; switch(ext) { case "doc","docx","xls","xlsx","ppt","pptx": // do what is to be done here with these cases, or nothing // use ext instead of fext[0].toLower, if needed break; default: // do something for other cases // use ext instead of fext[0].toLower, if needed } auto ext = fext[0].toLower; should be placed before the foreach loop if fext[0] isn't changed in the loop (avoid computing a value more than one time when it is possible, though thinking before applying this rule or any other rule is not forbidden). There might exist solutions that could be faster, like testing the first letter before the second / using the length of the string / using finite state machine to save tests, but beware of the maintainability of your code and if you go into this, triple check that these solutions are indeed more efficient in terms of execution. See morarch_dodra, JS, bearophile and others posts for more precise information. If this part of your code is not known to be a bottleneck, I would opt for readability / elegance over over-optimization and for this particular case, I would trust the compiler for optimizing enough with the switch version of the code (though it is not more than a opinion here).
Aug 05 2013
prev sibling next sibling parent reply "Andre Artus" <andre.artus gmail.com> writes:
On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
 }

 foreach (...)
 {
   if (std.string.tolower(fext[0]) == "doc")
     continue;
   if (std.string.tolower(fext[0]) == "docx")
     continue;
   if (std.string.tolower(fext[0]) == "xls")
     continue;
   if (std.string.tolower(fext[0]) == "xlsx")
     continue;
   if (std.string.tolower(fext[0]) == "ppt")
     continue;
   if (std.string.tolower(fext[0]) == "pptx")
    continue;
   ...
   ...
 }

 thanks.

 josé
What exactly are you trying to do with this? I get the impression that there is an attempt at "local optimization" when broader approach could lead to better results. For instance. Using the OS's facilities to filter (six requests, one each for "*.doc", "*.docx") could actually end up being a lot faster. If you could give more detail about what you are trying to achieve then it could be possible to get better results.
Aug 05 2013
parent reply "jicman" <cabrera_ _wrc.xerox.com> writes:
On Tuesday, 6 August 2013 at 04:10:57 UTC, Andre Artus wrote:
 On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

  if (std.string.tolower(fext[0]) == "doc" ||
    std.string.tolower(fext[0]) == "docx" ||
    std.string.tolower(fext[0]) == "xls" ||
    std.string.tolower(fext[0]) == "xlsx" ||
    std.string.tolower(fext[0]) == "ppt" ||
    std.string.tolower(fext[0]) == "pptx")
   continue;
 }

 foreach (...)
 {
  if (std.string.tolower(fext[0]) == "doc")
    continue;
  if (std.string.tolower(fext[0]) == "docx")
    continue;
  if (std.string.tolower(fext[0]) == "xls")
    continue;
  if (std.string.tolower(fext[0]) == "xlsx")
    continue;
  if (std.string.tolower(fext[0]) == "ppt")
    continue;
  if (std.string.tolower(fext[0]) == "pptx")
   continue;
  ...
  ...
 }

 thanks.

 josé
What exactly are you trying to do with this? I get the impression that there is an attempt at "local optimization" when broader approach could lead to better results. For instance. Using the OS's facilities to filter (six requests, one each for "*.doc", "*.docx") could actually end up being a lot faster. If you could give more detail about what you are trying to achieve then it could be possible to get better results.
The files are in a network drive and doing a list foreach *.doc, *.docx, etc. will be more expensive than getting the list of all the files at once and then processing them accordingly.
Aug 06 2013
next sibling parent reply "Andre Artus" <andre.artus gmail.com> writes:
On Tuesday, 6 August 2013 at 12:32:13 UTC, jicman wrote:
 On Tuesday, 6 August 2013 at 04:10:57 UTC, Andre Artus wrote:
 On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

 if (std.string.tolower(fext[0]) == "doc" ||
   std.string.tolower(fext[0]) == "docx" ||
   std.string.tolower(fext[0]) == "xls" ||
   std.string.tolower(fext[0]) == "xlsx" ||
   std.string.tolower(fext[0]) == "ppt" ||
   std.string.tolower(fext[0]) == "pptx")
  continue;
 }

 foreach (...)
 {
 if (std.string.tolower(fext[0]) == "doc")
   continue;
 if (std.string.tolower(fext[0]) == "docx")
   continue;
 if (std.string.tolower(fext[0]) == "xls")
   continue;
 if (std.string.tolower(fext[0]) == "xlsx")
   continue;
 if (std.string.tolower(fext[0]) == "ppt")
   continue;
 if (std.string.tolower(fext[0]) == "pptx")
  continue;
 ...
 ...
 }

 thanks.

 josé
What exactly are you trying to do with this? I get the impression that there is an attempt at "local optimization" when broader approach could lead to better results. For instance. Using the OS's facilities to filter (six requests, one each for "*.doc", "*.docx") could actually end up being a lot faster. If you could give more detail about what you are trying to achieve then it could be possible to get better results.
The files are in a network drive and doing a list foreach *.doc, *.docx, etc. will be more expensive than getting the list of all the files at once and then processing them accordingly.
Again, what are you trying to achieve? Your statement is not necessarily true, for a myriad of reasons, but it entirely depends on what you want to do. I would reiterate Dennis Luehring's reply, why are you not benching? It seems like you are guessing at what the problems are, that's hardly ever useful. One of the first rules of network optimization is to reduce the amount od data, that normally means filtering.at the server, the next thing is coarse grained is better than fine (BOCTAOE/L).
Aug 06 2013
parent reply "jicman" <cabrera wrc.xerox.com> writes:
On Tuesday, 6 August 2013 at 14:49:42 UTC, Andre Artus wrote:
 On Tuesday, 6 August 2013 at 12:32:13 UTC, jicman wrote:
 On Tuesday, 6 August 2013 at 04:10:57 UTC, Andre Artus wrote:
 On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
 Greetings!

 I have this code,

 foreach (...)
 {

 if (std.string.tolower(fext[0]) == "doc" ||
  std.string.tolower(fext[0]) == "docx" ||
  std.string.tolower(fext[0]) == "xls" ||
  std.string.tolower(fext[0]) == "xlsx" ||
  std.string.tolower(fext[0]) == "ppt" ||
  std.string.tolower(fext[0]) == "pptx")
 continue;
 }

 foreach (...)
 {
 if (std.string.tolower(fext[0]) == "doc")
  continue;
 if (std.string.tolower(fext[0]) == "docx")
  continue;
 if (std.string.tolower(fext[0]) == "xls")
  continue;
 if (std.string.tolower(fext[0]) == "xlsx")
  continue;
 if (std.string.tolower(fext[0]) == "ppt")
  continue;
 if (std.string.tolower(fext[0]) == "pptx")
 continue;
 ...
 ...
 }

 thanks.

 josé
What exactly are you trying to do with this? I get the impression that there is an attempt at "local optimization" when broader approach could lead to better results. For instance. Using the OS's facilities to filter (six requests, one each for "*.doc", "*.docx") could actually end up being a lot faster. If you could give more detail about what you are trying to achieve then it could be possible to get better results.
The files are in a network drive and doing a list foreach *.doc, *.docx, etc. will be more expensive than getting the list of all the files at once and then processing them accordingly.
Again, what are you trying to achieve? Your statement is not necessarily true, for a myriad of reasons, but it entirely depends on what you want to do. I would reiterate Dennis Luehring's reply, why are you not benching? It seems like you are guessing at what the problems are, that's hardly ever useful. One of the first rules of network optimization is to reduce the amount od data, that normally means filtering.at the server, the next thing is coarse grained is better than fine (BOCTAOE/L).
It's a long story and I will return in a few months and give you the whole story, but right now, time is not on my side. I have answers for all the questions you folks have asked, and I appreciate all the input. I have the answer that I was looking for, so in a few months, I will come back and explain the whole story. Thanks for all the response and suggestions. jic
Aug 06 2013
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 07.08.2013 06:30, schrieb jicman:
 Again, what are you trying to achieve?
 Your statement is not necessarily true, for a  myriad of
 reasons, but it entirely depends on what you want to do.
 I would reiterate Dennis Luehring's reply, why are you not
 benching? It seems like you are guessing at what the problems
 are, that's hardly ever useful.
 One of the first rules of network optimization  is to reduce
 the amount od data, that normally means filtering.at the
 server, the next thing is coarse grained is better than fine
 (BOCTAOE/L).
It's a long story and I will return in a few months and give you the whole story, but right now, time is not on my side. I have answers for all the questions you folks have asked, and I appreciate all the input. I have the answer that I was looking for, so in a few months, I will come back and explain the whole story. Thanks for all the response and suggestions.
after makeing us girls all wet to help you - your reply is "no sex on the first date, im a gentlemen... but maybe in a few months" so: you having a jscript doing somehting with files,fileextensions over networkdrive - it runs around 8h you ported that jscript to D - now it runs for 6h you noob-guessed the lowercase-if-party could be evil (btw: it cost more time to guess then to benchmark) you get trivial answers that won't get you very much, the lowercase would not boost your speed that much and the networkdrive latency will kill all the other statemachine ideas you don't answer trivial questions about the big picture - and now you're out of time open questions: -why not collect the data on the server itself - instead of grabbing tiny bits over network? - this is for understanding your environent -how big is the speed drop with your tool on the very same drive localy and over a networkdrive? - this is for understanding the latency -are you also reading this files or just doing filename search (recursively?) and throwing out non office-extensions? this is for getting an idea if buildin OS(operating system) features can help see you in a few months
Aug 06 2013
parent "Andre Artus" <andre.artus gmail.com> writes:
 It's a long story and I will return in a few months and give 
 you
 the whole story, but right now, time is not on my side.  I have
 answers for all the questions you folks have asked, and I
 appreciate all the input.  I have the answer that I was looking
 for, so in a few months, I will come back and explain the whole
 story.  Thanks for all the response and suggestions.
after makeing us girls all wet to help you - your reply is "no sex on the first date, im a gentlemen... but maybe in a few months" so: you having a jscript doing somehting with files,fileextensions over networkdrive - it runs around 8h you ported that jscript to D - now it runs for 6h you noob-guessed the lowercase-if-party could be evil (btw: it cost more time to guess then to benchmark) you get trivial answers that won't get you very much, the lowercase would not boost your speed that much and the networkdrive latency will kill all the other statemachine ideas you don't answer trivial questions about the big picture - and now you're out of time open questions: -why not collect the data on the server itself - instead of grabbing tiny bits over network? - this is for understanding your environent
Just for a sanity check I implemented a quick client-server setup where the daemon takes filespec from the client and returns a line-by-line list compressed into one packet. The total running time on 8 terabytes of files stored over a dozen drives searched recursively: less than 1 minute. Same over slow WiFi, negligible difference (list compresses to a few Kb) with LZMA. I did not even bother to search each physical drive separately, just produced the list sequentially.
 -how big is the speed drop with your tool on the very same 
 drive localy and over a networkdrive? - this is for 
 understanding the latency

 -are you also reading this files or just doing filename search 
 (recursively?) and throwing out non office-extensions?
 this is for getting an idea if buildin OS(operating system) 
 features can help

 see you in a few months
It's impossible to help people who refuse to give basic information.
Aug 06 2013
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Aug 06, 2013 at 02:32:11PM +0200, jicman wrote:
 On Tuesday, 6 August 2013 at 04:10:57 UTC, Andre Artus wrote:
[...]
What exactly are you trying to do with this? I get the impression
that there is an attempt at "local optimization" when broader
approach could lead to better results.

For instance. Using the OS's facilities to filter (six requests,
one each for "*.doc", "*.docx") could actually end up being a lot
faster.

If you could give more detail about what you are trying to achieve
then it could be possible to get better results.
The files are in a network drive and doing a list foreach *.doc, *.docx, etc. will be more expensive than getting the list of all the files at once and then processing them accordingly.
In this case, the bottleneck most likely is in the network latency, so optimizing string comparisons won't get you very far. You should instead be focusing on how to improve the performance of network accesses. What kind of processing are you doing with the filtered list of files? If it's a complicated operation, you might want to consider having a separate thread for fetching the list of files while work is being done on them in the main thread. T -- Computers shouldn't beep through the keyhole.
Aug 06 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Aug 05, 2013 at 03:59:23PM +0200, jicman wrote:
 
 Greetings!
 
 I have this code,
 
 foreach (...)
 {
 
   if (std.string.tolower(fext[0]) == "doc" ||
     std.string.tolower(fext[0]) == "docx" ||
     std.string.tolower(fext[0]) == "xls" ||
     std.string.tolower(fext[0]) == "xlsx" ||
     std.string.tolower(fext[0]) == "ppt" ||
     std.string.tolower(fext[0]) == "pptx")
    continue;
 }
 
 foreach (...)
 {
   if (std.string.tolower(fext[0]) == "doc")
     continue;
   if (std.string.tolower(fext[0]) == "docx")
     continue;
   if (std.string.tolower(fext[0]) == "xls")
     continue;
   if (std.string.tolower(fext[0]) == "xlsx")
     continue;
   if (std.string.tolower(fext[0]) == "ppt")
     continue;
   if (std.string.tolower(fext[0]) == "pptx")
    continue;
   ...
   ...
 }
[...] It would appear that your bottleneck is not in the continue statements, but in the repeated calls to std.string.tolower. It's probably better to write it this way: foreach (...) { auto ext = std.string.tolower(fext[0]); if (ext == "doc" || ext == "docx" || ... ) continue; } This way you save on the overhead of many identical function calls, which probably outweights any benefit you get by optimizing continue. T -- It's amazing how careful choice of punctuation can leave you hanging:
Aug 05 2013
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, August 05, 2013 15:59:23 jicman wrote:
 Greetings!
 
 I have this code,
 
 foreach (...)
 {
 
    if (std.string.tolower(fext[0]) == "doc" ||
      std.string.tolower(fext[0]) == "docx" ||
      std.string.tolower(fext[0]) == "xls" ||
      std.string.tolower(fext[0]) == "xlsx" ||
      std.string.tolower(fext[0]) == "ppt" ||
      std.string.tolower(fext[0]) == "pptx")
     continue;
 }
 
 foreach (...)
 {
    if (std.string.tolower(fext[0]) == "doc")
      continue;
    if (std.string.tolower(fext[0]) == "docx")
      continue;
    if (std.string.tolower(fext[0]) == "xls")
      continue;
    if (std.string.tolower(fext[0]) == "xlsx")
      continue;
    if (std.string.tolower(fext[0]) == "ppt")
      continue;
    if (std.string.tolower(fext[0]) == "pptx")
     continue;
    ...
    ...
 }
As others have pointed out, the two code snippets are semantically identical, and there's a decent chance that the compiler will generate identical code for both. It wouldn't surprise me if the first snippet resulted in more concise assembly code, but there shouldn't be any performance difference between the two. But the first snippet is cleaner, so between those two, you should use that. Of course, as others have pointed out, it would be far more efficient to simply calculate std.string.tolower(fext[0]) once and re-use the result, which should be a very large performance gain here (assuming that the code around it doesn't dwarf it in cost). However, another possibility that I don't think anyone has pointed out is std.string.icmp, which will do case-insensitive comparison, meaning that you could do if(fext[0].icmp("doc") == 0 || fext[0].icmp("docx") == 0 || ...) and that might be more efficient as it avoids having to allocate a new string when fext[0] isn't all lowercase already. However, you would have to benchmark the code to be sure, since depending on the typical input, the code around this code, and how many times this code gets called, calling toLower once could be more efficient or using icmp could be more efficient, since toLower allocates whereas icmp doesn't, but icmp does more comparisons that a simple ==. But if you really care about efficiency, you should try both ways and see which is faster for your use case. - Jonathan M Davis
Aug 05 2013