digitalmars.D.learn - Replacing tango.text.Ascii.isearch

torhu (5/5) Oct 05 2022 I need a case-insensitive check to see if a string contains

Steven Schveighoffer (12/16) Oct 05 2022 https://dlang.org/phobos/std_uni.html#asLowerCase

torhu (17/19) Oct 05 2022 On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven

torhu (4/5) Oct 05 2022 Right, you can instantiate structs without arguments. It's been

torhu (3/9) Oct 05 2022 I think there should be sensible default here, seems like an easy

=?UTF-8?Q?Ali_=c3=87ehreli?= (7/8) Oct 05 2022 Either this:

torhu (4/14) Oct 05 2022 I did some basic testing, and regex was two orders of magnitude

Siarhei Siamashka (30/32) Oct 06 2022 Substring search functionality is currently in a very bad shape

Sergey (3/4) Oct 06 2022 Please don’t tell us that D will be slower than Python again?)

rassoc (48/49) Oct 06 2022 And what kind of testing was that? Mind to share? Because I did the foll...

torhu (34/39) Oct 06 2022 Originally I just loaded a one megabyte file and searched the

rassoc (2/3) Oct 06 2022 Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, ...

Siarhei Siamashka (15/20) Oct 06 2022 Benchmark results depend on many things, such as the actual text

Siarhei Siamashka (9/12) Oct 07 2022 Such as generating the input data via running:

bachmeier (4/16) Oct 07 2022 https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

Siarhei Siamashka (31/33) Oct 07 2022 Yes, that's how substring search works in the standard libraries

rassoc (16/18) Oct 07 2022 That's generating a file with a single line:

Siarhei Siamashka (69/89) Oct 08 2022 If you insist on having 100K lines in the input data, then you

rassoc (2/3) Oct 09 2022 Oh, I get all that, there's no reason to argue with me or win me over, I...

bauss (7/24) Oct 13 2022 This doesn't actually work properly in all languages. It will

rikki cattermole (6/13) Oct 13 2022 Yes turkic languages, they require a state machine and quite a bit of

bauss (5/19) Oct 13 2022 I think it's relevant when it comes to D since D is arguably a

bauss (10/32) Oct 13 2022 Oh and to add onto this, IFF you have to do it the hacky way,

rikki cattermole (6/14) Oct 13 2022 You will want to repeat this process with normalize to NFKC and

bauss (3/18) Oct 13 2022 Yeah, text isn't easy :D

rikki cattermole (7/8) Oct 13 2022 Indeed!

Patrick Schluter (6/32) Oct 13 2022 Greek will also be problematic. 2 different lowercase sigmas but
Siarhei Siamashka (22/39) Oct 24 2022 Wow, I didn't expect anything like this and just thought that the

rikki cattermole (8/12) Oct 24 2022 std.uni doesn't support it.

Siarhei Siamashka (9/13) Oct 25 2022 OK, I'm not specifically interested in this personally and I even

rikki cattermole (5/8) Oct 25 2022 Bauss is correct.

Siarhei Siamashka (12/15) Oct 25 2022 I can't ignore Unicode, because I frequently have to deal with

rikki cattermole (9/23) Oct 25 2022 Cyrillic isn't an issue.
=?UTF-8?Q?Ali_=c3=87ehreli?= (22/29) Oct 25 2022 The problem with Unicode is its main aim of allowing characters of

Siarhei Siamashka (39/48) Oct 28 2022 I personally don't think that it's the problem of the Unicode

rikki cattermole (16/21) Oct 28 2022 That particular function, is based upon the simple mappings provided by

torhu <torhu yahoo.com> writes:

I need a case-insensitive check to see if a string contains 
another string for a "quick filter" feature. It should 
preferrably be perceived as instant by the user, and needs to 
check a few thousand strings in typical cases. Is a regex the 
best option, or what would you suggest?

Oct 05 2022

Steven Schveighoffer <schveiguy gmail.com> writes:

On 10/5/22 12:59 PM, torhu wrote:
 I need a case-insensitive check to see if a string contains another 
 string for a "quick filter" feature. It should preferrably be perceived 
 as instant by the user, and needs to check a few thousand strings in 
 typical cases. Is a regex the best option, or what would you suggest?

https://dlang.org/phobos/std_uni.html#asLowerCase

```d
bool isearch(S1, S2)(S1 haystack, S2 needle)
{
     import std.uni;
     import std.algorithm;
     return haystack.asLowerCase.canFind(needle.asLowerCase);
}
```

untested.

-Steve

Oct 05 2022

torhu <torhu yahoo.com> writes:

On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:
[...]


I wanted to do some quick benchmarking to figure out what works.

When I run this:

```d
import std.stdio;
import std.datetime.stopwatch;

void main()
{
	auto sw = StopWatch();	
	sw.stop();
	writeln(sw.peek().toString());
}
```

It prints this:
 2 weeks, 6 days, 9 hours, 34 minutes, 43 secs, 214 ms, 946 ╬╝s, 
 and 7 hnsecs

Am I doing something wrong here?

Oct 05 2022

torhu <torhu yahoo.com> writes:

On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote:

 Am I doing something wrong here?

Right, you can instantiate structs without arguments. It's been 
ten years since I last used D, I was thinking of structs like if 
they were classes.

Oct 05 2022

torhu <torhu yahoo.com> writes:

On Wednesday, 5 October 2022 at 20:45:55 UTC, torhu wrote:
 On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote:

 Am I doing something wrong here?

 Right, you can instantiate structs without arguments. It's been 
 ten years since I last used D, I was thinking of structs like 
 if they were classes.

I think there should be sensible default here, seems like an easy 
trap to remove.

Oct 05 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 10/5/22 13:40, torhu wrote:

      auto sw = StopWatch();

Either this:

     auto sw = StopWatch(AutoStart.yes);

or this:

     auto sw = StopWatch();
     sw.start();

Ali

Oct 05 2022

torhu <torhu yahoo.com> writes:

On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:

 ```d
 bool isearch(S1, S2)(S1 haystack, S2 needle)
 {
     import std.uni;
     import std.algorithm;
     return haystack.asLowerCase.canFind(needle.asLowerCase);
 }
 ```

 untested.

 -Steve

I did some basic testing, and regex was two orders of magnitude 
faster. So now I know, I guess.

Oct 05 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:
 I did some basic testing, and regex was two orders of magnitude 
 faster. So now I know, I guess.

Substring search functionality is currently in a very bad shape 
in Phobos. I discovered this myself a few weeks ago when I was 
trying to solve the 
https://www.facebook.com/codingcompetitions/hacker-cup/2022/round-1/problems/A2
puzzle using D language. A part of the solution requires a fast substring
search (actually a subarray search) and Phobos doesn't offer anything with even
remotely acceptable performance.

Phobos does have a Boyer-Moore implementation. This algorithm is 
historically famous in computer science as one of the first 
attempts to optimize substring search, but it also has 
pathologically bad performance on certain input data and probably 
shouldn't be recommended for any practical use nowadays. The 
users of old versions of Python discovered this too: 
https://codeforces.com/blog/entry/106849?#comment-952483

The standard 'find' function from the following simple example 
also becomes awfully slow when arrays 'a' and 'b' are large 
and/or are adversarially constructed:

```D
import std;
void main() {
   auto a = [1, 2, 3, 4];
   auto b = [2, 3];
   writefln("Is %s a subarray of %s? The answer is %s.",
            b, a, a.find(b).empty ? "no" : "yes");
}
```

I think that the best fit for your use case is the 
https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and
there's an old issue in bugzilla about this:
https://issues.dlang.org/show_bug.cgi?id=16066

BTW, if anyone is curious, one of the possible solutions for the 
hacker-cup/2022/round-1/problems/A2 puzzle in D language can be 
found here: 
https://codeforces.com/blog/entry/106768?#comment-952808

Oct 06 2022

Sergey <kornburn yandex.ru> writes:

On Thursday, 6 October 2022 at 08:15:10 UTC, Siarhei Siamashka 
wrote:
 On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:

Please don’t tell us that D will be slower than Python again?)

Oct 06 2022

rassoc <rassoc posteo.de> writes:

On 10/5/22 23:50, torhu via Digitalmars-d-learn wrote:
 I did some basic testing, and regex was two orders of magnitude faster. So now
I know, I guess.

And what kind of testing was that? Mind to share? Because I did the following
real quick and wasn't able to measure a "two orders of magnitude" difference.
Sure, the regex version came on top, but they were both faster than the ruby
baseline I cooked up.

First, generate a word file with 100k entries of various lengths:

$> dmd -run words.d foobaz 100000
---
import std;

string randomWord(ulong n) {
     static chars = letters.array;
     return generate!(() => chars.choice).take(n).text;
}

void main(string[] args) {
     enforce(args.length == 3, "Usage: dmd -run words.d needle num");

     auto f = File("words.txt", "w");
     foreach (i; 0..args[2].to!ulong) {
         ulong n = uniform(0, 50), m = uniform(0, 50);
         if (i % 2 == 0)
             f.writeln(randomWord(n), args[1], randomWord(m));
         else
             f.writeln(randomWord(n + m));
     }
}
---

And then for the actual measuring:

$> dmd -O -version={range,regex} -of=search-{range,regex} search.d
$> ldc -O -d-version={range,regex} -of=search-{range,regex}-ldc search.d
$> time ./search-{range,regex}{,-ldc} foobaz
---
import std;

void main(string[] args) {
     enforce(args.length == 2, "Usage: search 'needle'");

     version (regex)
         auto rx = regex(args[1], "i");
     else version (range)
         auto needle = args[1].asLowerCase.text;
     else
         static assert(0, "use -version={regex,range}");

     ulong matches;
     foreach (line; File("words.txt").byLine) {
         version (regex)
             if (line.matchFirst(rx))
                 matches++;
         version (range)
             if (line.asLowerCase.canFind(needle))
                 matches++;
     }
     writeln(matches);
}
---

Oct 06 2022

torhu <torhu yahoo.com> writes:

On Thursday, 6 October 2022 at 21:36:48 UTC, rassoc wrote:

 And what kind of testing was that? Mind to share? Because I did 
 the following real quick and wasn't able to measure a "two 
 orders of magnitude" difference. Sure, the regex version came 
 on top, but they were both faster than the ruby baseline I 
 cooked up.

Originally I just loaded a one megabyte file and searched the 
whole thing. I changed it to split it into (40 000) lines 
instead, regex is about ten times faster then. I compile with 
-release -O -inline. Here is the second version:

```d
import std;
import std.datetime.stopwatch;

enum FILE = "test.lst";
string text;
string needle;

void test(bool delegate(string haystack) dg)
{

     auto sw = StopWatch(AutoStart.yes);
     int counter = 0;
     foreach (line; lineSplitter(text)) {
         if (dg(line))
             counter++;
     }
     sw.stop();
     writefln("%s", sw.peek());
     writefln("counter: %s", counter);
}

void main(char[][] args)
{
     enforce(args.length > 1, "Need a needle argument.");

     text = cast(string)read(FILE);
     needle = args[1].idup;
     auto re = regex(to!string(escaper(needle)), "i");
     string needleLower = needle.toLower();

     test((h) => !!h.matchFirst(re));
     test((h) => h.asLowerCase().canFind(needleLower));
}
```

Oct 06 2022

rassoc <rassoc posteo.de> writes:

On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:
 regex is about ten times faster then.

Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, nothing
close to 10x. Welp, the woes of superficial benchmarking. :)

Oct 06 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Friday, 7 October 2022 at 00:57:38 UTC, rassoc wrote:
 On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:
 regex is about ten times faster then.

 Interesting! Using your code, I'm seeing a 1.5x max difference 
 for ldc, nothing close to 10x. Welp, the woes of superficial 
 benchmarking. :)

Benchmark results depend on many things, such as the actual text 
in both needle and haystack and the needle length. Are we dealing 
with unicode text by the way? One example is searching for 
something like "äußere" in 
https://www.gutenberg.org/ebooks/6343.txt.utf-8

If it's the source code, then searching for 
"sqlite3_value_bytes16" in the sqlite3.c file from 
https://www.sqlite.org/2022/sqlite-amalgamation-3390400.zip may 
be a good test too.

I'm getting at least 5x difference in favor of regex with LDC on 
these two examples.

Also are we allowed to artificially construct needle and haystack 
to blow up this test rather than only benchmarking it on typical 
real data?

Oct 06 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka 
wrote:
 Also are we allowed to artificially construct needle and 
 haystack to blow up this test rather than only benchmarking it 
 on typical real data?

Such as generating the input data via running:

     python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

And then using 
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the 
character 'a' replicated 50 times) as the needle to search for. 
Much longer needles work even better. In Linux the command line 
size is limited by 128K, so there's a huge room for improvement.

Oct 07 2022

bachmeier <no spam.net> writes:

On Friday, 7 October 2022 at 07:16:19 UTC, Siarhei Siamashka 
wrote:
 On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka 
 wrote:
 Also are we allowed to artificially construct needle and 
 haystack to blow up this test rather than only benchmarking it 
 on typical real data?

 Such as generating the input data via running:

     python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

 And then using 
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the 
 character 'a' replicated 50 times) as the needle to search for. 
 Much longer needles work even better. In Linux the command line 
 size is limited by 128K, so there's a huge room for improvement.

https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

"the longer the pattern is, the faster the algorithm goes"

Oct 07 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
 https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

 "the longer the pattern is, the faster the algorithm goes"

Yes, that's how substring search works in the standard libraries 
of the other programming languages. Now please take the torhu's 
test program (posted a few comments above) and generate input for 
it using

     python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

Run it to search for 
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the 
character 'a' replicated 50 times) and then compare its 
performance against the other similar programs doing the same 
thing:

Python:

```Python
import sys
assert len(sys.argv) >= 2, "Need a needle argument."
needle = sys.argv[1].lower()
print(sum([1 if line.lower().find(needle) != -1 else 0 for line 
in open("test.lst")]))
```

Ruby/Crystal:

```Ruby
abort "Need a needle argument." unless ARGV.size >= 1
needle = ARGV[0].downcase
puts File.open("test.lst").each_line.sum {|line| 
line.downcase.index(needle) ? 1 : 0 }
```

A generic substring search is tuned to be fast on any input in 
the other programming languages. But in Phobos a simpleminded 
slow search algorithm is used by default. The Boyer-Moore 
algorithm can be used too if it's explicitly requested, but it 
has some gotchas of its own.

Oct 07 2022

rassoc <rassoc posteo.de> writes:

On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn wrote:
 On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
 python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

That's generating a file with a single line:

$> wc -l test.lst
1 test.lst

Going with an appropriate 100k mixed line file and your mentioned needle, D is
still quite a bit slower, but the results aren't as drastic and nowhere near
"two orders of magnitude".

$> crystal build --release -o search-cr search.cr
abort "Need a needle argument." unless ARGV.size >= 1
needle = ARGV[0].downcase
puts File.open("words.txt").each_line.count(&.downcase.includes? needle)

$> ldc2 -O2 --release -of search-ldc search.d
import std;
void main(string[] args) {
     enforce(args.length > 1, "Need a needle argument.");
     auto needle = args[1].toLower;
     File("words.txt").byLine.count!(ln =>
ln.asLowerCase.canFind(needle)).writeln;
}

Oct 07 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Saturday, 8 October 2022 at 01:07:46 UTC, rassoc wrote:
 On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn 
 wrote:
 On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
 python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

 That's generating a file with a single line:

 $> wc -l test.lst
 1 test.lst

If you insist on having 100K lines in the input data, then you 
can try a different test (run it as a shell script):

```bash
python -c "print(('a' * 9999 + 'b') * 89 + '\\n' * 99999 + 'a' * 
10000)" > words.txt
cp words.txt test.lst
wc -l words.txt

NEEDLE=`python -c "print('a' * 10000, end='')"`

echo "=== Python ==="
time python search.py "$NEEDLE"

echo "=== Ruby ==="
time ruby search.rb "$NEEDLE"

echo "=== Crystal ==="
time ./search-cr "$NEEDLE"

echo "=== D (LDC) ==="
time ./search-ldc "$NEEDLE"
```

It is testing a 1MB file with 100K lines and a 10K characters 
long needle. Results from my computer (with Python 3.10, because 
earlier versions of Python are slower):

```
100000 words.txt
=== Python ===
1

real    0m0.083s
user    0m0.072s
sys     0m0.010s
=== Ruby ===
1

real    0m0.313s
user    0m0.313s
sys     0m0.000s
=== Crystal ===
1

real    0m1.492s
user    0m1.482s
sys     0m0.010s
=== D (LDC) ===
1

real    1m10.314s
user    1m10.234s
sys     0m0.050s
```

 Going with an appropriate 100k mixed line file and your 
 mentioned needle, D is still quite a bit slower, but the 
 results aren't as drastic

Your "appropriate" file is also entirely artificial and happens 
to be specifically crafted to work best with the current 
".canFind" implementation from Phobos. The primary source of 
slowness are partial matches. Such as the `"int i = "` prefix 
when searching for `"int i = 0;"` substring in a string, which 
contains `"int i = 1;"`. The time is wasted on comparing the 
initial 8 characters before encountering a mismatch and bailing 
out. But when searching in a randomly generated gibberish, the 
chances of encountering long partial matches are very low. At 
least lower than in the real text files.

 and nowhere near "two orders of magnitude".

Does the difference really have to be two orders of magnitude for 
you to acknowledge that there might be a performance problem in 
Phobos? Moreover, you are quoting torhu, who compared two D 
implementations compiled by DMD (regex vs. canFind) rather than 
standard libraries of different programming languages.

 ```D
 import std;
 void main(string[] args) {
     enforce(args.length > 1, "Need a needle argument.");
     auto needle = args[1].toLower;
     File("words.txt").byLine.count!(ln => 
 ln.asLowerCase.canFind(needle)).writeln;
 }
 ```

It's a nicely looking one-liner implementation in D language and 
everything is fine. Except that similar one-liners implemented 
using other programming languages are faster and more versatile 
(can handle any input data without catastrophic performance 
pitfalls).

BTW, changing ".asLowerCase" to ".toLower" introduces an extra 
memory allocation, but still significantly improves handling of 
the worst case. Conversion to lowercase is expensive for unicode 
characters and revisiting the same character to repeat such 
conversion again and again is bad for performance.

Oct 08 2022

rassoc <rassoc posteo.de> writes:

On 10/9/22 03:08, Siarhei Siamashka via Digitalmars-d-learn wrote:
 Does the difference really have to be two orders of magnitude for you to
acknowledge that there might be a performance problem in Phobos? [...] Except
that similar one-liners implemented using other programming languages are
faster and more versatile (can handle any input data without catastrophic
performance pitfalls).

Oh, I get all that, there's no reason to argue with me or win me over, I can
see that the implementation is subpar. Since I'm never hitting this performance
bottleneck in my code, and being a regular dev and not a core maintainer, I
simply haven't been motivated enough to contribute an improvement. Change like
that isn't happening in the forums. Various optimizations have made it into
Phobos in the past, don't think you would get any pushback if you can show that
a new implementation improves the situation in almost all cases while
maintaining compatibility.

Oct 09 2022

bauss <jacobbauss gmail.com> writes:

On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:
 On 10/5/22 12:59 PM, torhu wrote:
 I need a case-insensitive check to see if a string contains 
 another string for a "quick filter" feature. It should 
 preferrably be perceived as instant by the user, and needs to 
 check a few thousand strings in typical cases. Is a regex the 
 best option, or what would you suggest?

 https://dlang.org/phobos/std_uni.html#asLowerCase

 ```d
 bool isearch(S1, S2)(S1 haystack, S2 needle)
 {
     import std.uni;
     import std.algorithm;
     return haystack.asLowerCase.canFind(needle.asLowerCase);
 }
 ```

 untested.

 -Steve

This doesn't actually work properly in all languages. It will 
probably work in most, but it's not entirely correct.

Ex. Turkish will not work with it properly.

Very interesting article: 
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

Oct 13 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 13/10/2022 9:27 PM, bauss wrote:
 This doesn't actually work properly in all languages. It will probably 
 work in most, but it's not entirely correct.
 
 Ex. Turkish will not work with it properly.
 
 Very interesting article: 
 http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

Yes turkic languages, they require a state machine and quite a bit of 
LUTs to work correctly.

You also need to provide a language and it has to operate on the whole 
string, not individual characters.

I didn't think it was relevant since Ascii was in the original post ;)

Oct 13 2022

bauss <jacobbauss gmail.com> writes:

On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole 
wrote:
 On 13/10/2022 9:27 PM, bauss wrote:
 This doesn't actually work properly in all languages. It will 
 probably work in most, but it's not entirely correct.
 
 Ex. Turkish will not work with it properly.
 
 Very interesting article: 
 http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

 Yes turkic languages, they require a state machine and quite a 
 bit of LUTs to work correctly.

 You also need to provide a language and it has to operate on 
 the whole string, not individual characters.

 I didn't think it was relevant since Ascii was in the original 
 post ;)

I think it's relevant when it comes to D since D is arguably a 
unicode language, not ascii.

D should strive to be correct, rather than fast.

Oct 13 2022

bauss <jacobbauss gmail.com> writes:

On Thursday, 13 October 2022 at 08:35:50 UTC, bauss wrote:
 On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole 
 wrote:
 On 13/10/2022 9:27 PM, bauss wrote:
 This doesn't actually work properly in all languages. It will 
 probably work in most, but it's not entirely correct.
 
 Ex. Turkish will not work with it properly.
 
 Very interesting article: 
 http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

 Yes turkic languages, they require a state machine and quite a 
 bit of LUTs to work correctly.

 You also need to provide a language and it has to operate on 
 the whole string, not individual characters.

 I didn't think it was relevant since Ascii was in the original 
 post ;)

 I think it's relevant when it comes to D since D is arguably a 
 unicode language, not ascii.

 D should strive to be correct, rather than fast.

Oh and to add onto this, IFF you have to do it the hacky way, 
then converting to uppercase instead of lowercase should be 
preferred, because not all lowercase characters can perform round 
trip, although a small group of characters, then using uppercase 
fixes it, so that's a relatively easy fix. A round trip is 
basically converting characters from one culture to another and 
then back. It's impossible with some characters when converting 
to lowercase, but should always be possible when converting to 
uppercase.

Oct 13 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 13/10/2022 9:42 PM, bauss wrote:
 Oh and to add onto this, IFF you have to do it the hacky way, then 
 converting to uppercase instead of lowercase should be preferred, 
 because not all lowercase characters can perform round trip, although a 
 small group of characters, then using uppercase fixes it, so that's a 
 relatively easy fix. A round trip is basically converting characters 
 from one culture to another and then back. It's impossible with some 
 characters when converting to lowercase, but should always be possible 
 when converting to uppercase.

You will want to repeat this process with normalize to NFKC and 
normalize to NFD before transforming. Otherwise there is a possibility 
that you will miss some transformations as the simplified mappings are 
1:1 for characters and not everything is representable as a single 
character.

Oct 13 2022

bauss <jacobbauss gmail.com> writes:

On Thursday, 13 October 2022 at 08:48:49 UTC, rikki cattermole 
wrote:
 On 13/10/2022 9:42 PM, bauss wrote:
 Oh and to add onto this, IFF you have to do it the hacky way, 
 then converting to uppercase instead of lowercase should be 
 preferred, because not all lowercase characters can perform 
 round trip, although a small group of characters, then using 
 uppercase fixes it, so that's a relatively easy fix. A round 
 trip is basically converting characters from one culture to 
 another and then back. It's impossible with some characters 
 when converting to lowercase, but should always be possible 
 when converting to uppercase.

 You will want to repeat this process with normalize to NFKC and 
 normalize to NFD before transforming. Otherwise there is a 
 possibility that you will miss some transformations as the 
 simplified mappings are 1:1 for characters and not everything 
 is representable as a single character.

Yeah, text isn't easy :D

Oct 13 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 13/10/2022 9:55 PM, bauss wrote:
 Yeah, text isn't easy :D

Indeed!

It has me a bit concerned actually, I'm wondering if my string stuff 
will even work correctly for UI's due to performance issues.

My string builder for instance allocates like crazy just to do slicing. 
But hey, at least I can feel confident that my general purpose allocator 
& infrastructure is working correctly!

Oct 13 2022

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:
 On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
 Schveighoffer wrote:
 On 10/5/22 12:59 PM, torhu wrote:
 I need a case-insensitive check to see if a string contains 
 another string for a "quick filter" feature. It should 
 preferrably be perceived as instant by the user, and needs to 
 check a few thousand strings in typical cases. Is a regex the 
 best option, or what would you suggest?

 https://dlang.org/phobos/std_uni.html#asLowerCase

 ```d
 bool isearch(S1, S2)(S1 haystack, S2 needle)
 {
     import std.uni;
     import std.algorithm;
     return haystack.asLowerCase.canFind(needle.asLowerCase);
 }
 ```

 untested.

 -Steve

 This doesn't actually work properly in all languages. It will 
 probably work in most, but it's not entirely correct.

 Ex. Turkish will not work with it properly.

Greek will also be problematic. 2 different lowercase sigmas but 
only 1 uppercase. Other languages that may make issues, German 
where normally ß uppercases as SS (or not) but not the other way 
round, but here we already arrived to Unicode land and the 
normalization conundrum.

Oct 13 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:
 ```d
 bool isearch(S1, S2)(S1 haystack, S2 needle)
 {
     import std.uni;
     import std.algorithm;
     return haystack.asLowerCase.canFind(needle.asLowerCase);
 }
 ```

 untested.

 -Steve

 This doesn't actually work properly in all languages. It will 
 probably work in most, but it's not entirely correct.

 Ex. Turkish will not work with it properly.

 Very interesting article: 
 http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

Wow, I didn't expect anything like this and just thought that the 
nightmares of handling 8-bit codepages for non-English languages 
ceased to exist nowadays. Too bad. What are the best practices to 
deal with Turkish text in D language?

For example, [Ukrainian letters 'і' and 
'І'](https://en.wikipedia.org/wiki/Dotted_I_(Cyrillic)) don't 
share the same codes with Latin 'i' and 'I' and this is working 
fine. Except for a possible [phishing 
opportunity](https://www.theguardian.com/technology/2017/apr/19/phishing
url-trick-hackers). Why haven't the standard committees done the same for
Turkish 'I' yet?

As for the [German letter 
'ß'](https://en.wikipedia.org/wiki/%C3%9F), wikipedia says that 
the uppercase variant 'ẞ' exists since 2008 (ISO 10646). Do 
German people use it now?
```D
import std;
void main() {
   "ß".asUpperCase.writeln;             // prints "SS"
   "ẞ".asLowerCase.writeln;             // prints "ß"
   "ẞ".asLowerCase.asUpperCase.writeln; // prints "SS"
}
```

Oct 24 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:
 Wow, I didn't expect anything like this and just thought that the 
 nightmares of handling 8-bit codepages for non-English languages ceased 
 to exist nowadays. Too bad. What are the best practices to deal with 
 Turkish text in D language?

std.uni doesn't support it.

For casing it only supports the simple mappings which are 1:1 and not 
language dependent.

I haven't got to it yet for my own string handling library, so I can't 
point you to that (even if it was not ready).

I'm sure somebody has got it but you may end up wanting to use ICU 
unfortunately.

Oct 24 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Tuesday, 25 October 2022 at 06:32:00 UTC, rikki cattermole 
wrote:
 On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:
 What are the best practices to deal with Turkish text in D 
 language?

 std.uni doesn't support it.

OK, I'm not specifically interested in this personally and I even 
would be happy to remain blissfully ignorant. Just wondered 
whether a preferred solution already exists, considering that 
this forum has a Turkish section.

Should we ignore the `"D should strive to be correct, rather than 
fast"` comment from bauss for now? Or some actions can be taken 
to improve the current situation?

Oct 25 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 26/10/2022 6:06 PM, Siarhei Siamashka wrote:
 Should we ignore the `"D should strive to be correct, rather than fast"` 
 comment from bauss for now? Or some actions can be taken to improve the 
 current situation?

Bauss is correct.

It should be implemented but it does not need to be fast.

But yeah, if you are able to ignore that Unicode is a thing, I'd 
recommend it. It is complicated, as we humans are very complicated ;)

Oct 25 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole 
wrote:
 if you are able to ignore that Unicode is a thing, I'd 
 recommend it. It is complicated, as we humans are very 
 complicated ;)

I can't ignore Unicode, because I frequently have to deal with 
Cyrillic alphabet ;) Also Unicode is significantly simpler than a 
set of various incompatible 8-bit encodings (such as 
[CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants 
of [KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 
8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were 
simultaneously in use earlier and caused a lot of pain. But I'm 
surely able to ignore the peculiarities of modern Turkish Unicode 
and wait for the other people to come up with a solution for D 
language if they really care.

Oct 25 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 26/10/2022 6:49 PM, Siarhei Siamashka wrote:
 On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole wrote:
 if you are able to ignore that Unicode is a thing, I'd recommend it. 
 It is complicated, as we humans are very complicated ;)

 
 I can't ignore Unicode, because I frequently have to deal with Cyrillic 
 alphabet ;) Also Unicode is significantly simpler than a set of various 
 incompatible 8-bit encodings (such as 
 [CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants of 
 [KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 
 8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were 
 simultaneously in use earlier and caused a lot of pain. But I'm surely 
 able to ignore the peculiarities of modern Turkish Unicode and wait for 
 the other people to come up with a solution for D language if they 
 really care.

Cyrillic isn't an issue.

Lithuanian, Turkish and Azeri are the ones with the biggest issues.

There is a bunch of non-simple mappings for Latin, Armenian and Greek, 
but they are not language dependent. There is six conditional ones which 
are all Greek.

So if you are not dealing with these languages (even if you are, a 
simple replace should be easy to do for most), you should be fine with 
the simple mappings supported by std.uni.

Oct 25 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 10/25/22 22:49, Siarhei Siamashka wrote:

 Unicode is significantly simpler than a set of various
 incompatible 8-bit encodings

Strongly agreed.

 I'm surely
 able to ignore the peculiarities of modern Turkish Unicode

The problem with Unicode is its main aim of allowing characters of 
multiple writing systems in the same text. When multiple writing systems 
are in play, conflicts and ambiguities will appear.

 and wait for
 the other people to come up with a solution for D language if they
 really care.

I solved my problem by writing an Alphabet hierarchy in the past. I 
don't like that code but it still works:

 
https://bitbucket.org/acehreli/ddili/src/4c0552fe8352dfe905c9734a57d84d36ce4ed476/src/alphabet.d#lines-50

It handles capitalization, ordering, etc. I use it when preparing the 
Index section of the Turkish edition of "Programming in D":

   http://ddili.org/ders/d/ix.html

One of the ambiguities is what came up on this thread: Should a word 
that starts with I (capital i) be listed under I (because it's Turkish) 
or under İ (because it's English)? So far, I am lucky because the only 
word that starts with I happens to be the English "IDE", so it goes 
under i (which appears as İ in the Turkish edition) and would make sense 
to a Turkish reader because a Turkish reader might (really?) accept it 
as the capital of ide.

It's confusing but it seems to work. :) It doesn't matter. Life is 
imperfect and things will somehow work in the end.

Ali

Oct 25 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Wednesday, 26 October 2022 at 06:05:14 UTC, Ali Çehreli wrote:
 The problem with Unicode is its main aim of allowing characters 
 of multiple writing systems in the same text. When multiple 
 writing systems are in play, conflicts and ambiguities will 
 appear.

I personally don't think that it's the problem of the Unicode 
itself. Based on what I can see, it looks like the individuals or 
the committees responsible for mapping the Turkish alphabet to 
Unicode just made a blunder.

For example, let's compare the Latin uppercase "B" and the 
Cyrillic uppercase "В". Looks exactly the same, right? Would it 
be a smart idea for them to share the same index in the Unicode 
table? But wait. What happens if we convert these letters to 
lowercase? The Latin "B" becomes "b" and the Cyrillic "В" becomes 
"в". Oops! So by having different indexes for the Latin uppercase 
"B" and the Cyrillic uppercase "В", we dodged a whole bunch of 
nasty problems.

Another example. Patrick Schluter mentioned the Greek sigma 
letter and the [wikipedia 
article](https://en.wikipedia.org/wiki/Sigma) says: "uppercase Σ, 
lowercase σ, lowercase in word-final position ς", which makes 
everything rather problematic. Now let's compare this to the 
Belarusian language and its letter "у". The Belarusian "у" 
transforms into "ў" depending on context, however this 
transformation doesn't happen for the first letter of proper 
nouns or in acronyms (and this theoretically makes the uppercase 
"ў" redundant). Just imagine an alternative Greek-inspired 
reality, where both "у" and "ў" uppercase to "У". And yet the 
uppercase "Ў" exists in Unicode, so luckily in our reality we 
don't have to deal with uppercase/lowercase round trip failures. 
This is computers friendly. And as I already mentioned in an 
earlier comment, the Germans also got the uppercase "ẞ" in 
Unicode since 2008 (better late than never).

 I solved my problem by writing an Alphabet hierarchy in the 
 past. I don't like that code but it still works:

 [...]

 It's confusing but it seems to work. :) It doesn't matter. Life 
 is imperfect and things will somehow work in the end.

What's your opinion/conclusion? Is it fine the way it is? Do you 
think that some unique property of the Turkish language/alphabet 
made these difficulties unavoidable? Or do you think that it was 
a mistake, but now it has to live with us forever for 
compatibility reasons? Anything else?

And as for the D language and Phobos, should "ß" still uppercase 
to "SS"? Or can we change it to uppercase "ẞ" and remove German 
from the list of tricky languages at 
https://dlang.org/library/std/uni/to_upper.html ? Should Turkish 
be listed there?

Oct 28 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 29/10/2022 11:05 AM, Siarhei Siamashka wrote:
 And as for the D language and Phobos, should "ß" still uppercase to 
 "SS"? Or can we change it to uppercase "ẞ" and remove German from the 
 list of tricky languages at 
 https://dlang.org/library/std/uni/to_upper.html ? Should Turkish be 
 listed there?

That particular function, is based upon the simple mappings provided by 
UnicodeData.txt and (should be) in compliance of the Unicode standard.

The only thing we need to do is regenerate the tables backing it 
whenever Unicode updates.

Note the behavior you are asking for is defined in the Unicode database 
file SpecialCasing.txt which have not been implemented.

```


titlecase(uppercase(<es-zed>))


```

That file is how you support languages like Turkish. We currently don't 
have it implemented. It requires operating on a whole string and to pass 
in what language rules to apply (i.e. Turkish, Azeri).

Oct 28 2022

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Replacing tango.text.Ascii.isearch