digitalmars.D.learn - Replacing tango.text.Ascii.isearch
- torhu (5/5) Oct 05 2022 I need a case-insensitive check to see if a string contains
- Steven Schveighoffer (12/16) Oct 05 2022 https://dlang.org/phobos/std_uni.html#asLowerCase
- torhu (17/19) Oct 05 2022 On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven
- torhu (4/5) Oct 05 2022 Right, you can instantiate structs without arguments. It's been
- torhu (3/9) Oct 05 2022 I think there should be sensible default here, seems like an easy
- =?UTF-8?Q?Ali_=c3=87ehreli?= (7/8) Oct 05 2022 Either this:
- torhu (4/14) Oct 05 2022 I did some basic testing, and regex was two orders of magnitude
- Siarhei Siamashka (30/32) Oct 06 2022 Substring search functionality is currently in a very bad shape
- Sergey (3/4) Oct 06 2022 Please don’t tell us that D will be slower than Python again?)
- rassoc (48/49) Oct 06 2022 And what kind of testing was that? Mind to share? Because I did the foll...
- torhu (34/39) Oct 06 2022 Originally I just loaded a one megabyte file and searched the
- rassoc (2/3) Oct 06 2022 Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, ...
- Siarhei Siamashka (15/20) Oct 06 2022 Benchmark results depend on many things, such as the actual text
- Siarhei Siamashka (9/12) Oct 07 2022 Such as generating the input data via running:
- bachmeier (4/16) Oct 07 2022 https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/
- Siarhei Siamashka (31/33) Oct 07 2022 Yes, that's how substring search works in the standard libraries
- rassoc (16/18) Oct 07 2022 That's generating a file with a single line:
- Siarhei Siamashka (69/89) Oct 08 2022 If you insist on having 100K lines in the input data, then you
- rassoc (2/3) Oct 09 2022 Oh, I get all that, there's no reason to argue with me or win me over, I...
- bauss (7/24) Oct 13 2022 This doesn't actually work properly in all languages. It will
- rikki cattermole (6/13) Oct 13 2022 Yes turkic languages, they require a state machine and quite a bit of
- bauss (5/19) Oct 13 2022 I think it's relevant when it comes to D since D is arguably a
- bauss (10/32) Oct 13 2022 Oh and to add onto this, IFF you have to do it the hacky way,
- rikki cattermole (6/14) Oct 13 2022 You will want to repeat this process with normalize to NFKC and
- bauss (3/18) Oct 13 2022 Yeah, text isn't easy :D
- rikki cattermole (7/8) Oct 13 2022 Indeed!
- Patrick Schluter (6/32) Oct 13 2022 Greek will also be problematic. 2 different lowercase sigmas but
- Siarhei Siamashka (22/39) Oct 24 2022 Wow, I didn't expect anything like this and just thought that the
- rikki cattermole (8/12) Oct 24 2022 std.uni doesn't support it.
- Siarhei Siamashka (9/13) Oct 25 2022 OK, I'm not specifically interested in this personally and I even
- rikki cattermole (5/8) Oct 25 2022 Bauss is correct.
- Siarhei Siamashka (12/15) Oct 25 2022 I can't ignore Unicode, because I frequently have to deal with
- rikki cattermole (9/23) Oct 25 2022 Cyrillic isn't an issue.
- =?UTF-8?Q?Ali_=c3=87ehreli?= (22/29) Oct 25 2022 The problem with Unicode is its main aim of allowing characters of
- Siarhei Siamashka (39/48) Oct 28 2022 I personally don't think that it's the problem of the Unicode
- rikki cattermole (16/21) Oct 28 2022 That particular function, is based upon the simple mappings provided by
I need a case-insensitive check to see if a string contains another string for a "quick filter" feature. It should preferrably be perceived as instant by the user, and needs to check a few thousand strings in typical cases. Is a regex the best option, or what would you suggest?
Oct 05 2022
On 10/5/22 12:59 PM, torhu wrote:I need a case-insensitive check to see if a string contains another string for a "quick filter" feature. It should preferrably be perceived as instant by the user, and needs to check a few thousand strings in typical cases. Is a regex the best option, or what would you suggest?https://dlang.org/phobos/std_uni.html#asLowerCase ```d bool isearch(S1, S2)(S1 haystack, S2 needle) { import std.uni; import std.algorithm; return haystack.asLowerCase.canFind(needle.asLowerCase); } ``` untested. -Steve
Oct 05 2022
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven Schveighoffer wrote: [...] I wanted to do some quick benchmarking to figure out what works. When I run this: ```d import std.stdio; import std.datetime.stopwatch; void main() { auto sw = StopWatch(); sw.stop(); writeln(sw.peek().toString()); } ``` It prints this:2 weeks, 6 days, 9 hours, 34 minutes, 43 secs, 214 ms, 946 ╬╝s, and 7 hnsecsAm I doing something wrong here?
Oct 05 2022
On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote:Am I doing something wrong here?Right, you can instantiate structs without arguments. It's been ten years since I last used D, I was thinking of structs like if they were classes.
Oct 05 2022
On Wednesday, 5 October 2022 at 20:45:55 UTC, torhu wrote:On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote:I think there should be sensible default here, seems like an easy trap to remove.Am I doing something wrong here?Right, you can instantiate structs without arguments. It's been ten years since I last used D, I was thinking of structs like if they were classes.
Oct 05 2022
On 10/5/22 13:40, torhu wrote:auto sw = StopWatch();Either this: auto sw = StopWatch(AutoStart.yes); or this: auto sw = StopWatch(); sw.start(); Ali
Oct 05 2022
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven Schveighoffer wrote:```d bool isearch(S1, S2)(S1 haystack, S2 needle) { import std.uni; import std.algorithm; return haystack.asLowerCase.canFind(needle.asLowerCase); } ``` untested. -SteveI did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess.
Oct 05 2022
On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:I did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess.Substring search functionality is currently in a very bad shape in Phobos. I discovered this myself a few weeks ago when I was trying to solve the https://www.facebook.com/codingcompetitions/hacker-cup/2022/round-1/problems/A2 puzzle using D language. A part of the solution requires a fast substring search (actually a subarray search) and Phobos doesn't offer anything with even remotely acceptable performance. Phobos does have a Boyer-Moore implementation. This algorithm is historically famous in computer science as one of the first attempts to optimize substring search, but it also has pathologically bad performance on certain input data and probably shouldn't be recommended for any practical use nowadays. The users of old versions of Python discovered this too: https://codeforces.com/blog/entry/106849?#comment-952483 The standard 'find' function from the following simple example also becomes awfully slow when arrays 'a' and 'b' are large and/or are adversarially constructed: ```D import std; void main() { auto a = [1, 2, 3, 4]; auto b = [2, 3]; writefln("Is %s a subarray of %s? The answer is %s.", b, a, a.find(b).empty ? "no" : "yes"); } ``` I think that the best fit for your use case is the https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and there's an old issue in bugzilla about this: https://issues.dlang.org/show_bug.cgi?id=16066 BTW, if anyone is curious, one of the possible solutions for the hacker-cup/2022/round-1/problems/A2 puzzle in D language can be found here: https://codeforces.com/blog/entry/106768?#comment-952808
Oct 06 2022
On Thursday, 6 October 2022 at 08:15:10 UTC, Siarhei Siamashka wrote:On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:Please don’t tell us that D will be slower than Python again?)
Oct 06 2022
On 10/5/22 23:50, torhu via Digitalmars-d-learn wrote:I did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess.And what kind of testing was that? Mind to share? Because I did the following real quick and wasn't able to measure a "two orders of magnitude" difference. Sure, the regex version came on top, but they were both faster than the ruby baseline I cooked up. First, generate a word file with 100k entries of various lengths: $> dmd -run words.d foobaz 100000 --- import std; string randomWord(ulong n) { static chars = letters.array; return generate!(() => chars.choice).take(n).text; } void main(string[] args) { enforce(args.length == 3, "Usage: dmd -run words.d needle num"); auto f = File("words.txt", "w"); foreach (i; 0..args[2].to!ulong) { ulong n = uniform(0, 50), m = uniform(0, 50); if (i % 2 == 0) f.writeln(randomWord(n), args[1], randomWord(m)); else f.writeln(randomWord(n + m)); } } --- And then for the actual measuring: $> dmd -O -version={range,regex} -of=search-{range,regex} search.d $> ldc -O -d-version={range,regex} -of=search-{range,regex}-ldc search.d $> time ./search-{range,regex}{,-ldc} foobaz --- import std; void main(string[] args) { enforce(args.length == 2, "Usage: search 'needle'"); version (regex) auto rx = regex(args[1], "i"); else version (range) auto needle = args[1].asLowerCase.text; else static assert(0, "use -version={regex,range}"); ulong matches; foreach (line; File("words.txt").byLine) { version (regex) if (line.matchFirst(rx)) matches++; version (range) if (line.asLowerCase.canFind(needle)) matches++; } writeln(matches); } ---
Oct 06 2022
On Thursday, 6 October 2022 at 21:36:48 UTC, rassoc wrote:And what kind of testing was that? Mind to share? Because I did the following real quick and wasn't able to measure a "two orders of magnitude" difference. Sure, the regex version came on top, but they were both faster than the ruby baseline I cooked up.Originally I just loaded a one megabyte file and searched the whole thing. I changed it to split it into (40 000) lines instead, regex is about ten times faster then. I compile with -release -O -inline. Here is the second version: ```d import std; import std.datetime.stopwatch; enum FILE = "test.lst"; string text; string needle; void test(bool delegate(string haystack) dg) { auto sw = StopWatch(AutoStart.yes); int counter = 0; foreach (line; lineSplitter(text)) { if (dg(line)) counter++; } sw.stop(); writefln("%s", sw.peek()); writefln("counter: %s", counter); } void main(char[][] args) { enforce(args.length > 1, "Need a needle argument."); text = cast(string)read(FILE); needle = args[1].idup; auto re = regex(to!string(escaper(needle)), "i"); string needleLower = needle.toLower(); test((h) => !!h.matchFirst(re)); test((h) => h.asLowerCase().canFind(needleLower)); } ```
Oct 06 2022
On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:regex is about ten times faster then.Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, nothing close to 10x. Welp, the woes of superficial benchmarking. :)
Oct 06 2022
On Friday, 7 October 2022 at 00:57:38 UTC, rassoc wrote:On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:Benchmark results depend on many things, such as the actual text in both needle and haystack and the needle length. Are we dealing with unicode text by the way? One example is searching for something like "äußere" in https://www.gutenberg.org/ebooks/6343.txt.utf-8 If it's the source code, then searching for "sqlite3_value_bytes16" in the sqlite3.c file from https://www.sqlite.org/2022/sqlite-amalgamation-3390400.zip may be a good test too. I'm getting at least 5x difference in favor of regex with LDC on these two examples. Also are we allowed to artificially construct needle and haystack to blow up this test rather than only benchmarking it on typical real data?regex is about ten times faster then.Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, nothing close to 10x. Welp, the woes of superficial benchmarking. :)
Oct 06 2022
On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka wrote:Also are we allowed to artificially construct needle and haystack to blow up this test rather than only benchmarking it on typical real data?Such as generating the input data via running: python -c "print(('a' * 49 + 'b') * 20000)" > test.lst And then using "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the character 'a' replicated 50 times) as the needle to search for. Much longer needles work even better. In Linux the command line size is limited by 128K, so there's a huge room for improvement.
Oct 07 2022
On Friday, 7 October 2022 at 07:16:19 UTC, Siarhei Siamashka wrote:On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka wrote:https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/ "the longer the pattern is, the faster the algorithm goes"Also are we allowed to artificially construct needle and haystack to blow up this test rather than only benchmarking it on typical real data?Such as generating the input data via running: python -c "print(('a' * 49 + 'b') * 20000)" > test.lst And then using "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the character 'a' replicated 50 times) as the needle to search for. Much longer needles work even better. In Linux the command line size is limited by 128K, so there's a huge room for improvement.
Oct 07 2022
On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/ "the longer the pattern is, the faster the algorithm goes"Yes, that's how substring search works in the standard libraries of the other programming languages. Now please take the torhu's test program (posted a few comments above) and generate input for it using python -c "print(('a' * 49 + 'b') * 20000)" > test.lst Run it to search for "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the character 'a' replicated 50 times) and then compare its performance against the other similar programs doing the same thing: Python: ```Python import sys assert len(sys.argv) >= 2, "Need a needle argument." needle = sys.argv[1].lower() print(sum([1 if line.lower().find(needle) != -1 else 0 for line in open("test.lst")])) ``` Ruby/Crystal: ```Ruby abort "Need a needle argument." unless ARGV.size >= 1 needle = ARGV[0].downcase puts File.open("test.lst").each_line.sum {|line| line.downcase.index(needle) ? 1 : 0 } ``` A generic substring search is tuned to be fast on any input in the other programming languages. But in Phobos a simpleminded slow search algorithm is used by default. The Boyer-Moore algorithm can be used too if it's explicitly requested, but it has some gotchas of its own.
Oct 07 2022
On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn wrote:On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote: python -c "print(('a' * 49 + 'b') * 20000)" > test.lstThat's generating a file with a single line: $> wc -l test.lst 1 test.lst Going with an appropriate 100k mixed line file and your mentioned needle, D is still quite a bit slower, but the results aren't as drastic and nowhere near "two orders of magnitude". $> crystal build --release -o search-cr search.cr abort "Need a needle argument." unless ARGV.size >= 1 needle = ARGV[0].downcase puts File.open("words.txt").each_line.count(&.downcase.includes? needle) $> ldc2 -O2 --release -of search-ldc search.d import std; void main(string[] args) { enforce(args.length > 1, "Need a needle argument."); auto needle = args[1].toLower; File("words.txt").byLine.count!(ln => ln.asLowerCase.canFind(needle)).writeln; }
Oct 07 2022
On Saturday, 8 October 2022 at 01:07:46 UTC, rassoc wrote:On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn wrote:If you insist on having 100K lines in the input data, then you can try a different test (run it as a shell script): ```bash python -c "print(('a' * 9999 + 'b') * 89 + '\\n' * 99999 + 'a' * 10000)" > words.txt cp words.txt test.lst wc -l words.txt NEEDLE=`python -c "print('a' * 10000, end='')"` echo "=== Python ===" time python search.py "$NEEDLE" echo "=== Ruby ===" time ruby search.rb "$NEEDLE" echo "=== Crystal ===" time ./search-cr "$NEEDLE" echo "=== D (LDC) ===" time ./search-ldc "$NEEDLE" ``` It is testing a 1MB file with 100K lines and a 10K characters long needle. Results from my computer (with Python 3.10, because earlier versions of Python are slower): ``` 100000 words.txt === Python === 1 real 0m0.083s user 0m0.072s sys 0m0.010s === Ruby === 1 real 0m0.313s user 0m0.313s sys 0m0.000s === Crystal === 1 real 0m1.492s user 0m1.482s sys 0m0.010s === D (LDC) === 1 real 1m10.314s user 1m10.234s sys 0m0.050s ```On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote: python -c "print(('a' * 49 + 'b') * 20000)" > test.lstThat's generating a file with a single line: $> wc -l test.lst 1 test.lstGoing with an appropriate 100k mixed line file and your mentioned needle, D is still quite a bit slower, but the results aren't as drasticYour "appropriate" file is also entirely artificial and happens to be specifically crafted to work best with the current ".canFind" implementation from Phobos. The primary source of slowness are partial matches. Such as the `"int i = "` prefix when searching for `"int i = 0;"` substring in a string, which contains `"int i = 1;"`. The time is wasted on comparing the initial 8 characters before encountering a mismatch and bailing out. But when searching in a randomly generated gibberish, the chances of encountering long partial matches are very low. At least lower than in the real text files.and nowhere near "two orders of magnitude".Does the difference really have to be two orders of magnitude for you to acknowledge that there might be a performance problem in Phobos? Moreover, you are quoting torhu, who compared two D implementations compiled by DMD (regex vs. canFind) rather than standard libraries of different programming languages.```D import std; void main(string[] args) { enforce(args.length > 1, "Need a needle argument."); auto needle = args[1].toLower; File("words.txt").byLine.count!(ln => ln.asLowerCase.canFind(needle)).writeln; } ```It's a nicely looking one-liner implementation in D language and everything is fine. Except that similar one-liners implemented using other programming languages are faster and more versatile (can handle any input data without catastrophic performance pitfalls). BTW, changing ".asLowerCase" to ".toLower" introduces an extra memory allocation, but still significantly improves handling of the worst case. Conversion to lowercase is expensive for unicode characters and revisiting the same character to repeat such conversion again and again is bad for performance.
Oct 08 2022
On 10/9/22 03:08, Siarhei Siamashka via Digitalmars-d-learn wrote:Does the difference really have to be two orders of magnitude for you to acknowledge that there might be a performance problem in Phobos? [...] Except that similar one-liners implemented using other programming languages are faster and more versatile (can handle any input data without catastrophic performance pitfalls).Oh, I get all that, there's no reason to argue with me or win me over, I can see that the implementation is subpar. Since I'm never hitting this performance bottleneck in my code, and being a regular dev and not a core maintainer, I simply haven't been motivated enough to contribute an improvement. Change like that isn't happening in the forums. Various optimizations have made it into Phobos in the past, don't think you would get any pushback if you can show that a new implementation improves the situation in almost all cases while maintaining compatibility.
Oct 09 2022
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven Schveighoffer wrote:On 10/5/22 12:59 PM, torhu wrote:This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly. Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.htmlI need a case-insensitive check to see if a string contains another string for a "quick filter" feature. It should preferrably be perceived as instant by the user, and needs to check a few thousand strings in typical cases. Is a regex the best option, or what would you suggest?https://dlang.org/phobos/std_uni.html#asLowerCase ```d bool isearch(S1, S2)(S1 haystack, S2 needle) { import std.uni; import std.algorithm; return haystack.asLowerCase.canFind(needle.asLowerCase); } ``` untested. -Steve
Oct 13 2022
On 13/10/2022 9:27 PM, bauss wrote:This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly. Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.htmlYes turkic languages, they require a state machine and quite a bit of LUTs to work correctly. You also need to provide a language and it has to operate on the whole string, not individual characters. I didn't think it was relevant since Ascii was in the original post ;)
Oct 13 2022
On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole wrote:On 13/10/2022 9:27 PM, bauss wrote:I think it's relevant when it comes to D since D is arguably a unicode language, not ascii. D should strive to be correct, rather than fast.This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly. Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.htmlYes turkic languages, they require a state machine and quite a bit of LUTs to work correctly. You also need to provide a language and it has to operate on the whole string, not individual characters. I didn't think it was relevant since Ascii was in the original post ;)
Oct 13 2022
On Thursday, 13 October 2022 at 08:35:50 UTC, bauss wrote:On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole wrote:Oh and to add onto this, IFF you have to do it the hacky way, then converting to uppercase instead of lowercase should be preferred, because not all lowercase characters can perform round trip, although a small group of characters, then using uppercase fixes it, so that's a relatively easy fix. A round trip is basically converting characters from one culture to another and then back. It's impossible with some characters when converting to lowercase, but should always be possible when converting to uppercase.On 13/10/2022 9:27 PM, bauss wrote:I think it's relevant when it comes to D since D is arguably a unicode language, not ascii. D should strive to be correct, rather than fast.This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly. Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.htmlYes turkic languages, they require a state machine and quite a bit of LUTs to work correctly. You also need to provide a language and it has to operate on the whole string, not individual characters. I didn't think it was relevant since Ascii was in the original post ;)
Oct 13 2022
On 13/10/2022 9:42 PM, bauss wrote:Oh and to add onto this, IFF you have to do it the hacky way, then converting to uppercase instead of lowercase should be preferred, because not all lowercase characters can perform round trip, although a small group of characters, then using uppercase fixes it, so that's a relatively easy fix. A round trip is basically converting characters from one culture to another and then back. It's impossible with some characters when converting to lowercase, but should always be possible when converting to uppercase.You will want to repeat this process with normalize to NFKC and normalize to NFD before transforming. Otherwise there is a possibility that you will miss some transformations as the simplified mappings are 1:1 for characters and not everything is representable as a single character.
Oct 13 2022
On Thursday, 13 October 2022 at 08:48:49 UTC, rikki cattermole wrote:On 13/10/2022 9:42 PM, bauss wrote:Yeah, text isn't easy :DOh and to add onto this, IFF you have to do it the hacky way, then converting to uppercase instead of lowercase should be preferred, because not all lowercase characters can perform round trip, although a small group of characters, then using uppercase fixes it, so that's a relatively easy fix. A round trip is basically converting characters from one culture to another and then back. It's impossible with some characters when converting to lowercase, but should always be possible when converting to uppercase.You will want to repeat this process with normalize to NFKC and normalize to NFD before transforming. Otherwise there is a possibility that you will miss some transformations as the simplified mappings are 1:1 for characters and not everything is representable as a single character.
Oct 13 2022
On 13/10/2022 9:55 PM, bauss wrote:Yeah, text isn't easy :DIndeed! It has me a bit concerned actually, I'm wondering if my string stuff will even work correctly for UI's due to performance issues. My string builder for instance allocates like crazy just to do slicing. But hey, at least I can feel confident that my general purpose allocator & infrastructure is working correctly!
Oct 13 2022
On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven Schveighoffer wrote:Greek will also be problematic. 2 different lowercase sigmas but only 1 uppercase. Other languages that may make issues, German where normally ß uppercases as SS (or not) but not the other way round, but here we already arrived to Unicode land and the normalization conundrum.On 10/5/22 12:59 PM, torhu wrote:This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly.I need a case-insensitive check to see if a string contains another string for a "quick filter" feature. It should preferrably be perceived as instant by the user, and needs to check a few thousand strings in typical cases. Is a regex the best option, or what would you suggest?https://dlang.org/phobos/std_uni.html#asLowerCase ```d bool isearch(S1, S2)(S1 haystack, S2 needle) { import std.uni; import std.algorithm; return haystack.asLowerCase.canFind(needle.asLowerCase); } ``` untested. -Steve
Oct 13 2022
On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:Wow, I didn't expect anything like this and just thought that the nightmares of handling 8-bit codepages for non-English languages ceased to exist nowadays. Too bad. What are the best practices to deal with Turkish text in D language? For example, [Ukrainian letters 'і' and 'І'](https://en.wikipedia.org/wiki/Dotted_I_(Cyrillic)) don't share the same codes with Latin 'i' and 'I' and this is working fine. Except for a possible [phishing opportunity](https://www.theguardian.com/technology/2017/apr/19/phishing url-trick-hackers). Why haven't the standard committees done the same for Turkish 'I' yet? As for the [German letter 'ß'](https://en.wikipedia.org/wiki/%C3%9F), wikipedia says that the uppercase variant 'ẞ' exists since 2008 (ISO 10646). Do German people use it now? ```D import std; void main() { "ß".asUpperCase.writeln; // prints "SS" "ẞ".asLowerCase.writeln; // prints "ß" "ẞ".asLowerCase.asUpperCase.writeln; // prints "SS" } ``````d bool isearch(S1, S2)(S1 haystack, S2 needle) { import std.uni; import std.algorithm; return haystack.asLowerCase.canFind(needle.asLowerCase); } ``` untested. -SteveThis doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly. Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
Oct 24 2022
On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:Wow, I didn't expect anything like this and just thought that the nightmares of handling 8-bit codepages for non-English languages ceased to exist nowadays. Too bad. What are the best practices to deal with Turkish text in D language?std.uni doesn't support it. For casing it only supports the simple mappings which are 1:1 and not language dependent. I haven't got to it yet for my own string handling library, so I can't point you to that (even if it was not ready). I'm sure somebody has got it but you may end up wanting to use ICU unfortunately.
Oct 24 2022
On Tuesday, 25 October 2022 at 06:32:00 UTC, rikki cattermole wrote:On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:OK, I'm not specifically interested in this personally and I even would be happy to remain blissfully ignorant. Just wondered whether a preferred solution already exists, considering that this forum has a Turkish section. Should we ignore the `"D should strive to be correct, rather than fast"` comment from bauss for now? Or some actions can be taken to improve the current situation?What are the best practices to deal with Turkish text in D language?std.uni doesn't support it.
Oct 25 2022
On 26/10/2022 6:06 PM, Siarhei Siamashka wrote:Should we ignore the `"D should strive to be correct, rather than fast"` comment from bauss for now? Or some actions can be taken to improve the current situation?Bauss is correct. It should be implemented but it does not need to be fast. But yeah, if you are able to ignore that Unicode is a thing, I'd recommend it. It is complicated, as we humans are very complicated ;)
Oct 25 2022
On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole wrote:if you are able to ignore that Unicode is a thing, I'd recommend it. It is complicated, as we humans are very complicated ;)I can't ignore Unicode, because I frequently have to deal with Cyrillic alphabet ;) Also Unicode is significantly simpler than a set of various incompatible 8-bit encodings (such as [CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants of [KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were simultaneously in use earlier and caused a lot of pain. But I'm surely able to ignore the peculiarities of modern Turkish Unicode and wait for the other people to come up with a solution for D language if they really care.
Oct 25 2022
On 26/10/2022 6:49 PM, Siarhei Siamashka wrote:On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole wrote:Cyrillic isn't an issue. Lithuanian, Turkish and Azeri are the ones with the biggest issues. There is a bunch of non-simple mappings for Latin, Armenian and Greek, but they are not language dependent. There is six conditional ones which are all Greek. So if you are not dealing with these languages (even if you are, a simple replace should be easy to do for most), you should be fine with the simple mappings supported by std.uni.if you are able to ignore that Unicode is a thing, I'd recommend it. It is complicated, as we humans are very complicated ;)I can't ignore Unicode, because I frequently have to deal with Cyrillic alphabet ;) Also Unicode is significantly simpler than a set of various incompatible 8-bit encodings (such as [CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants of [KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were simultaneously in use earlier and caused a lot of pain. But I'm surely able to ignore the peculiarities of modern Turkish Unicode and wait for the other people to come up with a solution for D language if they really care.
Oct 25 2022
On 10/25/22 22:49, Siarhei Siamashka wrote:Unicode is significantly simpler than a set of various incompatible 8-bit encodingsStrongly agreed.I'm surely able to ignore the peculiarities of modern Turkish UnicodeThe problem with Unicode is its main aim of allowing characters of multiple writing systems in the same text. When multiple writing systems are in play, conflicts and ambiguities will appear.and wait for the other people to come up with a solution for D language if they really care.I solved my problem by writing an Alphabet hierarchy in the past. I don't like that code but it still works: https://bitbucket.org/acehreli/ddili/src/4c0552fe8352dfe905c9734a57d84d36ce4ed476/src/alphabet.d#lines-50 It handles capitalization, ordering, etc. I use it when preparing the Index section of the Turkish edition of "Programming in D": http://ddili.org/ders/d/ix.html One of the ambiguities is what came up on this thread: Should a word that starts with I (capital i) be listed under I (because it's Turkish) or under İ (because it's English)? So far, I am lucky because the only word that starts with I happens to be the English "IDE", so it goes under i (which appears as İ in the Turkish edition) and would make sense to a Turkish reader because a Turkish reader might (really?) accept it as the capital of ide. It's confusing but it seems to work. :) It doesn't matter. Life is imperfect and things will somehow work in the end. Ali
Oct 25 2022
On Wednesday, 26 October 2022 at 06:05:14 UTC, Ali Çehreli wrote:The problem with Unicode is its main aim of allowing characters of multiple writing systems in the same text. When multiple writing systems are in play, conflicts and ambiguities will appear.I personally don't think that it's the problem of the Unicode itself. Based on what I can see, it looks like the individuals or the committees responsible for mapping the Turkish alphabet to Unicode just made a blunder. For example, let's compare the Latin uppercase "B" and the Cyrillic uppercase "В". Looks exactly the same, right? Would it be a smart idea for them to share the same index in the Unicode table? But wait. What happens if we convert these letters to lowercase? The Latin "B" becomes "b" and the Cyrillic "В" becomes "в". Oops! So by having different indexes for the Latin uppercase "B" and the Cyrillic uppercase "В", we dodged a whole bunch of nasty problems. Another example. Patrick Schluter mentioned the Greek sigma letter and the [wikipedia article](https://en.wikipedia.org/wiki/Sigma) says: "uppercase Σ, lowercase σ, lowercase in word-final position ς", which makes everything rather problematic. Now let's compare this to the Belarusian language and its letter "у". The Belarusian "у" transforms into "ў" depending on context, however this transformation doesn't happen for the first letter of proper nouns or in acronyms (and this theoretically makes the uppercase "ў" redundant). Just imagine an alternative Greek-inspired reality, where both "у" and "ў" uppercase to "У". And yet the uppercase "Ў" exists in Unicode, so luckily in our reality we don't have to deal with uppercase/lowercase round trip failures. This is computers friendly. And as I already mentioned in an earlier comment, the Germans also got the uppercase "ẞ" in Unicode since 2008 (better late than never).I solved my problem by writing an Alphabet hierarchy in the past. I don't like that code but it still works: [...] It's confusing but it seems to work. :) It doesn't matter. Life is imperfect and things will somehow work in the end.What's your opinion/conclusion? Is it fine the way it is? Do you think that some unique property of the Turkish language/alphabet made these difficulties unavoidable? Or do you think that it was a mistake, but now it has to live with us forever for compatibility reasons? Anything else? And as for the D language and Phobos, should "ß" still uppercase to "SS"? Or can we change it to uppercase "ẞ" and remove German from the list of tricky languages at https://dlang.org/library/std/uni/to_upper.html ? Should Turkish be listed there?
Oct 28 2022
On 29/10/2022 11:05 AM, Siarhei Siamashka wrote:And as for the D language and Phobos, should "ß" still uppercase to "SS"? Or can we change it to uppercase "ẞ" and remove German from the list of tricky languages at https://dlang.org/library/std/uni/to_upper.html ? Should Turkish be listed there?That particular function, is based upon the simple mappings provided by UnicodeData.txt and (should be) in compliance of the Unicode standard. The only thing we need to do is regenerate the tables backing it whenever Unicode updates. Note the behavior you are asking for is defined in the Unicode database file SpecialCasing.txt which have not been implemented. ``` titlecase(uppercase(<es-zed>)) ``` That file is how you support languages like Turkish. We currently don't have it implemented. It requires operating on a whole string and to pass in what language rules to apply (i.e. Turkish, Azeri).
Oct 28 2022