digitalmars.D - New Lookup Table (MixString)
- Salih Dincer (81/81) Sep 02 2023 Hi,
- Richard (Rikki) Andrew Cattermole (15/15) Sep 02 2023 Lets see:
- Salih Dincer (52/58) Sep 03 2023 I don't think speed is a big issue because a thousand pages and
- Salih Dincer (5/23) Sep 03 2023 ```
- Richard (Rikki) Andrew Cattermole (5/8) Sep 03 2023 Yeah your lookup table is small enough that it won't matter.
Hi, This is an InputRange and RandomAccessRange combined; it's also the placement of wchar in possible null parts of dchar. Please criticize this code. Each element of a UTF string is matched to its counterparts as dchar and used as wchar. The code is self-explanatory, do you think it's useful? ```d import std.stdio, std.algorithm; import std.range, std.conv; enum alphabets { u = "ÂABCÇDEFGHIJKLMNOPQRSTUVĞİWXYZÖŞÜÇÎÛ", l = "âabcçdefghıjklmnopqrstuvğiwxyzöşüçîû", ASCII65_95 = "AABCCDEFGHIJKLMNOPQRSTUVGIWXYZOSUCIU", ASCII96_127 = "aabccdefghijklmnopqrstuvgiwxyzosuciu", ASCII65_127 = ASCII65_95 ~ ASCII96_127 } //enum dictU =!(wchar[]); enum dictU = "ÂABCÇDEFGHIJKLMNOPQRSTUVĞİWXYZÖŞÜÇÎÛ".to!(wchar[]); enum dictL =!(wchar[]); struct MixString(T, T[] leftLiterals) { size_t index; dchar[] dict; this(string d) { // load dictionary foreach(dchar c; d) dict ~= c; // place counterparts foreach(i, wchar c; leftLiterals) { dict[i] |= c << 16; } } // input range functions bool empty() { return index == dict.length; } T front() { return dict[index] & 0x0000_FFFF; } void popFront() { ++index; } // search elements auto nextIndexOf(wchar key) { scope(exit) index = 0; size_t i = 1; while(!empty) { if(front == key) { return i; } else i++; popFront(); } return 0; } } // alias ConvUpper = MixString!(wchar, dictU); alias ConvLower = MixString!(wchar, dictL); void main() { auto test = ConvUpper(alphabets.l);/* foreach(wchar c; test) { c.writefln!"%4X: %s"(c); }//*/ string text = "fıstıkçı şâhap bir insandır!"; foreach(wchar c; text) { if(auto result = test.nextIndexOf(c)) { wchar lookup = test.dict[result - 1] >> 16; lookup.write; } else c.write; } writeln; } /* FISTIKÇI ŞÂHAP BİR İNSANDIR! */ ``` SDB 79
Sep 02 2023
Lets see: O(n) search for alphabet index Limited tables, that do not scale to other languages. Tables limited to BMP. Not particularly useful generally speaking, but with some improvements it may be useful in a limited capacity. Search can be replaced with either binary search (where probability of a particular character is unknown), fibonacci search if the probability is known with a preference towards the start of the ranges. Typically for such tables, they would be implemented using a multi-level trie. With the lookup being O(1). Costs more ROM, but is well worth it for the speed. Unicode Demystified covers the standard method for doing this sort of lookup as well as how to do the case conversion correctly.
Sep 02 2023
On Saturday, 2 September 2023 at 14:20:58 UTC, Richard (Rikki) Andrew Cattermole wrote:Lets see: O(n) search for alphabet indexI don't think speed is a big issue because a thousand pages and possibly 47 letters old text (kutadgu-bilig-fergana-holograph.txt: ~ 2 MB.) is completed in under 1 second. The conversion done includes reading from the file, finding the counterparts, and writing to the file... For-example: ```d enum abece { b = "AEINRLİDKMUYTBSOÜŞZGÇHĞVCÖPFJXWÂÎÛĖĀĪŪĦŜŊĠŻṬẒḲĮ".to!(wchar[]), k = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį".to!(wchar[]), ele = "gusiocCOISUG".to!(wchar[]) } void main() { alias MSbyk = MixString!(wchar, abece.b); enum bütünSözlük = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį"; //!string; auto büyük = MSbyk(bütünSözlük); // Source: auto dosya = File("KutadguBilig.txt", "r"); while (!dosya.eof) { foreach(wchar c; dosya.readln) { if(auto result = büyük.nextIndexOf(c)) { wchar lookup = büyük.dict[result - 1] >> 16; lookup.write; } else { c.write; } } writeln; } } /* pico enpi:~/Projeler/NewLookup$ time ./newLookupTable > result.txt real 0m0,875s user 0m0,859s sys 0m0,016s */ ``` On Saturday, 2 September 2023 at 14:20:58 UTC, Richard (Rikki) Andrew Cattermole wrote:Unicode Demystified covers the standard method for doing this sort of lookup as well as how to do the case conversion correctly. you, I will read the book you mentioned. SDB 79
Sep 03 2023
On Sunday, 3 September 2023 at 10:36:58 UTC, Salih Dincer wrote:For-example: ```d enum abece { b = "AEINRLİDKMUYTBSOÜŞZGÇHĞVCÖPFJXWÂÎÛĖĀĪŪĦŜŊĠŻṬẒḲĮ".to!(wchar[]), k = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį".to!(wchar[]), ele = "gusiocCOISUG".to!(wchar[]) } void main() { alias MSbyk = MixString!(wchar, abece.b); enum bütünSözlük = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį"; //!string;``` I wonder why I can't use abece.k directly. The error it gives is as follows:core.exception.ArrayIndexError newLookupTable.d(39): index [1] is out of bounds for array of length 1SDB 79
Sep 03 2023
On 03/09/2023 10:36 PM, Salih Dincer wrote:I don't think speed is a big issue because a thousand pages and possibly 47 letters old text (kutadgu-bilig-fergana-holograph.txt: ~ 2 MB.) is completed in under 1 second.Yeah your lookup table is small enough that it won't matter. Problem is that it won't scale. Unicode as a whole is 0x10FFFF big, with the first plane being 64k (BMP). Imagine trying to throw hardware at those sort of numbers.
Sep 03 2023