digitalmars.D - Notice/Warning on narrowStrings .length
- James Miller (11/11) Apr 23 2012 I'm writing an introduction/tutorial to using strings in D,
- Adam D. Ruppe (6/8) Apr 23 2012 Maybe... but it is important that this works:
- bearophile (20/29) Apr 23 2012 As with strlen() in C, unfortunately the result of
- James Miller (14/30) Apr 23 2012 I was thinking about that. This is quite a vague suggestion, more
- bearophile (8/11) Apr 23 2012 Lot of people in D.learn don't even use "-wi -property" so go
- Jonathan M Davis (15/27) Apr 23 2012 At this point, I don't think that it makes any sense to give a warning f...
- Nick Sabalausky (9/18) Apr 26 2012 I find that most of the time I actually *do* want to use length. Don't k...
- Jonathan M Davis (16/20) Apr 26 2012 You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's...
- Nick Sabalausky (5/29) Apr 26 2012 Yea, I'm not saying that walkLength should deal with graphemes. Just tha...
- H. S. Teoh (27/49) Apr 26 2012 [...]
- Nick Sabalausky (21/67) Apr 26 2012 Interesting. Kinda makes sence that such thing exists, though: The CJK
- H. S. Teoh (66/114) Apr 26 2012 Correction: the official term for this is "full-width" (as opposed to
- Nick Sabalausky (21/84) Apr 26 2012 Yikes!
- H. S. Teoh (42/96) Apr 26 2012 Don't laugh too hard. The original version of vi also had an undo buffer
- Nick Sabalausky (17/37) Apr 26 2012 "We didn't start that flamewar,
- H. S. Teoh (50/71) Apr 26 2012 You think that's crazy, huh? Check this out:
- Nick Sabalausky (3/21) Apr 27 2012 Jesus, I could *easily* mistake that for hardware schematics. That's wil...
- Nathan M. Swan (4/30) Apr 27 2012 It was actually the first human writing ever. Which Phoenician
- Nick Sabalausky (3/12) Apr 27 2012 That's pretty cool.
- Matt Peterson (3/10) Apr 26 2012 I actually recently wrote a lexer generator for D that wouldn't
- H. S. Teoh (8/16) Apr 26 2012 That's awesome! Would you like to give it a shot? ;-)
- Dmitry Olshansky (6/10) Apr 27 2012 Come on, notepad is a real nice in one job only: getting rid of style
- Nick Sabalausky (8/18) Apr 27 2012 That's the #1 biggest thing I use it for!! :) And yes, daily.
- Dmitry Olshansky (4/24) Apr 27 2012 Yup I certainly wouldn't mind a separate "copy with my font settings" ;)
- Andrej Mitrovic (6/8) Apr 26 2012 If you run "edit" in command prompt or the run dialog (well, assuming
- Nick Sabalausky (6/14) Apr 26 2012 Heh, I remember that :)
- Brad Anderson (8/187) Apr 26 2012 I'm not sure if you or others knew or not (I didn't until just
- Jonathan M Davis (4/26) Apr 26 2012 That's a fantastic idea! Of course, that leaves the job of implementing ...
- Dmitry Olshansky (10/56) Apr 27 2012 FSA are based on tables so it's all runs in the circle. Only the layout
- H. S. Teoh (20/41) Apr 27 2012 Yes, but hand-coded tables tend to go out of date, be prone to bugs, or
I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation? It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.
Apr 23 2012
On Monday, 23 April 2012 at 23:01:59 UTC, James Miller wrote:Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?Maybe... but it is important that this works: string s; if(s.length) do_something(s); since that's always right and quite common.
Apr 23 2012
James Miller:I realised that when you want the number of characters, you normally actually want to use walkLength, not length.As with strlen() in C, unfortunately the result of walkLength(somestring) is computed every time you call it... because it's doesn't get cached. A partial improvement for this situation is to assure walkLength(somestring) to be strongly pure, and to assure the D compiler is able to move this invariant pure computation out of loops.Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?This is not easy to do, because sometimes you want to know the number of code points, and sometimes of code units. I remember even a proposal to rename the "length" field to another name for narrow strings, to avoid such bugs. ----------------------- Adam D. Ruppe:Maybe... but it is important that this works: string s; if(s.length) do_something(s); since that's always right and quite common.Better: if (!s.empty) do_something(s); (or even better, built-in non-ulls, usable for strings too). Bye, bearophile
Apr 23 2012
On Monday, 23 April 2012 at 23:52:41 UTC, bearophile wrote:James Miller:I was thinking about that. This is quite a vague suggestion, more just throwing the idea out there and seeing what people think. I am aware of the issue of walkLength being computed every time, rather than being a constant lookup. One option would be to make it only a warning in safe code, so worst case scenario is that you mark the function as trusted. I feel this fits in with the idea of safe quite well, since you have to explicitly tell the compiler that you know what you're doing. Another option would be to have some sort of general lint tool that picks up on these kinds of potential errors, though that is a lot bigger scope... -- James MillerI realised that when you want the number of characters, you normally actually want to use walkLength, not length.As with strlen() in C, unfortunately the result of walkLength(somestring) is computed every time you call it... because it's doesn't get cached. A partial improvement for this situation is to assure walkLength(somestring) to be strongly pure, and to assure the D compiler is able to move this invariant pure computation out of loops.Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?This is not easy to do, because sometimes you want to know the number of code points, and sometimes of code units. I remember even a proposal to rename the "length" field to another name for narrow strings, to avoid such bugs.
Apr 23 2012
James Miller:Another option would be to have some sort of general lint tool that picks up on these kinds of potential errors, though that is a lot bigger scope...Lot of people in D.learn don't even use "-wi -property" so go figure how many will use a lint :-) In first approximation you can rely only on what people see compiling with "dmd foo.d", that is the most basic compilation use only. More serious programmers thankfully activate warnings. Bye, bearophile
Apr 23 2012
On Tuesday, April 24, 2012 01:01:57 James Miller wrote:I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation? It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.At this point, I don't think that it makes any sense to give a warning for this. The compiler can't possibly know whether using length is a good idea or correct in any particular set of code. If we really want to do something to tackle the problem, then we should create a new string type which better solves the issues. There's a _lot_ more to be worried about due to the fact that strings are variable length encoded than just their length. There has been talk of creating a new string type, and there has been talk of creating the concept of a variable length encoded range which better handles all of this stuff, but no proposal thus far has gotten anywhere. As for walkLength being O(n) in many cases (as discussed elsewhere in this thread), I don't think that it's that big a deal. If you know what it's doing, you know that it's O(n), and it's simple enough to simply save the result if you need to call it multiple times. - Jonathan M Davis
Apr 23 2012
"James Miller" <james aatch.net> wrote in message news:qdgacdzxkhmhojqcettj forum.dlang.org...I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation? It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.I find that most of the time I actually *do* want to use length. Don't know if that's common, though, or if it's just a reflection of my particular use-cases. Also, keep in mind that (unless I'm mistaken) walkLength does *not* return the number of "characters" (ie, graphemes), but merely the number of code points - which is not the same thing (due to existence of the [confusingly-named] "combining characters").
Apr 26 2012
On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:Also, keep in mind that (unless I'm mistaken) walkLength does *not* return the number of "characters" (ie, graphemes), but merely the number of code points - which is not the same thing (due to existence of the [confusingly-named] "combining characters").You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's internals) deals with graphemes. It all operates on code points, and strings are considered to be ranges of code points, not graphemes. So, as far as ranges go, walkLength returns the actual length of the range. That's _usually_ the number of characters/graphemes as well, but it's certainly not 100% correct. We'll need further unicode facilities in Phobos to deal with that though, and I doubt that strings will ever change to be treated as ranges of graphemes, since that would be incredibly expensive computationally. We have enough performance problems with strings as it is. What we'll probably get is extra functions to deal with normalization (and probably something to count the number of graphemes) and probably a wrapper type that does deal in graphemes. Regardless, you're right about walkLength returning the number of code points rather than graphemes, because strings are considered to be ranges of dchar. - Jonathan M Davis
Apr 26 2012
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message news:mailman.2166.1335463456.4860.digitalmars-d puremagic.com...On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:Yea, I'm not saying that walkLength should deal with graphemes. Just that if someone wants the number of "characters", then neither length *nor* walkLength are guaranteed to be correct.Also, keep in mind that (unless I'm mistaken) walkLength does *not* return the number of "characters" (ie, graphemes), but merely the number of code points - which is not the same thing (due to existence of the [confusingly-named] "combining characters").You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's internals) deals with graphemes. It all operates on code points, and strings are considered to be ranges of code points, not graphemes. So, as far as ranges go, walkLength returns the actual length of the range. That's _usually_ the number of characters/graphemes as well, but it's certainly not 100% correct. We'll need further unicode facilities in Phobos to deal with that though, and I doubt that strings will ever change to be treated as ranges of graphemes, since that would be incredibly expensive computationally. We have enough performance problems with strings as it is. What we'll probably get is extra functions to deal with normalization (and probably something to count the number of graphemes) and probably a wrapper type that does deal in graphemes.
Apr 26 2012
On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:"James Miller" <james aatch.net> wrote in message news:qdgacdzxkhmhojqcettj forum.dlang.org...[...] And don't forget that some code points (notably from the CJK block) are specified as "double-width", so if you're trying to do text layout, you'll want yet a different length (layoutLength?). So we really need all four lengths. Ain't unicode fun?! :-) Array length is simple. Walklength is already implemented. Grapheme length requires recognition of 'combining characters' (or rather, ignoring said characters), and layout length requires recognizing widthless, single- and double-width characters. I've been thinking about unicode processing recently. Traditionally, we have to decode narrow strings into UTF-32 (aka dchar) then do table lookups and such. But unicode encoding and properties, etc., are static information (at least within a single unicode release). So why bother with hardcoding tables and stuff at all? What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs), codegen'd, and then optimized again at the assembly level by the compiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding. The generation algo can then be run just once per unicode release, and everything will Just Work. T -- Give me some fresh salted fish, please.I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation? It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.I find that most of the time I actually *do* want to use length. Don't know if that's common, though, or if it's just a reflection of my particular use-cases. Also, keep in mind that (unless I'm mistaken) walkLength does *not* return the number of "characters" (ie, graphemes), but merely the number of code points - which is not the same thing (due to existence of the [confusingly-named] "combining characters").
Apr 26 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.2173.1335475413.4860.digitalmars-d puremagic.com...On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:Interesting. Kinda makes sence that such thing exists, though: The CJK characters (even the relatively simple Japanese *kanas) are detailed enough that they need to be larger to achieve the same readability. And that's the *non*-double-length ones. So I don't doubt there's ones that need to be tagged as "Draw Extra Big!!" :) For example, I have my font size in Windows Notepad set to a comfortable value. But when I want to use hiragana or katakana, I have to go into the settings and increase the font size so I can actually read it (Well, to what *little* extent I can even read it in the first place ;) ). And those kana's tend to be among the simplest CJK characters. (Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for real coding/writing)."James Miller" <james aatch.net> wrote in message news:qdgacdzxkhmhojqcettj forum.dlang.org...[...] And don't forget that some code points (notably from the CJK block) are specified as "double-width", so if you're trying to do text layout, you'll want yet a different length (layoutLength?).I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation? It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.I find that most of the time I actually *do* want to use length. Don't know if that's common, though, or if it's just a reflection of my particular use-cases. Also, keep in mind that (unless I'm mistaken) walkLength does *not* return the number of "characters" (ie, graphemes), but merely the number of code points - which is not the same thing (due to existence of the [confusingly-named] "combining characters").So we really need all four lengths. Ain't unicode fun?! :-)No kidding. The *one* thing I really, really hate about Unicode is the fact that most (if not all) of its complexity actually *is* necessary. Unicode *itself* is undisputably necessary, but I do sure miss ASCII.Array length is simple. Walklength is already implemented. Grapheme length requires recognition of 'combining characters' (or rather, ignoring said characters), and layout length requires recognizing widthless, single- and double-width characters.Yup.I've been thinking about unicode processing recently. Traditionally, we have to decode narrow strings into UTF-32 (aka dchar) then do table lookups and such. But unicode encoding and properties, etc., are static information (at least within a single unicode release). So why bother with hardcoding tables and stuff at all? What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs), codegen'd, and then optimized again at the assembly level by the compiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding. The generation algo can then be run just once per unicode release, and everything will Just Work.While I find that very intersting...I'm afraid I don't actually understand your suggestion :/ (I do understand FSM's and how they work, though) Could you give a little example of what you mean?
Apr 26 2012
On Thu, Apr 26, 2012 at 06:13:00PM -0400, Nick Sabalausky wrote:"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.2173.1335475413.4860.digitalmars-d puremagic.com...[...]Correction: the official term for this is "full-width" (as opposed to the "half-width" of the typical European scripts).And don't forget that some code points (notably from the CJK block) are specified as "double-width", so if you're trying to do text layout, you'll want yet a different length (layoutLength?).Interesting. Kinda makes sence that such thing exists, though: The CJK characters (even the relatively simple Japanese *kanas) are detailed enough that they need to be larger to achieve the same readability. And that's the *non*-double-length ones. So I don't doubt there's ones that need to be tagged as "Draw Extra Big!!" :)Have you seen U+9598? It's an insanely convoluted glyph composed of *three copies* of an already extremely complex glyph. http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png (And yes, that huge thing is supposed to fit inside a SINGLE character... what *were* those ancient Chinese scribes thinking?!)For example, I have my font size in Windows Notepad set to a comfortable value. But when I want to use hiragana or katakana, I have to go into the settings and increase the font size so I can actually read it (Well, to what *little* extent I can even read it in the first place ;) ). And those kana's tend to be among the simplest CJK characters. (Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for real coding/writing).LOL... love the fact that you felt obligated to justify your use of notepad. :-PWe're lucky the more imaginative scribes of the world have either been dead for centuries or have restricted themselves to writing fictional languages. :-) The inventions of the dead ones have been codified and simplified by the unfortunate people who inherited their overly complex systems (*cough*CJK glyphs*cough), and the inventions of the living ones are largely ignored by the world due to the fact that, well, their scripts are only useful for writing fictional languages. :-) So despite the fact that there are still some crazy convoluted stuff out there, such as Arabic or Indic scripts with pair-wise substitution rules in Unicode, overall things are relatively tame. At least the subcomponents of CJK glyphs are no longer productive (actively being used to compose new characters by script users) -- can you imagine the insanity if Unicode had to support composition by those radicals and subparts? Or if Unicode had to support a script like this one: http://www.arthaey.com/conlang/ashaille/writing/sarapin.html whose components are graphically composed in, shall we say, entirely non-trivial ways (see the composed samples at the bottom of the page)?So we really need all four lengths. Ain't unicode fun?! :-)No kidding. The *one* thing I really, really hate about Unicode is the fact that most (if not all) of its complexity actually *is* necessary.Unicode *itself* is undisputably necessary, but I do sure miss ASCII.In an ideal world, where memory is not an issue and bus width is indefinitely wide, a Unicode string would simply be a sequence of integers (of arbitrary size). Things like combining diacritics, etc., would have dedicated bits/digits for representing them, so there's no need of the complexity of UTF-8, UTF-16, etc.. Everything fits into a single character. Every possible combination of diacritics on every possible character has a unique representation as a single integer. String length would be equal to glyph count. In such an ideal world, screens would also be of indefinitely detailed resolution, so anything can fit inside a single grid cell, so there's no need of half-width/double-width distinctions. You could port ancient ASCII-centric C code just by increasing sizeof(char), and things would Just Work. Yeah I know. Totally impossible. But one can dream, right? :-) [...][...] Currently, std.uni code (argh the pun!!) is hand-written with tables of which character belongs to which class, etc.. These hand-coded tables are error-prone and unnecessary. For example, think of computing the layout width of a UTF-8 stream. Why waste time decoding into dchar, and then doing all sorts of table lookups to compute the width? Instead, treat the stream as a byte stream, with certain sequences of bytes evaluating to length 2, others to length 1, and yet others to length 0. A lexer engine is perfectly suited for recognizing these kinds of sequences with optimal speed. The only difference from a real lexer is that instead of spitting out tokens, it keeps a running total (layout) length, which is output at the end. So what we should do is to write a tool that processes Unicode.txt (the official table of character properties from the Unicode standard) and generates lexer engines that compute various Unicode properties (grapheme count, layout length, etc.) for each of the UTF encodings. This way, we get optimal speed for these algorithms, plus we don't need to manually maintain tables and stuff, we just run the tool on Unicode.txt each time there's a new Unicode release, and the correct code will be generated automatically. T -- Public parking: euphemism for paid parking. -- FloraI've been thinking about unicode processing recently. Traditionally, we have to decode narrow strings into UTF-32 (aka dchar) then do table lookups and such. But unicode encoding and properties, etc., are static information (at least within a single unicode release). So why bother with hardcoding tables and stuff at all? What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs), codegen'd, and then optimized again at the assembly level by the compiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding. The generation algo can then be run just once per unicode release, and everything will Just Work.While I find that very intersting...I'm afraid I don't actually understand your suggestion :/ (I do understand FSM's and how they work, though) Could you give a little example of what you mean?
Apr 26 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.2179.1335486409.4860.digitalmars-d puremagic.com...Have you seen U+9598? It's an insanely convoluted glyph composed of *three copies* of an already extremely complex glyph. http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png (And yes, that huge thing is supposed to fit inside a SINGLE character... what *were* those ancient Chinese scribes thinking?!)Yikes!Heh, any usage of Notepad *needs* to be justified. For example, it has an undo buffer of exactly ONE change. And the stupid thing doesn't even handle Unix-style newlines. *Everything* handes Unix-style newlines these days, even on Windows. Windows *BATCH* files even accept Unix-style newlines, for goddsakes! But not Notepad. It is nice in it's leanness and no-nonsence-ness. But it desperately needs some updates. At least it actually supports Unicode though. (Which actually I find somewhat surprising.) 'Course, this is all XP. For all I know maybe they have finally updated it in MS OSX, erm, I mean Vista and Win7...For example, I have my font size in Windows Notepad set to a comfortable value. But when I want to use hiragana or katakana, I have to go into the settings and increase the font size so I can actually read it (Well, to what *little* extent I can even read it in the first place ;) ). And those kana's tend to be among the simplest CJK characters. (Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for real coding/writing).LOL... love the fact that you felt obligated to justify your use of notepad. :-PThat's insane! And yet, very very interesting...We're lucky the more imaginative scribes of the world have either been dead for centuries or have restricted themselves to writing fictional languages. :-) The inventions of the dead ones have been codified and simplified by the unfortunate people who inherited their overly complex systems (*cough*CJK glyphs*cough), and the inventions of the living ones are largely ignored by the world due to the fact that, well, their scripts are only useful for writing fictional languages. :-) So despite the fact that there are still some crazy convoluted stuff out there, such as Arabic or Indic scripts with pair-wise substitution rules in Unicode, overall things are relatively tame. At least the subcomponents of CJK glyphs are no longer productive (actively being used to compose new characters by script users) -- can you imagine the insanity if Unicode had to support composition by those radicals and subparts? Or if Unicode had to support a script like this one: http://www.arthaey.com/conlang/ashaille/writing/sarapin.html whose components are graphically composed in, shall we say, entirely non-trivial ways (see the composed samples at the bottom of the page)?So we really need all four lengths. Ain't unicode fun?! :-)No kidding. The *one* thing I really, really hate about Unicode is the fact that most (if not all) of its complexity actually *is* necessary.Hah! :)While I find that very intersting...I'm afraid I don't actually understand your suggestion :/ (I do understand FSM's and how they work, though) Could you give a little example of what you mean?[...] Currently, std.uni code (argh the pun!!)is hand-written with tables of which character belongs to which class, etc.. These hand-coded tables are error-prone and unnecessary. For example, think of computing the layout width of a UTF-8 stream. Why waste time decoding into dchar, and then doing all sorts of table lookups to compute the width? Instead, treat the stream as a byte stream, with certain sequences of bytes evaluating to length 2, others to length 1, and yet others to length 0. A lexer engine is perfectly suited for recognizing these kinds of sequences with optimal speed. The only difference from a real lexer is that instead of spitting out tokens, it keeps a running total (layout) length, which is output at the end. So what we should do is to write a tool that processes Unicode.txt (the official table of character properties from the Unicode standard) and generates lexer engines that compute various Unicode properties (grapheme count, layout length, etc.) for each of the UTF encodings. This way, we get optimal speed for these algorithms, plus we don't need to manually maintain tables and stuff, we just run the tool on Unicode.txt each time there's a new Unicode release, and the correct code will be generated automatically.I see. I think that's a very good observation, and a great suggestion. In fact, it'd imagine it'd be considerably simpler than a typial lexer generator. Much less of the fancy regexy-ness would be needed. Maybe put together a pull request if you get the time...?
Apr 26 2012
On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote: [...]Heh, any usage of Notepad *needs* to be justified. For example, it has an undo buffer of exactly ONE change.Don't laugh too hard. The original version of vi also had an undo buffer of depth 1. In fact, one of the *current* vi's still only has an undo buffer of depth 1. (Fortunately vim is much much saner.)And the stupid thing doesn't even handle Unix-style newlines. *Everything* handes Unix-style newlines these days, even on Windows. Windows *BATCH* files even accept Unix-style newlines, for goddsakes! But not Notepad. It is nice in it's leanness and no-nonsence-ness. But it desperately needs some updates.Back in the day, my favorite editor ever was Norton Editor. It's tiny (only about 50k or less, IIRC) yet had innovative (for its day) features... like split pane editing, ^V which flips capitalization to EOL (so a single function serves for both uppercasing and lowercasing, and you just apply it twice to do a single word). Unfortunately it's a DOS-only program. I think it works in the command prompt, but I've never tested it (the modern windows command prompt is subtly different from the old DOS command prompt, so things may not quite work as they used to). It's ironic how useless Notepad is compared to an ancient DOS program from the dinosaur age.At least it actually supports Unicode though. (Which actually I find somewhat surprising.)Now in that, at least, it surpasses Norton Editor. :-) But had Norton not been bought over by Symantec, we'd have a modern, much more powerful version of NE today. But, oh well. Things have moved on. Vim beats the crap out of NE, Notepad, and just about any GUI editor out there. It also beats the snot out of emacs, but I don't want to start *that* flamewar. :-P [...]Here's more: http://www.omniglot.com/writing/conscripts2.htm Imagine if some of the more complicated scripts there were actually used in a real language, and Unicode had to support it... Like this one: http://www.omniglot.com/writing/talisman.htm Or, if you *really* wanna go all-out: http://www.omniglot.com/writing/ssioweluwur.php (Check out the sample text near the bottom of the page and gape in awe at what creative minds let loose can produce... and horror at the prospect of Unicode being required to support it.) [...]http://www.arthaey.com/conlang/ashaille/writing/sarapin.html whose components are graphically composed in, shall we say, entirely non-trivial ways (see the composed samples at the bottom of the page)?That's insane! And yet, very very interesting...[...] When I get the time? Hah... I really need to get my lazy bum back to working on the new AA implementation first. I think that would contribute greater value than optimizing Unicode algorithms. :-) I was hoping *somebody* would be inspired by my idea and run with it... T -- What do you mean the Internet isn't filled with subliminal messages? What about all those buttons marked "submit"??Currently, std.uni code (argh the pun!!)Hah! :)is hand-written with tables of which character belongs to which class, etc.. These hand-coded tables are error-prone and unnecessary. For example, think of computing the layout width of a UTF-8 stream. Why waste time decoding into dchar, and then doing all sorts of table lookups to compute the width? Instead, treat the stream as a byte stream, with certain sequences of bytes evaluating to length 2, others to length 1, and yet others to length 0. A lexer engine is perfectly suited for recognizing these kinds of sequences with optimal speed. The only difference from a real lexer is that instead of spitting out tokens, it keeps a running total (layout) length, which is output at the end. So what we should do is to write a tool that processes Unicode.txt (the official table of character properties from the Unicode standard) and generates lexer engines that compute various Unicode properties (grapheme count, layout length, etc.) for each of the UTF encodings. This way, we get optimal speed for these algorithms, plus we don't need to manually maintain tables and stuff, we just run the tool on Unicode.txt each time there's a new Unicode release, and the correct code will be generated automatically.I see. I think that's a very good observation, and a great suggestion. In fact, it'd imagine it'd be considerably simpler than a typial lexer generator. Much less of the fancy regexy-ness would be needed. Maybe put together a pull request if you get the time...?
Apr 26 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.2182.1335490591.4860.digitalmars-d puremagic.com...Now in that, at least, it surpasses Norton Editor. :-) But had Norton not been bought over by Symantec, we'd have a modern, much more powerful version of NE today. But, oh well. Things have moved on. Vim beats the crap out of NE, Notepad, and just about any GUI editor out there. It also beats the snot out of emacs, but I don't want to start *that* flamewar. :-P"We didn't start that flamewar, It was always burning, Since the world's been turning..."Here's more: http://www.omniglot.com/writing/conscripts2.htm Imagine if some of the more complicated scripts there were actually used in a real language, and Unicode had to support it... Like this one: http://www.omniglot.com/writing/talisman.htm Or, if you *really* wanna go all-out: http://www.omniglot.com/writing/ssioweluwur.php (Check out the sample text near the bottom of the page and gape in awe at what creative minds let loose can produce... and horror at the prospect of Unicode being required to support it.)Crazy stuff! Some of them look rather similar to Arabic or Korean's Hangul (sp?), at least to my untrained eye. And then others are just *really* interesting-looking, like: http://www.omniglot.com/writing/12480.htm http://www.omniglot.com/writing/ayeri.htm http://www.omniglot.com/writing/oxidilogi.htm You're right though, if I were in charge of Unicode and tasked with handling some of those, I think I'd just say "Screw it. Unicode is now depricated. Use ASCII instead. Doesn't have the characters for your langauge? Tough! Fix your language!" :)When I get the time? Hah... I really need to get my lazy bum back to working on the new AA implementation first. I think that would contribute greater value than optimizing Unicode algorithms. :-) I was hoping *somebody* would be inspired by my idea and run with it...Heh, yea. It is a tempting project, but my plate's overflowing too. (Now if only I could make the same happen to bank account...!)
Apr 26 2012
On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote: [...]Crazy stuff! Some of them look rather similar to Arabic or Korean's Hangul (sp?), at least to my untrained eye. And then others are just *really* interesting-looking, like: http://www.omniglot.com/writing/12480.htm http://www.omniglot.com/writing/ayeri.htm http://www.omniglot.com/writing/oxidilogi.htm You're right though, if I were in charge of Unicode and tasked with handling some of those, I think I'd just say "Screw it. Unicode is now depricated. Use ASCII instead. Doesn't have the characters for your langauge? Tough! Fix your language!" :)You think that's crazy, huh? Check this out: http://www.omniglot.com/writing/sumerian.htm Now take a deep breath... ... this writing was *actually used* in ancient times. Yeah. Which means it probably has a Unicode block assigned to it, right now. :-)[...] On the other hand though, sometimes it's refreshing to take a break from "serious" low-level core language D code, and just write plain ole normal boring application code in D. It's good to be reminded just how easy and pleasant it is to write application code in D. For example, just today I was playing around with a regex-based version of formattedRead: you pass in a regex and a bunch of pointers, and the function uses compile-time introspection to convert regex matches into the correct value types. So you could call it like this: int year; string month; int day; regexRead(input, `(\d{4})\s+(\w+)\s+(\d{2})`, &year, &month, &day); Basically, each pair of parentheses corresponds with a pointer argument; non-capturing parentheses (?:) can be used for grouping without assigning to an item. Its current implementation is still kinda crude, but it does support assigning to user-defined types if you define a fromString() method that does the requisite conversion from the matching substring. The next step is to standardize on enums in user-defined types that specify a regex substring to be used for matching items of that type, so that the caller doesn't have to know what kind of string pattern is expected by fromString(). I envision something like this: struct MyDate { enum stdFmt = `(\d{4}-\d{2}-\d{2})`; enum americanFmt = `(\d{2}-\d{2}-\d{4})`; static MyDate fromString(Char)(Char[] value) { ... } } ... string label1, label2; MyDate dt1, dt2; regexRead(input, `\s+(\w+)\s*=\s*`~MyDate.stdFmt~`\s*$`, &label1, &dt1); regexRead(input, `\s+(\w+)\s*=\s*`~MyDate.americanFmt~`\s*$`, &label2, &dt2); So the user can specify, in the regex, which date format to use in parsing the dates. I think this is a vast improvement over the current straitjacketed formattedRead. ;-) And it's so much fun to code (and use). T -- Let X be the set not defined by this sentence...When I get the time? Hah... I really need to get my lazy bum back to working on the new AA implementation first. I think that would contribute greater value than optimizing Unicode algorithms. :-) I was hoping *somebody* would be inspired by my idea and run with it...Heh, yea. It is a tempting project, but my plate's overflowing too. (Now if only I could make the same happen to bank account...!)
Apr 26 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.1.1335507187.22023.digitalmars-d puremagic.com...On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote: [...]Jesus, I could *easily* mistake that for hardware schematics. That's wild.Crazy stuff! Some of them look rather similar to Arabic or Korean's Hangul (sp?), at least to my untrained eye. And then others are just *really* interesting-looking, like: http://www.omniglot.com/writing/12480.htm http://www.omniglot.com/writing/ayeri.htm http://www.omniglot.com/writing/oxidilogi.htm You're right though, if I were in charge of Unicode and tasked with handling some of those, I think I'd just say "Screw it. Unicode is now depricated. Use ASCII instead. Doesn't have the characters for your langauge? Tough! Fix your language!" :)You think that's crazy, huh? Check this out: http://www.omniglot.com/writing/sumerian.htm Now take a deep breath... ... this writing was *actually used* in ancient times. Yeah.
Apr 27 2012
On Friday, 27 April 2012 at 06:12:01 UTC, H. S. Teoh wrote:On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote: [...]It was actually the first human writing ever. Which Phoenician scribe knew that his innovation of the alphabet would make programming easier thousands of years later?Crazy stuff! Some of them look rather similar to Arabic or Korean's Hangul (sp?), at least to my untrained eye. And then others are just *really* interesting-looking, like: http://www.omniglot.com/writing/12480.htm http://www.omniglot.com/writing/ayeri.htm http://www.omniglot.com/writing/oxidilogi.htm You're right though, if I were in charge of Unicode and tasked with handling some of those, I think I'd just say "Screw it. Unicode is now depricated. Use ASCII instead. Doesn't have the characters for your langauge? Tough! Fix your language!" :)You think that's crazy, huh? Check this out: http://www.omniglot.com/writing/sumerian.htm Now take a deep breath... ... this writing was *actually used* in ancient times. Yeah. Which means it probably has a Unicode block assigned to it, right now. :-)
Apr 27 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.1.1335507187.22023.digitalmars-d puremagic.com...For example, just today I was playing around with a regex-based version of formattedRead: you pass in a regex and a bunch of pointers, and the function uses compile-time introspection to convert regex matches into the correct value types. So you could call it like this: int year; string month; int day; regexRead(input, `(\d{4})\s+(\w+)\s+(\d{2})`, &year, &month, &day); [...]That's pretty cool.
Apr 27 2012
On Friday, 27 April 2012 at 01:35:26 UTC, H. S. Teoh wrote:When I get the time? Hah... I really need to get my lazy bum back to working on the new AA implementation first. I think that would contribute greater value than optimizing Unicode algorithms. :-) I was hoping *somebody* would be inspired by my idea and run with it...I actually recently wrote a lexer generator for D that wouldn't be that hard to adapt to something like this.
Apr 26 2012
On Fri, Apr 27, 2012 at 04:12:25AM +0200, Matt Peterson wrote:On Friday, 27 April 2012 at 01:35:26 UTC, H. S. Teoh wrote:That's awesome! Would you like to give it a shot? ;-) Also, I'm in love with lexer generators... I'd love to make good use of your lexer generator if the code is available somewhere. T -- Nothing in the world is more distasteful to a man than to take the path that leads to himself. -- Herman HesseWhen I get the time? Hah... I really need to get my lazy bum back to working on the new AA implementation first. I think that would contribute greater value than optimizing Unicode algorithms. :-) I was hoping *somebody* would be inspired by my idea and run with it...I actually recently wrote a lexer generator for D that wouldn't be that hard to adapt to something like this.
Apr 26 2012
On 27.04.2012 5:36, H. S. Teoh wrote:On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote: [...]Come on, notepad is a real nice in one job only: getting rid of style and fonts of a copied text fragment. I use it as clean-up scratch pool daily. Would be a shame if they ever add fonts and layout to it ;) -- Dmitry OlshanskyHeh, any usage of Notepad *needs* to be justified. For example, it has an undo buffer of exactly ONE change.
Apr 27 2012
"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message news:jndkji$23ni$2 digitalmars.com...On 27.04.2012 5:36, H. S. Teoh wrote:I frequently wish I had a global setting for "Don't include style in the clipboard", and maybe a *separate* "Copy with style" command. Or at least a standard "copy without style", or "remove style from clipboard" command. *Something*. 99% of the times I copy/paste text I *don't* want to include style. Drives me crazy.On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote: [...]Come on, notepad is a real nice in one job only: getting rid of style and fonts of a copied text fragment. I use it as clean-up scratch pool daily. Would be a shame if they ever add fonts and layout to it ;)Heh, any usage of Notepad *needs* to be justified. For example, it has an undo buffer of exactly ONE change.
Apr 27 2012
On 27.04.2012 12:31, Nick Sabalausky wrote:"Dmitry Olshansky"<dmitry.olsh gmail.com> wrote in message news:jndkji$23ni$2 digitalmars.com...Yup I certainly wouldn't mind a separate "copy with my font settings" ;) -- Dmitry OlshanskyOn 27.04.2012 5:36, H. S. Teoh wrote:I frequently wish I had a global setting for "Don't include style in the clipboard", and maybe a *separate* "Copy with style" command. Or at least a standard "copy without style", or "remove style from clipboard" command. *Something*. 99% of the times I copy/paste text I *don't* want to include style. Drives me crazy.On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote: [...]Come on, notepad is a real nice in one job only: getting rid of style and fonts of a copied text fragment. I use it as clean-up scratch pool daily. Would be a shame if they ever add fonts and layout to it ;)Heh, any usage of Notepad *needs* to be justified. For example, it has an undo buffer of exactly ONE change.
Apr 27 2012
On 4/27/12, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:It's ironic how useless Notepad is compared to an ancient DOS program from the dinosaur age.If you run "edit" in command prompt or the run dialog (well, assuming you had a win32 box somewhere), you'd actually get a pretty decent dos-based editor that is still better than Notepad. It has split windows, a tab stop setting, and even a whole bunch of color settings. :P
Apr 26 2012
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message news:mailman.2183.1335491333.4860.digitalmars-d puremagic.com...On 4/27/12, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:Heh, I remember that :) Holy crap, even in XP, they updated it to use the Windows standard key combos for cut/copy/paste. I had no idea, all this time. Back in DOS, it used that old "Shift-Ins" stuff.It's ironic how useless Notepad is compared to an ancient DOS program from the dinosaur age.If you run "edit" in command prompt or the run dialog (well, assuming you had a win32 box somewhere), you'd actually get a pretty decent dos-based editor that is still better than Notepad. It has split windows, a tab stop setting, and even a whole bunch of color settings. :P
Apr 26 2012
On Friday, 27 April 2012 at 00:25:44 UTC, H. S. Teoh wrote:On Thu, Apr 26, 2012 at 06:13:00PM -0400, Nick Sabalausky wrote:I'm not sure if you or others knew or not (I didn't until just now as there hasn't been an announcement) but one of the accepted GSOC projects is extending unicode support by Dmitry Olshansky. Maybe take up this idea with him. https://www.google-melange.com/gsoc/project/google/gsoc2012/dolsh/31002 Regards, Brad Anderson"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.2173.1335475413.4860.digitalmars-d puremagic.com...[...]Correction: the official term for this is "full-width" (as opposed to the "half-width" of the typical European scripts).And don't forget that some code points (notably from the CJK block) are specified as "double-width", so if you're trying to do text layout, you'll want yet a different length (layoutLength?).Interesting. Kinda makes sence that such thing exists, though: The CJK characters (even the relatively simple Japanese *kanas) are detailed enough that they need to be larger to achieve the same readability. And that's the *non*-double-length ones. So I don't doubt there's ones that need to be tagged as "Draw Extra Big!!" :)Have you seen U+9598? It's an insanely convoluted glyph composed of *three copies* of an already extremely complex glyph. http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png (And yes, that huge thing is supposed to fit inside a SINGLE character... what *were* those ancient Chinese scribes thinking?!)For example, I have my font size in Windows Notepad set to a comfortable value. But when I want to use hiragana or katakana, I have to go into the settings and increase the font size so I can actually read it (Well, to what *little* extent I can even read it in the first place ;) ). And those kana's tend to be among the simplest CJK characters. (Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for real coding/writing).LOL... love the fact that you felt obligated to justify your use of notepad. :-PWe're lucky the more imaginative scribes of the world have either been dead for centuries or have restricted themselves to writing fictional languages. :-) The inventions of the dead ones have been codified and simplified by the unfortunate people who inherited their overly complex systems (*cough*CJK glyphs*cough), and the inventions of the living ones are largely ignored by the world due to the fact that, well, their scripts are only useful for writing fictional languages. :-) So despite the fact that there are still some crazy convoluted stuff out there, such as Arabic or Indic scripts with pair-wise substitution rules in Unicode, overall things are relatively tame. At least the subcomponents of CJK glyphs are no longer productive (actively being used to compose new characters by script users) -- can you imagine the insanity if Unicode had to support composition by those radicals and subparts? Or if Unicode had to support a script like this one: http://www.arthaey.com/conlang/ashaille/writing/sarapin.html whose components are graphically composed in, shall we say, entirely non-trivial ways (see the composed samples at the bottom of the page)?So we really need all four lengths. Ain't unicode fun?! :-)No kidding. The *one* thing I really, really hate about Unicode is the fact that most (if not all) of its complexity actually *is* necessary.Unicode *itself* is undisputably necessary, but I do sure miss ASCII.In an ideal world, where memory is not an issue and bus width is indefinitely wide, a Unicode string would simply be a sequence of integers (of arbitrary size). Things like combining diacritics, etc., would have dedicated bits/digits for representing them, so there's no need of the complexity of UTF-8, UTF-16, etc.. Everything fits into a single character. Every possible combination of diacritics on every possible character has a unique representation as a single integer. String length would be equal to glyph count. In such an ideal world, screens would also be of indefinitely detailed resolution, so anything can fit inside a single grid cell, so there's no need of half-width/double-width distinctions. You could port ancient ASCII-centric C code just by increasing sizeof(char), and things would Just Work. Yeah I know. Totally impossible. But one can dream, right? :-) [...][...] Currently, std.uni code (argh the pun!!) is hand-written with tables of which character belongs to which class, etc.. These hand-coded tables are error-prone and unnecessary. For example, think of computing the layout width of a UTF-8 stream. Why waste time decoding into dchar, and then doing all sorts of table lookups to compute the width? Instead, treat the stream as a byte stream, with certain sequences of bytes evaluating to length 2, others to length 1, and yet others to length 0. A lexer engine is perfectly suited for recognizing these kinds of sequences with optimal speed. The only difference from a real lexer is that instead of spitting out tokens, it keeps a running total (layout) length, which is output at the end. So what we should do is to write a tool that processes Unicode.txt (the official table of character properties from the Unicode standard) and generates lexer engines that compute various Unicode properties (grapheme count, layout length, etc.) for each of the UTF encodings. This way, we get optimal speed for these algorithms, plus we don't need to manually maintain tables and stuff, we just run the tool on Unicode.txt each time there's a new Unicode release, and the correct code will be generated automatically. TI've been thinking about unicode processing recently. Traditionally, we have to decode narrow strings into UTF-32 (aka dchar) then do table lookups and such. But unicode encoding and properties, etc., are static information (at least within a single unicode release). So why bother with hardcoding tables and stuff at all? What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs), codegen'd, and then optimized again at the assembly level by the compiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding. The generation algo can then be run just once per unicode release, and everything will Just Work.While I find that very intersting...I'm afraid I don't actually understand your suggestion :/ (I do understand FSM's and how they work, though) Could you give a little example of what you mean?
Apr 26 2012
On Thursday, April 26, 2012 17:26:40 H. S. Teoh wrote:Currently, std.uni code (argh the pun!!) is hand-written with tables of which character belongs to which class, etc.. These hand-coded tables are error-prone and unnecessary. For example, think of computing the layout width of a UTF-8 stream. Why waste time decoding into dchar, and then doing all sorts of table lookups to compute the width? Instead, treat the stream as a byte stream, with certain sequences of bytes evaluating to length 2, others to length 1, and yet others to length 0. A lexer engine is perfectly suited for recognizing these kinds of sequences with optimal speed. The only difference from a real lexer is that instead of spitting out tokens, it keeps a running total (layout) length, which is output at the end. So what we should do is to write a tool that processes Unicode.txt (the official table of character properties from the Unicode standard) and generates lexer engines that compute various Unicode properties (grapheme count, layout length, etc.) for each of the UTF encodings. This way, we get optimal speed for these algorithms, plus we don't need to manually maintain tables and stuff, we just run the tool on Unicode.txt each time there's a new Unicode release, and the correct code will be generated automatically.That's a fantastic idea! Of course, that leaves the job of implementing it... :) - Jonathan M Davis
Apr 26 2012
On 27.04.2012 1:23, H. S. Teoh wrote:On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:Of course they are generated."James Miller"<james aatch.net> wrote in message news:qdgacdzxkhmhojqcettj forum.dlang.org...[...] And don't forget that some code points (notably from the CJK block) are specified as "double-width", so if you're trying to do text layout, you'll want yet a different length (layoutLength?). So we really need all four lengths. Ain't unicode fun?! :-) Array length is simple. Walklength is already implemented. Grapheme length requires recognition of 'combining characters' (or rather, ignoring said characters), and layout length requires recognizing widthless, single- and double-width characters. I've been thinking about unicode processing recently. Traditionally, we have to decode narrow strings into UTF-32 (aka dchar) then do table lookups and such. But unicode encoding and properties, etc., are static information (at least within a single unicode release). So why bother with hardcoding tables and stuff at all?I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation? It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.I find that most of the time I actually *do* want to use length. Don't know if that's common, though, or if it's just a reflection of my particular use-cases. Also, keep in mind that (unless I'm mistaken) walkLength does *not* return the number of "characters" (ie, graphemes), but merely the number of code points - which is not the same thing (due to existence of the [confusingly-named] "combining characters").What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs),FSA are based on tables so it's all runs in the circle. Only the layout changes. Yet the speed gains of non-decoding are huge. codegen'd, and then optimized again at the assembly level by thecompiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding. The generation algo can then be run just once per unicode release, and everything will Just Work.This year Unicode in D will receive a nice upgrade. Anyway keep me posted if you have these FSA ever come to soil your sleep ;) -- Dmitry Olshansky
Apr 27 2012
On Fri, Apr 27, 2012 at 12:20:13PM +0400, Dmitry Olshansky wrote:On 27.04.2012 1:23, H. S. Teoh wrote:[...]Yes, but hand-coded tables tend to go out of date, be prone to bugs, or are missing optimizations done by an FSA generator (e.g. a lexer generator). Collapsed FSA states, for example, can greatly reduce table size and speed things up.What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs),FSA are based on tables so it's all runs in the circle. Only the layout changes. Yet the speed gains of non-decoding are huge.codegen'd, and then optimized again at the assembly level by the[...] One area where autogenerated Unicode algos will be very useful is in normalization. Unicode normalization is non-trivial, to say the least; it involves looking up various character properties and performing mappings between them in a specified order. If we can encode this process as FSA, then we can let an automated FSA optimizer produce code that maps directly between the (non-decoded!) source string and the target (non-decoded!) normalized string. Similar things can be done for string concatenation (which requires arbitrarily-distant scanning in either direction from the joining point, though in normal use cases the distance should be very short). T -- Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANGcompiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding. The generation algo can then be run just once per unicode release, and everything will Just Work.This year Unicode in D will receive a nice upgrade. Anyway keep me posted if you have these FSA ever come to soil your sleep ;)
Apr 27 2012