digitalmars.D - Of possible interest: fast UTF8 validation
- Andrei Alexandrescu (1/1) May 16 2018 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_str...
- Ethan Watson (10/11) May 16 2018 I re-implemented some common string functionality at Remedy using
- Andrei Alexandrescu (6/18) May 16 2018 Is it workable to have a runtime-initialized flag that controls using
- Ethan Watson (14/16) May 16 2018 Sure, it's workable with these kind of speed gains. Although the
- Walter Bright (26/29) May 16 2018 I used to do things like that a simpler way. 3 functions would be create...
- Ethan (28/41) May 16 2018 It certainly sounds reasonable enough for 99% of use cases. But
- Walter Bright (3/8) May 16 2018 Linkers already do that. Alignment is specified on all symbols emitted b...
- rikki cattermole (2/11) May 16 2018 Would allowing align attribute on functions, make sense here for Ethan?
- Ethan (16/19) May 17 2018 Mea culpa. Upon further thinking, two things strike me:
- xenon325 (16/26) May 16 2018 Is this basically the same as Function MultiVersioning [1] ?
- Walter Bright (2/5) May 16 2018 It would be nice to get this technique put into std.algorithm!
- Jack Stouffer (6/7) May 16 2018 D doesn't seem to have C definitions for the x86 SIMD intrinsics,
- Ethan Watson (13/15) May 16 2018 Replying to highlight this.
- David Nadlinger (24/27) May 18 2018 To provide some context here: LDC only supports the types from
- Joakim (4/5) May 16 2018 Sigh, this reminds me of the old quote about people spending a
- Dmitry Olshansky (7/13) May 16 2018 Validating UTF-8 is super common, most text protocols and files
- Joakim (4/18) May 16 2018 I think you know what I'm referring to, which is that UTF-8 is a
- Jack Stouffer (6/9) May 16 2018 UTF-8 seems like the best option available given the problem
- Andrei Alexandrescu (9/27) May 16 2018 I find this an interesting minority opinion, at least from the
- Walter Bright (5/7) May 16 2018 Me too. I think UTF-8 is brilliant (and I suffered for years under the l...
- Jonathan M Davis (22/29) May 16 2018 I'm inclined to think that the redundancy is a serious flaw. I'd argue t...
- Joakim (39/73) May 16 2018 Thanks for the link, skipped to the part about text encodings,
- Patrick Schluter (45/122) May 17 2018 This is not practical, sorry. What happens when your message
- Joakim (46/91) May 17 2018 Why would it lose the header? TCP guarantees delivery and
- Patrick Schluter (98/192) May 17 2018 What does TCP/IP got to do with anything in discussion here.
- H. S. Teoh (23/24) May 17 2018 Yes. Imagine if we standardized on a header-based string encoding, and
- Patrick Schluter (6/21) May 17 2018 That's what rtf with code pages was essentially. I'm happy that
- Neia Neutuladh (35/60) May 18 2018 You'd have three data structures: Strand, Rope, and Slice.
- Andrei Alexandrescu (2/4) May 17 2018 Impressive! Is that the Europarl?
- Patrick Schluter (24/28) May 17 2018 No, Euramis. The central translation memory developed by the
- Walter Bright (4/13) May 17 2018 It sounds like the main issue is that a header based encoding would take...
- Dmitry Olshansky (10/27) May 17 2018 Indeed, and some other compression/deduplication options that
- Ethan (7/9) May 17 2018 Quoting to highlight and agree.
- Joakim (37/73) May 18 2018 The point wasn't that TCP is handling all the errors, it was a
- Nemanja Boric (5/12) May 18 2018 Welcome to my world (and probably world of most Europeans) where
- Joakim (41/67) May 17 2018 In general, you would be wrong, a carefully designed binary
- H. S. Teoh (10/27) May 17 2018 My bet is on the LZW being *far* better than a header-based encoding.
https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
May 16 2018
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/I re-implemented some common string functionality at Remedy using SSE 4.2 instructions. Pretty handy. Except we had to turn that code off for released products since nowhere near enough people are running SSE 4.2 capable hardware. The code linked doesn't seem to use any instructions newer than SSE2, so it's perfectly safe to run on any x64 processor. Could probably be sped up with newer SSE instructions if you're only ever running internally on hardware you control.
May 16 2018
On 05/16/2018 08:47 AM, Ethan Watson wrote:On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:Is it workable to have a runtime-initialized flag that controls using SSE vs. conservative?https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_us ng_as_little_as_07/I re-implemented some common string functionality at Remedy using SSE 4.2 instructions. Pretty handy. Except we had to turn that code off for released products since nowhere near enough people are running SSE 4.2 capable hardware.The code linked doesn't seem to use any instructions newer than SSE2, so it's perfectly safe to run on any x64 processor. Could probably be sped up with newer SSE instructions if you're only ever running internally on hardware you control.Even better! Contributions would be very welcome. Andrei
May 16 2018
On Wednesday, 16 May 2018 at 13:54:05 UTC, Andrei Alexandrescu wrote:Is it workable to have a runtime-initialized flag that controls using SSE vs. conservative?Sure, it's workable with these kind of speed gains. Although the conservative code path ends up being slightly worse off - an extra fetch, compare and branch get introduced. My preferred method though is to just build multiple sets of binaries as DLLs/SOs/DYNLIBs, then load in the correct libraries dependant on the CPUID test at program initialisation. Current Xbox/Playstation hardware is pretty terrible when it comes to branching, so compiling with minimal branching and deploying the exact binaries for the hardware capabilities is the way I generally approach things. We never got around to setting something like that up for the PC release of Quantum Break, although we definitely talked about it.
May 16 2018
On 5/16/2018 7:38 AM, Ethan Watson wrote:My preferred method though is to just build multiple sets of binaries as DLLs/SOs/DYNLIBs, then load in the correct libraries dependant on the CPUID test at program initialisation.I used to do things like that a simpler way. 3 functions would be created: void FeatureInHardware(); void EmulateFeature(); void Select(); void function() doIt = &Select; I.e. the first time doIt is called, it calls the Select function which then resets doIt to either FeatureInHardware() or EmulateFeature(). It costs an indirect call, but if you move it up the call hierarchy a bit so it isn't in the hot loops, the indirect function call cost is negligible. The advantage is there was only one binary. ---- The PDP-11 had an optional chipset to do floating point. The compiler generated function calls that emulated the floating point: call FPADD call FPSUB ... Those functions would check to see if the FPU existed. If it did, it would in-place patch the binary to replace the calls with FPU instructions! Of course, that won't work these days because of protected code pages. ---- In the bad old DOS days, emulator calls were written out by the compiler. Special relocation fixup records were emitted for them. The emulator or the FPU library was then linked in, and included special relocation fixup values which tricked the linker fixup mechanism into patching those instructions with either emulator calls or FPU instructions. It was just brilliant!
May 16 2018
On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:I used to do things like that a simpler way. 3 functions would be created: void FeatureInHardware(); void EmulateFeature(); void Select(); void function() doIt = &Select; I.e. the first time doIt is called, it calls the Select function which then resets doIt to either FeatureInHardware() or EmulateFeature(). It costs an indirect call, but if you move it up the call hierarchy a bit so it isn't in the hot loops, the indirect function call cost is negligible. The advantage is there was only one binary.It certainly sounds reasonable enough for 99% of use cases. But I'm definitely the 1% here ;-) Indirect calls invoke the wrath of the branch predictor on XB1/PS4 (ie an AMD Jaguar processor). But there's certainly some more interesting non-processor behaviour, at least on MSVC compilers. The provided auto-DLL loading in that environment performs a call to your DLL-boundary-crossing function, which actually winds up in a jump table that performs a jump instruction to actually get to your DLL code. I suspect this is more costly than the indirect jump at a "write a basic test" level. Doing an indirect call as the only action in a for-loop is guaranteed to bring out the costly branch predictor on the Jaguar. Without getting in and profiling a bunch of stuff, I'm not entirely sure which approach I'd prefer for a general approach. Certainly, as far as this particular thread goes, every general purpose function of a few lines that I write that use intrinsics is forced inline. No function calls, indirect or otherwise. And on top of that, the inlined code usually pushes the branches in the code out the code across the byte boundary lines just far enough that the simple branch predictor is only ever invoked. (Related: one feature I'd really really really love for linkers to implement is the ability to mark up certain functions to only ever be linked at a certain byte boundary. And that's purely because Jaguar branch prediction often made my profiling tests non-deterministic between compiles. A NOP is a legit optimisation on those processors.)
May 16 2018
On 5/16/2018 10:28 AM, Ethan wrote:(Related: one feature I'd really really really love for linkers to implement is the ability to mark up certain functions to only ever be linked at a certain byte boundary. And that's purely because Jaguar branch prediction often made my profiling tests non-deterministic between compiles. A NOP is a legit optimisation on those processors.)Linkers already do that. Alignment is specified on all symbols emitted by the compiler, and the linker uses that info.
May 16 2018
On 17/05/2018 8:34 AM, Walter Bright wrote:On 5/16/2018 10:28 AM, Ethan wrote:Would allowing align attribute on functions, make sense here for Ethan?(Related: one feature I'd really really really love for linkers to implement is the ability to mark up certain functions to only ever be linked at a certain byte boundary. And that's purely because Jaguar branch prediction often made my profiling tests non-deterministic between compiles. A NOP is a legit optimisation on those processors.)Linkers already do that. Alignment is specified on all symbols emitted by the compiler, and the linker uses that info.
May 16 2018
And at the risk of getting this topic back on track: On Wednesday, 16 May 2018 at 20:34:26 UTC, Walter Bright wrote:Linkers already do that. Alignment is specified on all symbols emitted by the compiler, and the linker uses that info.Mea culpa. Upon further thinking, two things strike me: 1) As suggested, there's no way to instruct the front-end to align functions to byte boundaries outside of "optimise for speed" command line flags 2) I would have heavily relied on incremental linking to iterate on these tests when trying to work out how the processor behaved. I expect MSVC's incremental linker would turn out to be just rubbish enough to not care about how those flags originally behaved. On Wednesday, 16 May 2018 at 20:36:10 UTC, Walter Bright wrote:It would be nice to get this technique put into std.algorithm!The code I wrote originally was C++ code with intrinsics. But I can certainly look at adapting it to DMD/LDC. The DMD frontend providing natural mappings for Intel's published intrinsics would be massively beneficial here.
May 17 2018
On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:I used to do things like that a simpler way. 3 functions would be created: void FeatureInHardware(); void EmulateFeature(); void Select(); void function() doIt = &Select; I.e. the first time doIt is called, it calls the Select function which then resets doIt to either FeatureInHardware() or EmulateFeature(). It costs an indirect call [...]Is this basically the same as Function MultiVersioning [1] ? I never had a need to use it and always wondered how does it work out it real life. From description it seems this would incur indirection: "To keep the cost of dispatching low, the IFUNC [2] mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump indirect instruction." In linked article [2] Ian Lance Taylor says glibc uses this for memcpy(), so this should be pretty efficient (but than again, one doesn't call memcpy() in hot loops too often) [1] https://gcc.gnu.org/wiki/FunctionMultiVersioning [2] https://www.airs.com/blog/archives/403 -- Alexander
May 16 2018
On 5/16/2018 5:47 AM, Ethan Watson wrote:I re-implemented some common string functionality at Remedy using SSE 4.2 instructions. Pretty handy. Except we had to turn that code off for released products since nowhere near enough people are running SSE 4.2 capable hardware.It would be nice to get this technique put into std.algorithm!
May 16 2018
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/D doesn't seem to have C definitions for the x86 SIMD intrinsics, which is a bummer https://issues.dlang.org/show_bug.cgi?id=18865 It's too bad that nothing came of std.simd.
May 16 2018
On Wednesday, 16 May 2018 at 14:25:07 UTC, Jack Stouffer wrote:D doesn't seem to have C definitions for the x86 SIMD intrinsics, which is a bummerReplying to highlight this. There's core.simd which doesn't look anything like SSE/AVX intrinsics at all, and looks a lot more like a wrapper for writing assembly instructions directly. And even better - LDC doesn't support core.simd and has its own intrinsics that don't match the SSE/AVX intrinsics API published by Intel. And since I'm a multi-platform developer, the "What about NEON intrinsics?" question always sits in the back of my mind. I ended up implementing my own SIMD primitives in Binderoo, but they're all versioned out for LDC at the moment until I look in to it and complete the implementation.
May 16 2018
On Wednesday, 16 May 2018 at 14:48:54 UTC, Ethan Watson wrote:And even better - LDC doesn't support core.simd and has its own intrinsics that don't match the SSE/AVX intrinsics API published by Intel.To provide some context here: LDC only supports the types from core.simd, but not the __simd "assembler macro" that DMD uses to more or less directly emit the corresponding x86 opcodes. LDC does support most of the GCC-style SIMD builtins for the respective target (x86, ARM, …), but there are two problems with this: 1) As Ethan pointed out, the GCC API does not match Intel's intrinsics; for example, it is `__builtin_ia32_vfnmsubpd256_mask3` instead of `_mm256_mask_fnmsub_pd`, and the argument orders differ as well. 2) The functions that LDC exposes as intrinsics are those that are intrinsics on the LLVM IR level. However, some operations can be directly represented in normal, instruction-set-independent LLVM IR – no explicit intrinsics are provided for these. Unfortunately, LLVM doesn't seem to provide any particularly helpful tools for implementing Intel's intrinsics API. x86intrin.h is manually implemented for Clang as a collection of various macros and functions. It would be seriously cool if someone could write a small tool to parse those headers, (semi-)automatically convert them to D, and generate tests for comparing the emitted IR against Clang. I'm happy to help with the LDC side of things. — David
May 18 2018
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
May 16 2018
On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
May 16 2018
On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
May 16 2018
On Wednesday, 16 May 2018 at 17:18:06 UTC, Joakim wrote:I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.UTF-8 seems like the best option available given the problem space. Junk data is going to be a problem with any possible string format given that encoding translations and programmer error will always be prevalent.
May 16 2018
On 5/16/18 1:18 PM, Joakim wrote:On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/ If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt! AndreiOn Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_us ng_as_little_as_07/Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
May 16 2018
On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!Me too. I think UTF-8 is brilliant (and I suffered for years under the lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder! Perhaps you're referring to the redundancy in UTF-8 - though illegal encodings are made possible by such redundancy.
May 16 2018
On Wednesday, May 16, 2018 13:42:11 Walter Bright via Digitalmars-d wrote:On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:I'm inclined to think that the redundancy is a serious flaw. I'd argue that if it were truly well-designed, there would be exactly one way to represent every character - including clear up to grapheme clusters where multiple code points are involved (i.e. there would be no normalization issues in valid Unicode, because there would be only one valid normalization). But there may be some technical issues that I'm not aware of that would make that problematic. Either way, the issues that I have with UTF-8 are issues that UTF-16 and UTF-32 have as well, since they're really issues relating to code points. Overall, I think that UTF-8 is by far the best encoding that we have, and I don't think that we're going to get anything better, but I'm also definitely inclined to think that it's still flawed - just far less flawed than the alternatives. And in general, I have to wonder if there would be a way to make Unicode less complicated if we could do it from scratch without worrying about any kind of compatability, since what we have is complicated enough that most programmers don't come close to understanding it, and it's just way too hard to get right. But I suspect that if efficiency matters, there's enough inherent complexity that we'd just be screwed on that front even if we could do a better job than was done with Unicode as we know it. - Jonathan M DavisIf you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!Me too. I think UTF-8 is brilliant (and I suffered for years under the lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder! Perhaps you're referring to the redundancy in UTF-8 - though illegal encodings are made possible by such redundancy.
May 16 2018
On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu wrote:On 5/16/18 1:18 PM, Joakim wrote:Thanks for the link, skipped to the part about text encodings, should be fun to read the rest later.On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again. I have been researching this a bit since then, and the stated goals for UTF-8 at inception were that it _could not overlap with ASCII anywhere for other languages_, to avoid issues with legacy software wrongly processing other languages as ASCII, and to allow seeking from an arbitrary location within a byte stream: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt I have no dispute with these priorities at the time, as they were optimizing for the institutional and tech realities of 1992 as Dylan also notes, and UTF-8 is actually a nice hack given those constraints. What I question is that those priorities are at all relevant today, when billions of smartphone users are regularly not using ASCII, and these tech companies are the largest private organizations on the planet, ie they have the resources to design a new transfer format. I see basically no relevance for the streaming requirement today, as I noted in this forum years ago, but I can see why it might have been considered important in the early '90s, before packet-based networking protocols had won. I think a header-based scheme would be _much_ better today and the reason I know Dmitry knows that is that I have discussed privately with him over email that I plan to prototype a format like that in D. Even if UTF-8 is already fairly widespread, something like that could be useful as a better intermediate format for string processing, and maybe someday could replace UTF-8 too.
May 16 2018
On Thursday, 17 May 2018 at 05:01:54 UTC, Joakim wrote:On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu wrote:This is not practical, sorry. What happens when your message loses the header? Exactly, the rest of the message is garbled. That's exactly what happened with code page based texts when you don't know in which code page it is encoded. It has the supplemental inconvenience that mixing languages becomes impossible or at least very cumbersome. UTF-8 has several properties that are difficult to have with other schemes. - It is state-less, means any byte in a stream always means the same thing. Its meaning does not depend on external or a previous byte. - It can mix any language in the same stream without acrobatics and if one thinks that mixing languages doesn't happen often should get his head extracted from his rear, because it is very common (check wikipedia's front page for example). - The multi byte nature of other alphabets is not as bad as people think because texts in computer do not live on their own, meaning that they are generally embedded inside file formats, which more often than not are extremely bloated (xml, html, xliff, akoma ntoso, rtf etc.). The few bytes more in the text do not weigh that much. I'm in charge at the European Commission of the biggest translation memory in the world. It handles currently 30 languages and without UTF-8 and UTF-16 it would be unmanageable. I still remember when I started there in 2002 when we handled only 11 languages of which only 1 was of another alphabet (Greek). Everything was based on RTF with codepages and it was a braindead mess. My first job in 2003 was to extend the system to handle the 8 newcomer languages and with ASCII based encodings it was completely unmanageable because every document processed mixes languages and alphabets freely (addresses and names are often written in their original form for instance). 2 years ago we implemented also support for Chinese. The nice thing was that we didn't have to change much to do that thanks to Unicode. The second surprise was with the file sizes, Chinese documents were generally smaller than their European counterparts. Yes CJK requires 3 bytes for each ideogram, but generally 1 ideogram replaces many letters. The ideogram 亿 replaces "One hundred million" for example, which of them take more bytes? So if CJK indeed requires more bytes to encode, it is firstly because they NEED many more bits in the first place (there are around 30000 CJK codepoints in the BMP alone, add to it the 60000 that are in the SIP and we have a need of 17 bits only to encode them.On 5/16/18 1:18 PM, Joakim wrote:Thanks for the link, skipped to the part about text encodings, should be fun to read the rest later.On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme,as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again. I have been researching this a bit since then, and the stated goals for UTF-8 at inception were that it _could not overlap with ASCII anywhere for other languages_, to avoid issues with legacy software wrongly processing other languages as ASCII, and to allow seeking from an arbitrary location within a byte stream: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt I have no dispute with these priorities at the time, as they were optimizing for the institutional and tech realities of 1992 as Dylan also notes, and UTF-8 is actually a nice hack given those constraints. What I question is that those priorities are at all relevant today, when billions of smartphone users are regularly not using ASCII, and these tech companies are the largest private organizations on the planet, ie they have the resources to design a new transfer format. I see basically no relevance for the streaming requirement today, as I noted in this forum years ago, but I can see why it might have been considered important in the early '90s, before packet-based networking protocols had won. I think a header-based scheme would be _much_ better today and the reason I know Dmitry knows that is that I have discussed privately with him over email that I plan to prototype a format like that in D. Even if UTF-8 is already fairly widespread, something like that could be useful as a better intermediate format for string processing, and maybe someday could replace UTF-8 too.
May 17 2018
On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:This is not practical, sorry. What happens when your message loses the header? Exactly, the rest of the message is garbled.Why would it lose the header? TCP guarantees delivery and checksums the data, that's effective enough at the transport layer. I agree that UTF-8 is a more redundant format, as others have mentioned earlier, and is thus more robust to certain types of data loss than a header-based scheme. However, I don't consider that the job of the text format, it's better done by other layers, like transport protocols or filesystems, which will guard against such losses much more reliably and efficiently. For example, a random bitflip somewhere in the middle of a UTF-8 string will not be detectable most of the time. However, more robust error-correcting schemes at other layers of the system will easily catch that.That's exactly what happened with code page based texts when you don't know in which code page it is encoded. It has the supplemental inconvenience that mixing languages becomes impossible or at least very cumbersome. UTF-8 has several properties that are difficult to have with other schemes. - It is state-less, means any byte in a stream always means the same thing. Its meaning does not depend on external or a previous byte.I realize this was considered important at one time, but I think it has proven to be a bad design decision, for HTTP too. There are some advantages when building rudimentary systems with crude hardware and lots of noise, as was the case back then, but that's not the tech world we live in today. That's why almost every HTTP request today is part of a stateful session that explicitly keeps track of the connection, whether through cookies, https encryption, or HTTP/2.- It can mix any language in the same stream without acrobatics and if one thinks that mixing languages doesn't happen often should get his head extracted from his rear, because it is very common (check wikipedia's front page for example).I question that almost anybody needs to mix "streams." As for messages or files, headers handle multiple language mixing easily, as noted in that earlier thread.- The multi byte nature of other alphabets is not as bad as people think because texts in computer do not live on their own, meaning that they are generally embedded inside file formats, which more often than not are extremely bloated (xml, html, xliff, akoma ntoso, rtf etc.). The few bytes more in the text do not weigh that much.Heh, the other parts of the tech stack are much more bloated, so this bloat is okay? A unique argument, but I'd argue that's why those bloated formats you mention are largely dying off too.I'm in charge at the European Commission of the biggest translation memory in the world. It handles currently 30 languages and without UTF-8 and UTF-16 it would be unmanageable. I still remember when I started there in 2002 when we handled only 11 languages of which only 1 was of another alphabet (Greek). Everything was based on RTF with codepages and it was a braindead mess. My first job in 2003 was to extend the system to handle the 8 newcomer languages and with ASCII based encodings it was completely unmanageable because every document processed mixes languages and alphabets freely (addresses and names are often written in their original form for instance).I have no idea what a "translation memory" is. I don't doubt that dealing with non-standard codepages or layouts was difficult, and that a standard like Unicode made your life easier. But the question isn't whether standards would clean things up, of course they would, the question is whether a hypothetical header-based standard would be better than the current continuation byte standard, UTF-8. I think your life would've been even easier with the former, though depending on your usage, maybe the main gain for you would be just from standardization.2 years ago we implemented also support for Chinese. The nice thing was that we didn't have to change much to do that thanks to Unicode. The second surprise was with the file sizes, Chinese documents were generally smaller than their European counterparts. Yes CJK requires 3 bytes for each ideogram, but generally 1 ideogram replaces many letters. The ideogram 亿 replaces "One hundred million" for example, which of them take more bytes? So if CJK indeed requires more bytes to encode, it is firstly because they NEED many more bits in the first place (there are around 30000 CJK codepoints in the BMP alone, add to it the 60000 that are in the SIP and we have a need of 17 bits only to encode them.That's not the relevant criteria: nobody cares if the CJK documents were smaller than their European counterparts. What they care about is that, given a different transfer format, the CJK document could have been significantly smaller still. Because almost nobody cares about which translation version is smaller, they care that the text they sent in Chinese or Korean is as small as it can be. Anyway, I didn't mean to restart this debate, so I'll leave it here.
May 17 2018
On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:What does TCP/IP got to do with anything in discussion here. UTF-8 (or UTF-16 or UTF-32) has nothing to do with network protocols. That's completely unrelated. A file encoded on a disk may never leave the machine it is written on and may never see a wire in its lifetime and its encoding is still of vital importance. That's why a header encoding is too restrictive.This is not practical, sorry. What happens when your message loses the header? Exactly, the rest of the message is garbled.Why would it lose the header? TCP guarantees delivery and checksums the data, that's effective enough at the transport layer.I agree that UTF-8 is a more redundant format, as others have mentioned earlier, and is thus more robust to certain types of data loss than a header-based scheme. However, I don't consider that the job of the text format, it's better done by other layers, like transport protocols or filesystems, which will guard against such losses much more reliably and efficiently.No. A text format cannot depend on a network protocol. It would be as if you could only listen to a music file or a video on streaming and never save it on offline file as there was nowhere the information of what that blob of bytes represents. It doesn't make any sense.For example, a random bitflip somewhere in the middle of a UTF-8 string will not be detectable most of the time. However, more robust error-correcting schemes at other layers of the system will easily catch that.That's the job of the other layers. Any other file would have the same problem. At least, with utf-8 there will be at most only ever 1 codepoint lost or changed. Any other encoding would fare better. This said if a checksum header for your document is important you can add it to externally anyway.Again, orthogonal to utf-8. When I speak above of streams it doesn't limit to sockets, file are also read in streams. So stop of equating UTF-8 with the Internet, these are 2 different domains. Internet and its protocols were defined and invented long before Unicode and Unicode is very usefull also offline.That's exactly what happened with code page based texts when you don't know in which code page it is encoded. It has the supplemental inconvenience that mixing languages becomes impossible or at least very cumbersome. UTF-8 has several properties that are difficult to have with other schemes. - It is state-less, means any byte in a stream always means the same thing. Its meaning does not depend on external or a previous byte.I realize this was considered important at one time, but I think it has proven to be a bad design decision, for HTTP too. There are some advantages when building rudimentary systems with crude hardware and lots of noise, as was the case back then, but that's not the tech world we live in today. That's why almost every HTTP request today is part of a stateful session that explicitly keeps track of the connection, whether through cookies, https encryption, or HTTP/2.Ok, show me how you transmit that, I'm curious: <prop type="Txt::Doc. No.">E2010C0002</prop> <tuv lang="EN-GB"> <seg>EFTA Surveillance Authority Decision</seg> </tuv> <tuv lang="DE-DE"> <seg>Beschluss der EFTA-Überwachungsbehörde</seg> </tuv> <tuv lang="DA-01"> <seg>EFTA-Tilsynsmyndighedens beslutning</seg> </tuv> <tuv lang="EL-01"> <seg>Απόφαση της Εποπτεύουσας Αρχής της ΕΖΕΣ</seg> </tuv> <tuv lang="ES-ES"> <seg>Decisión del Órgano de Vigilancia de la AELC</seg> </tuv> <tuv lang="FI-01"> <seg>EFTAn valvontaviranomaisen päätös</seg> </tuv> <tuv lang="FR-FR"> <seg>Décision de l'Autorité de surveillance AELE</seg> </tuv> <tuv lang="IT-IT"> <seg>Decisione dell’Autorità di vigilanza EFTA</seg> </tuv> <tuv lang="NL-NL"> <seg>Besluit van de Toezichthoudende Autoriteit van de EVA</seg> </tuv> <tuv lang="PT-PT"> <seg>Decisão do Órgão de Fiscalização da EFTA</seg> </tuv> <tuv lang="SV-SE"> <seg>Beslut av Eftas övervakningsmyndighet</seg> </tuv> <tuv lang="LV-01"> <seg>EBTA Uzraudzības iestādes Lēmums</seg> </tuv> <tuv lang="CS-01"> <seg>Rozhodnutí Kontrolního úřadu ESVO</seg> </tuv> <tuv lang="ET-01"> <seg>EFTA järelevalveameti otsus</seg> </tuv> <tuv lang="PL-01"> <seg>Decyzja Urzędu Nadzoru EFTA</seg> </tuv> <tuv lang="SL-01"> <seg>Odločba Nadzornega organa EFTE</seg> </tuv> <tuv lang="LT-01"> <seg>ELPA priežiūros institucijos sprendimas</seg> </tuv> <tuv lang="MT-01"> <seg>Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA</seg> </tuv> <tuv lang="SK-01"> <seg>Rozhodnutie Dozorného orgánu EZVO</seg> </tuv> <tuv lang="BG-01"> <seg>Решение на Надзорния орган на ЕАСТ</seg> </tuv> </tu> <tu>- It can mix any language in the same stream without acrobatics and if one thinks that mixing languages doesn't happen often should get his head extracted from his rear, because it is very common (check wikipedia's front page for example).I question that almost anybody needs to mix "streams." As for messages or files, headers handle multiple language mixing easily, as noted in that earlier thread.They don't, it's getting worse by the day, that's why I mentioned Akoma Ntoso and XLIFF, they will be used more and more. The world is not limited to webshit (see n-gate.com for the reference).- The multi byte nature of other alphabets is not as bad as people think because texts in computer do not live on their own, meaning that they are generally embedded inside file formats, which more often than not are extremely bloated (xml, html, xliff, akoma ntoso, rtf etc.). The few bytes more in the text do not weigh that much.Heh, the other parts of the tech stack are much more bloated, so this bloat is okay? A unique argument, but I'd argue that's why those bloated formats you mention are largely dying off too.I doubt it because the issue has nothing to do with network protocols as you seem to imply. It is about data format, i.e. the content that may be shuffled over a net, but can also stay on a disk, be printed on paper (gasp so old tech) or used interactively in a GUI.I'm in charge at the European Commission of the biggest translation memory in the world. It handles currently 30 languages and without UTF-8 and UTF-16 it would be unmanageable. I still remember when I started there in 2002 when we handled only 11 languages of which only 1 was of another alphabet (Greek). Everything was based on RTF with codepages and it was a braindead mess. My first job in 2003 was to extend the system to handle the 8 newcomer languages and with ASCII based encodings it was completely unmanageable because every document processed mixes languages and alphabets freely (addresses and names are often written in their original form for instance).I have no idea what a "translation memory" is. I don't doubt that dealing with non-standard codepages or layouts was difficult, and that a standard like Unicode made your life easier. But the question isn't whether standards would clean things up, of course they would, the question is whether a hypothetical header-based standard would be better than the current continuation byte standard, UTF-8. I think your life would've been even easier with the former, though depending on your usage, maybe the main gain for you would be just from standardization.At most 50% more but if the size is really that important it can use UTF-16 which is the same size as Big-5 or Shit-JIS, or as Walter suggested they would better compress the file in that case.2 years ago we implemented also support for Chinese. The nice thing was that we didn't have to change much to do that thanks to Unicode. The second surprise was with the file sizes, Chinese documents were generally smaller than their European counterparts. Yes CJK requires 3 bytes for each ideogram, but generally 1 ideogram replaces many letters. The ideogram 亿 replaces "One hundred million" for example, which of them take more bytes? So if CJK indeed requires more bytes to encode, it is firstly because they NEED many more bits in the first place (there are around 30000 CJK codepoints in the BMP alone, add to it the 60000 that are in the SIP and we have a need of 17 bits only to encode them.That's not the relevant criteria: nobody cares if the CJK documents were smaller than their European counterparts. What they care about is that, given a different transfer format, the CJK document could have been significantly smaller still. Because almost nobody cares about which translation version is smaller, they care that the text they sent in Chinese or Korean is as small as it can be.Anyway, I didn't mean to restart this debate, so I'll leave it here.- the auto-synchronization and the statelessness are big deals.
May 17 2018
On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]- the auto-synchronization and the statelessness are big deals.Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the nogc guys would be up in arms. And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents? Nevermind the recent trend of liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like. T -- Famous last words: I *think* this will work...
May 17 2018
On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]That's what rtf with code pages was essentially. I'm happy that we got rid of it and that they were replaced by xml, even if Microsoft's document xml being a bloated, ridiculous mess, it's still an order of magnitude less problematic than rtf (I mean at the text encoding level).[...]Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the nogc guys would be up in arms. [...]
May 17 2018
On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]You'd have three data structures: Strand, Rope, and Slice. A Strand is a series of bytes with an encoding. A Rope is a series of Strands. A Slice is a pair of location references within a Rope. You probably want a special datastructure to name a location within a Rope: Strand offset, then byte offset. Total of five words instead of two to pass a Slice, but zero dynamic allocations. This would be a problem for data locality. However, rope-style datastructures are handy for some types of string manipulation. As an alternative, you might have a separate document specifying what encodings apply to what byte ranges. Slices would then be three words long (pointer to the string struct, start offset, end offset). Iterating would cost O(log(S) + M), where S is the number of encoded segments and M is the number of bytes in the slice. Anyway, you either get a more complex data structure, or you have terrible time complexity, but you don't have both.- the auto-synchronization and the statelessness are big deals.Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the nogc guys would be up in arms.And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents? Nevermind the recent trend of liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like."Header" implies that all encoding data appears at the start of the document, or in a separate metadata segment. (Call it a start index and two bytes to specify the encoding; reserve the first few bits of the encoding to specify the width.) It also brings to mind HTTP, and reminds me that most documents are either mostly ASCII or a heavy mix of ASCII and something else (HTML and XML being the forerunners). If the encoding succeeded at making most scripts single-byte, then, testing with https://ar.wikipedia.org/wiki/Main_Page, you might get within 15% of UTF-8's efficiency. And then a simple sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as long in this encoding as UTF-8, since it has ten encoded segments, each with overhead. (Assuming the header supports strings up to 2^32 bytes long.) If it didn't succeed at making Latin and Arabic single-byte scripts (and Latin contains over 800 characters in Unicode, while Arabic has over three hundred), it would be worse than UTF-16.
May 18 2018
On 05/17/2018 09:14 AM, Patrick Schluter wrote:I'm in charge at the European Commission of the biggest translation memory in the world.Impressive! Is that the Europarl?
May 17 2018
On Thursday, 17 May 2018 at 15:37:01 UTC, Andrei Alexandrescu wrote:On 05/17/2018 09:14 AM, Patrick Schluter wrote:No, Euramis. The central translation memory developed by the Commission and used also by the other institutions. The database contains more than a billion segments from parallel texts and is afaik the biggest of its kind. One of the big strength of the Euramis TM is its multi-target language store this allows fuzzy searches in all combinations including indirect translations (i.e. if a document written in english was translated in Romanian and in Maltese it is then possible to search for alignments between ro and mt). It's not the only system to do that but on that volume it is quite unique. We publish also every year an extract of it of the published legislation [1] from the official journal so that they can be used by the research community. All the machine translation engines use it. It is one of most accessed data collection on the European Open Data portal [2]. The very uncommon thing about the backend software of EURAMIS is that it is written in C. Pure unadultered C. I'm trying to introduce D but with the strange (to say it politely) configurations our server have it is quite challenging. [1]: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory [2]: http://data.europa.eu/euodp/fr/dataI'm in charge at the European Commission of the biggest translation memory in the world.Impressive! Is that the Europarl?
May 17 2018
On 5/16/2018 10:01 PM, Joakim wrote:Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.It sounds like the main issue is that a header based encoding would take less size? If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.
May 17 2018
On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:On 5/16/2018 10:01 PM, Joakim wrote:Indeed, and some other compression/deduplication options that would allow limited random access / slicing (by decoding a single “block” to access an element for instance). Anything that depends on external information and is not self-sync is awful for interchange. Internally the application can do some smarts though, but even then things like interning (partial interning) might be more valuable approach. TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.It sounds like the main issue is that a header based encoding would take less size? If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.
May 17 2018
On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.Quoting to highlight and agree. TCP is reliable because it resends dropped packets and delivers them in order. I don't write TCP packets to my long-term storage medium. UTF as a transportation protocol Unicode is *far* more useful than just sending across a network.
May 17 2018
On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:The point wasn't that TCP is handling all the errors, it was a throwaway example of one other layer of the system, the network transport layer, that actually has a checksum that will detect a single bitflip, which UTF-8 will not usually detect. I mentioned that the filesystem and several other layers have their own such error detection, yet you guys crazily latch on to the TCP example alone. On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.Quoting to highlight and agree. TCP is reliable because it resends dropped packets and delivers them in order. I don't write TCP packets to my long-term storage medium. UTF as a transportation protocol Unicode is *far* more useful than just sending across a network.On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]As we discussed when I first raised this header scheme years ago, you're right that slicing could be more expensive, depending on whether you chose to allocate a new header for the substring or not. The question is whether the optimizations available from such a header telling you where all the language substrings are in a multi-language string make up for having to expensively process the entire UTF-8 string to get that or other data. I think it's fairly obvious the design tradeoff of the header would beat out UTF-8 for all but a few degenerate cases, but maybe you don't see it.- the auto-synchronization and the statelessness are big deals.Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the nogc guys would be up in arms.And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents?It would bloat the header to some extent, but less so than a UTF-8 string. You may want to use special header encodings for such edge cases too, if you want to maintain the same large performance lead over UTF-8 that you'd have for the common case.Nevermind the recent trend of liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like.Personally, I don't consider emojis worth implementing, :) they shouldn't be part of Unicode. But since they are, I'm fairly certain header-based text messages with emojis would be significantly smaller than using UTF-8/16. I was surprised to see that adding a emoji to a text message I sent last year cut my message character quota in half. I googled why this was and it turns out that when you add an emoji, the text messaging client actually changes your message encoding from UTF-8 to UTF-16! I don't know if this is a limitation of the default Android messaging client, my telco carrier, or SMS, but I strongly suspect this is widespread. Anyway, I can see the arguments about UTF-8 this time around are as bad as the first time I raised it five years back, so I'll leave this thread here.
May 18 2018
On Friday, 18 May 2018 at 08:44:41 UTC, Joakim wrote:I was surprised to see that adding a emoji to a text message I sent last year cut my message character quota in half. I googled why this was and it turns out that when you add an emoji, the text messaging client actually changes your message encoding from UTF-8 to UTF-16! I don't know if this is a limitation of the default Android messaging client, my telco carrier, or SMS, but I strongly suspect this is widespread.Welcome to my world (and probably world of most Europeans) where I don't type ć, č, ž and other non-ascii letters since early 2000s, even though SMS are today mostly flat rate and people chat via WhatsApp anyway.
May 18 2018
On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:On 5/16/2018 10:01 PM, Joakim wrote:Yes, and be easier to process.Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.It sounds like the main issue is that a header based encoding would take less size?If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.In general, you would be wrong, a carefully designed binary format will usually beat the pants off general-purpose compression: https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results Of course, that's because you can tailor your binary format for specific types of data, text in this case, and take advantage of patterns in that subset, such as specialized image compression formats do. In this case though, I haven't compared this scheme to general compression of UTF-8 strings, so I don't know which would compress better. However, that would mostly matter for network transmission, another big gain of a header-based scheme that doesn't use compression is much faster string processing in memory. Yes, the average end user doesn't care for this, but giant consumers of text data, like search engines, would benefit greatly from this. On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:Indeed, and some other compression/deduplication options that would allow limited random access / slicing (by decoding a single “block” to access an element for instance).Possibly competitive for compression only for transmission over the network, but unlikely for processing, as noted for Walter's idea.Anything that depends on external information and is not self-sync is awful for interchange.You are describing the vast majority of all formats and protocols, amazing how we got by with them all this time.Internally the application can do some smarts though, but even then things like interning (partial interning) might be more valuable approach. TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.You seem to have missed my point entirely: UTF-8 will not catch most bit flips either, only if it happens to corrupt certain key bits in a certain way, a minority of the possibilities. Nobody is arguing that data corruption doesn't happen or that some error-correction shouldn't be done somewhere. The question is whether the extremely limited robustness of UTF-8 added by its significant redundancy is a good tradeoff. I think it's obvious that it isn't, and I posit that anybody who knows anything about error-correcting codes would agree with that assessment. You would be much better off by having a more compact header-based transfer format and layering on the level of error correction you need at a different level, which as I noted is already done at the link and transport layers and various other parts of the system already. If you need more error-correction than that, do it right, not in a broken way as UTF-8 does. Honestly, error detection/correction is the most laughably broken part of UTF-8, it is amazing that people even bring that up as a benefit.
May 17 2018
On Thu, May 17, 2018 at 10:16:03AM -0700, Walter Bright via Digitalmars-d wrote:On 5/16/2018 10:01 PM, Joakim wrote:My bet is on the LZW being *far* better than a header-based encoding. Natural language, which a large part of textual data consists of, tends to have a lot of built-in redundancy, and therefore is highly compressible. A proper compression algorithm will beat any header-based size reduction scheme, while still maintaining the context-free nature of UTF-8. T -- In a world without fences, who needs Windows and Gates? -- Christian SurchiUnicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.It sounds like the main issue is that a header based encoding would take less size? If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.
May 17 2018