www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Of possible interest: fast UTF8 validation

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
May 16 2018
next sibling parent reply Ethan Watson <gooberman gmail.com> writes:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
I re-implemented some common string functionality at Remedy using SSE 4.2 instructions. Pretty handy. Except we had to turn that code off for released products since nowhere near enough people are running SSE 4.2 capable hardware. The code linked doesn't seem to use any instructions newer than SSE2, so it's perfectly safe to run on any x64 processor. Could probably be sped up with newer SSE instructions if you're only ever running internally on hardware you control.
May 16 2018
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/16/2018 08:47 AM, Ethan Watson wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_us
ng_as_little_as_07/ 
I re-implemented some common string functionality at Remedy using SSE 4.2 instructions. Pretty handy. Except we had to turn that code off for released products since nowhere near enough people are running SSE 4.2 capable hardware.
Is it workable to have a runtime-initialized flag that controls using SSE vs. conservative?
 The code linked doesn't seem to use any instructions newer than SSE2, so 
 it's perfectly safe to run on any x64 processor. Could probably be sped 
 up with newer SSE instructions if you're only ever running internally on 
 hardware you control.
Even better! Contributions would be very welcome. Andrei
May 16 2018
parent reply Ethan Watson <gooberman gmail.com> writes:
On Wednesday, 16 May 2018 at 13:54:05 UTC, Andrei Alexandrescu 
wrote:
 Is it workable to have a runtime-initialized flag that controls 
 using SSE vs. conservative?
Sure, it's workable with these kind of speed gains. Although the conservative code path ends up being slightly worse off - an extra fetch, compare and branch get introduced. My preferred method though is to just build multiple sets of binaries as DLLs/SOs/DYNLIBs, then load in the correct libraries dependant on the CPUID test at program initialisation. Current Xbox/Playstation hardware is pretty terrible when it comes to branching, so compiling with minimal branching and deploying the exact binaries for the hardware capabilities is the way I generally approach things. We never got around to setting something like that up for the PC release of Quantum Break, although we definitely talked about it.
May 16 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/16/2018 7:38 AM, Ethan Watson wrote:
 My preferred method though is to just build multiple sets of binaries as 
 DLLs/SOs/DYNLIBs, then load in the correct libraries dependant on the CPUID
test 
 at program initialisation.
I used to do things like that a simpler way. 3 functions would be created: void FeatureInHardware(); void EmulateFeature(); void Select(); void function() doIt = &Select; I.e. the first time doIt is called, it calls the Select function which then resets doIt to either FeatureInHardware() or EmulateFeature(). It costs an indirect call, but if you move it up the call hierarchy a bit so it isn't in the hot loops, the indirect function call cost is negligible. The advantage is there was only one binary. ---- The PDP-11 had an optional chipset to do floating point. The compiler generated function calls that emulated the floating point: call FPADD call FPSUB ... Those functions would check to see if the FPU existed. If it did, it would in-place patch the binary to replace the calls with FPU instructions! Of course, that won't work these days because of protected code pages. ---- In the bad old DOS days, emulator calls were written out by the compiler. Special relocation fixup records were emitted for them. The emulator or the FPU library was then linked in, and included special relocation fixup values which tricked the linker fixup mechanism into patching those instructions with either emulator calls or FPU instructions. It was just brilliant!
May 16 2018
next sibling parent reply Ethan <gooberman gmail.com> writes:
On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
 I used to do things like that a simpler way. 3 functions would 
 be created:

   void FeatureInHardware();
   void EmulateFeature();
   void Select();
   void function() doIt = &Select;

 I.e. the first time doIt is called, it calls the Select 
 function which then resets doIt to either FeatureInHardware() 
 or EmulateFeature().

 It costs an indirect call, but if you move it up the call 
 hierarchy a bit so it isn't in the hot loops, the indirect 
 function call cost is negligible.

 The advantage is there was only one binary.
It certainly sounds reasonable enough for 99% of use cases. But I'm definitely the 1% here ;-) Indirect calls invoke the wrath of the branch predictor on XB1/PS4 (ie an AMD Jaguar processor). But there's certainly some more interesting non-processor behaviour, at least on MSVC compilers. The provided auto-DLL loading in that environment performs a call to your DLL-boundary-crossing function, which actually winds up in a jump table that performs a jump instruction to actually get to your DLL code. I suspect this is more costly than the indirect jump at a "write a basic test" level. Doing an indirect call as the only action in a for-loop is guaranteed to bring out the costly branch predictor on the Jaguar. Without getting in and profiling a bunch of stuff, I'm not entirely sure which approach I'd prefer for a general approach. Certainly, as far as this particular thread goes, every general purpose function of a few lines that I write that use intrinsics is forced inline. No function calls, indirect or otherwise. And on top of that, the inlined code usually pushes the branches in the code out the code across the byte boundary lines just far enough that the simple branch predictor is only ever invoked. (Related: one feature I'd really really really love for linkers to implement is the ability to mark up certain functions to only ever be linked at a certain byte boundary. And that's purely because Jaguar branch prediction often made my profiling tests non-deterministic between compiles. A NOP is a legit optimisation on those processors.)
May 16 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/16/2018 10:28 AM, Ethan wrote:
 (Related: one feature I'd really really really love for linkers to implement
is 
 the ability to mark up certain functions to only ever be linked at a certain 
 byte boundary. And that's purely because Jaguar branch prediction often made
my 
 profiling tests non-deterministic between compiles. A NOP is a legit 
 optimisation on those processors.)
Linkers already do that. Alignment is specified on all symbols emitted by the compiler, and the linker uses that info.
May 16 2018
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 17/05/2018 8:34 AM, Walter Bright wrote:
 On 5/16/2018 10:28 AM, Ethan wrote:
 (Related: one feature I'd really really really love for linkers to 
 implement is the ability to mark up certain functions to only ever be 
 linked at a certain byte boundary. And that's purely because Jaguar 
 branch prediction often made my profiling tests non-deterministic 
 between compiles. A NOP is a legit optimisation on those processors.)
Linkers already do that. Alignment is specified on all symbols emitted by the compiler, and the linker uses that info.
Would allowing align attribute on functions, make sense here for Ethan?
May 16 2018
prev sibling parent Ethan <gooberman gmail.com> writes:
And at the risk of getting this topic back on track:

On Wednesday, 16 May 2018 at 20:34:26 UTC, Walter Bright wrote:
 Linkers already do that. Alignment is specified on all symbols 
 emitted by the compiler, and the linker uses that info.
Mea culpa. Upon further thinking, two things strike me: 1) As suggested, there's no way to instruct the front-end to align functions to byte boundaries outside of "optimise for speed" command line flags 2) I would have heavily relied on incremental linking to iterate on these tests when trying to work out how the processor behaved. I expect MSVC's incremental linker would turn out to be just rubbish enough to not care about how those flags originally behaved. On Wednesday, 16 May 2018 at 20:36:10 UTC, Walter Bright wrote:
 It would be nice to get this technique put into std.algorithm!
The code I wrote originally was C++ code with intrinsics. But I can certainly look at adapting it to DMD/LDC. The DMD frontend providing natural mappings for Intel's published intrinsics would be massively beneficial here.
May 17 2018
prev sibling parent xenon325 <anm programmer.net> writes:
On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
 I used to do things like that a simpler way. 3 functions would 
 be created:

   void FeatureInHardware();
   void EmulateFeature();
   void Select();
   void function() doIt = &Select;

 I.e. the first time doIt is called, it calls the Select 
 function which then resets doIt to either FeatureInHardware() 
 or EmulateFeature().

 It costs an indirect call [...]
Is this basically the same as Function MultiVersioning [1] ? I never had a need to use it and always wondered how does it work out it real life. From description it seems this would incur indirection: "To keep the cost of dispatching low, the IFUNC [2] mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump indirect instruction." In linked article [2] Ian Lance Taylor says glibc uses this for memcpy(), so this should be pretty efficient (but than again, one doesn't call memcpy() in hot loops too often) [1] https://gcc.gnu.org/wiki/FunctionMultiVersioning [2] https://www.airs.com/blog/archives/403 -- Alexander
May 16 2018
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/16/2018 5:47 AM, Ethan Watson wrote:
 I re-implemented some common string functionality at Remedy using SSE 4.2 
 instructions. Pretty handy. Except we had to turn that code off for released 
 products since nowhere near enough people are running SSE 4.2 capable hardware.
It would be nice to get this technique put into std.algorithm!
May 16 2018
prev sibling next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
D doesn't seem to have C definitions for the x86 SIMD intrinsics, which is a bummer https://issues.dlang.org/show_bug.cgi?id=18865 It's too bad that nothing came of std.simd.
May 16 2018
parent reply Ethan Watson <gooberman gmail.com> writes:
On Wednesday, 16 May 2018 at 14:25:07 UTC, Jack Stouffer wrote:
 D doesn't seem to have C definitions for the x86 SIMD 
 intrinsics, which is a bummer
Replying to highlight this. There's core.simd which doesn't look anything like SSE/AVX intrinsics at all, and looks a lot more like a wrapper for writing assembly instructions directly. And even better - LDC doesn't support core.simd and has its own intrinsics that don't match the SSE/AVX intrinsics API published by Intel. And since I'm a multi-platform developer, the "What about NEON intrinsics?" question always sits in the back of my mind. I ended up implementing my own SIMD primitives in Binderoo, but they're all versioned out for LDC at the moment until I look in to it and complete the implementation.
May 16 2018
parent David Nadlinger <code klickverbot.at> writes:
On Wednesday, 16 May 2018 at 14:48:54 UTC, Ethan Watson wrote:
 And even better - LDC doesn't support core.simd and has its own 
 intrinsics that don't match the SSE/AVX intrinsics API 
 published by Intel.
To provide some context here: LDC only supports the types from core.simd, but not the __simd "assembler macro" that DMD uses to more or less directly emit the corresponding x86 opcodes. LDC does support most of the GCC-style SIMD builtins for the respective target (x86, ARM, …), but there are two problems with this: 1) As Ethan pointed out, the GCC API does not match Intel's intrinsics; for example, it is `__builtin_ia32_vfnmsubpd256_mask3` instead of `_mm256_mask_fnmsub_pd`, and the argument orders differ as well. 2) The functions that LDC exposes as intrinsics are those that are intrinsics on the LLVM IR level. However, some operations can be directly represented in normal, instruction-set-independent LLVM IR – no explicit intrinsics are provided for these. Unfortunately, LLVM doesn't seem to provide any particularly helpful tools for implementing Intel's intrinsics API. x86intrin.h is manually implemented for Clang as a collection of various macros and functions. It would be seriously cool if someone could write a small tool to parse those headers, (semi-)automatically convert them to D, and generate tests for comparing the emitted IR against Clang. I'm happy to help with the LDC side of things. — David
May 18 2018
prev sibling parent reply Joakim <dlang joakim.fea.st> writes:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
May 16 2018
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
 wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...
May 16 2018
parent reply Joakim <dlang joakim.fea.st> writes:
On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:
 On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
 wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...
I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.
May 16 2018
next sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 16 May 2018 at 17:18:06 UTC, Joakim wrote:
 I think you know what I'm referring to, which is that UTF-8 is 
 a badly designed format, not that input validation shouldn't be 
 done.
UTF-8 seems like the best option available given the problem space. Junk data is going to be a problem with any possible string format given that encoding translations and programmer error will always be prevalent.
May 16 2018
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/16/18 1:18 PM, Joakim wrote:
 On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:
 On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_us
ng_as_little_as_07/ 
Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...
I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.
I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/ If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt! Andrei
May 16 2018
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
 If you could share some details on why you think UTF8 is badly designed and
how 
 you believe it could be/have been better, I'd be in your debt!
Me too. I think UTF-8 is brilliant (and I suffered for years under the lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder! Perhaps you're referring to the redundancy in UTF-8 - though illegal encodings are made possible by such redundancy.
May 16 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, May 16, 2018 13:42:11 Walter Bright via Digitalmars-d wrote:
 On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
 If you could share some details on why you think UTF8 is badly designed
 and how you believe it could be/have been better, I'd be in your debt!
Me too. I think UTF-8 is brilliant (and I suffered for years under the lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder! Perhaps you're referring to the redundancy in UTF-8 - though illegal encodings are made possible by such redundancy.
I'm inclined to think that the redundancy is a serious flaw. I'd argue that if it were truly well-designed, there would be exactly one way to represent every character - including clear up to grapheme clusters where multiple code points are involved (i.e. there would be no normalization issues in valid Unicode, because there would be only one valid normalization). But there may be some technical issues that I'm not aware of that would make that problematic. Either way, the issues that I have with UTF-8 are issues that UTF-16 and UTF-32 have as well, since they're really issues relating to code points. Overall, I think that UTF-8 is by far the best encoding that we have, and I don't think that we're going to get anything better, but I'm also definitely inclined to think that it's still flawed - just far less flawed than the alternatives. And in general, I have to wonder if there would be a way to make Unicode less complicated if we could do it from scratch without worrying about any kind of compatability, since what we have is complicated enough that most programmers don't come close to understanding it, and it's just way too hard to get right. But I suspect that if efficiency matters, there's enough inherent complexity that we'd just be screwed on that front even if we could do a better job than was done with Unicode as we know it. - Jonathan M Davis
May 16 2018
prev sibling parent reply Joakim <dlang joakim.fea.st> writes:
On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu 
wrote:
 On 5/16/18 1:18 PM, Joakim wrote:
 On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky 
 wrote:
 On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei 
 Alexandrescu wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...
I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.
I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/
Thanks for the link, skipped to the part about text encodings, should be fun to read the rest later.
 If you could share some details on why you think UTF8 is badly 
 designed and how you believe it could be/have been better, I'd 
 be in your debt!
Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again. I have been researching this a bit since then, and the stated goals for UTF-8 at inception were that it _could not overlap with ASCII anywhere for other languages_, to avoid issues with legacy software wrongly processing other languages as ASCII, and to allow seeking from an arbitrary location within a byte stream: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt I have no dispute with these priorities at the time, as they were optimizing for the institutional and tech realities of 1992 as Dylan also notes, and UTF-8 is actually a nice hack given those constraints. What I question is that those priorities are at all relevant today, when billions of smartphone users are regularly not using ASCII, and these tech companies are the largest private organizations on the planet, ie they have the resources to design a new transfer format. I see basically no relevance for the streaming requirement today, as I noted in this forum years ago, but I can see why it might have been considered important in the early '90s, before packet-based networking protocols had won. I think a header-based scheme would be _much_ better today and the reason I know Dmitry knows that is that I have discussed privately with him over email that I plan to prototype a format like that in D. Even if UTF-8 is already fairly widespread, something like that could be useful as a better intermediate format for string processing, and maybe someday could replace UTF-8 too.
May 16 2018
next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 17 May 2018 at 05:01:54 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu 
 wrote:
 On 5/16/18 1:18 PM, Joakim wrote:
 On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky 
 wrote:
 On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei 
 Alexandrescu wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so. I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...
I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.
I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/
Thanks for the link, skipped to the part about text encodings, should be fun to read the rest later.
 If you could share some details on why you think UTF8 is badly 
 designed and how you believe it could be/have been better, I'd 
 be in your debt!
Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme,
This is not practical, sorry. What happens when your message loses the header? Exactly, the rest of the message is garbled. That's exactly what happened with code page based texts when you don't know in which code page it is encoded. It has the supplemental inconvenience that mixing languages becomes impossible or at least very cumbersome. UTF-8 has several properties that are difficult to have with other schemes. - It is state-less, means any byte in a stream always means the same thing. Its meaning does not depend on external or a previous byte. - It can mix any language in the same stream without acrobatics and if one thinks that mixing languages doesn't happen often should get his head extracted from his rear, because it is very common (check wikipedia's front page for example). - The multi byte nature of other alphabets is not as bad as people think because texts in computer do not live on their own, meaning that they are generally embedded inside file formats, which more often than not are extremely bloated (xml, html, xliff, akoma ntoso, rtf etc.). The few bytes more in the text do not weigh that much. I'm in charge at the European Commission of the biggest translation memory in the world. It handles currently 30 languages and without UTF-8 and UTF-16 it would be unmanageable. I still remember when I started there in 2002 when we handled only 11 languages of which only 1 was of another alphabet (Greek). Everything was based on RTF with codepages and it was a braindead mess. My first job in 2003 was to extend the system to handle the 8 newcomer languages and with ASCII based encodings it was completely unmanageable because every document processed mixes languages and alphabets freely (addresses and names are often written in their original form for instance). 2 years ago we implemented also support for Chinese. The nice thing was that we didn't have to change much to do that thanks to Unicode. The second surprise was with the file sizes, Chinese documents were generally smaller than their European counterparts. Yes CJK requires 3 bytes for each ideogram, but generally 1 ideogram replaces many letters. The ideogram 亿 replaces "One hundred million" for example, which of them take more bytes? So if CJK indeed requires more bytes to encode, it is firstly because they NEED many more bits in the first place (there are around 30000 CJK codepoints in the BMP alone, add to it the 60000 that are in the SIP and we have a need of 17 bits only to encode them.
 as they mostly were except for obviously the Asian CJK 
 languages. That way, you optimize for the common string, ie one 
 that contains a single language or at least no CJK, rather than 
 pessimizing every non-ASCII language by doubling its character 
 width, as UTF-8 does. This UTF-8 issue is one of the first 
 topics I raised in this forum, but as you noted at the time 
 nobody agreed and I don't want to dredge that all up again.

 I have been researching this a bit since then, and the stated 
 goals for UTF-8 at inception were that it _could not overlap 
 with ASCII anywhere for other languages_, to avoid issues with 
 legacy software wrongly processing other languages as ASCII, 
 and to allow seeking from an arbitrary location within a byte 
 stream:

 https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

 I have no dispute with these priorities at the time, as they 
 were optimizing for the institutional and tech realities of 
 1992 as Dylan also notes, and UTF-8 is actually a nice hack 
 given those constraints. What I question is that those 
 priorities are at all relevant today, when billions of 
 smartphone users are regularly not using ASCII, and these tech 
 companies are the largest private organizations on the planet, 
 ie they have the resources to design a new transfer format. I 
 see basically no relevance for the streaming requirement today, 
 as I noted in this forum years ago, but I can see why it might 
 have been considered important in the early '90s, before 
 packet-based networking protocols had won.

 I think a header-based scheme would be _much_ better today and 
 the reason I know Dmitry knows that is that I have discussed 
 privately with him over email that I plan to prototype a format 
 like that in D. Even if UTF-8 is already fairly widespread, 
 something like that could be useful as a better intermediate 
 format for string processing, and maybe someday could replace 
 UTF-8 too.
May 17 2018
next sibling parent reply Joakim <dlang joakim.fea.st> writes:
On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:
 This is not practical, sorry. What happens when your message 
 loses the header? Exactly, the rest of the message is garbled.
Why would it lose the header? TCP guarantees delivery and checksums the data, that's effective enough at the transport layer. I agree that UTF-8 is a more redundant format, as others have mentioned earlier, and is thus more robust to certain types of data loss than a header-based scheme. However, I don't consider that the job of the text format, it's better done by other layers, like transport protocols or filesystems, which will guard against such losses much more reliably and efficiently. For example, a random bitflip somewhere in the middle of a UTF-8 string will not be detectable most of the time. However, more robust error-correcting schemes at other layers of the system will easily catch that.
 That's exactly what happened with code page based texts when 
 you don't know in which code page it is encoded. It has the 
 supplemental inconvenience that mixing languages becomes 
 impossible or at least very cumbersome.
 UTF-8 has several properties that are difficult to have with 
 other schemes.
 - It is state-less, means any byte in a stream always means the 
 same thing. Its meaning  does not depend on external or a 
 previous byte.
I realize this was considered important at one time, but I think it has proven to be a bad design decision, for HTTP too. There are some advantages when building rudimentary systems with crude hardware and lots of noise, as was the case back then, but that's not the tech world we live in today. That's why almost every HTTP request today is part of a stateful session that explicitly keeps track of the connection, whether through cookies, https encryption, or HTTP/2.
 - It can mix any language in the same stream without acrobatics 
 and if one thinks that mixing languages doesn't happen often 
 should get his head extracted from his rear, because it is very 
 common (check wikipedia's front page for example).
I question that almost anybody needs to mix "streams." As for messages or files, headers handle multiple language mixing easily, as noted in that earlier thread.
 - The multi byte nature of other alphabets is not as bad as 
 people think because texts in computer do not live on their 
 own, meaning that they are generally embedded inside file 
 formats, which more often than not are extremely bloated (xml, 
 html, xliff, akoma ntoso, rtf etc.). The few bytes more in the 
 text do not weigh that much.
Heh, the other parts of the tech stack are much more bloated, so this bloat is okay? A unique argument, but I'd argue that's why those bloated formats you mention are largely dying off too.
 I'm in charge at the European Commission of the biggest 
 translation memory in the world. It handles currently 30 
 languages and without UTF-8 and UTF-16 it would be 
 unmanageable. I still remember when I started there in 2002 
 when we handled only 11 languages of which only 1 was of 
 another alphabet (Greek). Everything was based on RTF with 
 codepages and it was a braindead mess. My first job in 2003 was 
 to extend the system to handle the 8 newcomer languages and 
 with ASCII based encodings it was completely unmanageable 
 because every document processed mixes languages and alphabets 
 freely (addresses and names are often written in their original 
 form for instance).
I have no idea what a "translation memory" is. I don't doubt that dealing with non-standard codepages or layouts was difficult, and that a standard like Unicode made your life easier. But the question isn't whether standards would clean things up, of course they would, the question is whether a hypothetical header-based standard would be better than the current continuation byte standard, UTF-8. I think your life would've been even easier with the former, though depending on your usage, maybe the main gain for you would be just from standardization.
 2 years ago we implemented also support for Chinese. The nice 
 thing was that we didn't have to change much to do that thanks 
 to Unicode. The second surprise was with the file sizes, 
 Chinese documents were generally smaller than their European 
 counterparts. Yes CJK requires 3 bytes for each ideogram, but 
 generally 1 ideogram replaces many letters. The ideogram 亿 
 replaces "One hundred million" for example, which of them take 
 more bytes? So if CJK indeed requires more bytes to encode, it 
 is firstly because they NEED many more bits in the first place 
 (there are around 30000 CJK codepoints in the BMP alone, add to 
 it the 60000 that are in the SIP and we have a need of 17 bits 
 only to encode them.
That's not the relevant criteria: nobody cares if the CJK documents were smaller than their European counterparts. What they care about is that, given a different transfer format, the CJK document could have been significantly smaller still. Because almost nobody cares about which translation version is smaller, they care that the text they sent in Chinese or Korean is as small as it can be. Anyway, I didn't mean to restart this debate, so I'll leave it here.
May 17 2018
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:
 On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter 
 wrote:
 This is not practical, sorry. What happens when your message 
 loses the header? Exactly, the rest of the message is garbled.
Why would it lose the header? TCP guarantees delivery and checksums the data, that's effective enough at the transport layer.
What does TCP/IP got to do with anything in discussion here. UTF-8 (or UTF-16 or UTF-32) has nothing to do with network protocols. That's completely unrelated. A file encoded on a disk may never leave the machine it is written on and may never see a wire in its lifetime and its encoding is still of vital importance. That's why a header encoding is too restrictive.
 I agree that UTF-8 is a more redundant format, as others have 
 mentioned earlier, and is thus more robust to certain types of 
 data loss than a header-based scheme. However, I don't consider 
 that the job of the text format, it's better done by other 
 layers, like transport protocols or filesystems, which will 
 guard against such losses much more reliably and efficiently.
No. A text format cannot depend on a network protocol. It would be as if you could only listen to a music file or a video on streaming and never save it on offline file as there was nowhere the information of what that blob of bytes represents. It doesn't make any sense.
 For example, a random bitflip somewhere in the middle of a 
 UTF-8 string will not be detectable most of the time. However, 
 more robust error-correcting schemes at other layers of the 
 system will easily catch that.
That's the job of the other layers. Any other file would have the same problem. At least, with utf-8 there will be at most only ever 1 codepoint lost or changed. Any other encoding would fare better. This said if a checksum header for your document is important you can add it to externally anyway.
 That's exactly what happened with code page based texts when 
 you don't know in which code page it is encoded. It has the 
 supplemental inconvenience that mixing languages becomes 
 impossible or at least very cumbersome.
 UTF-8 has several properties that are difficult to have with 
 other schemes.
 - It is state-less, means any byte in a stream always means 
 the same thing. Its meaning  does not depend on external or a 
 previous byte.
I realize this was considered important at one time, but I think it has proven to be a bad design decision, for HTTP too. There are some advantages when building rudimentary systems with crude hardware and lots of noise, as was the case back then, but that's not the tech world we live in today. That's why almost every HTTP request today is part of a stateful session that explicitly keeps track of the connection, whether through cookies, https encryption, or HTTP/2.
Again, orthogonal to utf-8. When I speak above of streams it doesn't limit to sockets, file are also read in streams. So stop of equating UTF-8 with the Internet, these are 2 different domains. Internet and its protocols were defined and invented long before Unicode and Unicode is very usefull also offline.
 - It can mix any language in the same stream without 
 acrobatics and if one thinks that mixing languages doesn't 
 happen often should get his head extracted from his rear, 
 because it is very common (check wikipedia's front page for 
 example).
I question that almost anybody needs to mix "streams." As for messages or files, headers handle multiple language mixing easily, as noted in that earlier thread.
Ok, show me how you transmit that, I'm curious: <prop type="Txt::Doc. No.">E2010C0002</prop> <tuv lang="EN-GB"> <seg>EFTA Surveillance Authority Decision</seg> </tuv> <tuv lang="DE-DE"> <seg>Beschluss der EFTA-Überwachungsbehörde</seg> </tuv> <tuv lang="DA-01"> <seg>EFTA-Tilsynsmyndighedens beslutning</seg> </tuv> <tuv lang="EL-01"> <seg>Απόφαση της Εποπτεύουσας Αρχής της ΕΖΕΣ</seg> </tuv> <tuv lang="ES-ES"> <seg>Decisión del Órgano de Vigilancia de la AELC</seg> </tuv> <tuv lang="FI-01"> <seg>EFTAn valvontaviranomaisen päätös</seg> </tuv> <tuv lang="FR-FR"> <seg>Décision de l'Autorité de surveillance AELE</seg> </tuv> <tuv lang="IT-IT"> <seg>Decisione dell’Autorità di vigilanza EFTA</seg> </tuv> <tuv lang="NL-NL"> <seg>Besluit van de Toezichthoudende Autoriteit van de EVA</seg> </tuv> <tuv lang="PT-PT"> <seg>Decisão do Órgão de Fiscalização da EFTA</seg> </tuv> <tuv lang="SV-SE"> <seg>Beslut av Eftas övervakningsmyndighet</seg> </tuv> <tuv lang="LV-01"> <seg>EBTA Uzraudzības iestādes Lēmums</seg> </tuv> <tuv lang="CS-01"> <seg>Rozhodnutí Kontrolního úřadu ESVO</seg> </tuv> <tuv lang="ET-01"> <seg>EFTA järelevalveameti otsus</seg> </tuv> <tuv lang="PL-01"> <seg>Decyzja Urzędu Nadzoru EFTA</seg> </tuv> <tuv lang="SL-01"> <seg>Odločba Nadzornega organa EFTE</seg> </tuv> <tuv lang="LT-01"> <seg>ELPA priežiūros institucijos sprendimas</seg> </tuv> <tuv lang="MT-01"> <seg>Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA</seg> </tuv> <tuv lang="SK-01"> <seg>Rozhodnutie Dozorného orgánu EZVO</seg> </tuv> <tuv lang="BG-01"> <seg>Решение на Надзорния орган на ЕАСТ</seg> </tuv> </tu> <tu>
 - The multi byte nature of other alphabets is not as bad as 
 people think because texts in computer do not live on their 
 own, meaning that they are generally embedded inside file 
 formats, which more often than not are extremely bloated (xml, 
 html, xliff, akoma ntoso, rtf etc.). The few bytes more in the 
 text do not weigh that much.
Heh, the other parts of the tech stack are much more bloated, so this bloat is okay? A unique argument, but I'd argue that's why those bloated formats you mention are largely dying off too.
They don't, it's getting worse by the day, that's why I mentioned Akoma Ntoso and XLIFF, they will be used more and more. The world is not limited to webshit (see n-gate.com for the reference).
 I'm in charge at the European Commission of the biggest 
 translation memory in the world. It handles currently 30 
 languages and without UTF-8 and UTF-16 it would be 
 unmanageable. I still remember when I started there in 2002 
 when we handled only 11 languages of which only 1 was of 
 another alphabet (Greek). Everything was based on RTF with 
 codepages and it was a braindead mess. My first job in 2003 
 was to extend the system to handle the 8 newcomer languages 
 and with ASCII based encodings it was completely unmanageable 
 because every document processed mixes languages and alphabets 
 freely (addresses and names are often written in their 
 original form for instance).
I have no idea what a "translation memory" is. I don't doubt that dealing with non-standard codepages or layouts was difficult, and that a standard like Unicode made your life easier. But the question isn't whether standards would clean things up, of course they would, the question is whether a hypothetical header-based standard would be better than the current continuation byte standard, UTF-8. I think your life would've been even easier with the former, though depending on your usage, maybe the main gain for you would be just from standardization.
I doubt it because the issue has nothing to do with network protocols as you seem to imply. It is about data format, i.e. the content that may be shuffled over a net, but can also stay on a disk, be printed on paper (gasp so old tech) or used interactively in a GUI.
 2 years ago we implemented also support for Chinese. The nice 
 thing was that we didn't have to change much to do that thanks 
 to Unicode. The second surprise was with the file sizes, 
 Chinese documents were generally smaller than their European 
 counterparts. Yes CJK requires 3 bytes for each ideogram, but 
 generally 1 ideogram replaces many letters. The ideogram 亿 
 replaces "One hundred million" for example, which of them take 
 more bytes? So if CJK indeed requires more bytes to encode, it 
 is firstly because they NEED many more bits in the first place 
 (there are around 30000 CJK codepoints in the BMP alone, add 
 to it the 60000 that are in the SIP and we have a need of 17 
 bits only to encode them.
That's not the relevant criteria: nobody cares if the CJK documents were smaller than their European counterparts. What they care about is that, given a different transfer format, the CJK document could have been significantly smaller still. Because almost nobody cares about which translation version is smaller, they care that the text they sent in Chinese or Korean is as small as it can be.
At most 50% more but if the size is really that important it can use UTF-16 which is the same size as Big-5 or Shit-JIS, or as Walter suggested they would better compress the file in that case.
 Anyway, I didn't mean to restart this debate, so I'll leave it 
 here.
- the auto-synchronization and the statelessness are big deals.
May 17 2018
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d
wrote:
[...]
 - the auto-synchronization and the statelessness are big deals.
Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the nogc guys would be up in arms. And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents? Nevermind the recent trend of liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like. T -- Famous last words: I *think* this will work...
May 17 2018
next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
 On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
 Digitalmars-d wrote: [...]
 [...]
Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the nogc guys would be up in arms. [...]
That's what rtf with code pages was essentially. I'm happy that we got rid of it and that they were replaced by xml, even if Microsoft's document xml being a bloated, ridiculous mess, it's still an order of magnitude less problematic than rtf (I mean at the text encoding level).
May 17 2018
prev sibling parent Neia Neutuladh <neia ikeran.org> writes:
On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
 On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
 Digitalmars-d wrote: [...]
 - the auto-synchronization and the statelessness are big deals.
Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the nogc guys would be up in arms.
You'd have three data structures: Strand, Rope, and Slice. A Strand is a series of bytes with an encoding. A Rope is a series of Strands. A Slice is a pair of location references within a Rope. You probably want a special datastructure to name a location within a Rope: Strand offset, then byte offset. Total of five words instead of two to pass a Slice, but zero dynamic allocations. This would be a problem for data locality. However, rope-style datastructures are handy for some types of string manipulation. As an alternative, you might have a separate document specifying what encodings apply to what byte ranges. Slices would then be three words long (pointer to the string struct, start offset, end offset). Iterating would cost O(log(S) + M), where S is the number of encoded segments and M is the number of bytes in the slice. Anyway, you either get a more complex data structure, or you have terrible time complexity, but you don't have both.
 And that's assuming we have a sane header-based encoding for 
 strings that contain segments in multiple languages in the 
 first place. Linguistic analysis articles, for example, would 
 easily contain many such segments within a paragraph, or 
 perhaps in the same sentence. How would a header-based encoding 
 work for such documents?  Nevermind the recent trend of 
 liberally sprinkling emojis all over regular text. If every 
 emoticon embedded in a string requires splitting the string 
 into 3 segments complete with their own headers, I dare not 
 imagine what the code that manipulates such strings would look 
 like.
"Header" implies that all encoding data appears at the start of the document, or in a separate metadata segment. (Call it a start index and two bytes to specify the encoding; reserve the first few bits of the encoding to specify the width.) It also brings to mind HTTP, and reminds me that most documents are either mostly ASCII or a heavy mix of ASCII and something else (HTML and XML being the forerunners). If the encoding succeeded at making most scripts single-byte, then, testing with https://ar.wikipedia.org/wiki/Main_Page, you might get within 15% of UTF-8's efficiency. And then a simple sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as long in this encoding as UTF-8, since it has ten encoded segments, each with overhead. (Assuming the header supports strings up to 2^32 bytes long.) If it didn't succeed at making Latin and Arabic single-byte scripts (and Latin contains over 800 characters in Unicode, while Arabic has over three hundred), it would be worse than UTF-16.
May 18 2018
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/17/2018 09:14 AM, Patrick Schluter wrote:
 I'm in charge at the European Commission of the biggest translation 
 memory in the world.
Impressive! Is that the Europarl?
May 17 2018
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 17 May 2018 at 15:37:01 UTC, Andrei Alexandrescu 
wrote:
 On 05/17/2018 09:14 AM, Patrick Schluter wrote:
 I'm in charge at the European Commission of the biggest 
 translation memory in the world.
Impressive! Is that the Europarl?
No, Euramis. The central translation memory developed by the Commission and used also by the other institutions. The database contains more than a billion segments from parallel texts and is afaik the biggest of its kind. One of the big strength of the Euramis TM is its multi-target language store this allows fuzzy searches in all combinations including indirect translations (i.e. if a document written in english was translated in Romanian and in Maltese it is then possible to search for alignments between ro and mt). It's not the only system to do that but on that volume it is quite unique. We publish also every year an extract of it of the published legislation [1] from the official journal so that they can be used by the research community. All the machine translation engines use it. It is one of most accessed data collection on the European Open Data portal [2]. The very uncommon thing about the backend software of EURAMIS is that it is written in C. Pure unadultered C. I'm trying to introduce D but with the strange (to say it politely) configurations our server have it is quite challenging. [1]: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory [2]: http://data.europa.eu/euodp/fr/data
May 17 2018
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/16/2018 10:01 PM, Joakim wrote:
 Unicode was a standardization of all the existing code pages and then added 
 these new transfer formats, but I have long thought that they'd have been
better 
 off going with a header-based format that kept most languages in a single-byte 
 scheme, as they mostly were except for obviously the Asian CJK languages. That 
 way, you optimize for the common string, ie one that contains a single
language 
 or at least no CJK, rather than pessimizing every non-ASCII language by
doubling 
 its character width, as UTF-8 does. This UTF-8 issue is one of the first
topics 
 I raised in this forum, but as you noted at the time nobody agreed and I don't 
 want to dredge that all up again.
It sounds like the main issue is that a header based encoding would take less size? If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.
May 17 2018
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
 On 5/16/2018 10:01 PM, Joakim wrote:
 Unicode was a standardization of all the existing code pages 
 and then added these new transfer formats, but I have long 
 thought that they'd have been better off going with a 
 header-based format that kept most languages in a single-byte 
 scheme, as they mostly were except for obviously the Asian CJK 
 languages. That way, you optimize for the common string, ie 
 one that contains a single language or at least no CJK, rather 
 than pessimizing every non-ASCII language by doubling its 
 character width, as UTF-8 does. This UTF-8 issue is one of the 
 first topics I raised in this forum, but as you noted at the 
 time nobody agreed and I don't want to dredge that all up 
 again.
It sounds like the main issue is that a header based encoding would take less size? If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.
Indeed, and some other compression/deduplication options that would allow limited random access / slicing (by decoding a single “block” to access an element for instance). Anything that depends on external information and is not self-sync is awful for interchange. Internally the application can do some smarts though, but even then things like interning (partial interning) might be more valuable approach. TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.
May 17 2018
parent reply Ethan <gooberman gmail.com> writes:
On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
 TCP being  reliable just plain doesn’t cut it. Corruption of
 single bit is very real.
Quoting to highlight and agree. TCP is reliable because it resends dropped packets and delivers them in order. I don't write TCP packets to my long-term storage medium. UTF as a transportation protocol Unicode is *far* more useful than just sending across a network.
May 17 2018
parent reply Joakim <dlang joakim.fea.st> writes:
On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:
 On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky 
 wrote:
 TCP being  reliable just plain doesn’t cut it. Corruption of
 single bit is very real.
Quoting to highlight and agree. TCP is reliable because it resends dropped packets and delivers them in order. I don't write TCP packets to my long-term storage medium. UTF as a transportation protocol Unicode is *far* more useful than just sending across a network.
The point wasn't that TCP is handling all the errors, it was a throwaway example of one other layer of the system, the network transport layer, that actually has a checksum that will detect a single bitflip, which UTF-8 will not usually detect. I mentioned that the filesystem and several other layers have their own such error detection, yet you guys crazily latch on to the TCP example alone. On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
 On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
 Digitalmars-d wrote: [...]
 - the auto-synchronization and the statelessness are big deals.
Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the nogc guys would be up in arms.
As we discussed when I first raised this header scheme years ago, you're right that slicing could be more expensive, depending on whether you chose to allocate a new header for the substring or not. The question is whether the optimizations available from such a header telling you where all the language substrings are in a multi-language string make up for having to expensively process the entire UTF-8 string to get that or other data. I think it's fairly obvious the design tradeoff of the header would beat out UTF-8 for all but a few degenerate cases, but maybe you don't see it.
 And that's assuming we have a sane header-based encoding for 
 strings that contain segments in multiple languages in the 
 first place. Linguistic analysis articles, for example, would 
 easily contain many such segments within a paragraph, or 
 perhaps in the same sentence. How would a header-based encoding 
 work for such documents?
It would bloat the header to some extent, but less so than a UTF-8 string. You may want to use special header encodings for such edge cases too, if you want to maintain the same large performance lead over UTF-8 that you'd have for the common case.
Nevermind the recent trend of
 liberally sprinkling emojis all over regular text. If every 
 emoticon embedded in a string requires splitting the string 
 into 3 segments complete with their own headers, I dare not 
 imagine what the code that manipulates such strings would look 
 like.
Personally, I don't consider emojis worth implementing, :) they shouldn't be part of Unicode. But since they are, I'm fairly certain header-based text messages with emojis would be significantly smaller than using UTF-8/16. I was surprised to see that adding a emoji to a text message I sent last year cut my message character quota in half. I googled why this was and it turns out that when you add an emoji, the text messaging client actually changes your message encoding from UTF-8 to UTF-16! I don't know if this is a limitation of the default Android messaging client, my telco carrier, or SMS, but I strongly suspect this is widespread. Anyway, I can see the arguments about UTF-8 this time around are as bad as the first time I raised it five years back, so I'll leave this thread here.
May 18 2018
parent Nemanja Boric <4burgos gmail.com> writes:
On Friday, 18 May 2018 at 08:44:41 UTC, Joakim wrote:
 I was surprised to see that adding a emoji to a text message I 
 sent last year cut my message character quota in half.  I 
 googled why this was and it turns out that when you add an 
 emoji, the text messaging client actually changes your message 
 encoding from UTF-8 to UTF-16! I don't know if this is a 
 limitation of the default Android messaging client, my telco 
 carrier, or SMS, but I strongly suspect this is widespread.
Welcome to my world (and probably world of most Europeans) where I don't type ć, č, ž and other non-ascii letters since early 2000s, even though SMS are today mostly flat rate and people chat via WhatsApp anyway.
May 18 2018
prev sibling next sibling parent Joakim <dlang joakim.fea.st> writes:
On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
 On 5/16/2018 10:01 PM, Joakim wrote:
 Unicode was a standardization of all the existing code pages 
 and then added these new transfer formats, but I have long 
 thought that they'd have been better off going with a 
 header-based format that kept most languages in a single-byte 
 scheme, as they mostly were except for obviously the Asian CJK 
 languages. That way, you optimize for the common string, ie 
 one that contains a single language or at least no CJK, rather 
 than pessimizing every non-ASCII language by doubling its 
 character width, as UTF-8 does. This UTF-8 issue is one of the 
 first topics I raised in this forum, but as you noted at the 
 time nobody agreed and I don't want to dredge that all up 
 again.
It sounds like the main issue is that a header based encoding would take less size?
Yes, and be easier to process.
 If that's correct, then I hypothesize that adding an LZW 
 compression layer would achieve the same or better result.
In general, you would be wrong, a carefully designed binary format will usually beat the pants off general-purpose compression: https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results Of course, that's because you can tailor your binary format for specific types of data, text in this case, and take advantage of patterns in that subset, such as specialized image compression formats do. In this case though, I haven't compared this scheme to general compression of UTF-8 strings, so I don't know which would compress better. However, that would mostly matter for network transmission, another big gain of a header-based scheme that doesn't use compression is much faster string processing in memory. Yes, the average end user doesn't care for this, but giant consumers of text data, like search engines, would benefit greatly from this. On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
 Indeed, and some other compression/deduplication options that 
 would allow limited random access / slicing (by decoding a 
 single “block” to access an element for instance).
Possibly competitive for compression only for transmission over the network, but unlikely for processing, as noted for Walter's idea.
 Anything that depends on external information and is not 
 self-sync is awful for interchange.
You are describing the vast majority of all formats and protocols, amazing how we got by with them all this time.
 Internally the application can do some smarts though, but even 
 then things like interning (partial interning) might be more 
 valuable approach. TCP being reliable just plain doesn’t cut 
 it. Corruption of single bit is very real.
You seem to have missed my point entirely: UTF-8 will not catch most bit flips either, only if it happens to corrupt certain key bits in a certain way, a minority of the possibilities. Nobody is arguing that data corruption doesn't happen or that some error-correction shouldn't be done somewhere. The question is whether the extremely limited robustness of UTF-8 added by its significant redundancy is a good tradeoff. I think it's obvious that it isn't, and I posit that anybody who knows anything about error-correcting codes would agree with that assessment. You would be much better off by having a more compact header-based transfer format and layering on the level of error correction you need at a different level, which as I noted is already done at the link and transport layers and various other parts of the system already. If you need more error-correction than that, do it right, not in a broken way as UTF-8 does. Honestly, error detection/correction is the most laughably broken part of UTF-8, it is amazing that people even bring that up as a benefit.
May 17 2018
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, May 17, 2018 at 10:16:03AM -0700, Walter Bright via Digitalmars-d wrote:
 On 5/16/2018 10:01 PM, Joakim wrote:
 Unicode was a standardization of all the existing code pages and
 then added these new transfer formats, but I have long thought that
 they'd have been better off going with a header-based format that
 kept most languages in a single-byte scheme, as they mostly were
 except for obviously the Asian CJK languages. That way, you optimize
 for the common string, ie one that contains a single language or at
 least no CJK, rather than pessimizing every non-ASCII language by
 doubling its character width, as UTF-8 does. This UTF-8 issue is one
 of the first topics I raised in this forum, but as you noted at the
 time nobody agreed and I don't want to dredge that all up again.
It sounds like the main issue is that a header based encoding would take less size? If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.
My bet is on the LZW being *far* better than a header-based encoding. Natural language, which a large part of textual data consists of, tends to have a lot of built-in redundancy, and therefore is highly compressible. A proper compression algorithm will beat any header-based size reduction scheme, while still maintaining the context-free nature of UTF-8. T -- In a world without fences, who needs Windows and Gates? -- Christian Surchi
May 17 2018