digitalmars.D - Of possible interest: fast UTF8 validation

Andrei Alexandrescu (1/1) May 16 2018 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_str...

Ethan Watson (10/11) May 16 2018 I re-implemented some common string functionality at Remedy using

Andrei Alexandrescu (6/18) May 16 2018 Is it workable to have a runtime-initialized flag that controls using

Ethan Watson (14/16) May 16 2018 Sure, it's workable with these kind of speed gains. Although the

Walter Bright (26/29) May 16 2018 I used to do things like that a simpler way. 3 functions would be create...

Ethan (28/41) May 16 2018 It certainly sounds reasonable enough for 99% of use cases. But

Walter Bright (3/8) May 16 2018 Linkers already do that. Alignment is specified on all symbols emitted b...

rikki cattermole (2/11) May 16 2018 Would allowing align attribute on functions, make sense here for Ethan?
Ethan (16/19) May 17 2018 Mea culpa. Upon further thinking, two things strike me:

xenon325 (16/26) May 16 2018 Is this basically the same as Function MultiVersioning [1] ?

Walter Bright (2/5) May 16 2018 It would be nice to get this technique put into std.algorithm!

Jack Stouffer (6/7) May 16 2018 D doesn't seem to have C definitions for the x86 SIMD intrinsics,

Ethan Watson (13/15) May 16 2018 Replying to highlight this.

David Nadlinger (24/27) May 18 2018 To provide some context here: LDC only supports the types from

Joakim (4/5) May 16 2018 Sigh, this reminds me of the old quote about people spending a

Dmitry Olshansky (7/13) May 16 2018 Validating UTF-8 is super common, most text protocols and files

Joakim (4/18) May 16 2018 I think you know what I'm referring to, which is that UTF-8 is a

Jack Stouffer (6/9) May 16 2018 UTF-8 seems like the best option available given the problem
Andrei Alexandrescu (9/27) May 16 2018 I find this an interesting minority opinion, at least from the

Walter Bright (5/7) May 16 2018 Me too. I think UTF-8 is brilliant (and I suffered for years under the l...

Jonathan M Davis (22/29) May 16 2018 I'm inclined to think that the redundancy is a serious flaw. I'd argue t...

Joakim (39/73) May 16 2018 Thanks for the link, skipped to the part about text encodings,

Patrick Schluter (45/122) May 17 2018 This is not practical, sorry. What happens when your message

Joakim (46/91) May 17 2018 Why would it lose the header? TCP guarantees delivery and

Patrick Schluter (98/192) May 17 2018 What does TCP/IP got to do with anything in discussion here.

H. S. Teoh (23/24) May 17 2018 Yes. Imagine if we standardized on a header-based string encoding, and

Patrick Schluter (6/21) May 17 2018 That's what rtf with code pages was essentially. I'm happy that
Neia Neutuladh (35/60) May 18 2018 You'd have three data structures: Strand, Rope, and Slice.

Andrei Alexandrescu (2/4) May 17 2018 Impressive! Is that the Europarl?

Patrick Schluter (24/28) May 17 2018 No, Euramis. The central translation memory developed by the

Walter Bright (4/13) May 17 2018 It sounds like the main issue is that a header based encoding would take...

Dmitry Olshansky (10/27) May 17 2018 Indeed, and some other compression/deduplication options that

Ethan (7/9) May 17 2018 Quoting to highlight and agree.

Joakim (37/73) May 18 2018 The point wasn't that TCP is handling all the errors, it was a

Nemanja Boric (5/12) May 18 2018 Welcome to my world (and probably world of most Europeans) where

Joakim (41/67) May 17 2018 In general, you would be wrong, a carefully designed binary
H. S. Teoh (10/27) May 17 2018 My bet is on the LZW being *far* better than a header-based encoding.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/

May 16 2018

Ethan Watson <gooberman gmail.com> writes:

On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/

I re-implemented some common string functionality at Remedy using 
SSE 4.2 instructions. Pretty handy. Except we had to turn that 
code off for released products since nowhere near enough people 
are running SSE 4.2 capable hardware.

The code linked doesn't seem to use any instructions newer than 
SSE2, so it's perfectly safe to run on any x64 processor. Could 
probably be sped up with newer SSE instructions if you're only 
ever running internally on hardware you control.

May 16 2018

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/16/2018 08:47 AM, Ethan Watson wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_us
ng_as_little_as_07/ 

 
 I re-implemented some common string functionality at Remedy using SSE 
 4.2 instructions. Pretty handy. Except we had to turn that code off for 
 released products since nowhere near enough people are running SSE 4.2 
 capable hardware.

Is it workable to have a runtime-initialized flag that controls using 
SSE vs. conservative?

 The code linked doesn't seem to use any instructions newer than SSE2, so 
 it's perfectly safe to run on any x64 processor. Could probably be sped 
 up with newer SSE instructions if you're only ever running internally on 
 hardware you control.

Even better!

Contributions would be very welcome.


Andrei

May 16 2018

Ethan Watson <gooberman gmail.com> writes:

On Wednesday, 16 May 2018 at 13:54:05 UTC, Andrei Alexandrescu 
wrote:
 Is it workable to have a runtime-initialized flag that controls 
 using SSE vs. conservative?

Sure, it's workable with these kind of speed gains. Although the 
conservative code path ends up being slightly worse off - an 
extra fetch, compare and branch get introduced.

My preferred method though is to just build multiple sets of 
binaries as DLLs/SOs/DYNLIBs, then load in the correct libraries 
dependant on the CPUID test at program initialisation. Current 
Xbox/Playstation hardware is pretty terrible when it comes to 
branching, so compiling with minimal branching and deploying the 
exact binaries for the hardware capabilities is the way I 
generally approach things.

We never got around to setting something like that up for the PC 
release of Quantum Break, although we definitely talked about it.

May 16 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 5/16/2018 7:38 AM, Ethan Watson wrote:
 My preferred method though is to just build multiple sets of binaries as 
 DLLs/SOs/DYNLIBs, then load in the correct libraries dependant on the CPUID
test 
 at program initialisation.

I used to do things like that a simpler way. 3 functions would be created:

   void FeatureInHardware();
   void EmulateFeature();
   void Select();
   void function() doIt = &Select;

I.e. the first time doIt is called, it calls the Select function which then 
resets doIt to either FeatureInHardware() or EmulateFeature().

It costs an indirect call, but if you move it up the call hierarchy a bit so it 
isn't in the hot loops, the indirect function call cost is negligible.

The advantage is there was only one binary.

----

The PDP-11 had an optional chipset to do floating point. The compiler generated 
function calls that emulated the floating point:

     call FPADD
     call FPSUB
     ...

Those functions would check to see if the FPU existed. If it did, it would 
in-place patch the binary to replace the calls with FPU instructions! Of
course, 
that won't work these days because of protected code pages.

----

In the bad old DOS days, emulator calls were written out by the compiler. 
Special relocation fixup records were emitted for them. The emulator or the FPU 
library was then linked in, and included special relocation fixup values which 
tricked the linker fixup mechanism into patching those instructions with either 
emulator calls or FPU instructions. It was just brilliant!

May 16 2018

Ethan <gooberman gmail.com> writes:

On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
 I used to do things like that a simpler way. 3 functions would 
 be created:

   void FeatureInHardware();
   void EmulateFeature();
   void Select();
   void function() doIt = &Select;

 I.e. the first time doIt is called, it calls the Select 
 function which then resets doIt to either FeatureInHardware() 
 or EmulateFeature().

 It costs an indirect call, but if you move it up the call 
 hierarchy a bit so it isn't in the hot loops, the indirect 
 function call cost is negligible.

 The advantage is there was only one binary.

It certainly sounds reasonable enough for 99% of use cases. But 
I'm definitely the 1% here ;-)

Indirect calls invoke the wrath of the branch predictor on 
XB1/PS4 (ie an AMD Jaguar processor). But there's certainly some 
more interesting non-processor behaviour, at least on MSVC 
compilers. The provided auto-DLL loading in that environment 
performs a call to your DLL-boundary-crossing function, which 
actually winds up in a jump table that performs a jump 
instruction to actually get to your DLL code. I suspect this is 
more costly than the indirect jump at a "write a basic test" 
level. Doing an indirect call as the only action in a for-loop is 
guaranteed to bring out the costly branch predictor on the 
Jaguar. Without getting in and profiling a bunch of stuff, I'm 
not entirely sure which approach I'd prefer for a general 
approach.

Certainly, as far as this particular thread goes, every general 
purpose function of a few lines that I write that use intrinsics 
is forced inline. No function calls, indirect or otherwise. And 
on top of that, the inlined code usually pushes the branches in 
the code out the code across the byte boundary lines just far 
enough that the simple branch predictor is only ever invoked.

(Related: one feature I'd really really really love for linkers 
to implement is the ability to mark up certain functions to only 
ever be linked at a certain byte boundary. And that's purely 
because Jaguar branch prediction often made my profiling tests 
non-deterministic between compiles. A NOP is a legit optimisation 
on those processors.)

May 16 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 5/16/2018 10:28 AM, Ethan wrote:
 (Related: one feature I'd really really really love for linkers to implement
is 
 the ability to mark up certain functions to only ever be linked at a certain 
 byte boundary. And that's purely because Jaguar branch prediction often made
my 
 profiling tests non-deterministic between compiles. A NOP is a legit 
 optimisation on those processors.)

Linkers already do that. Alignment is specified on all symbols emitted by the 
compiler, and the linker uses that info.

May 16 2018

rikki cattermole <rikki cattermole.co.nz> writes:

On 17/05/2018 8:34 AM, Walter Bright wrote:
 On 5/16/2018 10:28 AM, Ethan wrote:
 (Related: one feature I'd really really really love for linkers to 
 implement is the ability to mark up certain functions to only ever be 
 linked at a certain byte boundary. And that's purely because Jaguar 
 branch prediction often made my profiling tests non-deterministic 
 between compiles. A NOP is a legit optimisation on those processors.)

 
 Linkers already do that. Alignment is specified on all symbols emitted 
 by the compiler, and the linker uses that info.

Would allowing align attribute on functions, make sense here for Ethan?

May 16 2018

Ethan <gooberman gmail.com> writes:

And at the risk of getting this topic back on track:

On Wednesday, 16 May 2018 at 20:34:26 UTC, Walter Bright wrote:
 Linkers already do that. Alignment is specified on all symbols 
 emitted by the compiler, and the linker uses that info.

Mea culpa. Upon further thinking, two things strike me:

1) As suggested, there's no way to instruct the front-end to 
align functions to byte boundaries outside of "optimise for 
speed" command line flags

2) I would have heavily relied on incremental linking to iterate 
on these tests when trying to work out how the processor behaved. 
I expect MSVC's incremental linker would turn out to be just 
rubbish enough to not care about how those flags originally 
behaved.

On Wednesday, 16 May 2018 at 20:36:10 UTC, Walter Bright wrote:
 It would be nice to get this technique put into std.algorithm!

The code I wrote originally was C++ code with intrinsics. But I 
can certainly look at adapting it to DMD/LDC. The DMD frontend 
providing natural mappings for Intel's published intrinsics would 
be massively beneficial here.

May 17 2018

xenon325 <anm programmer.net> writes:

On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
 I used to do things like that a simpler way. 3 functions would 
 be created:

   void FeatureInHardware();
   void EmulateFeature();
   void Select();
   void function() doIt = &Select;

 I.e. the first time doIt is called, it calls the Select 
 function which then resets doIt to either FeatureInHardware() 
 or EmulateFeature().

 It costs an indirect call [...]

Is this basically the same as Function MultiVersioning [1] ?

I never had a need to use it and always wondered how does it work 
out it real life.
 From description it seems this would incur indirection:

"To keep the cost of dispatching low, the IFUNC [2] mechanism is 
used for dispatching. This makes the call to the dispatcher a 
one-time thing during startup and a call to a function version is 
a single jump indirect instruction."

In linked article [2] Ian Lance Taylor says glibc uses this for 
memcpy(), so this should be pretty efficient (but than again, one 
doesn't call memcpy() in hot loops too often)

[1] https://gcc.gnu.org/wiki/FunctionMultiVersioning
[2] https://www.airs.com/blog/archives/403

--
Alexander

May 16 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 5/16/2018 5:47 AM, Ethan Watson wrote:
 I re-implemented some common string functionality at Remedy using SSE 4.2 
 instructions. Pretty handy. Except we had to turn that code off for released 
 products since nowhere near enough people are running SSE 4.2 capable hardware.

It would be nice to get this technique put into std.algorithm!

May 16 2018

Jack Stouffer <jack jackstouffer.com> writes:

On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/

D doesn't seem to have C definitions for the x86 SIMD intrinsics, 
which is a bummer

https://issues.dlang.org/show_bug.cgi?id=18865

It's too bad that nothing came of std.simd.

May 16 2018

Ethan Watson <gooberman gmail.com> writes:

On Wednesday, 16 May 2018 at 14:25:07 UTC, Jack Stouffer wrote:
 D doesn't seem to have C definitions for the x86 SIMD 
 intrinsics, which is a bummer

Replying to highlight this.

There's core.simd which doesn't look anything like SSE/AVX 
intrinsics at all, and looks a lot more like a wrapper for 
writing assembly instructions directly.

And even better - LDC doesn't support core.simd and has its own 
intrinsics that don't match the SSE/AVX intrinsics API published 
by Intel.

And since I'm a multi-platform developer, the "What about NEON 
intrinsics?" question always sits in the back of my mind.

I ended up implementing my own SIMD primitives in Binderoo, but 
they're all versioned out for LDC at the moment until I look in 
to it and complete the implementation.

May 16 2018

David Nadlinger <code klickverbot.at> writes:

On Wednesday, 16 May 2018 at 14:48:54 UTC, Ethan Watson wrote:
 And even better - LDC doesn't support core.simd and has its own 
 intrinsics that don't match the SSE/AVX intrinsics API 
 published by Intel.

To provide some context here: LDC only supports the types from 
core.simd, but not the __simd "assembler macro" that DMD uses to 
more or less directly emit the corresponding x86 opcodes.

LDC does support most of the GCC-style SIMD builtins for the 
respective target (x86, ARM, …), but there are two problems with 
this:

  1) As Ethan pointed out, the GCC API does not match Intel's 
intrinsics; for example, it is 
`__builtin_ia32_vfnmsubpd256_mask3` instead of 
`_mm256_mask_fnmsub_pd`, and the argument orders differ as well.

  2) The functions that LDC exposes as intrinsics are those that 
are intrinsics on the LLVM IR level. However, some operations can 
be directly represented in normal, instruction-set-independent 
LLVM IR – no explicit intrinsics are provided for these.

Unfortunately, LLVM doesn't seem to provide any particularly 
helpful tools for implementing Intel's intrinsics API. 
x86intrin.h is manually implemented for Clang as a collection of 
various macros and functions.

It would be seriously cool if someone could write a small tool to 
parse those headers, (semi-)automatically convert them to D, and 
generate tests for comparing the emitted IR against Clang. I'm 
happy to help with the LDC side of things.

  — David

May 18 2018

Joakim <dlang joakim.fea.st> writes:

On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/

Sigh, this reminds me of the old quote about people spending a 
bunch of time making more efficient what shouldn't be done at all.

May 16 2018

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
 wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/

 Sigh, this reminds me of the old quote about people spending a 
 bunch of time making more efficient what shouldn't be done at 
 all.

Validating UTF-8 is super common, most text protocols and files 
these days would use it, other would have an option to do so.

I’d like our validateUtf to be fast, since right now we do 
validation every time we decode string. And THAT is slow. Trying 
to not validate on decode means most things should be validated 
on input...

May 16 2018

Joakim <dlang joakim.fea.st> writes:

On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:
 On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
 wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/

 Sigh, this reminds me of the old quote about people spending a 
 bunch of time making more efficient what shouldn't be done at 
 all.

 Validating UTF-8 is super common, most text protocols and files 
 these days would use it, other would have an option to do so.

 I’d like our validateUtf to be fast, since right now we do 
 validation every time we decode string. And THAT is slow. 
 Trying to not validate on decode means most things should be 
 validated on input...

I think you know what I'm referring to, which is that UTF-8 is a 
badly designed format, not that input validation shouldn't be 
done.

May 16 2018

Jack Stouffer <jack jackstouffer.com> writes:

On Wednesday, 16 May 2018 at 17:18:06 UTC, Joakim wrote:
 I think you know what I'm referring to, which is that UTF-8 is 
 a badly designed format, not that input validation shouldn't be 
 done.

UTF-8 seems like the best option available given the problem 
space.

Junk data is going to be a problem with any possible string 
format given that encoding translations and programmer error will 
always be prevalent.

May 16 2018

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/16/18 1:18 PM, Joakim wrote:
 On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:
 On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_us
ng_as_little_as_07/ 

 Sigh, this reminds me of the old quote about people spending a bunch 
 of time making more efficient what shouldn't be done at all.

 Validating UTF-8 is super common, most text protocols and files these 
 days would use it, other would have an option to do so.

 I’d like our validateUtf to be fast, since right now we do validation 
 every time we decode string. And THAT is slow. Trying to not validate 
 on decode means most things should be validated on input...

 
 I think you know what I'm referring to, which is that UTF-8 is a badly 
 designed format, not that input validation shouldn't be done.

I find this an interesting minority opinion, at least from the 
perspective of the circles I frequent, where UTF8 is unanimously 
heralded as a great design. Only a couple of weeks ago I saw Dylan 
Beattie give a very entertaining talk on exactly this topic: 
https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/

If you could share some details on why you think UTF8 is badly designed 
and how you believe it could be/have been better, I'd be in your debt!


Andrei

May 16 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
 If you could share some details on why you think UTF8 is badly designed and
how 
 you believe it could be/have been better, I'd be in your debt!

Me too. I think UTF-8 is brilliant (and I suffered for years under the lash of 
other multibyte encodings prior to UTF-8). Shift-JIS: shudder!

Perhaps you're referring to the redundancy in UTF-8 - though illegal encodings 
are made possible by such redundancy.

May 16 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Wednesday, May 16, 2018 13:42:11 Walter Bright via Digitalmars-d wrote:
 On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
 If you could share some details on why you think UTF8 is badly designed
 and how you believe it could be/have been better, I'd be in your debt!

 Me too. I think UTF-8 is brilliant (and I suffered for years under the
 lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder!

 Perhaps you're referring to the redundancy in UTF-8 - though illegal
 encodings are made possible by such redundancy.

I'm inclined to think that the redundancy is a serious flaw. I'd argue that
if it were truly well-designed, there would be exactly one way to represent
every character - including clear up to grapheme clusters where multiple
code points are involved (i.e. there would be no normalization issues in
valid Unicode, because there would be only one valid normalization). But
there may be some technical issues that I'm not aware of that would make
that problematic. Either way, the issues that I have with UTF-8 are issues
that UTF-16 and UTF-32 have as well, since they're really issues relating to
code points.

Overall, I think that UTF-8 is by far the best encoding that we have, and I
don't think that we're going to get anything better, but I'm also definitely
inclined to think that it's still flawed - just far less flawed than the
alternatives.

And in general, I have to wonder if there would be a way to make Unicode
less complicated if we could do it from scratch without worrying about any
kind of compatability, since what we have is complicated enough that most
programmers don't come close to understanding it, and it's just way too hard
to get right. But I suspect that if efficiency matters, there's enough
inherent complexity that we'd just be screwed on that front even if we could
do a better job than was done with Unicode as we know it.

- Jonathan M Davis

May 16 2018

Joakim <dlang joakim.fea.st> writes:

On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu 
wrote:
 On 5/16/18 1:18 PM, Joakim wrote:
 On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky 
 wrote:
 On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
 On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei 
 Alexandrescu wrote:
 https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/

 Sigh, this reminds me of the old quote about people spending 
 a bunch of time making more efficient what shouldn't be done 
 at all.

 Validating UTF-8 is super common, most text protocols and 
 files these days would use it, other would have an option to 
 do so.

 I’d like our validateUtf to be fast, since right now we do 
 validation every time we decode string. And THAT is slow. 
 Trying to not validate on decode means most things should be 
 validated on input...

 
 I think you know what I'm referring to, which is that UTF-8 is 
 a badly designed format, not that input validation shouldn't 
 be done.

 I find this an interesting minority opinion, at least from the 
 perspective of the circles I frequent, where UTF8 is 
 unanimously heralded as a great design. Only a couple of weeks 
 ago I saw Dylan Beattie give a very entertaining talk on 
 exactly this topic: 
 https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/

Thanks for the link, skipped to the part about text encodings, 
should be fun to read the rest later.

 If you could share some details on why you think UTF8 is badly 
 designed and how you believe it could be/have been better, I'd 
 be in your debt!

Unicode was a standardization of all the existing code pages and 
then added these new transfer formats, but I have long thought 
that they'd have been better off going with a header-based format 
that kept most languages in a single-byte scheme, as they mostly 
were except for obviously the Asian CJK languages. That way, you 
optimize for the common string, ie one that contains a single 
language or at least no CJK, rather than pessimizing every 
non-ASCII language by doubling its character width, as UTF-8 
does. This UTF-8 issue is one of the first topics I raised in 
this forum, but as you noted at the time nobody agreed and I 
don't want to dredge that all up again.

I have been researching this a bit since then, and the stated 
goals for UTF-8 at inception were that it _could not overlap with 
ASCII anywhere for other languages_, to avoid issues with legacy 
software wrongly processing other languages as ASCII, and to 
allow seeking from an arbitrary location within a byte stream:

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I have no dispute with these priorities at the time, as they were 
optimizing for the institutional and tech realities of 1992 as 
Dylan also notes, and UTF-8 is actually a nice hack given those 
constraints. What I question is that those priorities are at all 
relevant today, when billions of smartphone users are regularly 
not using ASCII, and these tech companies are the largest private 
organizations on the planet, ie they have the resources to design 
a new transfer format. I see basically no relevance for the 
streaming requirement today, as I noted in this forum years ago, 
but I can see why it might have been considered important in the 
early '90s, before packet-based networking protocols had won.

I think a header-based scheme would be _much_ better today and 
the reason I know Dmitry knows that is that I have discussed 
privately with him over email that I plan to prototype a format 
like that in D. Even if UTF-8 is already fairly widespread, 
something like that could be useful as a better intermediate 
format for string processing, and maybe someday could replace 
UTF-8 too.

May 16 2018

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 17 May 2018 at 05:01:54 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu
wrote:
On 5/16/18 1:18 PM, Joakim wrote:
On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky
wrote:
On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei
Alexandrescu wrote:
https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/

Sigh, this reminds me of the old quote about people
spending a bunch of time making more efficient what
shouldn't be done at all.

Validating UTF-8 is super common, most text protocols and
files these days would use it, other would have an option to
do so.

I’d like our validateUtf to be fast, since right now we do
validation every time we decode string. And THAT is slow.
Trying to not validate on decode means most things should be
validated on input...

I think you know what I'm referring to, which is that UTF-8
is a badly designed format, not that input validation
shouldn't be done.

I find this an interesting minority opinion, at least from the
perspective of the circles I frequent, where UTF8 is
unanimously heralded as a great design. Only a couple of weeks
ago I saw Dylan Beattie give a very entertaining talk on
exactly this topic:
https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/

Thanks for the link, skipped to the part about text encodings,
should be fun to read the rest later.

If you could share some details on why you think UTF8 is badly
designed and how you believe it could be/have been better, I'd
be in your debt!

Unicode was a standardization of all the existing code pages
and then added these new transfer formats, but I have long
thought that they'd have been better off going with a
header-based format that kept most languages in a single-byte
scheme,

This is not practical, sorry. What happens when your message
loses the header? Exactly, the rest of the message is garbled.
That's exactly what happened with code page based texts when you
don't know in which code page it is encoded. It has the
supplemental inconvenience that mixing languages becomes
impossible or at least very cumbersome.
UTF-8 has several properties that are difficult to have with
other schemes.
- It is state-less, means any byte in a stream always means the
same thing. Its meaning does not depend on external or a
previous byte.
- It can mix any language in the same stream without acrobatics
and if one thinks that mixing languages doesn't happen often
should get his head extracted from his rear, because it is very
common (check wikipedia's front page for example).
- The multi byte nature of other alphabets is not as bad as
people think because texts in computer do not live on their own,
meaning that they are generally embedded inside file formats,
which more often than not are extremely bloated (xml, html,
xliff, akoma ntoso, rtf etc.). The few bytes more in the text do
not weigh that much.

I'm in charge at the European Commission of the biggest
translation memory in the world. It handles currently 30
languages and without UTF-8 and UTF-16 it would be unmanageable.
I still remember when I started there in 2002 when we handled
only 11 languages of which only 1 was of another alphabet
(Greek). Everything was based on RTF with codepages and it was a
braindead mess. My first job in 2003 was to extend the system to
handle the 8 newcomer languages and with ASCII based encodings it
was completely unmanageable because every document processed
mixes languages and alphabets freely (addresses and names are
often written in their original form for instance).
2 years ago we implemented also support for Chinese. The nice
thing was that we didn't have to change much to do that thanks to
Unicode. The second surprise was with the file sizes, Chinese
documents were generally smaller than their European
counterparts. Yes CJK requires 3 bytes for each ideogram, but
generally 1 ideogram replaces many letters. The ideogram 亿
replaces "One hundred million" for example, which of them take
more bytes? So if CJK indeed requires more bytes to encode, it is
firstly because they NEED many more bits in the first place
(there are around 30000 CJK codepoints in the BMP alone, add to
it the 60000 that are in the SIP and we have a need of 17 bits
only to encode them.

as they mostly were except for obviously the Asian CJK
languages. That way, you optimize for the common string, ie one
that contains a single language or at least no CJK, rather than
pessimizing every non-ASCII language by doubling its character
width, as UTF-8 does. This UTF-8 issue is one of the first
topics I raised in this forum, but as you noted at the time
nobody agreed and I don't want to dredge that all up again.

I have been researching this a bit since then, and the stated
goals for UTF-8 at inception were that it _could not overlap
with ASCII anywhere for other languages_, to avoid issues with
legacy software wrongly processing other languages as ASCII,
and to allow seeking from an arbitrary location within a byte
stream:

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I have no dispute with these priorities at the time, as they
were optimizing for the institutional and tech realities of
1992 as Dylan also notes, and UTF-8 is actually a nice hack
given those constraints. What I question is that those
priorities are at all relevant today, when billions of
smartphone users are regularly not using ASCII, and these tech
companies are the largest private organizations on the planet,
ie they have the resources to design a new transfer format. I
see basically no relevance for the streaming requirement today,
as I noted in this forum years ago, but I can see why it might
have been considered important in the early '90s, before
packet-based networking protocols had won.

I think a header-based scheme would be _much_ better today and
the reason I know Dmitry knows that is that I have discussed
privately with him over email that I plan to prototype a format
like that in D. Even if UTF-8 is already fairly widespread,
something like that could be useful as a better intermediate
format for string processing, and maybe someday could replace
UTF-8 too.

May 17 2018

Joakim <dlang joakim.fea.st> writes:

On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:
 This is not practical, sorry. What happens when your message 
 loses the header? Exactly, the rest of the message is garbled.

Why would it lose the header? TCP guarantees delivery and 
checksums the data, that's effective enough at the transport 
layer.

I agree that UTF-8 is a more redundant format, as others have 
mentioned earlier, and is thus more robust to certain types of 
data loss than a header-based scheme. However, I don't consider 
that the job of the text format, it's better done by other 
layers, like transport protocols or filesystems, which will guard 
against such losses much more reliably and efficiently.

For example, a random bitflip somewhere in the middle of a UTF-8 
string will not be detectable most of the time. However, more 
robust error-correcting schemes at other layers of the system 
will easily catch that.

 That's exactly what happened with code page based texts when 
 you don't know in which code page it is encoded. It has the 
 supplemental inconvenience that mixing languages becomes 
 impossible or at least very cumbersome.
 UTF-8 has several properties that are difficult to have with 
 other schemes.
 - It is state-less, means any byte in a stream always means the 
 same thing. Its meaning  does not depend on external or a 
 previous byte.

I realize this was considered important at one time, but I think 
it has proven to be a bad design decision, for HTTP too. There 
are some advantages when building rudimentary systems with crude 
hardware and lots of noise, as was the case back then, but that's 
not the tech world we live in today. That's why almost every HTTP 
request today is part of a stateful session that explicitly keeps 
track of the connection, whether through cookies, https 
encryption, or HTTP/2.

 - It can mix any language in the same stream without acrobatics 
 and if one thinks that mixing languages doesn't happen often 
 should get his head extracted from his rear, because it is very 
 common (check wikipedia's front page for example).

I question that almost anybody needs to mix "streams." As for 
messages or files, headers handle multiple language mixing 
easily, as noted in that earlier thread.

 - The multi byte nature of other alphabets is not as bad as 
 people think because texts in computer do not live on their 
 own, meaning that they are generally embedded inside file 
 formats, which more often than not are extremely bloated (xml, 
 html, xliff, akoma ntoso, rtf etc.). The few bytes more in the 
 text do not weigh that much.

Heh, the other parts of the tech stack are much more bloated, so 
this bloat is okay? A unique argument, but I'd argue that's why 
those bloated formats you mention are largely dying off too.

 I'm in charge at the European Commission of the biggest 
 translation memory in the world. It handles currently 30 
 languages and without UTF-8 and UTF-16 it would be 
 unmanageable. I still remember when I started there in 2002 
 when we handled only 11 languages of which only 1 was of 
 another alphabet (Greek). Everything was based on RTF with 
 codepages and it was a braindead mess. My first job in 2003 was 
 to extend the system to handle the 8 newcomer languages and 
 with ASCII based encodings it was completely unmanageable 
 because every document processed mixes languages and alphabets 
 freely (addresses and names are often written in their original 
 form for instance).

I have no idea what a "translation memory" is. I don't doubt that 
dealing with non-standard codepages or layouts was difficult, and 
that a standard like Unicode made your life easier. But the 
question isn't whether standards would clean things up, of course 
they would, the question is whether a hypothetical header-based 
standard would be better than the current continuation byte 
standard, UTF-8. I think your life would've been even easier with 
the former, though depending on your usage, maybe the main gain 
for you would be just from standardization.

 2 years ago we implemented also support for Chinese. The nice 
 thing was that we didn't have to change much to do that thanks 
 to Unicode. The second surprise was with the file sizes, 
 Chinese documents were generally smaller than their European 
 counterparts. Yes CJK requires 3 bytes for each ideogram, but 
 generally 1 ideogram replaces many letters. The ideogram 亿 
 replaces "One hundred million" for example, which of them take 
 more bytes? So if CJK indeed requires more bytes to encode, it 
 is firstly because they NEED many more bits in the first place 
 (there are around 30000 CJK codepoints in the BMP alone, add to 
 it the 60000 that are in the SIP and we have a need of 17 bits 
 only to encode them.

That's not the relevant criteria: nobody cares if the CJK 
documents were smaller than their European counterparts. What 
they care about is that, given a different transfer format, the 
CJK document could have been significantly smaller still. Because 
almost nobody cares about which translation version is smaller, 
they care that the text they sent in Chinese or Korean is as 
small as it can be.

Anyway, I didn't mean to restart this debate, so I'll leave it 
here.

May 17 2018

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:
 On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter 
 wrote:
 This is not practical, sorry. What happens when your message 
 loses the header? Exactly, the rest of the message is garbled.

 Why would it lose the header? TCP guarantees delivery and 
 checksums the data, that's effective enough at the transport 
 layer.

What does TCP/IP got to do with anything in discussion here. 
UTF-8 (or UTF-16 or UTF-32) has nothing to do with network 
protocols. That's completely unrelated. A file encoded on a disk 
may never leave the machine it is written on and may never see a 
wire in its lifetime and its encoding is still of vital 
importance. That's why a header encoding is too restrictive.

 I agree that UTF-8 is a more redundant format, as others have 
 mentioned earlier, and is thus more robust to certain types of 
 data loss than a header-based scheme. However, I don't consider 
 that the job of the text format, it's better done by other 
 layers, like transport protocols or filesystems, which will 
 guard against such losses much more reliably and efficiently.

No. A text format cannot depend on a network protocol. It would 
be as if you could only listen to a music file or a video on 
streaming and never save it on offline file as there was nowhere 
the information of what that blob of bytes represents. It doesn't 
make any sense.

 For example, a random bitflip somewhere in the middle of a 
 UTF-8 string will not be detectable most of the time. However, 
 more robust error-correcting schemes at other layers of the 
 system will easily catch that.

That's the job of the other layers. Any other file would have the 
same problem. At least, with utf-8 there will be at most only 
ever 1 codepoint lost or changed. Any other encoding would fare 
better. This said if a checksum header for your document is 
important you can add it to externally anyway.


 That's exactly what happened with code page based texts when 
 you don't know in which code page it is encoded. It has the 
 supplemental inconvenience that mixing languages becomes 
 impossible or at least very cumbersome.
 UTF-8 has several properties that are difficult to have with 
 other schemes.
 - It is state-less, means any byte in a stream always means 
 the same thing. Its meaning  does not depend on external or a 
 previous byte.

 I realize this was considered important at one time, but I 
 think it has proven to be a bad design decision, for HTTP too. 
 There are some advantages when building rudimentary systems 
 with crude hardware and lots of noise, as was the case back 
 then, but that's not the tech world we live in today. That's 
 why almost every HTTP request today is part of a stateful 
 session that explicitly keeps track of the connection, whether 
 through cookies, https encryption, or HTTP/2.

Again, orthogonal to utf-8. When I speak above of streams it 
doesn't limit to sockets, file are also read in streams. So stop 
of equating UTF-8 with the Internet, these are 2 different 
domains. Internet and its protocols were defined and invented 
long before Unicode and Unicode is very usefull also offline.

 - It can mix any language in the same stream without 
 acrobatics and if one thinks that mixing languages doesn't 
 happen often should get his head extracted from his rear, 
 because it is very common (check wikipedia's front page for 
 example).

 I question that almost anybody needs to mix "streams." As for 
 messages or files, headers handle multiple language mixing 
 easily, as noted in that earlier thread.

Ok, show me how you transmit that, I'm curious:

<prop type="Txt::Doc. No.">E2010C0002</prop>
<tuv lang="EN-GB">
<seg>EFTA Surveillance Authority Decision</seg>
</tuv>
<tuv lang="DE-DE">
<seg>Beschluss der EFTA-Überwachungsbehörde</seg>
</tuv>
<tuv lang="DA-01">
<seg>EFTA-Tilsynsmyndighedens beslutning</seg>
</tuv>
<tuv lang="EL-01">
<seg>Απόφαση της Εποπτεύουσας Αρχής της
ΕΖΕΣ</seg>
</tuv>
<tuv lang="ES-ES">
<seg>Decisión del Órgano de Vigilancia de la AELC</seg>
</tuv>
<tuv lang="FI-01">
<seg>EFTAn valvontaviranomaisen päätös</seg>
</tuv>
<tuv lang="FR-FR">
<seg>Décision de l'Autorité de surveillance AELE</seg>
</tuv>
<tuv lang="IT-IT">
<seg>Decisione dell’Autorità di vigilanza EFTA</seg>
</tuv>
<tuv lang="NL-NL">
<seg>Besluit van de Toezichthoudende Autoriteit van de EVA</seg>
</tuv>
<tuv lang="PT-PT">
<seg>Decisão do Órgão de Fiscalização da EFTA</seg>
</tuv>
<tuv lang="SV-SE">
<seg>Beslut av Eftas övervakningsmyndighet</seg>
</tuv>
<tuv lang="LV-01">
<seg>EBTA Uzraudzības iestādes Lēmums</seg>
</tuv>
<tuv lang="CS-01">
<seg>Rozhodnutí Kontrolního úřadu ESVO</seg>
</tuv>
<tuv lang="ET-01">
<seg>EFTA järelevalveameti otsus</seg>
</tuv>
<tuv lang="PL-01">
<seg>Decyzja Urzędu Nadzoru EFTA</seg>
</tuv>
<tuv lang="SL-01">
<seg>Odločba Nadzornega organa EFTE</seg>
</tuv>
<tuv lang="LT-01">
<seg>ELPA priežiūros institucijos sprendimas</seg>
</tuv>
<tuv lang="MT-01">
<seg>Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA</seg>
</tuv>
<tuv lang="SK-01">
<seg>Rozhodnutie Dozorného orgánu EZVO</seg>
</tuv>
<tuv lang="BG-01">
<seg>Решение на Надзорния орган на ЕАСТ</seg>
</tuv>
</tu>
<tu>


 - The multi byte nature of other alphabets is not as bad as 
 people think because texts in computer do not live on their 
 own, meaning that they are generally embedded inside file 
 formats, which more often than not are extremely bloated (xml, 
 html, xliff, akoma ntoso, rtf etc.). The few bytes more in the 
 text do not weigh that much.

 Heh, the other parts of the tech stack are much more bloated, 
 so this bloat is okay? A unique argument, but I'd argue that's 
 why those bloated formats you mention are largely dying off too.

They don't, it's getting worse by the day, that's why I mentioned 
Akoma Ntoso and XLIFF, they will be used more and more. The world 
is not limited to webshit (see n-gate.com for the reference).

 I'm in charge at the European Commission of the biggest 
 translation memory in the world. It handles currently 30 
 languages and without UTF-8 and UTF-16 it would be 
 unmanageable. I still remember when I started there in 2002 
 when we handled only 11 languages of which only 1 was of 
 another alphabet (Greek). Everything was based on RTF with 
 codepages and it was a braindead mess. My first job in 2003 
 was to extend the system to handle the 8 newcomer languages 
 and with ASCII based encodings it was completely unmanageable 
 because every document processed mixes languages and alphabets 
 freely (addresses and names are often written in their 
 original form for instance).

 I have no idea what a "translation memory" is. I don't doubt 
 that dealing with non-standard codepages or layouts was 
 difficult, and that a standard like Unicode made your life 
 easier. But the question isn't whether standards would clean 
 things up, of course they would, the question is whether a 
 hypothetical header-based standard would be better than the 
 current continuation byte standard, UTF-8. I think your life 
 would've been even easier with the former, though depending on 
 your usage, maybe the main gain for you would be just from 
 standardization.

I doubt it because the issue has nothing to do with network 
protocols as you seem to imply. It is about data format, i.e. the 
content that may be shuffled over a net, but can also stay on a 
disk, be printed on paper (gasp so old tech) or used 
interactively in a GUI.


 2 years ago we implemented also support for Chinese. The nice 
 thing was that we didn't have to change much to do that thanks 
 to Unicode. The second surprise was with the file sizes, 
 Chinese documents were generally smaller than their European 
 counterparts. Yes CJK requires 3 bytes for each ideogram, but 
 generally 1 ideogram replaces many letters. The ideogram 亿 
 replaces "One hundred million" for example, which of them take 
 more bytes? So if CJK indeed requires more bytes to encode, it 
 is firstly because they NEED many more bits in the first place 
 (there are around 30000 CJK codepoints in the BMP alone, add 
 to it the 60000 that are in the SIP and we have a need of 17 
 bits only to encode them.

 That's not the relevant criteria: nobody cares if the CJK 
 documents were smaller than their European counterparts. What 
 they care about is that, given a different transfer format, the 
 CJK document could have been significantly smaller still. 
 Because almost nobody cares about which translation version is 
 smaller, they care that the text they sent in Chinese or Korean 
 is as small as it can be.

At most 50% more but if the size is really that important it can 
use UTF-16 which is the same size as Big-5 or Shit-JIS, or as 
Walter suggested they would better compress the file in that case.

 Anyway, I didn't mean to restart this debate, so I'll leave it 
 here.

- the auto-synchronization and the statelessness are big deals.

May 17 2018

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d
wrote:
[...]
 - the auto-synchronization and the statelessness are big deals.

Yes.  Imagine if we standardized on a header-based string encoding, and
we wanted to implement a substring function over a string that contains
multiple segments of different languages. Instead of a cheap slicing
over the string, you'd need to scan the string or otherwise keep track
of which segment the start/end of the substring lies in, allocate memory
to insert headers so that the segments are properly interpreted, etc..
It would be an implementational nightmare, and an unavoidable
performance hit (you'd have to copy data every time you take a
substring), and the  nogc guys would be up in arms.

And that's assuming we have a sane header-based encoding for strings
that contain segments in multiple languages in the first place.
Linguistic analysis articles, for example, would easily contain many
such segments within a paragraph, or perhaps in the same sentence. How
would a header-based encoding work for such documents?  Nevermind the
recent trend of liberally sprinkling emojis all over regular text. If
every emoticon embedded in a string requires splitting the string into 3
segments complete with their own headers, I dare not imagine what the
code that manipulates such strings would look like.


T

-- 
Famous last words: I *think* this will work...

May 17 2018

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
 On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
 Digitalmars-d wrote: [...]
 [...]

 Yes.  Imagine if we standardized on a header-based string 
 encoding, and we wanted to implement a substring function over 
 a string that contains multiple segments of different 
 languages. Instead of a cheap slicing over the string, you'd 
 need to scan the string or otherwise keep track of which 
 segment the start/end of the substring lies in, allocate memory 
 to insert headers so that the segments are properly 
 interpreted, etc.. It would be an implementational nightmare, 
 and an unavoidable performance hit (you'd have to copy data 
 every time you take a substring), and the  nogc guys would be 
 up in arms.

 [...]

That's what rtf with code pages was essentially. I'm happy that 
we got rid of it and that they were replaced by xml, even if 
Microsoft's document xml being a bloated, ridiculous mess, it's 
still an order of magnitude less problematic than rtf (I mean at 
the text encoding level).

May 17 2018

Neia Neutuladh <neia ikeran.org> writes:

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
 On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
 Digitalmars-d wrote: [...]
 - the auto-synchronization and the statelessness are big deals.

 Yes.  Imagine if we standardized on a header-based string 
 encoding, and we wanted to implement a substring function over 
 a string that contains multiple segments of different 
 languages. Instead of a cheap slicing over the string, you'd 
 need to scan the string or otherwise keep track of which 
 segment the start/end of the substring lies in, allocate memory 
 to insert headers so that the segments are properly 
 interpreted, etc.. It would be an implementational nightmare, 
 and an unavoidable performance hit (you'd have to copy data 
 every time you take a substring), and the  nogc guys would be 
 up in arms.

You'd have three data structures: Strand, Rope, and Slice.

A Strand is a series of bytes with an encoding. A Rope is a 
series of Strands. A Slice is a pair of location references 
within a Rope. You probably want a special datastructure to name 
a location within a Rope: Strand offset, then byte offset. Total 
of five words instead of two to pass a Slice, but zero dynamic 
allocations.

This would be a problem for data locality. However, rope-style 
datastructures are handy for some types of string manipulation.

As an alternative, you might have a separate document specifying 
what encodings apply to what byte ranges. Slices would then be 
three words long (pointer to the string struct, start offset, end 
offset). Iterating would cost O(log(S) + M), where S is the 
number of encoded segments and M is the number of bytes in the 
slice.

Anyway, you either get a more complex data structure, or you have 
terrible time complexity, but you don't have both.

 And that's assuming we have a sane header-based encoding for 
 strings that contain segments in multiple languages in the 
 first place. Linguistic analysis articles, for example, would 
 easily contain many such segments within a paragraph, or 
 perhaps in the same sentence. How would a header-based encoding 
 work for such documents?  Nevermind the recent trend of 
 liberally sprinkling emojis all over regular text. If every 
 emoticon embedded in a string requires splitting the string 
 into 3 segments complete with their own headers, I dare not 
 imagine what the code that manipulates such strings would look 
 like.

"Header" implies that all encoding data appears at the start of 
the document, or in a separate metadata segment. (Call it a start 
index and two bytes to specify the encoding; reserve the first 
few bits of the encoding to specify the width.) It also brings to 
mind HTTP, and reminds me that most documents are either mostly 
ASCII or a heavy mix of ASCII and something else (HTML and XML 
being the forerunners).

If the encoding succeeded at making most scripts single-byte, 
then, testing with https://ar.wikipedia.org/wiki/Main_Page, you 
might get within 15% of UTF-8's efficiency. And then a simple 
sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as 
long in this encoding as UTF-8, since it has ten encoded 
segments, each with overhead. (Assuming the header supports 
strings up to 2^32 bytes long.)

If it didn't succeed at making Latin and Arabic single-byte 
scripts (and Latin contains over 800 characters in Unicode, while 
Arabic has over three hundred), it would be worse than UTF-16.

May 18 2018

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/17/2018 09:14 AM, Patrick Schluter wrote:
 I'm in charge at the European Commission of the biggest translation 
 memory in the world.

Impressive! Is that the Europarl?

May 17 2018

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 17 May 2018 at 15:37:01 UTC, Andrei Alexandrescu 
wrote:
 On 05/17/2018 09:14 AM, Patrick Schluter wrote:
 I'm in charge at the European Commission of the biggest 
 translation memory in the world.

 Impressive! Is that the Europarl?

No, Euramis. The central translation memory developed by the 
Commission and used also by the other institutions. The database 
contains more than a billion segments from parallel texts and is 
afaik the biggest of its kind. One of the big strength of the 
Euramis TM is its multi-target language store this allows fuzzy 
searches in all combinations including indirect translations 
(i.e. if a document written in english was translated in Romanian 
and in Maltese it is then possible to search for alignments 
between ro and mt). It's not the only system to do that but on 
that volume it is quite unique.
We publish also every year an extract of it of the published 
legislation [1] from the official journal so that they can be 
used by the research community. All the machine translation 
engines use it. It is one of most accessed data collection on the 
European Open Data portal [2].

The very uncommon thing about the backend software of EURAMIS is 
that it is written in C. Pure unadultered C. I'm trying to 
introduce D but with the strange (to say it politely) 
configurations our server have it is quite challenging.

[1]: 
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
[2]: http://data.europa.eu/euodp/fr/data

May 17 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 5/16/2018 10:01 PM, Joakim wrote:
 Unicode was a standardization of all the existing code pages and then added 
 these new transfer formats, but I have long thought that they'd have been
better 
 off going with a header-based format that kept most languages in a single-byte 
 scheme, as they mostly were except for obviously the Asian CJK languages. That 
 way, you optimize for the common string, ie one that contains a single
language 
 or at least no CJK, rather than pessimizing every non-ASCII language by
doubling 
 its character width, as UTF-8 does. This UTF-8 issue is one of the first
topics 
 I raised in this forum, but as you noted at the time nobody agreed and I don't 
 want to dredge that all up again.

It sounds like the main issue is that a header based encoding would take less
size?

If that's correct, then I hypothesize that adding an LZW compression layer
would 
achieve the same or better result.

May 17 2018

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
 On 5/16/2018 10:01 PM, Joakim wrote:
 Unicode was a standardization of all the existing code pages 
 and then added these new transfer formats, but I have long 
 thought that they'd have been better off going with a 
 header-based format that kept most languages in a single-byte 
 scheme, as they mostly were except for obviously the Asian CJK 
 languages. That way, you optimize for the common string, ie 
 one that contains a single language or at least no CJK, rather 
 than pessimizing every non-ASCII language by doubling its 
 character width, as UTF-8 does. This UTF-8 issue is one of the 
 first topics I raised in this forum, but as you noted at the 
 time nobody agreed and I don't want to dredge that all up 
 again.

 It sounds like the main issue is that a header based encoding 
 would take less size?

 If that's correct, then I hypothesize that adding an LZW 
 compression layer would achieve the same or better result.

Indeed, and some other compression/deduplication options that 
would allow limited random access / slicing (by decoding a single 
“block” to access an element for instance).

Anything that depends on external information and is not 
self-sync is awful for interchange. Internally the application 
can do some smarts though, but even then things like interning 
(partial interning) might be more valuable approach. TCP being 
reliable just plain doesn’t cut it. Corruption of single bit is 
very real.

May 17 2018

Ethan <gooberman gmail.com> writes:

On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
 TCP being  reliable just plain doesn’t cut it. Corruption of
 single bit is very real.

Quoting to highlight and agree.

TCP is reliable because it resends dropped packets and delivers 
them in order.

I don't write TCP packets to my long-term storage medium.

UTF as a transportation protocol Unicode is *far* more useful 
than just sending across a network.

May 17 2018

Joakim <dlang joakim.fea.st> writes:

On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:
 On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky 
 wrote:
 TCP being  reliable just plain doesn’t cut it. Corruption of
 single bit is very real.

 Quoting to highlight and agree.

 TCP is reliable because it resends dropped packets and delivers 
 them in order.

 I don't write TCP packets to my long-term storage medium.

 UTF as a transportation protocol Unicode is *far* more useful 
 than just sending across a network.

The point wasn't that TCP is handling all the errors, it was a 
throwaway example of one other layer of the system, the network 
transport layer, that actually has a checksum that will detect a 
single bitflip, which UTF-8 will not usually detect. I mentioned 
that the filesystem and several other layers have their own such 
error detection, yet you guys crazily latch on to the TCP example 
alone.

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
 On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
 Digitalmars-d wrote: [...]
 - the auto-synchronization and the statelessness are big deals.

 Yes.  Imagine if we standardized on a header-based string 
 encoding, and we wanted to implement a substring function over 
 a string that contains multiple segments of different 
 languages. Instead of a cheap slicing over the string, you'd 
 need to scan the string or otherwise keep track of which 
 segment the start/end of the substring lies in, allocate memory 
 to insert headers so that the segments are properly 
 interpreted, etc.. It would be an implementational nightmare, 
 and an unavoidable performance hit (you'd have to copy data 
 every time you take a substring), and the  nogc guys would be 
 up in arms.

As we discussed when I first raised this header scheme years ago, 
you're right that slicing could be more expensive, depending on 
whether you chose to allocate a new header for the substring or 
not. The question is whether the optimizations available from 
such a header telling you where all the language substrings are 
in a multi-language string make up for having to expensively 
process the entire UTF-8 string to get that or other data. I 
think it's fairly obvious the design tradeoff of the header would 
beat out UTF-8 for all but a few degenerate cases, but maybe you 
don't see it.

 And that's assuming we have a sane header-based encoding for 
 strings that contain segments in multiple languages in the 
 first place. Linguistic analysis articles, for example, would 
 easily contain many such segments within a paragraph, or 
 perhaps in the same sentence. How would a header-based encoding 
 work for such documents?

It would bloat the header to some extent, but less so than a 
UTF-8 string. You may want to use special header encodings for 
such edge cases too, if you want to maintain the same large 
performance lead over UTF-8 that you'd have for the common case.

Nevermind the recent trend of
 liberally sprinkling emojis all over regular text. If every 
 emoticon embedded in a string requires splitting the string 
 into 3 segments complete with their own headers, I dare not 
 imagine what the code that manipulates such strings would look 
 like.

Personally, I don't consider emojis worth implementing, :) they 
shouldn't be part of Unicode. But since they are, I'm fairly 
certain header-based text messages with emojis would be 
significantly smaller than using UTF-8/16.

I was surprised to see that adding a emoji to a text message I 
sent last year cut my message character quota in half.  I googled 
why this was and it turns out that when you add an emoji, the 
text messaging client actually changes your message encoding from 
UTF-8 to UTF-16! I don't know if this is a limitation of the 
default Android messaging client, my telco carrier, or SMS, but I 
strongly suspect this is widespread.

Anyway, I can see the arguments about UTF-8 this time around are 
as bad as the first time I raised it five years back, so I'll 
leave this thread here.

May 18 2018

Nemanja Boric <4burgos gmail.com> writes:

On Friday, 18 May 2018 at 08:44:41 UTC, Joakim wrote:
 I was surprised to see that adding a emoji to a text message I 
 sent last year cut my message character quota in half.  I 
 googled why this was and it turns out that when you add an 
 emoji, the text messaging client actually changes your message 
 encoding from UTF-8 to UTF-16! I don't know if this is a 
 limitation of the default Android messaging client, my telco 
 carrier, or SMS, but I strongly suspect this is widespread.

Welcome to my world (and probably world of most Europeans) where 
I don't type ć, č, ž and other non-ascii letters since early 
2000s, even though SMS are today mostly flat rate and people chat 
via WhatsApp anyway.

May 18 2018

Joakim <dlang joakim.fea.st> writes:

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
 On 5/16/2018 10:01 PM, Joakim wrote:
 Unicode was a standardization of all the existing code pages 
 and then added these new transfer formats, but I have long 
 thought that they'd have been better off going with a 
 header-based format that kept most languages in a single-byte 
 scheme, as they mostly were except for obviously the Asian CJK 
 languages. That way, you optimize for the common string, ie 
 one that contains a single language or at least no CJK, rather 
 than pessimizing every non-ASCII language by doubling its 
 character width, as UTF-8 does. This UTF-8 issue is one of the 
 first topics I raised in this forum, but as you noted at the 
 time nobody agreed and I don't want to dredge that all up 
 again.

 It sounds like the main issue is that a header based encoding 
 would take less size?

Yes, and be easier to process.

 If that's correct, then I hypothesize that adding an LZW 
 compression layer would achieve the same or better result.

In general, you would be wrong, a carefully designed binary 
format will usually beat the pants off general-purpose 
compression:

https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results

Of course, that's because you can tailor your binary format for 
specific types of data, text in this case, and take advantage of 
patterns in that subset, such as specialized image compression 
formats do. In this case though, I haven't compared this scheme 
to general compression of UTF-8 strings, so I don't know which 
would compress better.

However, that would mostly matter for network transmission, 
another big gain of a header-based scheme that doesn't use 
compression is much faster string processing in memory. Yes, the 
average end user doesn't care for this, but giant consumers of 
text data, like search engines, would benefit greatly from this.

On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
 Indeed, and some other compression/deduplication options that 
 would allow limited random access / slicing (by decoding a 
 single “block” to access an element for instance).

Possibly competitive for compression only for transmission over 
the network, but unlikely for processing, as noted for Walter's 
idea.

 Anything that depends on external information and is not 
 self-sync is awful for interchange.

You are describing the vast majority of all formats and 
protocols, amazing how we got by with them all this time.

 Internally the application can do some smarts though, but even 
 then things like interning (partial interning) might be more 
 valuable approach. TCP being reliable just plain doesn’t cut 
 it. Corruption of single bit is very real.

You seem to have missed my point entirely: UTF-8 will not catch 
most bit flips either, only if it happens to corrupt certain key 
bits in a certain way, a minority of the possibilities. Nobody is 
arguing that data corruption doesn't happen or that some 
error-correction shouldn't be done somewhere.

The question is whether the extremely limited robustness of UTF-8 
added by its significant redundancy is a good tradeoff. I think 
it's obvious that it isn't, and I posit that anybody who knows 
anything about error-correcting codes would agree with that 
assessment. You would be much better off by having a more compact 
header-based transfer format and layering on the level of error 
correction you need at a different level, which as I noted is 
already done at the link and transport layers and various other 
parts of the system already.

If you need more error-correction than that, do it right, not in 
a broken way as UTF-8 does. Honestly, error detection/correction 
is the most laughably broken part of UTF-8, it is amazing that 
people even bring that up as a benefit.

May 17 2018

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, May 17, 2018 at 10:16:03AM -0700, Walter Bright via Digitalmars-d wrote:
 On 5/16/2018 10:01 PM, Joakim wrote:
 Unicode was a standardization of all the existing code pages and
 then added these new transfer formats, but I have long thought that
 they'd have been better off going with a header-based format that
 kept most languages in a single-byte scheme, as they mostly were
 except for obviously the Asian CJK languages. That way, you optimize
 for the common string, ie one that contains a single language or at
 least no CJK, rather than pessimizing every non-ASCII language by
 doubling its character width, as UTF-8 does. This UTF-8 issue is one
 of the first topics I raised in this forum, but as you noted at the
 time nobody agreed and I don't want to dredge that all up again.

 
 It sounds like the main issue is that a header based encoding would
 take less size?
 
 If that's correct, then I hypothesize that adding an LZW compression
 layer would achieve the same or better result.

My bet is on the LZW being *far* better than a header-based encoding.
Natural language, which a large part of textual data consists of, tends
to have a lot of built-in redundancy, and therefore is highly
compressible.  A proper compression algorithm will beat any header-based
size reduction scheme, while still maintaining the context-free nature
of UTF-8.


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi

May 17 2018

D Programming

C/C++ Programming

Other

digitalmars.D - Of possible interest: fast UTF8 validation