digitalmars.D - Fix Phobos dependencies on autodecoding

Walter Bright (11/11) Aug 13 2019 We don't yet have a good plan on how to remove autodecoding and yet prov...

a11e99z (6/25) Aug 13 2019 imo autodecoding is one of right thing.

Alexandru Ermicioi (9/38) Aug 13 2019 One of the reasons is that it adds unnecessary complexity for

a11e99z (23/32) Aug 13 2019 imo this is a contrived problem.

Jonathan M Davis (35/68) Aug 13 2019 Code points are almost always the wrong level to be operating at. Many

a11e99z (5/11) Aug 13 2019 thx for explanations.

Daniel Kozak (26/36) Aug 13 2019 I hate autodecoding for many reason, one of them it is not done right:
H. S. Teoh (33/35) Aug 13 2019 [...]

jmh530 (2/17) Aug 13 2019 Huh, those two look the same.

H. S. Teoh (5/28) Aug 13 2019 The location of the acute accent on the second line is wrong.

jmh530 (8/11) Aug 13 2019 I'm still confused...

Jonathan M Davis (5/19) Aug 13 2019 It's not on the e in both of them. It's on the e on the second line of t...

Gregor =?UTF-8?B?TcO8Y2ts?= (5/27) Aug 13 2019 We must be seeing different things then. I've taken a screenshot

Jonathan M Davis (18/48) Aug 13 2019 I suspect that some clients are not handling the text correctly (probabl...
H. S. Teoh (41/45) Aug 13 2019 Did you copy-n-paste the code and run it? If you did, the browser may

Nick Sabalausky (Abscissa) (2/4) Aug 15 2019 Albeit much slower than necessary in most cases...

jmh530 (3/8) Aug 13 2019 On my machine & browser, it looks like it is on the e on both.

Dukc (3/13) Aug 13 2019 And for me, both on e at Windows but the bottom one on T at
Patrick Schluter (3/13) Aug 13 2019 You're not alone, on my firefox on windows 10 pro the accents are
H. S. Teoh (19/29) Aug 13 2019 Probably what Jonathan said about the browser munging the Unicode.

a11e99z (26/37) Aug 13 2019 accent in wchar array can looks like:
dangbinghoo (14/28) Aug 13 2019 we can take Chinese char. as an example, it's clear:

matheus (13/24) Aug 13 2019 Copy and paste Expected and Actual output on notepad and you will

jmh530 (4/17) Aug 13 2019 Interestingly enough, what you have there does not look any
matheus (7/8) Aug 13 2019 Like others said you may not be able to see through the Browser,

Nick Sabalausky (Abscissa) (4/10) Aug 15 2019 Jesus, haven't browser devs learned *ANYTHING* from their very own,

Argolis (7/10) Aug 14 2019 Can you provide example of algorithms and use cases that don't

Gregor =?UTF-8?B?TcO8Y2ts?= (24/34) Aug 14 2019 There is no single universally correct way to segment a string.

H. S. Teoh (9/11) Aug 14 2019 Or, instead of homebrewing your own string-handling algorithms and
Argolis (10/34) Aug 15 2019 Are you meaning that there's no way to verify that assumptions?

H. S. Teoh (74/86) Aug 14 2019 Most cases of string processing involve:

Argolis (27/82) Aug 15 2019 What is the use case of slicing some multi-codeunit encoded

nkm1 (5/8) Aug 15 2019 Parsing XML, HTML and other such things is what people usually
Gregor =?UTF-8?B?TcO8Y2ts?= (64/68) Aug 15 2019 This is the point we're trying to get across to you: this isn't

Walter Bright (3/3) Aug 15 2019 In my not-so-humble opinion, the introduction of "normalization" to Unic...

Gregor =?UTF-8?B?TcO8Y2ts?= (14/18) Aug 15 2019 I am not sure that you can go entirely without normalization for

Walter Bright (9/16) Aug 15 2019 Unicode also fouled up by adding semantic information that is invisible ...

ag0aep6g (2/8) Aug 15 2019 'I' and 'l' are (virtually) identical in many fonts.

H. S. Teoh (7/16) Aug 15 2019 And 0 and O are also identical in many fonts. But none of us would
Walter Bright (12/20) Aug 15 2019 That's a problem with some fonts, not the concept. When such fonts are u...

H. S. Teoh (16/30) Aug 15 2019 [...]

Walter Bright (4/11) Aug 15 2019 Splitting semantic hares is pointless, as the fact remains it worked jus...

Patrick Schluter (32/50) Aug 16 2019 Sorry, no it didn't work in reality before Unicode. Multi

Walter Bright (10/15) Aug 16 2019 I have several older books that move facilely between multiple languages...

Patrick Schluter (25/43) Aug 16 2019 Unicode's purpose is not limited to the output at the end the

Jonathan M Davis (8/14) Aug 16 2019 I don't think that anyone is arguing that Unicode is worse than what we ...
Abdulhaq (8/37) Aug 16 2019 These are great examples and I totally agree with you (and HS

H. S. Teoh (22/28) Aug 16 2019 To be clear, there are aspects of Unicode that I don't agree with. But

Walter Bright (13/17) Aug 16 2019 And that's mission creep, which came later and should not have occurred.

lithium iodate (3/7) Aug 16 2019 Uhm, well, the Unicode block "Mathematical Alphanumeric Symbols"

Walter Bright (4/14) Aug 16 2019 ye gawds:

H. S. Teoh (15/17) Aug 16 2019 [...]

Walter Bright (2/12) Aug 16 2019 Fonts people use for programming take pains to distinguish them.

H. S. Teoh (6/20) Aug 16 2019 So you're saying that what constitutes a "character" should be

Jonathan M Davis (10/13) Aug 15 2019 IMHO, the fact that Unicode normalization is a thing is one of those thi...

Walter Bright (10/12) Aug 15 2019 Exactly. And two glyphs that render identically should be the same code ...

a11e99z (11/16) Aug 15 2019 if it was not sarcasm:

Walter Bright (10/29) Aug 15 2019 Yes, I've heard this argument before.

H. S. Teoh (27/46) Aug 15 2019 [...]

Walter Bright (2/3) Aug 15 2019 The same way printers solved the problem for the last 500 years.

Patrick Schluter (8/11) Aug 16 2019 They didn't have to do automatic processing of the represented

Walter Bright (67/69) Aug 16 2019 Google translate can (and does) figure it out from the context, just lik...

H. S. Teoh (26/46) Aug 16 2019 Ha! Actually, IME, randomly substituting lookalike characters from

H. S. Teoh (15/19) Aug 16 2019 Please elaborate. Because you appear to be saying that Unicode should
H. S. Teoh (45/53) Aug 16 2019 [...]

H. S. Teoh (71/76) Aug 15 2019 [...]

Walter Bright (2/3) Aug 15 2019 And yet somehow people manage to read printed material without all these...

xenon325 (12/14) Aug 16 2019 If same glyphs had same codes, what will you do with these:

Walter Bright (12/24) Aug 16 2019 Except that there's no guarantee that whoever entered the data used the ...

Gregor =?UTF-8?B?TcO8Y2ts?= (21/38) Aug 16 2019 Depends. On smartphones, switching the keyboard language is easy
Patrick Schluter (58/87) Aug 17 2019 From my experience, that was an issue that WE encountered often

sarn (3/8) Aug 16 2019 FWIW, we have that problem today with Unicode and the letter i:

Gregor =?UTF-8?B?TcO8Y2ts?= (9/16) Aug 15 2019 OK, but Unicode also does the inverse: it has multiple legal

H. S. Teoh (11/26) Aug 15 2019 Well, yes, that part I agree with. Unicode does have some dark corners
Jonathan M Davis (8/31) Aug 15 2019 wrote:
H. S. Teoh (20/32) Aug 15 2019 [...]

Argolis (3/11) Aug 16 2019 I want to thank you, that's was really inspiring to me in trying

GreatSam4sure (17/31) Aug 13 2019 Thanks for your effort toward this direction I once a massive

Andre Pany (9/27) Aug 13 2019 I started to create github issues every time I see some errors on

Walter Bright (2/4) Aug 14 2019 First fix: https://github.com/dlang/phobos/pull/7133
Vladimir Panteleev (22/23) Aug 15 2019 Thank you for working on this!

Vladimir Panteleev (23/25) Aug 15 2019 In std.uni, there is genericDecodeGrapheme, which needs to:
Vladimir Panteleev (5/7) Aug 15 2019 I should add that the std.uni "silent" breakage also was due to
Les De Ridder (7/15) Aug 15 2019 I remembered this article from the wiki where you pointed this

Vladimir Panteleev (3/7) Aug 15 2019 I completely forgot about that. Thanks for bringing it up, looks
Walter Bright (22/22) Aug 15 2019 I ran into that as well with the 3 PRs I did:

Walter Bright (5/6) Aug 15 2019 Can you please add a link to those PRs in

Vladimir Panteleev (3/8) Aug 15 2019 I added a link to #7130 to the PR derciptions, which should do

Walter Bright (2/12) Aug 15 2019 I went one better, I added a [no autodecode] label!

Walter Bright <newshound2 digitalmars.com> writes:

We don't yet have a good plan on how to remove autodecoding and yet provide 
backward compatibility with autodecoding-reliant projects, but one thing we can 
do is make Phobos work properly with and without autodecoding.

To that end, I created a build of Phobos that disables autodecoding:

https://github.com/dlang/phobos/pull/7130

Of course, it fails. If people want impactful things to work on, fixing each 
failure is worthwhile (each in separate PRs).

Note that this is neither trivial nor mindless code editing. Each case has to
be 
examined as to why it is doing autodecoding, is autodecoding necessary, and 
deciding to replace it with byChar, byDchar, or simply hardcoding the decoding 
logic.

Aug 13 2019

a11e99z <black80 bk.ru> writes:

On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
 We don't yet have a good plan on how to remove autodecoding and 
 yet provide backward compatibility with autodecoding-reliant 
 projects, but one thing we can do is make Phobos work properly 
 with and without autodecoding.

 To that end, I created a build of Phobos that disables 
 autodecoding:

 https://github.com/dlang/phobos/pull/7130

 Of course, it fails. If people want impactful things to work 
 on, fixing each failure is worthwhile (each in separate PRs).

 Note that this is neither trivial nor mindless code editing. 
 Each case has to be examined as to why it is doing 
 autodecoding, is autodecoding necessary, and deciding to 
 replace it with byChar, byDchar, or simply hardcoding the 
 decoding logic.

imo autodecoding is one of right thing.
maybe will be better to leave it as is and just to add
 immutable(ubyte)[] bytes( string str )  nogc nothrow {
     return *cast( immutable(ubyte)[]* )&str;
 }

and use it as
 foreach( b; "Привет, Мир!".bytes) // Hello world in RU
     writefln( "%x", b );          // 21 bytes, 12 runes

?

why u decide to fight with autodecoding?

Aug 13 2019

Alexandru Ermicioi <alexandru.ermicioi gmail.com> writes:

On Tuesday, 13 August 2019 at 07:31:28 UTC, a11e99z wrote:
 On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
 We don't yet have a good plan on how to remove autodecoding 
 and yet provide backward compatibility with 
 autodecoding-reliant projects, but one thing we can do is make 
 Phobos work properly with and without autodecoding.

 To that end, I created a build of Phobos that disables 
 autodecoding:

 https://github.com/dlang/phobos/pull/7130

 Of course, it fails. If people want impactful things to work 
 on, fixing each failure is worthwhile (each in separate PRs).

 Note that this is neither trivial nor mindless code editing. 
 Each case has to be examined as to why it is doing 
 autodecoding, is autodecoding necessary, and deciding to 
 replace it with byChar, byDchar, or simply hardcoding the 
 decoding logic.

 imo autodecoding is one of right thing.
 maybe will be better to leave it as is and just to add
 immutable(ubyte)[] bytes( string str )  nogc nothrow {
     return *cast( immutable(ubyte)[]* )&str;
 }

 and use it as
 foreach( b; "Привет, Мир!".bytes) // Hello world in RU
     writefln( "%x", b );          // 21 bytes, 12 runes

 ?

 why u decide to fight with autodecoding?

One of the reasons is that it adds unnecessary complexity for 
templated code that is working with ranges. Check function 
prototypes for some algorithms found in std.algorithm package, 
you're bound to find special treatment for autodecoding strings. 
It also messes up user expectation when suddenly applying a range 
function on a string instead of front char you're getting dchar.

Best regards,
Alexandru

Aug 13 2019

a11e99z <black80 bk.ru> writes:

On Tuesday, 13 August 2019 at 07:51:23 UTC, Alexandru Ermicioi 
wrote:
 On Tuesday, 13 August 2019 at 07:31:28 UTC, a11e99z wrote:
 On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright

 One of the reasons is that it adds unnecessary complexity for 
 templated code that is working with ranges. Check function 
 prototypes for some algorithms found in std.algorithm package, 
 you're bound to find special treatment for autodecoding 
 strings. It also messes up user expectation when suddenly 
 applying a range function on a string instead of front char 
 you're getting dchar.

imo this is a contrived problem.
string contains chars, not in meaning "char" as type but runes or 
codepoints.
and world is not perfect so chars/runes are stored as utf8 
codepoints.

in world where "char" is alias for "byte"/"ubyte" such vision was 
a problem:
   is this buffer string(seq of chars) or just raw bytes? how it 
should be enumerated?
but we have better world with different bytes and chars.

probably better was naming for "char" as "utf8cp"/orSomething 
(don't mix with C/C++ type)
and when u/anybody see string from that point everything falls 
into place.

I don't see problem that str.front returns codepoint from 
0..0x10ffff and when str.length returns 21 and str.count=12. but 
somebody see problem here, so again this is a contrived problem. 
and for now this vision problem will recreate/recheck tons of 
code.
I thought that WB don't want change code peremptorily. Should be 
BIG problem when he does.

Aug 13 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Tuesday, August 13, 2019 2:52:58 AM MDT a11e99z via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 07:51:23 UTC, Alexandru Ermicioi

 wrote:
 On Tuesday, 13 August 2019 at 07:31:28 UTC, a11e99z wrote:
 On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright

 One of the reasons is that it adds unnecessary complexity for
 templated code that is working with ranges. Check function
 prototypes for some algorithms found in std.algorithm package,
 you're bound to find special treatment for autodecoding
 strings. It also messes up user expectation when suddenly
 applying a range function on a string instead of front char
 you're getting dchar.

 imo this is a contrived problem.
 string contains chars, not in meaning "char" as type but runes or
 codepoints.
 and world is not perfect so chars/runes are stored as utf8
 codepoints.

 in world where "char" is alias for "byte"/"ubyte" such vision was
 a problem:
    is this buffer string(seq of chars) or just raw bytes? how it
 should be enumerated?
 but we have better world with different bytes and chars.

 probably better was naming for "char" as "utf8cp"/orSomething
 (don't mix with C/C++ type)
 and when u/anybody see string from that point everything falls
 into place.

 I don't see problem that str.front returns codepoint from
 0..0x10ffff and when str.length returns 21 and str.count=12. but
 somebody see problem here, so again this is a contrived problem.
 and for now this vision problem will recreate/recheck tons of
 code.
 I thought that WB don't want change code peremptorily. Should be
 BIG problem when he does.

Code points are almost always the wrong level to be operating at. Many
algorithms can operate at the code unit level with no problem, whereas those
that require decoding usually need to operate at the grapheme level so that
the actual, conceptual characters are being compared. Just like code units
aren't necessarily full characters, code points aren't necessarily full
characters.

Auto-decoding was introduced, because at the time, Andrei did not have a
solid enough understanding of Unicode and thought that code points were
always entire characters and didn't know about graphemes. Having
auto-decoding has caused us tons of problems. It's inefficient, gives a
false sense of code correctness, requires special-casing all over the place,
and the whole "narrow string" concept causes all kinds of grief where
algorithms don't work properly with strings, because they don't consider
them to be random access, have a different type for their range element type
than for their actual element type, etc. Pretty much all of the big D
contributors have thought for years now that auto-decoding was a mistake,
and we've wanted to get rid of it. Many of us actually thought that
autodecoding was a good idea at first, but we've all come to understand how
terrible it is. Walter is one of the few that understood from the get-go,
but he wasn't paying much attention to Phobos (since he usually focuses on
the compiler) and didn't catch Andrei's mistake. If he had, autodecoding
would never have been a thing in Phobos.

The only reason that auto-decoding still exists in Phobos is because of how
hard it is to remove without breaking code. Making Phobos not rely on
autodecoding and making it so that it will work regardless of whether the
character type for a range is char, wchar, dchar, or a grapheme is exactly
what we need to be doing. Some work has been done in that direction already
but nowhere near enough. Once that's done, then we can look at how to fully
remove autodecoding, be it Phobos v2 (which Andrei has already proposed) or
some other clever solution. But regardless of how we go about removing
auto-decoding - or even if we ultimately end up leaving it in place - we
need to make Phobos autodecoding-agnostic so that it's not forced on
everything.

- Jonathan M Davis

Aug 13 2019

a11e99z <black80 bk.ru> writes:

On Tuesday, 13 August 2019 at 09:15:30 UTC, Jonathan M Davis 
wrote:
 On Tuesday, August 13, 2019 2:52:58 AM MDT a11e99z via 
 Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 07:51:23 UTC, Alexandru Ermicioi

 we've wanted to get rid of it. Many of us actually thought that 
 autodecoding was a good idea at first, but we've all come to

thx for explanations.
probably I am on this stage too.
ok. I can live with .byRunes and .byBytes

Aug 13 2019

Daniel Kozak <kozzi11 gmail.com> writes:

On Tue, Aug 13, 2019 at 9:35 AM a11e99z via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 imo autodecoding is one of right thing.
 maybe will be better to leave it as is and just to add
 immutable(ubyte)[] bytes( string str )  nogc nothrow {
     return *cast( immutable(ubyte)[]* )&str;
 }

 and use it as
 foreach( b; "Привет, Мир!".bytes) // Hello world in RU
     writefln( "%x", b );          // 21 bytes, 12 runes

 ?

 why u decide to fight with autodecoding?

I hate autodecoding for many reason, one of them it is not done right:

https://run.dlang.io/is/IHECPf

```
import std.stdio;
void main()
{
    string strd = "é🜢🜢࠷❻𐝃";
    size_t cnt;
    foreach(i, wchar c; strd)
    {
        write(i);
    }

    writeln("");
    foreach(i, char c; strd)
    {
        write(i);
    }
    writeln("");
    foreach(i, dchar c; strd)
    {
        write(i);
    }
}
```

Aug 13 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Aug 13, 2019 at 07:31:28AM +0000, a11e99z via Digitalmars-d wrote:
[...]
 imo autodecoding is one of right thing.

[...]
 why u decide to fight with autodecoding?

Because it *appears* to be right, but it's actually wrong. For example:

	import std.range : retro;
	import std.stdio;

	void main() {
		writeln("привет".retro);
		writeln("приве́т".retro);
	}

Expected output:
	тевирп
	те́вирп

Actual output:
	тевирп
	т́евирп

The problem is that autodecoding makes the assumption that Unicode code
point == grapheme, but this is not true. It's usually true for European
languages, but it fails for many other languages.  So auto-decoding
gives you the illusion of correctness, but when you ship your product to
Asia suddenly you get a ton of bug reports.

To guarantee correctness you need to work with graphemes (see
.byGrapheme). But we can't make that the default because it's a big
performance hit, and many string algorithms don't actually need grapheme
segmentation.

Ultimately, the correct solution is to put the onus on the programmer to
select the iteration scheme (by code units, code points, or graphemes)
depending on what's actually needed at the application level.
Arbitrarily choosing one of them to be the default leads to a false
sense of security.


T

-- 
That's not a bug; that's a feature!

Aug 13 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
 [snip]

 Because it *appears* to be right, but it's actually wrong. For 
 example:

 	import std.range : retro;
 	import std.stdio;

 	void main() {
 		writeln("привет".retro);
 		writeln("приве́т".retro);
 	}

 Expected output:
 	тевирп
 	те́вирп

 Actual output:
 	тевирп
 	т́евирп

Huh, those two look the same.

Aug 13 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Aug 13, 2019 at 04:29:33PM +0000, jmh530 via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
 [snip]
 
 Because it *appears* to be right, but it's actually wrong. For example:
 
 	import std.range : retro;
 	import std.stdio;
 
 	void main() {
 		writeln("привет".retro);
 		writeln("приве́т".retro);
 	}
 
 Expected output:
 	тевирп
 	те́вирп
 
 Actual output:
 	тевирп
 	т́евирп
 

 
 Huh, those two look the same.

The location of the acute accent on the second line is wrong.


T

-- 
GEEK = Gatherer of Extremely Enlightening Knowledge

Aug 13 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T

I'm still confused...

What I was first confused about was that the second line of the 
expected output looks exactly the same as the second line of the 
actual output. However, you seemed to have indicated that is a 
problem. From your follow-up post, I'm still confused because the 
accent seems to be on the "e" on both of them. Isn't that where 
it's supposed to be?

Aug 13 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T

 I'm still confused...

 What I was first confused about was that the second line of the
 expected output looks exactly the same as the second line of the
 actual output. However, you seemed to have indicated that is a
 problem. From your follow-up post, I'm still confused because the
 accent seems to be on the "e" on both of them. Isn't that where
 it's supposed to be?

It's not on the e in both of them. It's on the e on the second line of the
"expected" output, but it's on the T in the second line of the "actual"
output.

- Jonathan M Davis

Aug 13 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis 
wrote:
 On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via 
 Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T

 I'm still confused...

 What I was first confused about was that the second line of 
 the expected output looks exactly the same as the second line 
 of the actual output. However, you seemed to have indicated 
 that is a problem. From your follow-up post, I'm still 
 confused because the accent seems to be on the "e" on both of 
 them. Isn't that where it's supposed to be?

 It's not on the e in both of them. It's on the e on the second 
 line of the "expected" output, but it's on the T in the second 
 line of the "actual" output.

 - Jonathan M Davis

We must be seeing different things then. I've taken a screenshot 
of how the post looks to me:

http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png

Aug 13 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Tuesday, August 13, 2019 11:43:19 AM MDT Gregor M�ckl via Digitalmars-d 
wrote:
 On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis

 wrote:
 On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via

 Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T

 I'm still confused...

 What I was first confused about was that the second line of
 the expected output looks exactly the same as the second line
 of the actual output. However, you seemed to have indicated
 that is a problem. From your follow-up post, I'm still
 confused because the accent seems to be on the "e" on both of
 them. Isn't that where it's supposed to be?

 It's not on the e in both of them. It's on the e on the second
 line of the "expected" output, but it's on the T in the second
 line of the "actual" output.

 - Jonathan M Davis

 We must be seeing different things then. I've taken a screenshot
 of how the post looks to me:

 http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png

I suspect that some clients are not handling the text correctly (probably
due to bugs in their Unicode handling). If I view this thread on
forum.dlang.org in firefox, then the text ends up with the accent on the T
in the code with it being on the B in the expected output and on the e in
the actual output. If I view it in chrome, the code has it on the e, the
expected output has it on the e, and the actual output has it on the T -
which is exactly what happens in my e-mail client. If I run the program on
run.dlang.io in either firefox or chrome, it does the same thing as chrome
and my e-mail client do with the forum post, putting the accent on the e in
the code and putting it on the T in the output. The same thing happens when
I run it locally in my console on FreeBSD. In no case do I see the accent on
the e in the actual output, but it probably wouldn't be hard for a bug in a
program's Unicode handling to put it on the e. Unicode is stupidly hard to
process correctly, and the correct output of this program isn't something
that you would normally see in real text.

- Jonathan M Davis

Aug 13 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Aug 13, 2019 at 05:43:19PM +0000, Gregor Mückl via Digitalmars-d wrote:
[...]
 We must be seeing different things then. I've taken a screenshot of
 how the post looks to me:
 
 http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png

Did you copy-n-paste the code and run it?  If you did, the browser may
have done some Unicode processing on the string literal and munged the
results.  Maybe spelling out the second string literal might help:

	writeln("приве\u0301т".retro);

Basically, the issue here is that "е\u0301" should be processed as a
single grapheme, but since it's two separate code points, auto-decoding
splits the grapheme, and when .retro is applied to it, the \u0301 is now
attached to the wrong code point.

This is probably not the best example, since е\u0301 isn't really how
Russian is normally written (it could be used in some learner
dictionaries to indicate stress, but it's non-standard and most printed
material don't do that).  Perhaps a better example might be Hangul Jamo
or Arabic ligatures, but I'm unfamiliar with those languages so I don't
know how to come up with a realistic example.

But the point is that according to Unicode, a grapheme consists of a
base character followed by zero or more combining diacritics.
Auto-decoding treats the base character separately from any combining
diacritics, because it iterates over code points rather than graphemes,
thus when the application is logically dealing with graphemes, you'll
get incorrect results.  But if you're working only with code points,
then auto-decoding works.

The problem is that most of the time, either (1) you're working with
"characters" ("visual" characters, i.e. graphemes), or (2) you don't
actually care about the string contents but just need to copy / move /
erase a substring.  For (1), auto-decoding gives the wrong results.  For
(2), auto-decoding wastes time decoding code units: you could have just
used a straight memcpy / memcmp / etc..

Unless you're implementing Unicode algorithms, you rarely need to work
with code points directly. And if you're implementing Unicode
algorithms, you already know (or should already know) at which level you
need to be working with (code units, code points, or graphemes), so you
hardly need the default iteration to be code points (just write
.byCodePoint for clarity).

It doesn't make sense to have Phobos iterate over code points *by
default* when it's not the common use case, represents a hidden
performance hit, and in spite of that still not 100% correct anyway.


T

-- 
Век живи - век учись. А дураком помрёшь.

Aug 13 2019

"Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:

On 8/13/19 2:11 PM, H. S. Teoh wrote:
 But if you're working only with code points,
 then auto-decoding works.

Albeit much slower than necessary in most cases...

Aug 15 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis 
wrote:
 [snip]

 It's not on the e in both of them. It's on the e on the second 
 line of the "expected" output, but it's on the T in the second 
 line of the "actual" output.

 - Jonathan M Davis

On my machine & browser, it looks like it is on the e on both.

Aug 13 2019

Dukc <ajieskola gmail.com> writes:

On Tuesday, 13 August 2019 at 18:24:23 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis 
 wrote:
 [snip]

 It's not on the e in both of them. It's on the e on the second 
 line of the "expected" output, but it's on the T in the second 
 line of the "actual" output.

 - Jonathan M Davis

 On my machine & browser, it looks like it is on the e on both.

And for me, both on e at Windows but the bottom one on T at 
Linux, on the same browser (Firefox)!

Aug 13 2019

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 13 August 2019 at 18:24:23 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis 
 wrote:
 [snip]

 It's not on the e in both of them. It's on the e on the second 
 line of the "expected" output, but it's on the T in the second 
 line of the "actual" output.

 - Jonathan M Davis

 On my machine & browser, it looks like it is on the e on both.

You're not alone, on my firefox on windows 10 pro the accents are 
both on the e.

Aug 13 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Aug 13, 2019 at 06:24:23PM +0000, jmh530 via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis wrote:
 [snip]
 
 It's not on the e in both of them. It's on the e on the second line
 of the "expected" output, but it's on the T in the second line of
 the "actual" output.
 
 - Jonathan M Davis

 
 On my machine & browser, it looks like it is on the e on both.

Probably what Jonathan said about the browser munging the Unicode.
Unicode is notoriously hard to process correctly, and I wouldn't be
surprised if the majority of applications out there actually don't
handle it correctly in all cases.

The whole auto-decoding deal is a prime example of this: even an expert
programmer like Andrei fell into the wrong assumption that code point ==
grapheme. I have no confidence that less capable programmers, who form
the majority of today's programmers and write the bulk of the industry's
code, are any more likely to get it right.  (For years I myself didn't
even know there was such a thing as "graphemes".)  In fact, almost every
day I see "enterprise" code that commits atrocities against Unicode --
because QA hasn't thought to pass a *real* Unicode string as test input
yet. The day the idea occurs to them, a LOT of code (and I mean a LOT)
will need to be rewritten, probably from scratch.


T

--
"Real programmers can write assembly code in any language. :-)" -- Larry
Wall

Aug 13 2019

a11e99z <black80 bk.ru> writes:

On Tuesday, 13 August 2019 at 16:51:57 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.

 I'm still confused...

 What I was first confused about was that the second line of the 
 expected output looks exactly the same as the second line of 
 the actual output. However, you seemed to have indicated that 
 is a problem. From your follow-up post, I'm still confused 
 because the accent seems to be on the "e" on both of them. 
 Isn't that where it's supposed to be?

accent in wchar array can looks like:
приве'т - accent to vowel 'e'.. two glyphs combined in one
but in reverse it can be like:
т'евирп - accent to consonants 'т'.. that is wrong, accent can be 
for vowels only (for RU)

its Russian lang and it have not additions to glyphs but other 
langs has and one letter can be represented as 2 wchars or just 
1.. depends from editor.. and it looks same with same meanings.

OT: about Russian(cyrillic) letter with additions

Е and Ё are 2 different letters, vowels, sometimes 
interchangeable (when u cannot find letter Ё on keyboard u can 
use Е)
И and Й are 2 different letters, vowel and consonant, not 
interchangeable, they are totally different.

Ё(upper) ё(lower), Й(up) й(low) - letters where addition is part 
of letter itself, it cannot be separated.
E - transcription is "jˈe"
Ё - "jˈɵ"
И - "ˈi"
Й - "j"

Russian has not additions to glyphs except accent for right 
reading unknown word or for dictionaries like Oxford, Wiki and 
etc.
Many Europian - Latin or Cyrillic - can have additions to 
letters, idk their meanings.

Aug 13 2019

dangbinghoo <dangbinghoo gmail.com> writes:

On Tuesday, 13 August 2019 at 16:51:57 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T

 I'm still confused...

 What I was first confused about was that the second line of the 
 expected output looks exactly the same as the second line of 
 the actual output. However, you seemed to have indicated that 
 is a problem. From your follow-up post, I'm still confused 
 because the accent seems to be on the "e" on both of them. 
 Isn't that where it's supposed to be?

we can take Chinese char. as an example, it's clear:

```
  writeln("汉语&中国🇨🇳".retro);
  writeln("汉字🐠中国🇨🇳".retro);
```

expected:

🇨🇳国中&语汉
🇨🇳国中🐠字汉

actual:
🇳🇨国中&语汉
🇳🇨国中🐠字汉


--
binghoo dang

Aug 13 2019

matheus <matheus gmail.com> writes:

On Tuesday, 13 August 2019 at 16:29:33 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
 ...
 Expected output:
 	тевирп
 	те́вирп

 Actual output:
 	тевирп
 	т́евирп

 Huh, those two look the same.

Copy and paste Expected and Actual output on notepad and you will 
see the difference, or just take a look at the HTML page source 
on your browser (Search for Expected Output):

<span class="forum-quote-prefix">&gt; </span>Expected output:
<span class="forum-quote-prefix">&gt; </span>	тевирп
<span class="forum-quote-prefix">&gt; </span>	те́вирп
<span class="forum-quote-prefix">&gt;</span>
<span class="forum-quote-prefix">&gt; </span>Actual output:
<span class="forum-quote-prefix">&gt; </span>	тевирп
<span class="forum-quote-prefix">&gt; </span>	т́евирп

For me it shows the difference pretty clear.

Matheus.

Aug 13 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
 [snip]

 Copy and paste Expected and Actual output on notepad and you 
 will see the difference, or just take a look at the HTML page 
 source on your browser (Search for Expected Output):

 <span class="forum-quote-prefix">&gt; </span>Expected output:
 <span class="forum-quote-prefix">&gt; </span>	тевирп
 <span class="forum-quote-prefix">&gt; </span>	те́вирп
 <span class="forum-quote-prefix">&gt;</span>
 <span class="forum-quote-prefix">&gt; </span>Actual output:
 <span class="forum-quote-prefix">&gt; </span>	тевирп
 <span class="forum-quote-prefix">&gt; </span>	т́евирп

 For me it shows the difference pretty clear.

 Matheus.

Interestingly enough, what you have there does not look any 
different. However, if I actually do what you say and post it to 
notepad or something, then it does look different.

Aug 13 2019

matheus <matheus gmail.com> writes:

On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
 ...

Like others said you may not be able to see through the Browser, 
because the render may "fix" this.

Here how it looks through the HTML Code Inspection: 
https://i.imgur.com/e57wCZp.png

Notice the character '´' position.

Matheus.

Aug 13 2019

"Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:

On 8/13/19 3:17 PM, matheus wrote:
 On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
 ...

 
 Like others said you may not be able to see through the Browser, because 
 the render may "fix" this.
 

Jesus, haven't browser devs learned *ANYTHING* from their very own, 
INFAMOUS, "Let's completely fuck up 'the reliability principle'" 
debacle? I guess not. Cult of the amateurs wins out again...

Aug 15 2019

Argolis <argolis gmail.com> writes:

On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:

 But we can't make that the default because it's a big 
 performance hit, and many string algorithms don't actually need 
 grapheme segmentation.

Can you provide example of algorithms and use cases that don't 
need grapheme segmentation?
Are they really SO common that the correct default is go for code 
points?

Is it not better to have as a default the grapheme segmentation, 
the correct way of handling a string, instead?

Aug 14 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Wednesday, 14 August 2019 at 07:15:54 UTC, Argolis wrote:
 On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:

 But we can't make that the default because it's a big 
 performance hit, and many string algorithms don't actually 
 need grapheme segmentation.

 Can you provide example of algorithms and use cases that don't 
 need grapheme segmentation?
 Are they really SO common that the correct default is go for 
 code points?

 Is it not better to have as a default the grapheme 
 segmentation, the correct way of handling a string, instead?

There is no single universally correct way to segment a string. 
Grapheme segmentation requires a correct assumption of the text 
encoding in the string and also the assumption that the encoding 
is flawless. Neither may be guaranteed in general. There is a lot 
of ways to corrupt UTF-8 strings, for example. And then there is 
a question of the length of a grapheme: IIRC they can consist of 
up to 6 or 7 code points with each of them encoded in a varying 
number of bytes in UTF-8, UTF-16 or UCS-2. So what data type do 
you use for representing graphemes then that is both not wasteful 
and doesn't require dynamic memory management?

Then there are other nasty quirks around graphemes: their 
encoding is not unique. This Unicode TR gives a good impression 
of how complex this single aspect is: 
https://unicode.org/reports/tr15/

So if you want to use graphemes, do you want to keep the original 
encoding or do you implicitly convert them to NFC or NFD? NFC 
tends to be better for language processing, NFD tends to be 
better for text rendering (with exceptions). If you don't 
normalize, semantically equivalent graphemes may not be equal 
under comparison.

At this point you're probably approaching the complexity of 
libraries like ICU. You can take a look at it if you want a good 
scare. ;)

Aug 14 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Aug 14, 2019 at 09:29:30AM +0000, Gregor M�ckl via Digitalmars-d wrote:
[...]
 At this point you're probably approaching the complexity of libraries
 like ICU. You can take a look at it if you want a good scare. ;)

Or, instead of homebrewing your own string-handling algorithms and
probably getting it all wrong, actually *use* ICU to handle Unicode
strings for you instead.  Saves you from writing more code, and from
unintentional bugs.


T

-- 
Truth, Sir, is a cow which will give [skeptics] no more milk, and so they are
gone to milk the bull. -- Sam. Johnson

Aug 14 2019

Argolis <argolis gmail.com> writes:

On Wednesday, 14 August 2019 at 09:29:30 UTC, Gregor Mückl wrote:

 There is no single universally correct way to segment a string. 
 Grapheme segmentation requires a correct assumption of the text 
 encoding in the string and also the assumption that the 
 encoding is flawless. Neither may be guaranteed in general. 
 There is a lot of ways to corrupt UTF-8 strings, for example.

Are you meaning that there's no way to verify that assumptions?
Sorting algorithms in Phobos are returning a SortedRange.

 And then there is a question of the length of a grapheme: IIRC 
 they can consist of up to 6 or 7 code points with each of them 
 encoded in a varying number of bytes in UTF-8, UTF-16 or UCS-2. 
 So what data type do you use for representing graphemes then 
 that is both not wasteful and doesn't require dynamic memory 
 management?

It's performance the rationale of not using dynamic memory 
management, if that it's unavoidable to have a correct behaviour?

 Then there are other nasty quirks around graphemes: their 
 encoding is not unique. This Unicode TR gives a good impression 
 of how complex this single aspect is: 
 https://unicode.org/reports/tr15/
 So if you want to use graphemes, do you want to keep the 
 original encoding or do you implicitly convert them to NFC or 
 NFD? NFC tends to be better for language processing, NFD tends 
 to be better for text rendering (with exceptions). If you don't 
 normalize, semantically equivalent graphemes may not be equal 
 under comparison.

It's performance the rationale of not using normalisation, that 
solves all the problems you have mentioned above?

 At this point you're probably approaching the complexity of 
 libraries like ICU. You can take a look at it if you want a 
 good scare. ;)

The original question still is not answered: can you provide 
example of algorithms and use cases that don't need grapheme 
segmentation?

Aug 15 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Aug 14, 2019 at 07:15:54AM +0000, Argolis via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
 
 But we can't make that the default because it's a big performance
 hit, and many string algorithms don't actually need grapheme
 segmentation.

 
 Can you provide example of algorithms and use cases that don't need
 grapheme segmentation?

Most cases of string processing involve:
- Taking substrings: does not need grapheme segmentation; you just slice
  the string.
- Copying one string to another: does not need grapheme segmentation,
  you just use memcpy (or equivalent).
- Concatenating n strings: does not need grapheme segmentation, you just
  use memcpy (or equivalent).  In D, you just use array append, or
  std.array.appender if you get fancy.
- Comparing one string to another: does not need grapheme segmentation;
  you either use strcmp/memcmp, or if you need more delicate semantics,
  call one of the standard Unicode string collation algorithms (std.uni,
  meaning, your code does not need to worry about grapheme segmentation,
  and besides, Unicode collation algorithms operate at the code point
  level, not at the grapheme level).
- Matching a substring: does not need grapheme segmentation; most
  applications just need subarray matching, i.e., treat the substring as
  an opaque blob of bytes, and match it against the target.  If you need
  more delicate semantics, there are standard Unicode algorithms for
  substring matching (i.e., user code does not need to worry about the
  low-level details -- the inputs are basically opaque Unicode strings
  whose internal structure is unimportant).

You really only need grapheme segmentation when:
- Implementing a text layout algorithm where you need to render glyphs
  to some canvas.  Usually, this is already taken care of by the GUI
  framework or the terminal emulator, so user code rarely has to worry
  about this.
- Measuring the size of some piece of text for output alignment
  purposes: in this case, grapheme segmentation isn't enough; you need
  font size information and other such details (like kerning, spacing
  parameters, etc.). Usually, you wouldn't write this yourself, but use
  a text rendering library.  So most user code don't actually have to
  worry about this.  (Note that iterating by graphemes does NOT give you
  the correct value for width even with a fixed-width font in a text
  mode terminal emulator, because there are such things as double-width
  characters in Unicode, which occupy two cells each. And also
  zero-width characters which count as distinct (empty) graphemes, but
  occupy no space.)


And as an appendix, the way most string processing code is done in C/C++
(iterate over characters) is actually wrong w.r.t. Unicode, because it's
really only reliable for ASCII inputs. For "real" Unicode strings, you
can't really get away with the "character by character" approach, even
if you use grapheme segmentation: in some writing systems like Arabic
breaking up a string like this can cause incorrect behaviour like
breaking ligatures, which may not be intended.  For this sort of
operations the application really needs to be using the standard Unicode
algorithms, that depend on the *purpose* of the function, not the
mechanics of iterating over characters, e.g., find suitable line breaks,
find suitable hyphenation points, etc..  There's a reason the Unicode
Consortium defines standard algorithms for these operations: it's
because na�vely iterating over graphemes, in general, does *not* yield
the correct results in all cases.

Ultimately, the whole point behind removing autodecoding is to put the
onus on the user code to decide what kind of iteration it wants: code
units, code points, or graphemes. (Or just use one of the standard
algorithms and don't reinvent the square wheel.)


 Are they really SO common that the correct default is go for code
 points?

The whole point behind removing autodecoding is so that we do NOT
default to code points, which is currently the default.  We want to put
the choice in the user's hand, not silently default to iteration by code
point under the illusion of correctness, which is actually incorrect for
non-trivial inputs.


 Is it not better to have as a default the grapheme segmentation, the
 correct way of handling a string, instead?

Grapheme segmentation is very complex, and therefore, very slow.  Most
string processing doesn't actually need grapheme segmentation.  Setting
that as the default would mean D string processing will be
excruciatingly slow by default, and furthermore all that extra work will
be mostly for nothing because most of the time we don't need it anyway.

Not to repeat that most na�ve iterations over graphemes actually do
*not* yield what one might think is the correct result. For example,
measuring the size of a piece of text in a fixed-width font in a
text-mode terminal by counting graphemes is actually wrong, due to
double-width and zero-width characters.


T

-- 
The most powerful one-line C program: #include "/dev/tty" -- IOCCC

Aug 14 2019

Argolis <argolis gmail.com> writes:

On Wednesday, 14 August 2019 at 17:12:00 UTC, H. S. Teoh wrote:

 - Taking substrings: does not need grapheme segmentation; you 
 just slice the string.

What is the use case of slicing some multi-codeunit encoded 
grapheme in the middle?

 - Copying one string to another: does not need grapheme 
 segmentation, - you just use memcpy (or equivalent).
 - Concatenating n strings: does not need grapheme segmentation, 
 you just use memcpy (or equivalent).  In D, you just use array 
 append,  or  std.array.appender if you get fancy.

That use case is not string processing, but general memory 
handling of an opaque type

 - Comparing one string to another: does not need grapheme  
 segmentation;
   you either use strcmp/memcmp

That use case is not string processing, but general memory 
comparison of an opaque type

, or if you need more delicate semantics,
 call one of the standard Unicode string collation algorithms 
 (std.uni, meaning, your code does not need to worry about 
 grapheme segmentation, and besides, Unicode collation 
 algorithms operate at the code point  level, not at the 
 grapheme level).

So this use case algorithm needs a proper handling of encoded 
code units, and can't be satisfied simply removing auto decoding

 - Matching a substring: does not need grapheme segmentation;  
 most
   applications just need subarray matching, i.e., treat the  
 substring as
   an opaque blob of bytes, and match it against the target.

That use case is not string processing, but general memory 
comparison  of an opaque type

 If  you need more delicate semantics, there are standard 
 Unicode  algorithms for
 substring matching (i.e., user code does not need to worry 
 about the low-level details -- the inputs are basically opaque 
 Unicode strings whose internal structure is unimportant).

Again, removing auto decoding does not change anything for that.

 You really only need grapheme segmentation when:
 - Implementing a text layout algorithm where you need to render 
 glyphs
 to some canvas.
 - Measuring the size of some piece of text for output alignment
   purposes: in this case, grapheme segmentation isn't enough; 
 you need font size information and other such details (like 
 kerning, spacing parameters, etc.).

What about all the example above in the thread, about the wrong 
way of working of auto decoding right now?

Retro, correct substrings slicing, correct indexing, et cetera

 Ultimately, the whole point behind removing autodecoding is to 
 put the onus on the user code to decide what kind of iteration 
 it wants: code units, code points, or graphemes. (Or just use 
 one of the standard algorithms and don't reinvent the square 
 wheel.)

There will be always a default way to iterate, see below

 Are they really SO common that the correct default is go for 
 code points?

 The whole point behind removing autodecoding is so that we do 
 NOT default to code points, which is currently the default.  We 
 want to put the choice in the user's hand, not silently default 
 to iteration by code point under the illusion of correctness, 
 which is actually incorrect for non-trivial inputs.

The illusion of correctness should be turned into correctness, 
then.

 Is it not better to have as a default the grapheme 
 segmentation, the correct way of handling a string, instead?

 Grapheme segmentation is very complex, and therefore, very 
 slow.  Most string processing doesn't actually need grapheme 
 segmentation.

Can you provide string processing that doesn't need grapheme 
segmentation?
The examples listed above are not string processing example.

 Setting that as the default would mean D string processing will 
 be excruciatingly slow by default, and furthermore all that 
 extra work will be mostly for nothing because most of the time 
 we don't need it anyway.

 From the examples above, most of the time you simply need opaque 
memory management, so decaying the string/dstring/wstring to a 
binary blob, but that's not string processing

My (refined) point still stands: can you provide example of (text 
processing) algorithms and use cases that don't need grapheme 
segmentation?

Aug 15 2019

nkm1 <t4nk074 openmailbox.org> writes:

On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:

 My (refined) point still stands: can you provide example of 
 (text processing) algorithms and use cases that don't need 
 grapheme segmentation?

Parsing XML, HTML and other such things is what people usually 
have in mind. In general, all sorts of text where human-readable 
parts are interleaved with (easier to handle) machine 
instructions.

Aug 15 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:
 From the examples above, most of the time you simply need 
 opaque memory management, so decaying the 
 string/dstring/wstring to a binary blob, but that's not string 
 processing

This is the point we're trying to get across to you: this isn't 
sufficient. Depending on the context and the script/language, you 
need access to the string at various levels. E.g. a font renderer 
needs to sometimes iterate code points, not graphemes in order to 
compose the correct glyphs.

Binary blob comparisons for comparing strings are *also* not 
sufficient, again depending on both script/language of the text 
in the string and the context in which the comparison is 
performed. If the comparison is to be purely semantic, the 
following strings should be equal: "\u00f6" and "\u006f\u0308". 
They both represent the same "Latin Small Letter O with 
Diaeresis". Their in-memory representations clearly aren't equal, 
so a memcpy won't yield the correct result. The same applies to 
sorting.

If you decide to force a specific string normalization 
internally, you put the burden on the user to explicitly select a 
different normalization when they require it. Plus, there is no 
way to perfectly reconstruct the input binary representation of a 
string, e.g. when it was given in a non-normalized form (e.g. a 
mix of NFC and NFD). Once such a string is through a 
normalization algorithm, the exact input is unrecoverable. This 
makes interfacing with other code that has idiosyncrasies around 
all of this hard to impossible to achieve.

One such system that I worked on in the past was a small embedded 
microcontroller driven HCI module with very limited capabilites, 
but with the requirement to be multilingual. I carefully worked 
out that for the languages that were required, a UTF-8 encoding 
with a very specific normalization would just about work. This 
choice was viable because the user interface was created in a 
custom tool where I could control the code and data generation 
just enough to make it work.

Another case where normalization is troublesome is ligatures. 
Ligatures that are purely stylistic like "ff", "ffi", "fft", 
"st", "ct" etc... have their own code points. Yet, it is a purely 
stylistic choice whether to use them. So in terms of the 
contained text, the ligature \ufb00 is equal to the string "ff", 
but it is not the same grapheme. Whether you can normalize this 
depends on the context. The user may have selected the ligature 
representation deliberately to have it appear as such on screen. 
If you want to do spell checking on the other hand, you would 
need to resolve the ligature to its individual letters.

And then there is Hangul: this is a prime example of a writing 
system that is "weird" to westerners. It is based on 40 symbols 
(19 consonants, 21 vowels) which aren't written individually, but 
merged syllable by syllable into rectangular blocks of two or 
three such symbols. These symbols get arranged in different 
layouts depending on which symbols there are in a syllable. As 
far as I understand, this follows a clear algorithm. This results 
in approximately 6500 individual graphemes that are actually 
written. Yet each of these is a group of two or three letters and 
parsed as such. So depending on whether you're interested in 
individual letters or syllables, you need to use a different 
string representation for processing that language.

OK, this are all just examples that come to my mind while 
brainstorming the question a little bit. However, none of us are 
not experts in language processing, so whatever examples we can 
come up with are very likely just the very tip of the iceberg. 
There is a reason why libraries like ICU give the user a lot of 
control over string handling and expose a lot of variants of 
functions depending on the user intent and context. This design 
rests on a lot of expert knowledge that we don't have, but we 
know that it is sound. Going against that wisdom is inviting 
trouble. Autodecoding is an example of doing just that.

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:

In my not-so-humble opinion, the introduction of "normalization" to Unicode was 
a huge mistake. It's not necessary and causes nothing but grief. They should 
have consulted with me first :-)

Aug 15 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Thursday, 15 August 2019 at 19:11:14 UTC, Walter Bright wrote:
 In my not-so-humble opinion, the introduction of 
 "normalization" to Unicode was a huge mistake. It's not 
 necessary and causes nothing but grief. They should have 
 consulted with me first :-)

I am not sure that you can go entirely without normalization for 
all languages in existence. But Unicode conflates semantic 
representation and rendering in ways that are effectively 
layering violations. The LTR and RTL control characters are nice 
examples of that. Why should a Unicode string be able to specify 
the displayed direction of the script? The same goes for the 
stylistic ligatures I pointed out. These should be handled 
exclusively by the font rendering subsystem. There's a 
substitution table in OpenType for that, FFS!

Well, I guess that Unicode is the best we have despite all this 
maddening cruft. Attempting to do better would just result in 
text encoding "standard" N+1. And we know how much the world 
needs that. ;)

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/15/2019 12:38 PM, Gregor Mückl wrote:
 I am not sure that you can go entirely without normalization for all languages 
 in existence. But Unicode conflates semantic representation and rendering in 
 ways that are effectively layering violations. The LTR and RTL control 
 characters are nice examples of that. Why should a Unicode string be able to 
 specify the displayed direction of the script? The same goes for the stylistic 
 ligatures I pointed out. These should be handled exclusively by the font 
 rendering subsystem. There's a substitution table in OpenType for that, FFS!

Unicode also fouled up by adding semantic information that is invisible to the 
rendering. It should have stuck with the Unicode<=>print round-trip not losing 
information.

Naturally, people have already used such to trick people, track people, etc.

Another thing I hate about Unicode is there are articles about how people can 
get their vanity symbol into Unicode! And they do. They invent glyphs, and get 
them in. This goes on all the time.

Unicode started out as a cool idea, and turned rather quickly into a cesspool.

Aug 15 2019

ag0aep6g <anonymous example.com> writes:

On 15.08.19 21:54, Walter Bright wrote:
 Unicode also fouled up by adding semantic information that is invisible 
 to the rendering. It should have stuck with the Unicode<=>print 
 round-trip not losing information.
 
 Naturally, people have already used such to trick people, track people, 
 etc.

'I' and 'l' are (virtually) identical in many fonts.

Aug 15 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 15, 2019 at 11:38:08PM +0200, ag0aep6g via Digitalmars-d wrote:
 On 15.08.19 21:54, Walter Bright wrote:
 Unicode also fouled up by adding semantic information that is
 invisible to the rendering. It should have stuck with the
 Unicode<=>print round-trip not losing information.
 
 Naturally, people have already used such to trick people, track
 people, etc.

 
 'I' and 'l' are (virtually) identical in many fonts.

And 0 and O are also identical in many fonts.  But none of us would
seriously entertain the idea that O and 0 ought to be the same
character.


T

-- 
Indifference will certainly be the downfall of mankind, but who cares? --
Miquel van Smoorenburg

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/15/2019 2:38 PM, ag0aep6g wrote:
 On 15.08.19 21:54, Walter Bright wrote:
 Unicode also fouled up by adding semantic information that is invisible to the 
 rendering. It should have stuck with the Unicode<=>print round-trip not losing 
 information.

 Naturally, people have already used such to trick people, track people, etc.

 
 'I' and 'l' are (virtually) identical in many fonts.

That's a problem with some fonts, not the concept. When such fonts are used,
the 
distinguishment comes from the context, not the symbol itself.

On the other hand, the Unicode spec itself routinely shows identical glyphs for 
different code points.

Consider also:

    (800)555-1212

You know it's a phone number, because of the context. The digits used in are
NOT 
actually numbers, they do not have any mathematical properties. Should Unicode 
have a separate code point for these?

The point is, the meaning of the symbol comes from its context, not the symbol 
itself. This is the fundamental error Unicode made.

Aug 15 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 15, 2019 at 03:21:32PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/15/2019 2:38 PM, ag0aep6g wrote:
 On 15.08.19 21:54, Walter Bright wrote:
 Unicode also fouled up by adding semantic information that is
 invisible to the rendering. It should have stuck with the
 Unicode<=>print round-trip not losing information.
 
 Naturally, people have already used such to trick people, track
 people, etc.

 
 'I' and 'l' are (virtually) identical in many fonts.

 
 That's a problem with some fonts, not the concept. When such fonts are
 used, the distinguishment comes from the context, not the symbol
 itself.

[...]

And there you go: you're basically saying that "symbol" is different
from "glyph", and therefore, you're contradicting your own axiom that
character == glyph.  "Symbol" is basically an abstract notion of a
character that exists *apart from the glyph used to render it*.

And now that you agree that character encoding should be based on
"symbol" rather than "glyph", the next step is the realization that, in
the wide world of international languages out there, there exist
multiple "symbols" that are rendered with the *same* glyph.  This is a
hard fact of reality, and no matter how you wish it to be otherwise, it
simply ain't so.  Your ideal of "character == glyph" simply doesn't
work in real life.


T

-- 
There's light at the end of the tunnel. It's the oncoming train.

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/15/2019 3:56 PM, H. S. Teoh wrote:
 And now that you agree that character encoding should be based on
 "symbol" rather than "glyph", the next step is the realization that, in
 the wide world of international languages out there, there exist
 multiple "symbols" that are rendered with the *same* glyph.  This is a
 hard fact of reality, and no matter how you wish it to be otherwise, it
 simply ain't so.  Your ideal of "character == glyph" simply doesn't
 work in real life.

Splitting semantic hares is pointless, as the fact remains it worked just fine 
in real life before Unicode, it's called "printing" on paper.

As for not working in real life, that's Unicode.

Aug 15 2019

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 16 August 2019 at 06:28:30 UTC, Walter Bright wrote:
 On 8/15/2019 3:56 PM, H. S. Teoh wrote:
 And now that you agree that character encoding should be based 
 on
 "symbol" rather than "glyph", the next step is the realization 
 that, in
 the wide world of international languages out there, there 
 exist
 multiple "symbols" that are rendered with the *same* glyph.  
 This is a
 hard fact of reality, and no matter how you wish it to be 
 otherwise, it
 simply ain't so.  Your ideal of "character == glyph" simply 
 doesn't
 work in real life.

 Splitting semantic hares is pointless, as the fact remains it 
 worked just fine in real life before Unicode, it's called 
 "printing" on paper.

Sorry, no it didn't work in reality before Unicode. Multi 
language system were a mess.
My job is on the biggest translation memory in the world, the 
Euramis system of the European Union and when I started there in 
2002, the system supported only 11 languages. The data in the 
Oracle database was already in Unicode but all the supporting 
translation chain was codepage based. It was a catastrophe and 
the amount of crap, especially in Greek data, was staggering. The 
issues H.S.Teoh described above were indeed a real pain point. In 
greek text it was very frequent to have mixed Latin characters 
with Greek character from codepage 1253. Was the A an alpha or a 
\x41. This crap made a lot of algorithms that were used 
downstream from the database (CAT tools, automatic translation 
etc.) completely bonkers.
For the 2004 extension of the EU we had to support one alphabet 
more (Cyrillic for Bulgarian) and 4 codepages more (CP-1250 
Latin-2 Extended-A, CP-1251 Cyrillic, CP-1257 Baltic and 
ISO-8859-3 Maltese). It would have been such a mess that we 
decided to convert everything to Unicode.
We don't have these crap data anymore. Our code is not perfect, 
far from it, but adopting Unicode through and throug and dropping 
all support for the old coding crap simplified our lives 
tremendously.
When we got in 2010 the request from the EEAS (European External 
Action Service) to support also other languages than the 24 
official EU languages, i.e. Russian, Arabic and Chinese, we 
didn't break a sweat to implement it, thanks to Unicode.

 As for not working in real life, that's Unicode.

Unicode works much, much better than anything that existed 
before. The issue is that not a lot of people work in a 
multi-language environment and don't have a clue of the unholy 
mess it was before.

Aug 16 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2019 2:20 AM, Patrick Schluter wrote:
 Sorry, no it didn't work in reality before Unicode. Multi language system were
a 
 mess.

I have several older books that move facilely between multiple languages. It's 
not a mess.

Since the reader can figure all this out without invisible semantic information 
in the glyphs, that invisible information is not necessary.

Once you print/display the Unicode string, all that semantic information is 
gone. It is not needed.


 Unicode works much, much better than anything that existed before. The issue
is 
 that not a lot of people work in a multi-language environment and don't have a 
 clue of the unholy mess it was before.

Actually, I do. Zortech C++ supported multiple code pages, multiple multibyte 
encodings, and had error messages in 4 languages.

Unicode, in its original vision, solved those problems.

Aug 16 2019

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 16 August 2019 at 09:34:21 UTC, Walter Bright wrote:
 On 8/16/2019 2:20 AM, Patrick Schluter wrote:
 Sorry, no it didn't work in reality before Unicode. Multi 
 language system were a mess.

 I have several older books that move facilely between multiple 
 languages. It's not a mess.

 Since the reader can figure all this out without invisible 
 semantic information in the glyphs, that invisible information 
 is not necessary.

Unicode's purpose is not limited to the output at the end the 
processing chain. It's the whole processing chain that is the 
point.

 Once you print/display the Unicode string, all that semantic 
 information is gone. It is not needed.

As said, printing is only a minor part of language processing. To 
give an example from the EU again, and just to illustrate, we 
have exactly three laser printer (one is a photocopier) on each 
floor of our offices. You may say; o you're the IT guys, you 
don't need to print that much, to which I respond, half of the 
floor is populated with the english translation unit and while 
they indeed use the printers more than us, it is not a 
significant part of their workflow.

 Unicode works much, much better than anything that existed 
 before. The issue is that not a lot of people work in a 
 multi-language environment and don't have a clue of the unholy 
 mess it was before.

 Actually, I do. Zortech C++ supported multiple code pages, 
 multiple multibyte encodings, and had error messages in 4 
 languages.

Each string was in its own language. We have to deal with texts 
that are mixed languages. Sentences in Bulgarian with an office 
address in Greece, embedded in a xml file. Codepages don't work 
in that case, or you have to introduce an escaping scheme much 
more brittle and annoying than utf-8 or utf-16 encoding.
European Parliament's session logs are what is called panaché 
documents, i.e. the transcripts are in native language of 
intervening MEP's. So completely mixed documents.

 Unicode, in its original vision, solved those problems.

Unicode is not perfect and indeed the crap with emoji is crap, 
but Unicode is better than what was used before.
And to insist again, Unicode is mostly about "DATA PROCESSING". 
Sometime it might result to a human readable result, but that is 
only one part of its purpose.

Aug 16 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Friday, August 16, 2019 4:32:06 AM MDT Patrick Schluter via Digitalmars-d 
wrote:
 Unicode, in its original vision, solved those problems.

 Unicode is not perfect and indeed the crap with emoji is crap,
 but Unicode is better than what was used before.
 And to insist again, Unicode is mostly about "DATA PROCESSING".
 Sometime it might result to a human readable result, but that is
 only one part of its purpose.

I don't think that anyone is arguing that Unicode is worse than what we had
before. The problem is that there are aspects of Unicode that are screwed
up, making it far worse to deal with than it should be. We'd be way better
off if those mistakes had not been made. So, we're better off than we were
but also definitely worse off than we should be.

- Jonathan M Davis

Aug 16 2019

Abdulhaq <alynch4047 gmail.com> writes:

On Friday, 16 August 2019 at 10:32:06 UTC, Patrick Schluter wrote:
 On Friday, 16 August 2019 at 09:34:21 UTC, Walter Bright wrote:
 [...]

 Unicode's purpose is not limited to the output at the end the 
 processing chain. It's the whole processing chain that is the 
 point.

 [...]

 As said, printing is only a minor part of language processing. 
 To give an example from the EU again, and just to illustrate, 
 we have exactly three laser printer (one is a photocopier) on 
 each floor of our offices. You may say; o you're the IT guys, 
 you don't need to print that much, to which I respond, half of 
 the floor is populated with the english translation unit and 
 while they indeed use the printers more than us, it is not a 
 significant part of their workflow.

 [...]

 Each string was in its own language. We have to deal with texts 
 that are mixed languages. Sentences in Bulgarian with an office 
 address in Greece, embedded in a xml file. Codepages don't work 
 in that case, or you have to introduce an escaping scheme much 
 more brittle and annoying than utf-8 or utf-16 encoding.
 European Parliament's session logs are what is called panaché 
 documents, i.e. the transcripts are in native language of 
 intervening MEP's. So completely mixed documents.

 [...]

 Unicode is not perfect and indeed the crap with emoji is crap, 
 but Unicode is better than what was used before.
 And to insist again, Unicode is mostly about "DATA PROCESSING". 
 Sometime it might result to a human readable result, but that 
 is only one part of its purpose.

These are great examples and I totally agree with you (and HS 
Teoh). It's no coincidence that those people who can read, write 
and speak more than one language with more than one script are 
those who think Unicode is beneficial. It seems that those who 
are stuck in the world of anglo/latin characters just don't have 
the experience required to understand why their simpler schemes 
won't work.

Aug 16 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Aug 16, 2019 at 04:41:01PM +0000, Abdulhaq via Digitalmars-d wrote:
[...]
 It's no coincidence that those people who can read, write and speak
 more than one language with more than one script are those who think
 Unicode is beneficial.

To be clear, there are aspects of Unicode that I don't agree with.  But
what Walter is proposing (1 glyph == 1 character) simply does not work.
It fails to handle the inherent complexities of working with
multi-lingual strings.


 It seems that those who are stuck in the world of anglo/latin
 characters just don't have the experience required to understand why
 their simpler schemes won't work.

Walter claims to have experience working with code translated into 4
languages.  I suspect (Walter please correct me if I'm wrong) that it
mostly just involved selecting a language at the beginning of the
program, and substituting strings with translations into said language
during output.  If this is the case, his stance of 1 glyph == 1
character makes sense, because that's all that's needed to support this
limited functionality.

Where this scheme falls down is when you need to perform automatic
processing of multi-lingual strings -- an unavoidable inevitability in
this day and age of global communications. It makes no sense for a
single letter to have two different encodings just because your user
decided to use a different font, but that's exactly what Walter is
proposing -- I wonder if he realizes that.


T

-- 
Written on the window of a clothing store: No shirt, no shoes, no service.

Aug 16 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2019 3:32 AM, Patrick Schluter wrote:
 Unicode is not perfect and indeed the crap with emoji is crap, but Unicode is 
 better than what was used before.

I'm not arguing otherwise.

 And to insist again, Unicode is mostly about "DATA PROCESSING". Sometime it 
 might result to a human readable result, but that is only one part of its
purpose.

And that's mission creep, which came later and should not have occurred.

With such mission creep, there will be no end of intractable problems. People 
assign new semantic meanings to characters all the time. Trying to embed that 
into Unicode is beyond impractical.

To repeat an example:

     a + b = c

Why not have special Unicode code points for when letters are used as 
mathematical symbols?

    18004775555

Maybe some special Unicode code points for phone numbers?

How about Social Security digits? Credit card digits?

Aug 16 2019

lithium iodate <whatdoiknow doesntexist.net> writes:

On Friday, 16 August 2019 at 20:14:33 UTC, Walter Bright wrote:
 To repeat an example:

     a + b = c

 Why not have special Unicode code points for when letters are 
 used as mathematical symbols?

Uhm, well, the Unicode block "Mathematical Alphanumeric Symbols" 
already exists and is basically that.

Aug 16 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2019 1:26 PM, lithium iodate wrote:
 On Friday, 16 August 2019 at 20:14:33 UTC, Walter Bright wrote:
 To repeat an example:

     a + b = c

 Why not have special Unicode code points for when letters are used as 
 mathematical symbols?

 
 Uhm, well, the Unicode block "Mathematical Alphanumeric Symbols" already
exists 
 and is basically that.

ye gawds:

  https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols

I see they forgot the phone number code points.

Aug 16 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Aug 16, 2019 at 02:34:21AM -0700, Walter Bright via Digitalmars-d wrote:
[...]
 Once you print/display the Unicode string, all that semantic
 information is gone. It is not needed.

[...]

So in other words, we should encode 1, I, |, and l with exactly the same
value, because in print, they aII look about the same anyway, and the
user is well able to figure out from context which one is meant. After
a11, once you print the string the semantic distinction is gone anyway,
and human beings are very good at te||ing what was actually intended in
spite of the ambiguity.

Bye-bye unambiguous D lexer, we hardly knew you; now we need to rewrite
you with a context-sensitive algorithm that figures out whether we meant
11, ||, II, or ll in our source code encoded in Walter Encoding.


T

-- 
Debugging is twice as hard as writing the code in the first place. Therefore,
if you write the code as cleverly as possible, you are, by definition, not
smart enough to debug it. -- Brian W. Kernighan

Aug 16 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2019 10:52 AM, H. S. Teoh wrote:
 So in other words, we should encode 1, I, |, and l with exactly the same
 value, because in print, they aII look about the same anyway, and the
 user is well able to figure out from context which one is meant. After
 a11, once you print the string the semantic distinction is gone anyway,
 and human beings are very good at te||ing what was actually intended in
 spite of the ambiguity.
 
 Bye-bye unambiguous D lexer, we hardly knew you; now we need to rewrite
 you with a context-sensitive algorithm that figures out whether we meant
 11, ||, II, or ll in our source code encoded in Walter Encoding.

Fonts people use for programming take pains to distinguish them.

Aug 16 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Aug 16, 2019 at 01:18:54PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/16/2019 10:52 AM, H. S. Teoh wrote:
 So in other words, we should encode 1, I, |, and l with exactly the
 same value, because in print, they aII look about the same anyway,
 and the user is well able to figure out from context which one is
 meant. After a11, once you print the string the semantic distinction
 is gone anyway, and human beings are very good at te||ing what was
 actually intended in spite of the ambiguity.
 
 Bye-bye unambiguous D lexer, we hardly knew you; now we need to
 rewrite you with a context-sensitive algorithm that figures out
 whether we meant 11, ||, II, or ll in our source code encoded in
 Walter Encoding.

 
 Fonts people use for programming take pains to distinguish them.

So you're saying that what constitutes a "character" should be
determined by fonts??


T

-- 
Programming is not just an act of telling a computer what to do: it is also an
act of telling other programmers what you wished the computer to do. Both are
important, and the latter deserves care. -- Andrew Morton

Aug 16 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Thursday, August 15, 2019 1:11:14 PM MDT Walter Bright via Digitalmars-d 
wrote:
 In my not-so-humble opinion, the introduction of "normalization" to
 Unicode was a huge mistake. It's not necessary and causes nothing but
 grief. They should have consulted with me first :-)

IMHO, the fact that Unicode normalization is a thing is one of those things
that proves that Unicode is unnecessarily complex. There should only be a
single way to represent a given character. Unfortunately, that's definitely
not the way they went, and we suffer that much more because of it. Honestly,
I question that very many applications exist which actually handle Unicode
fully correctly. Its level of complexity is way past the point that the
average programmer has much chance of getting it right.

- Jonathan M Davis

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
 There should only be a
 single way to represent a given character.

Exactly. And two glyphs that render identically should be the same code point.

After all, when we write:

    a + b = c

we don't use a separate code point for the letters. Also,

    a) one item
    b) another item

we don't use a separate code point, either. I've debated this point with
Unicode 
people, and their arguments for separate glyphs fall to pieces when I point
this 
out.

Aug 15 2019

a11e99z <black80 bk.ru> writes:

On Thursday, 15 August 2019 at 19:59:34 UTC, Walter Bright wrote:
 On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
 There should only be a
 single way to represent a given character.

 Exactly. And two glyphs that render identically should be the 
 same code point.

if it was not sarcasm:
different code points can ref to same glyphs not vice verse:
A(EN,\u0041), A(RU,\u0410), A(EL,\u0391)
else sorting for non English will not work.

even order(A<B) will be wrong for example such RU glyphs
ABCEHKMOPTXacepuxy
corresponds to next English letters by sound or meaning
AVSENKMORTHaserihu
as u can see even uppers and lowers don't exists as pairs and 
have different meanings

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/15/2019 2:26 PM, a11e99z wrote:
 On Thursday, 15 August 2019 at 19:59:34 UTC, Walter Bright wrote:
 On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
 There should only be a
 single way to represent a given character.

 Exactly. And two glyphs that render identically should be the same code point.

 
 if it was not sarcasm:
 different code points can ref to same glyphs not vice verse:
 A(EN,\u0041), A(RU,\u0410), A(EL,\u0391)
 else sorting for non English will not work.
 
 even order(A<B) will be wrong for example such RU glyphs
 ABCEHKMOPTXacepuxy
 corresponds to next English letters by sound or meaning
 AVSENKMORTHaserihu
 as u can see even uppers and lowers don't exists as pairs and have different 
 meanings

Yes, I've heard this argument before.

The answer is that language should not be embedded in Unicode. It will lead to 
nothing but problems. The language is something externally assigned to a block 
of text, not the text itself, just like in printed text.

Again,

    a + b = c

Should those be separate code points? How about:

    a) one thing
    b) another

Aug 15 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 15, 2019 at 02:42:50PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/15/2019 2:26 PM, a11e99z wrote:

[...]
 if it was not sarcasm:
 different code points can ref to same glyphs not vice verse:
 A(EN,\u0041), A(RU,\u0410), A(EL,\u0391)
 else sorting for non English will not work.
 
 even order(A<B) will be wrong for example such RU glyphs
 ABCEHKMOPTXacepuxy
 corresponds to next English letters by sound or meaning
 AVSENKMORTHaserihu
 as u can see even uppers and lowers don't exists as pairs and have
 different meanings

 
 Yes, I've heard this argument before.
 
 The answer is that language should not be embedded in Unicode. It will
 lead to nothing but problems. The language is something externally
 assigned to a block of text, not the text itself, just like in printed
 text.

[...]

You cannot avoid conveying language in a string. Certain characters only
exist in certain languages, and the existence of the character itself
already encodes language. But that's a peripheral issue.

The more pertinent point is that *different* languages may reuse the
*same* glyphs for different (often completely unrelated) purposes. And
because of these different purposes, it changes the way the *same* glyph
is printed / laid out, and may affect other things in the surrounding
context as well.

Put it this way: you agree that the encoding of a character ought not to
change depending on font, right?

If so, consider your proposal to identify characters by glyph shape. A
letter with the shape 'u', by your argument, ought to be represented by
one, and only one, Unicode code point -- because, after all, it has the
same glyph shape.  Correct?

If so, now you have a problem: the shape 'u' in Cyrillic is the cursive
lowercase form of и.  So now you're essentially saying that all
occurrences of 'u' in Cyrillic text must be substituted with и when you
change the font from cursive to non-cursive.  Which is a contradiction
of the initial axiom that character encoding should not be
font-dependent.

Please explain how you solve this problem.


T

-- 
Real men don't take backups. They put their source on a public FTP-server and
let the world mirror it. -- Linus Torvalds

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/15/2019 3:16 PM, H. S. Teoh wrote:
 Please explain how you solve this problem.

The same way printers solved the problem for the last 500 years.

Aug 15 2019

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 16 August 2019 at 06:29:50 UTC, Walter Bright wrote:
 On 8/15/2019 3:16 PM, H. S. Teoh wrote:
 Please explain how you solve this problem.

 The same way printers solved the problem for the last 500 years.

They didn't have to do automatic processing of the represented 
data, i.e. it was for pure human consumption.
When the data is to be processed automatically, it is a whole 
other problem. I'm quite sure that you sometime appreciate the 
results of automatic translation (ggogle translate, yandex, 
systran etc.). While the results are far from perfect, they would 
be absolutely impossible if we used what you propose here.

Aug 16 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2019 2:27 AM, Patrick Schluter wrote:
 While the results are far from 
 perfect, they would be absolutely impossible if we used what you propose here.

Google translate can (and does) figure it out from the context, just like a 
human reader would.

Sentences written in mixed languages *are* written for human consumption. I
have 
many books written that way. They are quite readable, and don't have any need
to 
clue in the reader "the next word is in french/latin/greek/german".

And frankly, if data processing software is totally reliant on using the
correct 
language-specific glyph, it will fail, because people will not type in the 
correct one, and visually they cannot proof it for correctness. Anything that 
does OCR is going to completely fail at this.

Robust data processing software is going to be forced to accept and allow for 
multiple encodings of the same glyph, pretty much rendering the semantic 
difference meaningless.

I bet in 10 or 20 years of being clobbered by experience you'll reluctantly 
agree with me that assigning semantics to individual code points was a mistake.
:-)

BTW, I was a winner in the 1986 Obfuscated C Code Contest with:

-------------------------
#include <stdio.h>
#define O1O printf
#define OlO putchar
#define O10 exit
#define Ol0 strlen
#define QLQ fopen
#define OlQ fgetc
#define O1Q abs
#define QO0 for
typedef char lOL;

lOL*QI[] = {"Use:\012\011dump file\012","Unable to open file '\x25s'\012",
  "\012","   ",""};

main(I,Il)
lOL*Il[];
{	FILE *L;
	unsigned lO;
	int Q,OL[' '^'0'],llO = EOF,

	O=1,l=0,lll=O+O+O+l,OQ=056;
	lOL*llL="%2x ";
	(I != 1<<1&&(O1O(QI[0]),O10(1011-1010))),
	((L = QLQ(Il[O],"r"))==0&&(O1O(QI[O],Il[O]),O10(O)));
	lO = I-(O<<l<<O);
	while (L-l,1)
	{	QO0(Q = 0L;((Q &~(0x10-O))== l);
			OL[Q++] = OlQ(L));
		if (OL[0]==llO) break;
		O1O("\0454x: ",lO);
		if (I == (1<<1))
		{	QO0(Q=Ol0(QI[O<<O<<1]);Q<Ol0(QI[0]);
			Q++)O1O((OL[Q]!=llO)?llL:QI[lll],OL[Q]);/*"
			O10(QI[1O])*/
			O1O(QI[lll]);{}
		}
		QO0 (Q=0L;Q<1<<1<<1<<1<<1;Q+=Q<0100)
		{	(OL[Q]!=llO)? /* 0010 10lOQ 000LQL */
			((D(OL[Q])==0&&(*(OL+O1Q(Q-l))=OQ)),
			OlO(OL[Q])):
			OlO(1<<(1<<1<<1)<<1);
		}
		O1O(QI[01^10^9]);
		lO+=Q+0+l;}
	}
	D(l) { return l>=' '&&l<='\~';
}
-------------------------



I am indeed aware of the problems with confusing O0l1|. D does take steps to be 
more tolerant of bad fonts, such as 10l being allowed in C, but not D. I 
seriously considered banning the identifiers l and O. Perhaps I should have. | 
is not a problem because the grammar (i.e. the context) detects errors with it.

Aug 16 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Aug 16, 2019 at 01:44:20PM -0700, Walter Bright via Digitalmars-d wrote:
[...]
 Google translate can (and does) figure it out from the context, just
 like a human reader would.

Ha!  Actually, IME, randomly substituting lookalike characters from
other languages in the input to Google Translate often transmutes the
result from passably-understandable to outright hilarious (and
ridiculous).  Or the poor befuddled software just gives up and spits the
input back at you verbatim.


[...]
 And frankly, if data processing software is totally reliant on using
 the correct language-specific glyph, it will fail, because people will
 not type in the correct one, and visually they cannot proof it for
 correctness.  Anything that does OCR is going to completely fail at
 this.
 
 Robust data processing software is going to be forced to accept and
 allow for multiple encodings of the same glyph, pretty much rendering
 the semantic difference meaningless.

It's not a hard problem. You just need a preprocessing stage to
normalize such stray glyphs into the correct language-specific code
points, and all subsequent stages in your software pipeline will Just
Work(tm). Think of it as a rudimentary "OCR" stage to sanitize your
inputs.

This option would be unavailable if you used an encoding scheme that
*cannot* encode language as part of the string.


 I bet in 10 or 20 years of being clobbered by experience you'll
 reluctantly agree with me that assigning semantics to individual code
 points was a mistake. :-)

That remains to be seen. :-)


 BTW, I was a winner in the 1986 Obfuscated C Code Contest with:

[...]
 I am indeed aware of the problems with confusing O0l1|. D does take
 steps to be more tolerant of bad fonts, such as 10l being allowed in
 C, but not D. I seriously considered banning the identifiers l and O.
 Perhaps I should have.  | is not a problem because the grammar (i.e.
 the context) detects errors with it.

I also won an IOCCC award once, albeit anonymously (see 2005/anon)...
though it had nothing to do with lookalike characters, but more to do
with what I call M.A.S.S. (Memory Allocated by Stack-Smashing), in which
the program does not declare any variables (besides the two parameters
to main()) nor calls any memory allocation functions, but happily
manipulates arrays of data. :-D


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5

Aug 16 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 15, 2019 at 11:29:50PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/15/2019 3:16 PM, H. S. Teoh wrote:
 Please explain how you solve this problem.

 
 The same way printers solved the problem for the last 500 years.

Please elaborate.  Because you appear to be saying that Unicode should
encode the specific glyph, i.e., every font will have unique encodings
for its glyphs, because every unique glyph corresponds to a unique
encoding.  This is patently absurd, since your string encoding becomes
dependent on font selection.

How do you reconcile these two things:

(1) The encoding of a character should not be font-dependent. I.e., it
    should encode the abstract "symbol" rather than the physical
    rendering of said symbol.

(2) In the real world, there exist different symbols that share the same
    glyph shape.


T

-- 
Customer support: the art of getting your clients to pay for your own
incompetence.

Aug 16 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Aug 16, 2019 at 10:01:57AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 How do you reconcile these two things:
 
 (1) The encoding of a character should not be font-dependent. I.e., it
     should encode the abstract "symbol" rather than the physical
     rendering of said symbol.
 
 (2) In the real world, there exist different symbols that share the same
     glyph shape.

[...]

Or, to use a different example that stem from the same underlying issue,
let's say we take a Russian string:

	Я тебя люблю.

In a cursive font, it might look something like this:

	Я mеδя ∧юδ∧ю.

(I'm deliberately substituting various divergent Unicode characters to
make a point.)

According to your proposal, т and m ought to be encoded differently. So
that means that Cyrillic lowercase т has *two* different encodings (and
ditto with the other lookalikes).  This is obviously absurd, because
it's the SAME LETTER in Cyrillic.  Insisting that they be encoded
differently means your string encoding depends on font, which is in
itself already ridiculous, and worse yet, it means that if you're
writing a web script that accepts input from users, you have no idea
which encoding they will use when they want to write Cyrillic lowercase
т.  You end up with two strings that are logically identical, but
bitwise different because the user happened to have a font where т is
displayed as m.  Goodbye, sane substring search, goodbye sane automatic
string processing, goodbye, consistent string rendering code.

This is equivalent to saying that English capital A in serif ought to
have a different encoding from English capital A in sans serif, because
their glyph shapes are different. If you follow that route, pretty soon
you'll have a different encoding for bolded A, another encoding for
slanted A (which is different from italic A), and the combinatorial
explosion of useless redundant encodings thereof. It simply does not
make any sense.

The only sane way out of this mess is the way Unicode has taken: you
encode *not* the glyph, but the logical entity behind the glyph, i.e.,
the "symbol" as you call it, or in Unicode parlance, the code point.
Cyrillic lowercase т is a unique entity that should correspond with
exactly one code point, notwithstanding that some of its forms are
lookalikes to Latin lowercase m.  Even if the font ultimately uses
literally the same glyph to render them, they remain distinct entities
in the encoding because they are *logically different things*.

In today's age of international communications and multilingual strings,
the fact of different logical characters sharing the same rendered form
is an unavoidable, harsh reality.  You either face it and deal with it
in a sane way, or you can hold on to broken old approaches that don't
work and fade away in the rearview mirror.  Your choice. :-D


T

-- 
Без труда не выловишь и рыбку из пруда.

Aug 16 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 15, 2019 at 12:59:34PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
 There should only be a single way to represent a given character.

 
 Exactly. And two glyphs that render identically should be the same
 code point.

[...]

It's not as simple as you imagine.  Letter shapes across different
languages can look alike, but have zero correspondence with each other.
Conflating two distinct letter forms just because they happen to look
alike is the beginning of the road to madness.

First and foremost, the exact glyph shape depends on the font -- a
cursive M is a different shape from a serif upright M which is different
from a sans-serif bolded M.  They are logically the exact same
character, but they are rendered differently depending on the font.

What's the problem with that, you say?  Here's the problem: if we follow
your suggestion of identifying characters by rendered glyph, that means
a lowercase English 'u' ought to be the same character as the cursive
form of Cyrillic и (because that's how it's written in cursive).
However, non-cursive Cyrillic и is printed as и (i.e., the equivalent of
a "backwards" small-caps English N).  You cannot be seriously suggesting
that и and u should be the same character, right?!  The point is that
this changes *based on the font*; Russian speakers recognize the two
*distinct* glyphs as the SAME letter.  They also recognize that it's a
DIFFERENT letter from English u, in spite of the fact the glyphs are
identical.

This is just one of many such examples.  Yet another Cyrillic example:
lowercase cursive Т is written with a glyph that, for all practical
purposes, is identical to the glyph for English 'm'.  Again, conflating
the two based on your idea is outright ridiculous.  Just because the
user changes the font, should not mean that now the character becomes a
different letter! (Or that the program needs to rewrite all и's into
lowercase u's!)

How a letter is rendered is a question of *font*, and I'm sure you'll
agree that it doesn't make sense to make decisions on character identity
based on which font you happen to be using.

Then take an example from Chinese: the character for "one" is, once you
strip away the stylistic embellishments (which is an issue of font, and
ought not to come into play with a character encoding), basically the
same shape as a hyphen. You cannot seriously be telling me that we
should treat the two as the same thing.

Basically, there is no sane way to avoid detaching the character
encoding from the physical appearance of the character.  It simply makes
no sense to have a different character for every variation of glyph
across a set of fonts.  You *have* to work on a more abstract level, at
the level of the *logical* identity of the character, not its specific
physical appearance per font.

But that *inevitably* means you'll end up with multiple distinct
characters that happen to share the same glyph (again, modulo which font
the user selected for displaying the text).  See the Cyrillic examples
above.  There are many other examples of logically-distinct characters
from different languages that happen to share the same glyph shape with
some English letter in some cases, which you cannot possibly conflate
without ending up with nonsensical results.  You cannot eliminate
dependence on the specific font if you insist on identifying characters
by shape.  The only sane solution is to work on the abstract level,
where the same logical character (e.g., Cyrillic letter N) can have
multiple different glyphs depending on the font (in cursive, for
example, capital И looks like English U).

But once you work at the abstract level, you cannot avoid some
logically-distinct letters coinciding in glyph shape (e.g., English
lowercase u vs. Cyrillic и).  And once you start on that slippery slope,
you're not very far from descending into the "chaos" of the current
Unicode standard -- because inevitably you'll have to make distinctions
like "lowercase Greek mu as used in mathematics" vs. "lowercase Greek mu
as used by Greeks to write their language" -- because although
historically the two were identical, over time their usage has diverged
and now there exists some contexts where you have to differentiate
between the two.

The fact of the matter is that human language is inherently complex (not
to mention *changes over time* -- something many people don't consider),
and no amount of cleverness is going to surmount that without producing
an inherently-complex solution.


T

-- 
Why ask rhetorical questions? -- JC

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/15/2019 3:04 PM, H. S. Teoh wrote:
 [...]

And yet somehow people manage to read printed material without all these
problems.

Aug 15 2019

xenon325 <anm programmer.net> writes:

On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright wrote:
 And yet somehow people manage to read printed material without 
 all these problems.

If same glyphs had same codes, what will you do with these:

1) Sort string.

In my phone's contact lists there are entries in russian, in 
english and mixed.
Now they are sorted as:
A (latin), B (latin), C, А (ru), Б, В (ru).
Wich is pretty easy to search/navigate.

What would be the order in case Unicode worked the way you want?

2) Convert cases:
- in english: 'B'.toLower == 'b'
- in russian: 'В'.toLower == 'в'

Aug 16 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2019 9:32 AM, xenon325 wrote:
 On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright wrote:
 And yet somehow people manage to read printed material without all these 
 problems.

 
 If same glyphs had same codes, what will you do with these:
 
 1) Sort string.
 
 In my phone's contact lists there are entries in russian, in english and mixed.
 Now they are sorted as:
 A (latin), B (latin), C, А (ru), Б, В (ru).
 Wich is pretty easy to search/navigate.

Except that there's no guarantee that whoever entered the data used the right 
code point.

The pragmatic solution, again, is to use context. I.e. if a glyphy is
surrounded 
by russian characters, it's likely a russian glyph. If it is surrounded by 
characters that form a common russian word, it's likely a russian glyph.

Of course it isn't perfect, but I bet using context will work better than 
expecting the code points to have been entered correctly.

I note that you had to tag В with (ru), because otherwise no human reader or
OCR 
would know what it was. This is exactly the problem I'm talking about.

Writing software that relies on invisible semantic information is never going
to 
work.

Aug 16 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Friday, 16 August 2019 at 21:05:44 UTC, Walter Bright wrote:
 On 8/16/2019 9:32 AM, xenon325 wrote:
 On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright 
 wrote:
 And yet somehow people manage to read printed material 
 without all these problems.

 
 If same glyphs had same codes, what will you do with these:
 
 1) Sort string.
 
 In my phone's contact lists there are entries in russian, in 
 english and mixed.
 Now they are sorted as:
 A (latin), B (latin), C, А (ru), Б, В (ru).
 Wich is pretty easy to search/navigate.

 Except that there's no guarantee that whoever entered the data 
 used the right code point.

Depends. On smartphones, switching the keyboard language is easy 
(just a swipe on Android), so users that are regularly 
multilingual should be fine there. Windows also offers keyboard 
layout switching on the fly with an awkward keyboard shortcut, 
but it is pretty well hidden. So again, users that are 
multilingual in their daily routines should really be fine.

But taking a step back and trying to take a bird's eye view on 
this discussion, it becomes clear to me that the argument could 
be solved if there was a clear separation of text representations 
for processing (sorting, spell checking, whatever other NLP you 
can think of) and a completely seperate one for display. The 
transformation to the later would naturally be lossy and not 
perfectly reversible. The funny thing about that part is that 
text rendering with OpenType fonts is *already* doing exactly 
this transformation to derive the font specific glyph indices 
from the text. But all the bells and whistles in Unicode blur 
this boundary way too much. And this is what we are getting hung 
up over, I think.

Man, we really managed to go off track in this thread, didn't we? 
;)

Aug 16 2019

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 16 August 2019 at 21:05:44 UTC, Walter Bright wrote:
 On 8/16/2019 9:32 AM, xenon325 wrote:
 On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright 
 wrote:
 And yet somehow people manage to read printed material 
 without all these problems.

 
 If same glyphs had same codes, what will you do with these:
 
 1) Sort string.
 
 In my phone's contact lists there are entries in russian, in 
 english and mixed.
 Now they are sorted as:
 A (latin), B (latin), C, А (ru), Б, В (ru).
 Wich is pretty easy to search/navigate.

 Except that there's no guarantee that whoever entered the data 
 used the right code point.

 From my experience, that was an issue that WE encountered often 
before Unicode the uppercase letters in Greek texts that were 
mixes of ASCII (A 0x41) and Greek (Α 0xC1 in CP-1253). It was so 
bad that the Greek translation department didn't use Euramis for 
a significant amount. It was only when we got completely rid of 
this crap (and also the RTF file format) and embraced Unicode 
that we got rid of this issue of mis-used encoding.
While I get that Unicode is (over-)complicated and in some 
aspects silly. It has nonetheless 2 essential virtues that all 
other encoding schemes never were able achieve:
- it is a norm that is widely used, almost universal.
- it is a norm that is widely used, almost universal.

Yeah, I'm lame, I repeated it twice :-)

The fact that it is widely adopted even in far east makes it 
really something essential. Could they have defined things 
differently or simpler? Maybe but I doubt it, as the complexity 
of Unicode comes from the complexity of language themselves.


 The pragmatic solution, again, is to use context. I.e. if a 
 glyphy is surrounded by russian characters, it's likely a 
 russian glyph. If it is surrounded by characters that form a 
 common russian word, it's likely a russian glyph.

No, that doesn't work for panaché documents, we've been there, we 
had that and it sucks. UTF was such a relief.
Here little example from our configuration. The regular 
expression used to detect a document reference in a text as a 
replaceable:

0:UN:EC_N:((№|č.|nr.|št.|αριθ.|No|nr|N:o|Uimh.|br.|n.|Nr.|Nru|[Nn][º�
o]|[Nn].[º°o])[  ][0-9]+/[0-9]+/(EC|ES|EF|EG|EK|EΚ|CE|EÜ|EY|CE|EZ|EB|KE|WE))

What is the context here? Btw the EC is Cyrillic and the first EK 
is Greek

and their substitution expressions
T:BG:EC_N:№\2/ЕС
T:CS:EC_N:č.\2/ES
T:DA:EC_N:nr.\2/EF
T:DE:EC_N:Nr.\2/EG
T:EL:EC_N:αριθ.\2/EΚ
T:EN:EC_N:No\2/EC
T:ES:EC_N:nº\2/CE
T:ET:EC_N:nr\2/EÜ
T:FI:EC_N:N:o\2/EY
T:FR:EC_N:nº\2/CE
T:GA:EC_N:Uimh.\2/CE
T:HR:EC_N:br.\2/EZ
T:IT:EC_N:n.\2/CE
T:LT:EC_N:Nr.\2/EB
T:LV:EC_N:Nr.\2/EK
T:MT:EC_N:Nru\2/KE
T:NL:EC_N:nr.\2/EG
T:PL:EC_N:nr\2/WE
T:PT:EC_N:n.º\2/CE
T:RO:EC_N:nr.\2/CE
T:SK:EC_N:č.\2/ES
T:SL:EC_N:št.\2/ES
T:SV:EC_N:nr\2/EG

and as said before, such a number can be in a citation in the 
language of the citation not in the language of the document.

 Of course it isn't perfect, but I bet using context will work 
 better than expecting the code points to have been entered 
 correctly.

 I note that you had to tag В with (ru), because otherwise no 
 human reader or OCR would know what it was. This is exactly the 
 problem I'm talking about.

Yeah, but what you propose makes it even worse not better.

 Writing software that relies on invisible semantic information 
 is never going to work.

Invisible to your eyes, not invisible to the machines, that's the 
whole point. Why do we need to annotate all the functions in D 
with these annoying attributes if the compiler can detect them 
automagically via context? Because in general it can't, the 
semantic information must be provided somehow.

Aug 17 2019

sarn <sarn theartofmachinery.com> writes:

On Friday, 16 August 2019 at 16:32:05 UTC, xenon325 wrote:
 If same glyphs had same codes, what will you do with these:
 ...
 2) Convert cases:
 - in english: 'B'.toLower == 'b'
 - in russian: 'В'.toLower == 'в'

FWIW, we have that problem today with Unicode and the letter i:
https://en.wikipedia.org/wiki/Dotted_and_dotless_I#In_computing

Aug 16 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
 Basically, there is no sane way to avoid detaching the 
 character encoding from the physical appearance of the 
 character.  It simply makes no sense to have a different 
 character for every variation of glyph across a set of fonts.  
 You *have* to work on a more abstract level, at the level of 
 the *logical* identity of the character, not its specific 
 physical appearance per font.

OK, but Unicode also does the inverse: it has multiple legal 
representations of characters that are logically the same, mean 
the same and should appear the exact same (the later doesn't 
necessarily happen because of font rendering deficiencies). E.g. 
the word
"schön" can be encoded two different ways while using only code 
points intended for German. So you can get the situation that 
"schön" != "schön". This is unnecessary duplication.

Aug 15 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 15, 2019 at 10:37:57PM +0000, Gregor M�ckl via Digitalmars-d wrote:
On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
Basically, there is no sane way to avoid detaching the character
encoding from the physical appearance of the character. It simply
makes no sense to have a different character for every variation of
glyph across a set of fonts. You *have* to work on a more abstract
level, at the level of the *logical* identity of the character, not
its specific physical appearance per font.

OK, but Unicode also does the inverse: it has multiple legal
representations of characters that are logically the same, mean the
same and should appear the exact same (the later doesn't necessarily
happen because of font rendering deficiencies). E.g. the word "sch�n"
can be encoded two different ways while using only code points
intended for German. So you can get the situation that "sch�n" !=
Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Thursday, August 15, 2019 4:59:45 PM MDT H. S. Teoh via Digitalmars-d
wrote:
On Thu, Aug 15, 2019 at 10:37:57PM +0000, Gregor M�ckl via Digitalmars-d

wrote:
On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
Basically, there is no sane way to avoid detaching the character
encoding from the physical appearance of the character. It simply
makes no sense to have a different character for every variation of
glyph across a set of fonts. You *have* to work on a more abstract
level, at the level of the *logical* identity of the character, not
its specific physical appearance per font.

OK, but Unicode also does the inverse: it has multiple legal
representations of characters that are logically the same, mean the
same and should appear the exact same (the later doesn't necessarily
happen because of font rendering deficiencies). E.g. the word "sch�n"
can be encoded two different ways while using only code points
intended for German. So you can get the situation that "sch�n" !=
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 15, 2019 at 05:06:57PM -0600, Jonathan M Davis via Digitalmars-d
wrote:
On Thursday, August 15, 2019 4:59:45 PM MDT H. S. Teoh via Digitalmars-d
wrote:

[...]
Unicode does have some dark corners like that.[*]

[...]
[*] And some worse-than-dark-corners, like the whole codepage
dedicated to emoji *and* combining marks for said emoji that changes
their *appearance* -- something that ought not to have any place in
a character encoding scheme! Talk about scope creep...

Considering that emojis are supposed to be pictures formed with
letters (simple ASCII art basically), they have no business being part
part of an encoding scheme in the first place - but having combining
marks to change their appearance definitely makes it that much worse.

[...]

It's not just emojis; GUI icons are already a thing in Unicode. If this
trend of encoding graphics in a string continues, in about a decade's
time we'll be able to reinvent Nethack with graphical tiles inside a
text mode terminal, using Unicode RPG icon "characters" which you can
animate by attaching various "combining diacritics". It would be kewl.
But also utterly pointless and ridiculous.

(In fact, I wouldn't be surprised if you can already do this to some
extent using emojis and GUI icon "characters". Just add a few more
Unicode "characters" for in-game objects and a few more "diacritics" for
animation frames, and we're already there. Throw in a zero-width,
non-spacing "animation frame variant selector" "character", and we could
have an entire animation sequence encoded as a string. Who even needs
PNGs and animated SVGs anymore?!)

--
First Rule of History: History doesn't repeat itself -- historians merely
repeat each other.

Aug 15 2019

Argolis <argolis gmail.com> writes:
On Thursday, 15 August 2019 at 19:05:32 UTC, Gregor Mückl wrote:
On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:
[...]

This is the point we're trying to get across to you: this isn't
sufficient. Depending on the context and the script/language,
you need access to the string at various levels. E.g. a font
renderer needs to sometimes iterate code points, not graphemes
in order to compose the correct glyphs.

[...]

I want to thank you, that's was really inspiring to me in trying
to dig harder in the problem!

Aug 16 2019

GreatSam4sure <greatsam4sure gmail.com> writes:
On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
We don't yet have a good plan on how to remove autodecoding and
yet provide backward compatibility with autodecoding-reliant
projects, but one thing we can do is make Phobos work properly
with and without autodecoding.

To that end, I created a build of Phobos that disables
autodecoding:

https://github.com/dlang/phobos/pull/7130

Of course, it fails. If people want impactful things to work
on, fixing each failure is worthwhile (each in separate PRs).

Note that this is neither trivial nor mindless code editing.
Each case has to be examined as to why it is doing
autodecoding, is autodecoding necessary, and deciding to
replace it with byChar, byDchar, or simply hardcoding the
decoding logic.

Thanks for your effort toward this direction I once a massive
this discussion on auto decoding.

Recently I have witnessed a massive effort from you, Andrei and
the entire community on the D language.

I must confess you have a beautiful language already. The D
language promises a lot by its elegance, compilation speed,
speed, generic and multiple programming techniques supported.

I don't have a problem with the language that much but with the
libraries, tutorial, documentation, ide. Each time I download the
library from fun packages almost 90% there must be one error or
another.

I will be happy if the tools and library just work out of the
box. The tools, the library should be set up that a novice like
me can use them.

I don't have much expertise in programming so I can contribute to
D for the now

Aug 13 2019

Andre Pany <andre s-e-a-p.de> writes:
On Tuesday, 13 August 2019 at 11:01:30 UTC, GreatSam4sure wrote:
On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
[...]

Thanks for your effort toward this direction I once a massive
this discussion on auto decoding.

Recently I have witnessed a massive effort from you, Andrei and
the entire community on the D language.

I must confess you have a beautiful language already. The D
language promises a lot by its elegance, compilation speed,
speed, generic and multiple programming techniques supported.

I don't have a problem with the language that much but with the
libraries, tutorial, documentation, ide. Each time I download
the library from fun packages almost 90% there must be one
error or another.

I will be happy if the tools and library just work out of the
box. The tools, the library should be set up that a novice like
me can use them.

I don't have much expertise in programming so I can contribute
to D for the now

I started to create github issues every time I see some errors on
libraries. This already helps a lot.

What really would be useful is to see the build status of
libraries on code.dlang.org.
With the new CI/CD functionality of Github, (free for open source
projects), this becomes are lot more feasible and easy to setup.

Kind regards
Andre

Aug 13 2019

Walter Bright <newshound2 digitalmars.com> writes:
On 8/13/2019 12:08 AM, Walter Bright wrote:
If people want impactful things to work on, fixing each
failure is worthwhile (each in separate PRs).

First fix: https://github.com/dlang/phobos/pull/7133

Aug 14 2019

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
https://github.com/dlang/phobos/pull/7130

Thank you for working on this!

Surprisingly, the amount of breakage this causes seems rather
small. I sent a few PRs for the modules that I am listed as a
code owner of.

However, I noticed that one kind of the breakage is silent (the
code compiles and runs, but behaves differently). This makes me
uneasy, as it would be difficult to ensure that programs are
fully and correctly updated for a (hypothetical) transition to
no-autodecode.

I found two cases of such silent breakage. One was in std.stdio:

https://github.com/dlang/phobos/pull/7140

If there was a warning or it was an error to implicitly convert
char to dchar, the breakage would have been detected during
compilation. I'm sure we discussed this before. (Allowing a char,
which might have a value >=0x80, to implicitly convert to dchar,
which would be nonsense, is problematic, etc.)

The other instance of silent breakage was in std.regex. This
unittest assert started failing:

https://github.com/dlang/phobos/blob/5cb4d927e56725a38b0b1ea1548d9954083d3290/std/regex/package.d#L629

I haven't looked into that, perhaps someone more familiar with
std.regex and std.uni could have a look.

Aug 15 2019

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 15 August 2019 at 12:09:02 UTC, Vladimir Panteleev
wrote:
I haven't looked into that, perhaps someone more familiar with
std.regex and std.uni could have a look.

In std.uni, there is genericDecodeGrapheme, which needs to:

1. Work with strings of any width
2. Work with input ranges of dchars
3. Advance the given range by ref

With autodecoding on, the first case is handled by .front /
.popFront.

With autodecoding off, there is no direct equivalent any more.
The problem is that the function needs to peek ahead (which can
be multiple range elements for ranges of narrow char types, which
is not possible for input ranges).

- Replacing .front / .popFront with std.utf.decodeFront does not
work because the function does not do these operations in the
same place, so we need to save the range before decodeFront
advances it, but we can't .save() input ranges from case 2 above.

- Using byDchar does not work because .byDchar does not take its
range by ref, so advancing the byDchar range will not advance the
range passed by ref to genericDecodeGrapheme. I tried to use
std.range.refRange for this but hit a compiler ICE ("precedence
not defined for token 'cantexp'").

Perhaps there is already a construct in Phobos that can solve
this?

Aug 15 2019

I should add that the std.uni "silent" breakage also was due to
`dchar c = str.front`, and would have been found by disallowing
char->dchar implicit conversion.

Aug 15 2019

Les De Ridder <les lesderid.net> writes:
On Thursday, 15 August 2019 at 12:09:02 UTC, Vladimir Panteleev
wrote:
On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
https://github.com/dlang/phobos/pull/7130

[...]

I remembered this article from the wiki where you pointed this
out back
in 2014:

https://wiki.dlang.org/Element_type_of_string_ranges

See also the forum thread that it links to.

Aug 15 2019

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 15 August 2019 at 15:01:22 UTC, Les De Ridder wrote:
I remembered this article from the wiki where you pointed this
out back
in 2014:

https://wiki.dlang.org/Element_type_of_string_ranges

I completely forgot about that. Thanks for bringing it up, looks
like it's still relevant :)

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:
I ran into that as well with the 3 PRs I did:

fix array(String) to work with no autodecode
https://github.com/dlang/phobos/pull/7133

fix assocArray() unittests for no autodecode
https://github.com/dlang/phobos/pull/7134

fix unittests for array.join() for no autodecode
https://github.com/dlang/phobos/pull/7135

More specifically, the ElementType template returns a dchar for an
autodecodable
string, and char/wchar for a non-autodecodable string. I suspect that most
people are not aware of this, and code that uses ElementType may already be
subtly broken.

Note that the documentation for ElementType is also wrong,

https://github.com/dlang/phobos/pull/

because isNarrowString is NOT THE SAME THING as an autodecoding string! The
difference is isNarrowString excludes stringish aggregates and enums with a
string base type, while autodecoding types include them.

Does this confuse anyone? It confuses me. I can never remember which is which.

Autodecoding is not only a conceptual mistake, the way it is implemented is a
buggy, confusing disaster. (isNarrowString is often incorrectly used instead
isAutodecodableString in Phobos.)

I think the only solution is to "rip the band-aid off" and have ElementType
give
the code unit type when autodecoding is disabled.

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
I sent a few PRs for the modules that I am listed as a code owner of.

Can you please add a link to those PRs in
https://github.com/dlang/phobos/pull/7130 ?

I think such references to how Phobos fixed its dependencies on autodecode will
be valuable to programmers who want to fix theirs.

Aug 15 2019

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 15 August 2019 at 21:21:33 UTC, Walter Bright wrote:
On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
I sent a few PRs for the modules that I am listed as a code
owner of.

Can you please add a link to those PRs in
https://github.com/dlang/phobos/pull/7130 ?

it. Noticed you added some comments just now too doing the same.

Aug 15 2019

Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 2:25 PM, Vladimir Panteleev wrote:
On Thursday, 15 August 2019 at 21:21:33 UTC, Walter Bright wrote:
On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
I sent a few PRs for the modules that I am listed as a code owner of.

Can you please add a link to those PRs in
https://github.com/dlang/phobos/pull/7130 ?

added some comments just now too doing the same.

I went one better, I added a [no autodecode] label!

Aug 15 2019

D Programming

C/C++ Programming

Other

digitalmars.D - Fix Phobos dependencies on autodecoding