digitalmars.D - Why not extend do to allow unicode in ID's?

Bert (80/80) Jun 29 2019 It would greatly expand the coverage.

sarn (8/8) Jun 29 2019 D already allows non-latin characters in identifiers, just not

Bert (3/11) Jun 30 2019 Yeah, I noticed some work but many do not and I'm not even sure

Dennis (27/33) Jun 30 2019 Currently D allows "universal alphas" in identifiers, so Greek

Bert (21/54) Jun 30 2019 Thanks. I guess I could create a small routine that hacks the

Dennis (11/21) Jul 01 2019 I don't have much Nim experience myself, so maybe you should ask

Martin Krejcirik (4/7) Jul 01 2019 I think a source code should be easily editable by anyone using a

Bert (11/19) Jul 01 2019 It's time to grow up? How can progress be made if we don't

Martin Krejcirik (4/7) Jul 02 2019 Editors maybe, but what about keyboards ? Can you easily write my

Timon Gehr (4/11) Jul 02 2019 Of course. I can easily write this even on a US keyboard. My editor is

XavierAP (4/8) Jul 03 2019 auto Krejčiřík = 0;

Jonathan M Davis (19/22) Jul 01 2019 ...

rikki cattermole (4/30) Jul 01 2019 No DIP is required. The lexer just needs updating to match to the

Jonathan M Davis (9/41) Jul 01 2019 If a character is a Unicode alpha character, then yes. However if it's n...

XavierAP (8/9) Jul 03 2019 A good example of a character that should not be allowed in

Bert (49/60) Jul 04 2019 Maybe, maybe not. It could be useful in some contexts... probably

Bert <Bert gmail.com> writes:

It would greatly expand the coverage.

It would be nice to use certain characters that are truly 
meaningful.

In fact, it would be nice for ops too.

░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼
∩ε≡φ±≥≤⌠⌡÷≈∙·√ⁿ²■☺♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼

I realize the excuse is going to be "It makes the code look ugly 
or hard to read", not all editors will support it, etc...

So? Those are lame excuses. People can abuse anything, you can't 
police the world. Stopping all legitimate uses because someone 
might use it illegitimately is ignorant and harmful(which is why 
it is ignorant).

It is best to enable unicode support and then have standards and 
guidelines and let some people shoot themselves in the foot if 
they want... that is the best way to learn not to do it again.

Imagine being able to write proper mathematical formula ID's:

∞
δ
Ω
Θ
Φ
τ
µ
σ
ε
φ

or using valid mathematical operators:

∩
≡
±
≥
≤
÷
≈
∙
√
ⁿ
²

or when you write a card game:

♥
♦
♣
♠


These are much better than the verbose words that we have to use 
now. I know some will say the opposite, but they can say it and 
be wrong. Trying to stop me from shooting myself in the foot when 
I don't own a gun is abusive to me and just like shooting me in 
the foot! I don't write any code that anyone else reads so let me 
make the choice for myself rather than create arbitrary rules 
that limit expressiveness. There is a reason the first amendment 
exists in the US, because the founders knew what limitations of 
expression would do. The same applies to all things.

Maybe one could use a switch to enable such a language with a 
compiler warning about such use. Maybe we can have a special D 
code page for useful symbols so there is a standard code for each 
that one could properly map using their editor of choice?

For example, we could have each symbol map to a long name that 
one could use to replace the source:

♥ = Symbol_Heart_0x2660    // or even just __Symbol__0x2660
♦ = Symbol_Diamond_0x2661
♣ = Symbol_Club_0x2662
♠ = Symbol_Spade_0x2663

And one could then change any source code between the symbolic 
form and the verbose form using a command line utility.

E.g.,


int ♥ = 3;

Can be converted to

int Symbol_Heart_0x2660 = 3;

and back without issue(99.9999999999% of code).

This would potentially cause issue with meta programming when 
comparing string of the id names but this is a minor issue. In 
fact, internally D could just use the long symbol name and 
require the programmer to use them.

E.g.,

static if (id == "♥") // invalid if id gets converted to long 
name internally.


symbol to it's long name

There are solutions to the problems... let's work on finding one 
to make D better.

Jun 29 2019

sarn <sarn theartofmachinery.com> writes:

D already allows non-latin characters in identifiers, just not 
arbitrary symbols:

import std.stdio;

void main()
{
	double φ = 1.61803398874989484820;
	writeln(φ);
}

Jun 29 2019

Bert <Bert gmail.com> writes:

On Sunday, 30 June 2019 at 03:12:55 UTC, sarn wrote:
 D already allows non-latin characters in identifiers, just not 
 arbitrary symbols:

 import std.stdio;

 void main()
 {
 	double φ = 1.61803398874989484820;
 	writeln(φ);
 }

Yeah, I noticed some work but many do not and I'm not even sure 
what does or doesn't ;/

Jun 30 2019

Dennis <dkorpel gmail.com> writes:

On Saturday, 29 June 2019 at 22:38:06 UTC, Bert wrote:
 Imagine being able to write proper mathematical formula ID's:

Currently D allows "universal alphas" in identifiers, so Greek 
letters are allowed already. See: 
https://dlang.org/spec/lex.html#identifiers

 or using valid mathematical operators:

In D you can't add operators, but if you want math notation on 
existing ones, you might be interested in fonts with programming 
ligatures such as:
https://github.com/tonsky/FiraCode

 or when you write a card game:

Custom literals can be added with templates, for example:
octal!377
(https://github.com/dlang/phobos/blob/d57be4690fc923a1974a4ef4d8b84a951131d219/std/conv.d#L4062)
tok!"if"
(https://github.com/dlang-community/libdparse/blob/5270739bcd1962418784c7760773e24d28b6009b/src/dparse/lexer.d#L115)

Since in strings any Unicode is allowed, you can do something 
similar:
suit!"♥"

 I don't write any code that anyone else reads so let me make 
 the choice for myself rather than create arbitrary rules that 
 limit expressiveness.

If it's only for yourself, you can add a build step that 
substitutes your custom symbols with valid identifiers before 
compiling. Or use your own fork of the compiler, you probably 
only need to remove this line:
https://github.com/dlang/dmd/blob/2599559d624275bfcff298b3a8b31f9d82ae534f/src/dmd/lexer.d#L524

Finally, if you truly long for ultimate freedom in how you write 
code, then Nim might be the right language for you since it 
aligns more with your "putting full trust in the programmer" view 
than D. In Nim, any non-ascii character is valid for identifiers, 
so even invalid Unicode characters are allowed.
https://nim-lang.org/docs/manual.html#lexical-analysis-identifiers-amp-keywords

Jun 30 2019

Bert <Bert gmail.com> writes:

On Sunday, 30 June 2019 at 10:10:41 UTC, Dennis wrote:
On Saturday, 29 June 2019 at 22:38:06 UTC, Bert wrote:
Imagine being able to write proper mathematical formula ID's:

Currently D allows "universal alphas" in identifiers, so Greek
letters are allowed already. See:
https://dlang.org/spec/lex.html#identifiers

or using valid mathematical operators:

In D you can't add operators, but if you want math notation on
existing ones, you might be interested in fonts with
programming ligatures such as:
https://github.com/tonsky/FiraCode

or when you write a card game:

Custom literals can be added with templates, for example:
octal!377
(https://github.com/dlang/phobos/blob/d57be4690fc923a1974a4ef4d8b84a951131d219/std/conv.d#L4062)
tok!"if"
(https://github.com/dlang-community/libdparse/blob/5270739bcd1962418784c7760773e24d28b6009b/src/dparse/lexer.d#L115)

Since in strings any Unicode is allowed, you can do something
similar:
suit!"♥"

I don't write any code that anyone else reads so let me make
the choice for myself rather than create arbitrary rules that
limit expressiveness.

If it's only for yourself, you can add a build step that
substitutes your custom symbols with valid identifiers before
compiling. Or use your own fork of the compiler, you probably
only need to remove this line:
https://github.com/dlang/dmd/blob/2599559d624275bfcff298b3a8b31f9d82ae534f/src/dmd/lexer.d#L524

Thanks. I guess I could create a small routine that hacks the
binary that reverses the if check. This would be easiest to
maintain as I woudln't have to recompile dmd every release, just
install the new one and patch.

Finally, if you truly long for ultimate freedom in how you
write code, then Nim might be the right language for you since
it aligns more with your "putting full trust in the programmer"
view than D. In Nim, any non-ascii character is valid for
identifiers, so even invalid Unicode characters are allowed.
https://nim-lang.org/docs/manual.html#lexical-analysis-identifiers-amp-keywords

I've heard of nim but never really looked in to it much.... but
every time I hear about it I am more and more enticed.

It seems well put together but the syntax is a little off
putting. I'm sure I could get used to it.

I have a few questions:

1. There doesn't seem to be good IDE support. I mainly use Visual
Studio and I see a nim for VSC which I don't use ;/ Is there any
really good IDE support?

2. How does meta programming of Nim compare to D's? The main
reason I use D is it's meta programming.

3. Nim seems to be have somewhat of a strong categorical and
functional foundation. Is it more like Haskell than D? (In the
sense of catering to strongly structured programming(functors,
natural transformations, etc))

I'll try to read over the manual. Maybe my next program will be
in Nim.

Jun 30 2019

Dennis <dkorpel gmail.com> writes:

On Sunday, 30 June 2019 at 23:27:56 UTC, Bert wrote:
 I have a few questions:

 1. There doesn't seem to be good IDE support. I mainly use 
 Visual Studio and I see a nim for VSC which I don't use ;/  Is 
 there any really good IDE support?

I don't have much Nim experience myself, so maybe you should ask 
on the Nim forum.

 2. How does meta programming of Nim compare to D's? The main 
 reason I use D is it's meta programming.

It also has static if ('when'), CTFE, type reflection 
('typedesc') and templates. In addition, it has AST macros which 
D will not have. (You can find long past discussions why, or 
Google 'The Lisp Curse' for something related).

 3. Nim seems to be have somewhat of a strong categorical and 
 functional foundation. Is it more like Haskell than D? (In the 
 sense of catering to strongly structured programming(functors, 
 natural transformations, etc))

Both are system programming languages that support mutation, 
loops and pointers, so you can write C-style procedural code in 
either language. Whether Nim's higher level constructs are 
similar to Haskell is something I cannot judge.

Jul 01 2019

Martin Krejcirik <mk-junk i-line.cz> writes:

On Saturday, 29 June 2019 at 22:38:06 UTC, Bert wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly 
 meaningful.


I think a source code should be easily editable by anyone using a 
keyborad and plain editor. Extended characters only complicate 
things.

Jul 01 2019

Bert <Bert gmail.com> writes:

On Monday, 1 July 2019 at 17:14:08 UTC, Martin Krejcirik wrote:
 On Saturday, 29 June 2019 at 22:38:06 UTC, Bert wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly 
 meaningful.


 I think a source code should be easily editable by anyone using 
 a keyborad and plain editor. Extended characters only 
 complicate things.

It's time to grow up? How can progress be made if we don't 
progress. 99% of all modern text editors support UTF-8... with 
your logic we could say that ascii characters only complicate 
things, why not just force everyone to code in binary? That would 
be the simplest thing to do, right?

What you are telling me is that you want too force me to use your 
view but you don't want me to force you to use mine.

What you are actually doing is assuming it would be a problem 
without actually knowing or having any evidence it would be. You 
should ponder that a little.

Jul 01 2019

Martin Krejcirik <mk-junk i-line.cz> writes:

On Monday, 1 July 2019 at 23:52:25 UTC, Bert wrote:
 It's time to grow up? How can progress be made if we don't 
 progress. 99% of all modern text editors support UTF-8... with 
 your logic we could say that ascii characters only complicate

Editors maybe, but what about keyboards ? Can you easily write my 
name (Krejčiřík) without copy and paste or character selector 
tool ?

Jul 02 2019

Timon Gehr <timon.gehr gmx.ch> writes:

On 02.07.19 11:10, Martin Krejcirik wrote:
 On Monday, 1 July 2019 at 23:52:25 UTC, Bert wrote:
 It's time to grow up? How can progress be made if we don't progress. 
 99% of all modern text editors support UTF-8... with your logic we 
 could say that ascii characters only complicate

 
 Editors maybe, but what about keyboards ? Can you easily write my name 
 (Krejčiřík) without copy and paste or character selector tool ?

Of course. I can easily write this even on a US keyboard. My editor is 
set up to translate Krej\vci\vr\'ik to Krejčiřík as I type.

This is not a hard problem.

Jul 02 2019

XavierAP <n3minis-git yahoo.es> writes:

On Tuesday, 2 July 2019 at 18:28:06 UTC, Timon Gehr wrote:
 Of course. I can easily write this even on a US keyboard. My 
 editor is set up to translate Krej\vci\vr\'ik to Krejčiřík as I 
 type.

 This is not a hard problem.

	auto Krejčiřík = 0;
	static assert(is(typeof(Krejčiřík)));

Already supported :)

Jul 03 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Saturday, June 29, 2019 4:38:06 PM MDT Bert via Digitalmars-d wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly
 meaningful.

...

Like most major languages, D supports identifiers with alphanumeric
characters plus underscore with the first character not being allowed to be
numeric. However, unlike most languages, it expands that to include Unicode
alpha characters, meaning that quite a lot of Unicode is supported in
identifiers. So, it already goes far beyond what most languages do.

That being said, I think that you'll find that most folks will not be in
favor of using Unicode in identifiers outside of code intended for people of
a specific language who actually use those characters normally (e.g.
Japanese characters when all of the programmers involved read and write
Japanese and have keyboards that support it). The fact that a character is
not a key on a typical keyboard means that anyone using an identifier with
that charater in it will almost certainly have to copy-paste it, and that's
really not going to over well with most people. If you really feel strongly
about the matter, you can always create a DIP to propose a language change
to allow more Unicode characters in identifiers, but I would not expect it
to be accepted.

- Jonathan M Davis

Jul 01 2019

rikki cattermole <rikki cattermole.co.nz> writes:

On 02/07/2019 2:17 PM, Jonathan M Davis wrote:
 On Saturday, June 29, 2019 4:38:06 PM MDT Bert via Digitalmars-d wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly
 meaningful.

 ...
 
 Like most major languages, D supports identifiers with alphanumeric
 characters plus underscore with the first character not being allowed to be
 numeric. However, unlike most languages, it expands that to include Unicode
 alpha characters, meaning that quite a lot of Unicode is supported in
 identifiers. So, it already goes far beyond what most languages do.
 
 That being said, I think that you'll find that most folks will not be in
 favor of using Unicode in identifiers outside of code intended for people of
 a specific language who actually use those characters normally (e.g.
 Japanese characters when all of the programmers involved read and write
 Japanese and have keyboards that support it). The fact that a character is
 not a key on a typical keyboard means that anyone using an identifier with
 that charater in it will almost certainly have to copy-paste it, and that's
 really not going to over well with most people. If you really feel strongly
 about the matter, you can always create a DIP to propose a language change
 to allow more Unicode characters in identifiers, but I would not expect it
 to be accepted.
 
 - Jonathan M Davis

No DIP is required. The lexer just needs updating to match to the 
(current) Unicode spec.

https://github.com/dlang/dmd/blob/master/src/dmd/lexer.d#L1082

Jul 01 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, July 1, 2019 8:56:55 PM MDT rikki cattermole via Digitalmars-d 
wrote:
 On 02/07/2019 2:17 PM, Jonathan M Davis wrote:
 On Saturday, June 29, 2019 4:38:06 PM MDT Bert via Digitalmars-d wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly
 meaningful.

 ...

 Like most major languages, D supports identifiers with alphanumeric
 characters plus underscore with the first character not being allowed to
 be numeric. However, unlike most languages, it expands that to include
 Unicode alpha characters, meaning that quite a lot of Unicode is
 supported in identifiers. So, it already goes far beyond what most
 languages do.

 That being said, I think that you'll find that most folks will not be in
 favor of using Unicode in identifiers outside of code intended for
 people of a specific language who actually use those characters
 normally (e.g. Japanese characters when all of the programmers involved
 read and write Japanese and have keyboards that support it). The fact
 that a character is not a key on a typical keyboard means that anyone
 using an identifier with that charater in it will almost certainly have
 to copy-paste it, and that's really not going to over well with most
 people. If you really feel strongly about the matter, you can always
 create a DIP to propose a language change to allow more Unicode
 characters in identifiers, but I would not expect it to be accepted.

 - Jonathan M Davis

 No DIP is required. The lexer just needs updating to match to the
 (current) Unicode spec.

 https://github.com/dlang/dmd/blob/master/src/dmd/lexer.d#L1082

If a character is a Unicode alpha character, then yes. However if it's not,
then that would definitely be a language change and would require a DIP. The
spec is quite specific about it requiring Unicode alpha characters, and the
code does the same. Without looking at the Unicode spec, I have no clue
which characters are alpha characters, but I'd be extremely surprised if a
character like ± or ♥ qualified.

- Jonathan M Davis

Jul 01 2019

XavierAP <n3minis-git yahoo.es> writes:

On Tuesday, 2 July 2019 at 04:34:42 UTC, Jonathan M Davis wrote:
 a character like ±

A good example of a character that should not be allowed in 
identifiers, because it has a meaning of operator (and in general 
in theory we may want to reserve it for such future use).

ISO or Unicode define what, not all, characters are letters or 
alphanumeric:

https://dlang.org/spec/lex.html#identifiers

https://docs.microsoft.com/en-us/dotnet/api/system.char.isletter#remarks

Jul 03 2019

Bert <Bert gmail.com> writes:

On Wednesday, 3 July 2019 at 23:21:19 UTC, XavierAP wrote:
 On Tuesday, 2 July 2019 at 04:34:42 UTC, Jonathan M Davis wrote:
 a character like ±

 A good example of a character that should not be allowed in 
 identifiers, because it has a meaning of operator (and in 
 general in theory we may want to reserve it for such future 
 use).

 ISO or Unicode define what, not all, characters are letters or 
 alphanumeric:

 https://dlang.org/spec/lex.html#identifiers

 https://docs.microsoft.com/en-us/dotnet/api/system.char.isletter#remarks

Maybe, maybe not. It could be useful in some contexts... probably 
could be more confusing but -, +, ± can be very useful as sub or 
superscripts for special mathematical situations(I've seen it 
used many times, such as representing the even and odd sets of 
things or for lower and raising operations that are encoded in 
symbolic form(such as momentum operations that can be computed by 
multiplication)).

It may not be worth allowing because s_-*s_++3 would be very 
ambiguous... as would s±4+3. Specially if ± is also defined as an 
operator...

But ± should be allowed to be used as an operator as that is the 
most useful case.

4 ± 3

could be a mathematical object containing two values.

a ± b could be a mathematical object containing 2(m+n) values 
depend on how many values a and b contains.

(4 ± 3)*(±6) contains 4 values = 42, -42, 6, -6.


So D could go through the unicode list and determine which 
symbols are best suited for operators and which for identifiers 
and then enable their usage. Many symbols that are not 
appropriate for id's would be appropriate for operators: ▌╚█

These are ugly in some sense but they could have good meaning in 
relation to operations. █ could mean boxing: █a means box a.

But they could also be useful for Id's...  █ could mean rectangle.

Symbols are arbitrary. We know millions of symbols. Our brain has 
no issues decoding them after we learn the meaning. The only 
problem is that it's nice to have consistency so we don't have to 
learn many different purposes for the same symbol(but we already 
do, it's not a huge deal, it does slow us down a little but  
usually context is clear).

I think having it more open ended is better. It might require 
people exercising their neurons little bit but it is a good thing 
in the long run. Obviously people could make it very difficult by 
making code very terse but I doubt that would happen much. People 
don't code in D to make their life more difficult, they do it to 
make it less. Virtually everyone will choose the symbols in a 
logical way that will make sense.



What could be done is that any unicode character in an id could 
have some ascii equivalent.

someÆx is also

some::432::x

or whatever. If a good symbol could be found instead of ::. Then 
IDE's could learn to support the syntax and convert between them. 
A simple hotkey could work between the two and code pages could 
be flipped to change the keyboard. a pragma(codepage, 43) could 
inform the IDE to use use a codepage. These might have issues but 
without trying different things the optimal solution can't be 
found.

Jul 04 2019

D Programming

C/C++ Programming

Other

digitalmars.D - Why not extend do to allow unicode in ID's?