www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why not extend do to allow unicode in ID's?

reply Bert <Bert gmail.com> writes:
It would greatly expand the coverage.

It would be nice to use certain characters that are truly 
meaningful.

In fact, it would be nice for ops too.

░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼
∩ε≡φ±≥≤⌠⌡÷≈∙·√ⁿ²■☺♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼

I realize the excuse is going to be "It makes the code look ugly 
or hard to read", not all editors will support it, etc...

So? Those are lame excuses. People can abuse anything, you can't 
police the world. Stopping all legitimate uses because someone 
might use it illegitimately is ignorant and harmful(which is why 
it is ignorant).

It is best to enable unicode support and then have standards and 
guidelines and let some people shoot themselves in the foot if 
they want... that is the best way to learn not to do it again.

Imagine being able to write proper mathematical formula ID's:

∞
δ
Ω
Θ
Φ
τ
µ
σ
ε
φ

or using valid mathematical operators:

∩
≡
±
≥
≤
÷
≈
∙
√
ⁿ
²

or when you write a card game:

♥
♦
♣
♠


These are much better than the verbose words that we have to use 
now. I know some will say the opposite, but they can say it and 
be wrong. Trying to stop me from shooting myself in the foot when 
I don't own a gun is abusive to me and just like shooting me in 
the foot! I don't write any code that anyone else reads so let me 
make the choice for myself rather than create arbitrary rules 
that limit expressiveness. There is a reason the first amendment 
exists in the US, because the founders knew what limitations of 
expression would do. The same applies to all things.

Maybe one could use a switch to enable such a language with a 
compiler warning about such use. Maybe we can have a special D 
code page for useful symbols so there is a standard code for each 
that one could properly map using their editor of choice?

For example, we could have each symbol map to a long name that 
one could use to replace the source:

♥ = Symbol_Heart_0x2660    // or even just __Symbol__0x2660
♦ = Symbol_Diamond_0x2661
♣ = Symbol_Club_0x2662
♠ = Symbol_Spade_0x2663

And one could then change any source code between the symbolic 
form and the verbose form using a command line utility.

E.g.,


int ♥ = 3;

Can be converted to

int Symbol_Heart_0x2660 = 3;

and back without issue(99.9999999999% of code).

This would potentially cause issue with meta programming when 
comparing string of the id names but this is a minor issue. In 
fact, internally D could just use the long symbol name and 
require the programmer to use them.

E.g.,

static if (id == "♥") // invalid if id gets converted to long 
name internally.


symbol to it's long name

There are solutions to the problems... let's work on finding one 
to make D better.
Jun 29 2019
next sibling parent reply sarn <sarn theartofmachinery.com> writes:
D already allows non-latin characters in identifiers, just not 
arbitrary symbols:

import std.stdio;

void main()
{
	double φ = 1.61803398874989484820;
	writeln(φ);
}
Jun 29 2019
parent Bert <Bert gmail.com> writes:
On Sunday, 30 June 2019 at 03:12:55 UTC, sarn wrote:
 D already allows non-latin characters in identifiers, just not 
 arbitrary symbols:

 import std.stdio;

 void main()
 {
 	double φ = 1.61803398874989484820;
 	writeln(φ);
 }
Yeah, I noticed some work but many do not and I'm not even sure what does or doesn't ;/
Jun 30 2019
prev sibling next sibling parent reply Dennis <dkorpel gmail.com> writes:
On Saturday, 29 June 2019 at 22:38:06 UTC, Bert wrote:
 Imagine being able to write proper mathematical formula ID's:
Currently D allows "universal alphas" in identifiers, so Greek letters are allowed already. See: https://dlang.org/spec/lex.html#identifiers
 or using valid mathematical operators:
In D you can't add operators, but if you want math notation on existing ones, you might be interested in fonts with programming ligatures such as: https://github.com/tonsky/FiraCode
 or when you write a card game:
Custom literals can be added with templates, for example: octal!377 (https://github.com/dlang/phobos/blob/d57be4690fc923a1974a4ef4d8b84a951131d219/std/conv.d#L4062) tok!"if" (https://github.com/dlang-community/libdparse/blob/5270739bcd1962418784c7760773e24d28b6009b/src/dparse/lexer.d#L115) Since in strings any Unicode is allowed, you can do something similar: suit!"♥"
 I don't write any code that anyone else reads so let me make 
 the choice for myself rather than create arbitrary rules that 
 limit expressiveness.
If it's only for yourself, you can add a build step that substitutes your custom symbols with valid identifiers before compiling. Or use your own fork of the compiler, you probably only need to remove this line: https://github.com/dlang/dmd/blob/2599559d624275bfcff298b3a8b31f9d82ae534f/src/dmd/lexer.d#L524 Finally, if you truly long for ultimate freedom in how you write code, then Nim might be the right language for you since it aligns more with your "putting full trust in the programmer" view than D. In Nim, any non-ascii character is valid for identifiers, so even invalid Unicode characters are allowed. https://nim-lang.org/docs/manual.html#lexical-analysis-identifiers-amp-keywords
Jun 30 2019
parent reply Bert <Bert gmail.com> writes:
On Sunday, 30 June 2019 at 10:10:41 UTC, Dennis wrote:
 On Saturday, 29 June 2019 at 22:38:06 UTC, Bert wrote:
 Imagine being able to write proper mathematical formula ID's:
Currently D allows "universal alphas" in identifiers, so Greek letters are allowed already. See: https://dlang.org/spec/lex.html#identifiers
 or using valid mathematical operators:
In D you can't add operators, but if you want math notation on existing ones, you might be interested in fonts with programming ligatures such as: https://github.com/tonsky/FiraCode
 or when you write a card game:
Custom literals can be added with templates, for example: octal!377 (https://github.com/dlang/phobos/blob/d57be4690fc923a1974a4ef4d8b84a951131d219/std/conv.d#L4062) tok!"if" (https://github.com/dlang-community/libdparse/blob/5270739bcd1962418784c7760773e24d28b6009b/src/dparse/lexer.d#L115) Since in strings any Unicode is allowed, you can do something similar: suit!"♥"
 I don't write any code that anyone else reads so let me make 
 the choice for myself rather than create arbitrary rules that 
 limit expressiveness.
If it's only for yourself, you can add a build step that substitutes your custom symbols with valid identifiers before compiling. Or use your own fork of the compiler, you probably only need to remove this line: https://github.com/dlang/dmd/blob/2599559d624275bfcff298b3a8b31f9d82ae534f/src/dmd/lexer.d#L524
Thanks. I guess I could create a small routine that hacks the binary that reverses the if check. This would be easiest to maintain as I woudln't have to recompile dmd every release, just install the new one and patch.
 Finally, if you truly long for ultimate freedom in how you 
 write code, then Nim might be the right language for you since 
 it aligns more with your "putting full trust in the programmer" 
 view than D. In Nim, any non-ascii character is valid for 
 identifiers, so even invalid Unicode characters are allowed.
 https://nim-lang.org/docs/manual.html#lexical-analysis-identifiers-amp-keywords
I've heard of nim but never really looked in to it much.... but every time I hear about it I am more and more enticed. It seems well put together but the syntax is a little off putting. I'm sure I could get used to it. I have a few questions: 1. There doesn't seem to be good IDE support. I mainly use Visual Studio and I see a nim for VSC which I don't use ;/ Is there any really good IDE support? 2. How does meta programming of Nim compare to D's? The main reason I use D is it's meta programming. 3. Nim seems to be have somewhat of a strong categorical and functional foundation. Is it more like Haskell than D? (In the sense of catering to strongly structured programming(functors, natural transformations, etc)) I'll try to read over the manual. Maybe my next program will be in Nim.
Jun 30 2019
parent Dennis <dkorpel gmail.com> writes:
On Sunday, 30 June 2019 at 23:27:56 UTC, Bert wrote:
 I have a few questions:

 1. There doesn't seem to be good IDE support. I mainly use 
 Visual Studio and I see a nim for VSC which I don't use ;/  Is 
 there any really good IDE support?
I don't have much Nim experience myself, so maybe you should ask on the Nim forum.
 2. How does meta programming of Nim compare to D's? The main 
 reason I use D is it's meta programming.
It also has static if ('when'), CTFE, type reflection ('typedesc') and templates. In addition, it has AST macros which D will not have. (You can find long past discussions why, or Google 'The Lisp Curse' for something related).
 3. Nim seems to be have somewhat of a strong categorical and 
 functional foundation. Is it more like Haskell than D? (In the 
 sense of catering to strongly structured programming(functors, 
 natural transformations, etc))
Both are system programming languages that support mutation, loops and pointers, so you can write C-style procedural code in either language. Whether Nim's higher level constructs are similar to Haskell is something I cannot judge.
Jul 01 2019
prev sibling next sibling parent reply Martin Krejcirik <mk-junk i-line.cz> writes:
On Saturday, 29 June 2019 at 22:38:06 UTC, Bert wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly 
 meaningful.
I think a source code should be easily editable by anyone using a keyborad and plain editor. Extended characters only complicate things.
Jul 01 2019
parent reply Bert <Bert gmail.com> writes:
On Monday, 1 July 2019 at 17:14:08 UTC, Martin Krejcirik wrote:
 On Saturday, 29 June 2019 at 22:38:06 UTC, Bert wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly 
 meaningful.
I think a source code should be easily editable by anyone using a keyborad and plain editor. Extended characters only complicate things.
It's time to grow up? How can progress be made if we don't progress. 99% of all modern text editors support UTF-8... with your logic we could say that ascii characters only complicate things, why not just force everyone to code in binary? That would be the simplest thing to do, right? What you are telling me is that you want too force me to use your view but you don't want me to force you to use mine. What you are actually doing is assuming it would be a problem without actually knowing or having any evidence it would be. You should ponder that a little.
Jul 01 2019
parent reply Martin Krejcirik <mk-junk i-line.cz> writes:
On Monday, 1 July 2019 at 23:52:25 UTC, Bert wrote:
 It's time to grow up? How can progress be made if we don't 
 progress. 99% of all modern text editors support UTF-8... with 
 your logic we could say that ascii characters only complicate
Editors maybe, but what about keyboards ? Can you easily write my name (Krejčiřík) without copy and paste or character selector tool ?
Jul 02 2019
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.07.19 11:10, Martin Krejcirik wrote:
 On Monday, 1 July 2019 at 23:52:25 UTC, Bert wrote:
 It's time to grow up? How can progress be made if we don't progress. 
 99% of all modern text editors support UTF-8... with your logic we 
 could say that ascii characters only complicate
Editors maybe, but what about keyboards ? Can you easily write my name (Krejčiřík) without copy and paste or character selector tool ?
Of course. I can easily write this even on a US keyboard. My editor is set up to translate Krej\vci\vr\'ik to Krejčiřík as I type. This is not a hard problem.
Jul 02 2019
parent XavierAP <n3minis-git yahoo.es> writes:
On Tuesday, 2 July 2019 at 18:28:06 UTC, Timon Gehr wrote:
 Of course. I can easily write this even on a US keyboard. My 
 editor is set up to translate Krej\vci\vr\'ik to Krejčiřík as I 
 type.

 This is not a hard problem.
auto Krejčiřík = 0; static assert(is(typeof(Krejčiřík))); Already supported :)
Jul 03 2019
prev sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, June 29, 2019 4:38:06 PM MDT Bert via Digitalmars-d wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly
 meaningful.
... Like most major languages, D supports identifiers with alphanumeric characters plus underscore with the first character not being allowed to be numeric. However, unlike most languages, it expands that to include Unicode alpha characters, meaning that quite a lot of Unicode is supported in identifiers. So, it already goes far beyond what most languages do. That being said, I think that you'll find that most folks will not be in favor of using Unicode in identifiers outside of code intended for people of a specific language who actually use those characters normally (e.g. Japanese characters when all of the programmers involved read and write Japanese and have keyboards that support it). The fact that a character is not a key on a typical keyboard means that anyone using an identifier with that charater in it will almost certainly have to copy-paste it, and that's really not going to over well with most people. If you really feel strongly about the matter, you can always create a DIP to propose a language change to allow more Unicode characters in identifiers, but I would not expect it to be accepted. - Jonathan M Davis
Jul 01 2019
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 02/07/2019 2:17 PM, Jonathan M Davis wrote:
 On Saturday, June 29, 2019 4:38:06 PM MDT Bert via Digitalmars-d wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly
 meaningful.
... Like most major languages, D supports identifiers with alphanumeric characters plus underscore with the first character not being allowed to be numeric. However, unlike most languages, it expands that to include Unicode alpha characters, meaning that quite a lot of Unicode is supported in identifiers. So, it already goes far beyond what most languages do. That being said, I think that you'll find that most folks will not be in favor of using Unicode in identifiers outside of code intended for people of a specific language who actually use those characters normally (e.g. Japanese characters when all of the programmers involved read and write Japanese and have keyboards that support it). The fact that a character is not a key on a typical keyboard means that anyone using an identifier with that charater in it will almost certainly have to copy-paste it, and that's really not going to over well with most people. If you really feel strongly about the matter, you can always create a DIP to propose a language change to allow more Unicode characters in identifiers, but I would not expect it to be accepted. - Jonathan M Davis
No DIP is required. The lexer just needs updating to match to the (current) Unicode spec. https://github.com/dlang/dmd/blob/master/src/dmd/lexer.d#L1082
Jul 01 2019
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, July 1, 2019 8:56:55 PM MDT rikki cattermole via Digitalmars-d 
wrote:
 On 02/07/2019 2:17 PM, Jonathan M Davis wrote:
 On Saturday, June 29, 2019 4:38:06 PM MDT Bert via Digitalmars-d wrote:
 It would greatly expand the coverage.

 It would be nice to use certain characters that are truly
 meaningful.
... Like most major languages, D supports identifiers with alphanumeric characters plus underscore with the first character not being allowed to be numeric. However, unlike most languages, it expands that to include Unicode alpha characters, meaning that quite a lot of Unicode is supported in identifiers. So, it already goes far beyond what most languages do. That being said, I think that you'll find that most folks will not be in favor of using Unicode in identifiers outside of code intended for people of a specific language who actually use those characters normally (e.g. Japanese characters when all of the programmers involved read and write Japanese and have keyboards that support it). The fact that a character is not a key on a typical keyboard means that anyone using an identifier with that charater in it will almost certainly have to copy-paste it, and that's really not going to over well with most people. If you really feel strongly about the matter, you can always create a DIP to propose a language change to allow more Unicode characters in identifiers, but I would not expect it to be accepted. - Jonathan M Davis
No DIP is required. The lexer just needs updating to match to the (current) Unicode spec. https://github.com/dlang/dmd/blob/master/src/dmd/lexer.d#L1082
If a character is a Unicode alpha character, then yes. However if it's not, then that would definitely be a language change and would require a DIP. The spec is quite specific about it requiring Unicode alpha characters, and the code does the same. Without looking at the Unicode spec, I have no clue which characters are alpha characters, but I'd be extremely surprised if a character like ± or ♥ qualified. - Jonathan M Davis
Jul 01 2019
parent reply XavierAP <n3minis-git yahoo.es> writes:
On Tuesday, 2 July 2019 at 04:34:42 UTC, Jonathan M Davis wrote:
 a character like ±
A good example of a character that should not be allowed in identifiers, because it has a meaning of operator (and in general in theory we may want to reserve it for such future use). ISO or Unicode define what, not all, characters are letters or alphanumeric: https://dlang.org/spec/lex.html#identifiers https://docs.microsoft.com/en-us/dotnet/api/system.char.isletter#remarks
Jul 03 2019
parent Bert <Bert gmail.com> writes:
On Wednesday, 3 July 2019 at 23:21:19 UTC, XavierAP wrote:
 On Tuesday, 2 July 2019 at 04:34:42 UTC, Jonathan M Davis wrote:
 a character like ±
A good example of a character that should not be allowed in identifiers, because it has a meaning of operator (and in general in theory we may want to reserve it for such future use). ISO or Unicode define what, not all, characters are letters or alphanumeric: https://dlang.org/spec/lex.html#identifiers https://docs.microsoft.com/en-us/dotnet/api/system.char.isletter#remarks
Maybe, maybe not. It could be useful in some contexts... probably could be more confusing but -, +, ± can be very useful as sub or superscripts for special mathematical situations(I've seen it used many times, such as representing the even and odd sets of things or for lower and raising operations that are encoded in symbolic form(such as momentum operations that can be computed by multiplication)). It may not be worth allowing because s_-*s_++3 would be very ambiguous... as would s±4+3. Specially if ± is also defined as an operator... But ± should be allowed to be used as an operator as that is the most useful case. 4 ± 3 could be a mathematical object containing two values. a ± b could be a mathematical object containing 2(m+n) values depend on how many values a and b contains. (4 ± 3)*(±6) contains 4 values = 42, -42, 6, -6. So D could go through the unicode list and determine which symbols are best suited for operators and which for identifiers and then enable their usage. Many symbols that are not appropriate for id's would be appropriate for operators: ▌╚█ These are ugly in some sense but they could have good meaning in relation to operations. █ could mean boxing: █a means box a. But they could also be useful for Id's... █ could mean rectangle. Symbols are arbitrary. We know millions of symbols. Our brain has no issues decoding them after we learn the meaning. The only problem is that it's nice to have consistency so we don't have to learn many different purposes for the same symbol(but we already do, it's not a huge deal, it does slow us down a little but usually context is clear). I think having it more open ended is better. It might require people exercising their neurons little bit but it is a good thing in the long run. Obviously people could make it very difficult by making code very terse but I doubt that would happen much. People don't code in D to make their life more difficult, they do it to make it less. Virtually everyone will choose the symbols in a logical way that will make sense. What could be done is that any unicode character in an id could have some ascii equivalent. someÆx is also some::432::x or whatever. If a good symbol could be found instead of ::. Then IDE's could learn to support the syntax and convert between them. A simple hotkey could work between the two and code pages could be flipped to change the keyboard. a pragma(codepage, 43) could inform the IDE to use use a codepage. These might have issues but without trying different things the optimal solution can't be found.
Jul 04 2019