digitalmars.D - Updating D beyond Unicode 2.0
- Neia Neutuladh (34/34) Sep 21 2018 D's currently accepted identifier characters are based on Unicode
- Walter Bright (12/12) Sep 21 2018 When I originally started with D, I thought non-ASCII identifiers with U...
- Adam D. Ruppe (13/18) Sep 21 2018 Do you look at Japanese D code much? Or Turkish? Or Chinese?
- =?UTF-8?Q?Ali_=c3=87ehreli?= (20/22) Sep 23 2018 Fully aggreed but as far as I know, Turkish companies use English in
- Kagamin (3/4) Sep 23 2018 You even contributed to
- Neia Neutuladh (13/19) Sep 21 2018 ...you *do* know that not every codebase has people working on it
- Thomas Mader (25/27) Sep 22 2018 This topic boils down to diversity vs. productivity.
- Steven Schveighoffer (7/31) Sep 22 2018 But aren't we arguing about the wrong thing here? D already accepts
- Neia Neutuladh (19/24) Sep 22 2018 Walter was doing that thing that people in the US who only speak
- Erik van Velzen (11/18) Sep 22 2018 On Saturday, 22 September 2018 at 16:56:10 UTC, Neia Neutuladh
- Neia Neutuladh (3/5) Sep 22 2018 I did. https://git.ikeran.org/dhasenan/muzikilo
- Adam D. Ruppe (14/16) Sep 22 2018 This is the obvious observation bias I alluded to before: of
- sarn (5/7) Sep 22 2018 You can find a lot more Japanese D code on this blogging platform:
- Shachar Shemesh (4/13) Sep 22 2018 Comments in Japanese. Identifiers in English. Not advancing your point,
- sarn (4/14) Sep 23 2018 Well, I knew that when I posted, so I honestly have no idea what
- Shachar Shemesh (10/25) Sep 23 2018 I don't know what point you were trying to make. That's precisely why I
- aliak (3/6) Sep 23 2018 https://forum.dlang.org/post/piwvbtetcwyxlalocxkw@forum.dlang.org
- Steven Schveighoffer (37/62) Sep 24 2018 I don't think he was doing that. I think what he was saying was, D tried...
- Joakim (17/33) Sep 21 2018 To wit, Windows linker error with Unicode symbol:
- Neia Neutuladh (4/10) Sep 21 2018 The compiler doesn't have to do much with Unicode processing,
- Jonathan M Davis (18/28) Sep 22 2018 Unicode identifiers may make sense in a code base that is going to be us...
- Shachar Shemesh (5/14) Sep 22 2018 Thank Allah that someone said it before I had to. I could not agree
- Thomas Mader (8/11) Sep 22 2018 The goal of Unicode is to support diversity, if you argue against
- Jonathan M Davis (9/20) Sep 22 2018 Unicode is supposed to be a universal way of representing every characte...
- Shachar Shemesh (6/11) Sep 22 2018 To be fair to them, that word is part of the "Arabic-representation
- Thomas Mader (17/24) Sep 22 2018 At least since the incorporation of Emojis it's not supposed to
- Shachar Shemesh (8/11) Sep 22 2018 If memory serves me right, hieroglyphs actually represent consonants
- Neia Neutuladh (21/26) Sep 22 2018 Egyptian hieroglyphics uses logographs (symbols representing
- =?UTF-8?Q?Ali_=c3=87ehreli?= (5/8) Sep 23 2018 I had the misconception of each Chinese character meaning a word until I...
- Steven Schveighoffer (4/14) Sep 22 2018 But aren't some (many?) Chinese/Japanese characters representing whole
- Jonathan M Davis (13/27) Sep 22 2018 It's true that they're not characters in the sense that Roman characters...
- Steven Schveighoffer (7/34) Sep 24 2018 But there are tons of emojis that have nothing to do with sequences of
- sarn (6/9) Sep 22 2018 Kind of hair-splitting, but it's more accurate to say that some
- Neia Neutuladh (11/17) Sep 22 2018 You have a problem when you need to share a codebase between two
- Jonathan M Davis (32/49) Sep 22 2018 My point is that if your code base is definitely only going to be used
- Walter Bright (16/18) Sep 23 2018 In the earlier days of D, I put on the web pages a google widget what wo...
- Neia Neutuladh (4/9) Sep 23 2018 Okay, that's why you previously selected C99 as the standard for
- Walter Bright (2/5) Sep 23 2018 I wasn't aware it changed in C11.
- Neia Neutuladh (9/14) Sep 23 2018 http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page
- Steven Schveighoffer (41/59) Sep 24 2018 I searched around for the current state of symbol names in C, and found
- Adam D. Ruppe (17/25) Sep 24 2018 Eh, those are kinda opaque sequences anyway, since the meanings
- Steven Schveighoffer (7/16) Sep 24 2018 Well, even on top of that, the standard library is full of English words...
- Martin Tschierschke (7/26) Sep 24 2018 You might get really funny error messages.
- Steven Schveighoffer (5/29) Sep 24 2018 Haha, it could be cynical as well
- Patrick Schluter (7/12) Sep 24 2018 Indeed. IBM mainframes have C compilers too but not ASCII. They
- Steven Schveighoffer (11/23) Sep 24 2018 Right. But it's just a side-note -- I'd guess all modern compilers
- Dennis (10/15) Sep 23 2018 I always thought D supported Unicode with the goal of going
- Walter Bright (3/4) Sep 23 2018 D the language is well suited to the development of Unicode apps. D sour...
- Dennis (10/12) Sep 24 2018 But in the article you specifically talk about the use of Unicode
- Jonathan M Davis (7/19) Sep 24 2018 Given that the typical keyboard has none of those characters, maintainin...
- Dennis (9/16) Sep 24 2018 Note that I'm not trying to argue either way, it's just that I
- Adam D. Ruppe (7/9) Sep 24 2018 It is pretty easy to type them with a little keyboard config
- Abdulhaq (30/38) Sep 23 2018 According to the Unicode website,
- Walter Bright (6/8) Sep 25 2018 Small character sets are much more implementable on primitive systems li...
- aliak (11/27) Sep 23 2018 Not seeing identifiers in languages you don't program in or can
- Walter Bright (21/23) Sep 23 2018 On the other hand, I've been programming for 40 years. I've customized m...
- 0xEAB (30/33) Sep 24 2018 I'm a native German speaker.
- 0xEAB (2/4) Sep 24 2018 addendum: I've been using the English version since VS2017
- =?UTF-8?Q?Ali_=c3=87ehreli?= (6/7) Sep 25 2018 This is something I had heard from a Digital Research programmer in
- Simen =?UTF-8?B?S2rDpnLDpXM=?= (13/19) Sep 25 2018 My ex-girlfriend tried to learn SQL from a book that had gotten a
- Patrick Schluter (4/10) Sep 25 2018 The K&R in German was of the same "quality". That happens when
- ShadoLight (13/18) Sep 26 2018 [snip]
- abcde1234 (5/26) Sep 26 2018 In case you missed it, this was well spreaded in the tech news
- =?UTF-8?Q?Ali_=c3=87ehreli?= (5/5) Sep 26 2018 A delicious Turkish desert is "kabak tatlısı", made of squash. Now, it...
- Jonathan M Davis (4/8) Sep 26 2018 Was it any good? ;)
- Andrea Fontana (4/9) Sep 27 2018 You can't even imagine how many italian words and recipes are
- Paolo Invernizzi (3/14) Sep 27 2018 +1 :-P
- Andrea Fontana (9/15) Sep 26 2018 Yes please. Keep them in english.
- Jonathan M Davis (13/16) Sep 26 2018 It reminds me of one of the reasons that Bryan Cantrill thinks that many
- Erik van Velzen (7/7) Sep 21 2018 Agreed with Walter.
- Seb (11/18) Sep 21 2018 A: Wait. Using emojis as identifiers is not a good idea?
- Neia Neutuladh (9/16) Sep 21 2018 The C11 spec says that emoji should be allowed in identifiers
- rikki cattermole (3/7) Sep 21 2018 This can be strongly mitigated by using a compose key. But they are not
- Kagamin (5/9) Sep 23 2018 It's not like we have a lot of good fonts (I know only one), and
- FeepingCreature (7/12) Sep 25 2018 I just want to chime in that I've definitely used greek letters
- Dukc (15/24) Sep 25 2018 When I make code that I expect to be only used around here, I
- Shachar Shemesh (11/15) Sep 25 2018 This sounded like a very compelling example, until I gave it a second
- Dukc (9/14) Sep 26 2018 How so?
- Shachar Shemesh (12/28) Sep 26 2018 Sure you can. It's just very poor design.
- Dukc (10/15) Sep 26 2018 Two years ago, I taked part in implementing a commerical game. It
- Steven Schveighoffer (7/24) Sep 26 2018 Hm... I could see actually some "clever" use of opDispatch being used to...
- Walter Bright (3/5) Sep 26 2018 Also, there are usually common ASCII versions of city names, such as Col...
- Jacob Carlborg (11/54) Sep 25 2018 I'm not a native English speaker but I write all my public and private
- rjframe (11/19) Sep 26 2018 I just want to point out since this thread is still living that there ha...
- Steven Schveighoffer (20/44) Sep 26 2018 This is a non-starter. We can't break people's code, especially for
- Walter Bright (10/13) Sep 26 2018 We're not going to remove it, because there's not much to gain from it.
- Adam D. Ruppe (3/6) Sep 26 2018 http://ddili.org/ders/d/
- Steven Schveighoffer (16/19) Sep 26 2018 It may be the weight is already there in the form of unicode symbol
- Neia Neutuladh (6/8) Sep 26 2018 Yes, a lot of languages that don't use the Latin alphabet have standard
- aliak (37/55) Sep 27 2018 It's not that they don't know English. It's that non-English
- Shachar Shemesh (5/11) Sep 27 2018 I'm sorry I keep bringing this up, but context is really important here.
- aliak (6/20) Sep 27 2018 The point was that being able to use non-English in code is
- Shachar Shemesh (10/14) Sep 27 2018 If you wish to make a point about something irrelevant to the
- aliak (9/25) Sep 27 2018 English doesn't mean ascii. You can write non-English in ascii,
- sarn (4/33) Sep 27 2018 Shachar seems to be aiming for an internet high score by shooting
- Dukc (3/7) Sep 28 2018 I believe you're being too harsh. It's easy to miss a part of a
- sarn (14/15) Sep 28 2018 That's very true, and it's always good to give people the benefit
- Shachar Shemesh (6/14) Sep 28 2018 A minor correction: Aliak is not accusing me of missing a part of the
- Dukc (4/7) Sep 29 2018 I know you meant Sarn, but still... can you please be a bit less
- Shachar Shemesh (26/34) Sep 29 2018 That is the word used by the article *you* linked to, in reference to
- Shachar Shemesh (4/14) Sep 29 2018 You are 100% correct. My most sincere apologies.
- Walter Bright (3/9) Sep 27 2018 Nobody is suggesting D not support Unicode in strings, comments, and the...
- Walter Bright (3/4) Sep 26 2018 Feel free to write one, but its chances of getting incorporated are remo...
D's currently accepted identifier characters are based on Unicode 2.0: * ASCII range values are handled specially. * Letters and combining marks from Unicode 2.0 are accepted. * Numbers outside the ASCII range are accepted. * Eight random punctuation marks are accepted. This follows the C99 standard. Python, ECMAScript, just to name a few. A small number of languages reject non-ASCII characters: Dart, Perl. Some languages are weirdly generous: Swift and C11 allow everything outside the Basic Multilingual Plane. I'd like to update that so that D accepts something as a valid identifier character if it's a letter or combining mark or modifier symbol that's present in Unicode 11, or a non-ASCII number. This allows the 146 most popular writing systems and a lot more characters from those writing systems. This *would* reject those eight random punctuation marks, so I'll keep them in as legacy characters. It would mean we don't have to reference the C99 standard when enumerating the allowed characters; we just have to refer to the Unicode standard, which we already need to talk about in the lexical part of the spec. It might also make the lexer a tiny bit faster; it reduces the number of valid-ident-char segments to search from 245 to 134. On the other hand, it will change the ident char ranges from wchar to dchar, which means the table takes up marginally more memory. And, of course, it lets you write programs entirely in Linear B, and that's a marketing ploy not to be missed. I've got this coded up and can submit a PR, but I thought I'd get feedback here first. Does anyone see any horrible potential problems here? Or is there an interestingly better option? Does this need a DIP?
Sep 21 2018
When I originally started with D, I thought non-ASCII identifiers with Unicode was a good idea. I've since slowly become less and less enthusiastic about it. First off, D source text simply must (and does) fully support Unicode in comments, characters, and string literals. That's not an issue. But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code. Extending it further will also cause problems for all the tools that work with D object code, like debuggers, disassemblers, linkers, filesystems, etc. Absent a much more compelling rationale for it, I'd say no.
Sep 21 2018
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases.Do you look at Japanese D code much? Or Turkish? Or Chinese? I know there are decently sized D communities in those languages, and I am pretty sure I have seen identifiers in their languages before, but I can't find it right now. Just there's a pretty clear potential for observation bias here. Even our search engine queries are going to be biased toward English-language results, so there can be a whole D world kinda invisible to you and I. We should reach out and get solid stats before making a final decision.most likely be annoyance rather than illumination when people who don't know that language have to work on the code.Well, for example, with a Chinese company, they may very well find forced English identifiers to be an annoyance.
Sep 21 2018
On 09/21/2018 04:18 PM, Adam D. Ruppe wrote:Well, for example, with a Chinese company, they may very well find forced English identifiers to be an annoyance.Fully aggreed but as far as I know, Turkish companies use English in source code. Turkish alphabet is Latin based where dotted and undotted versions of Latin letters are distinct and produce different meanings. Quick examples: sık: dense (n), squeeze (v), ... sik: penis (n), f*ck (v) [1] şık: one of multiple choices (1), swanky (2) döndür: return dondur: make frozen sök: disassemble, dismantle, ... sok: insert, install, ... şok: shock Hence, non-Unicode is unacceptable in Turkish code unless we reserve programming to English speakers only, which is unacceptable because it would be exclusionary and would produce English identifiers that are frequently amusing. I've seen the latter in code of English learners. :) Ali [1] https://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail
Sep 23 2018
On Sunday, 23 September 2018 at 11:18:42 UTC, Ali Çehreli wrote:Hence, non-Unicode is unacceptable in Turkish codeYou even contributed to http://code.google.com/p/trileri/source/browse/trunk/tr/yazi.d
Sep 23 2018
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code....you *do* know that not every codebase has people working on it who only know English, right? If I took a software development job in China, I'd need to learn Chinese. I'd expect the codebase to be in Chinese. Because a Chinese company generally operates in Chinese, and they're likely to have a lot of employees who only speak Chinese. And no, you can't just transcribe Chinese into ASCII. Same for Spanish, Norwegian, German, Polish, Russian -- heck, it's almost easier to list out the languages you *don't* need non-ASCII characters for. Anyway, here's some more D code using non-ASCII identifiers, in case you need examples: https://git.ikeran.org/dhasenan/muzikilo
Sep 21 2018
On Saturday, 22 September 2018 at 01:08:26 UTC, Neia Neutuladh wrote:...you *do* know that not every codebase has people working on it who only know English, right?This topic boils down to diversity vs. productivity. If supporting diversity in this case is questionable. I work in a German speaking company and we have no developers who are not speaking German for now. In fact all are native speakers. Still we write our code, comments and commit messages in English. Even at university you learn that you should use English to code. The reasoning is simple. You never know who will work on your code in the future. If a company writes code in Chinese, they will have a hard time to expand the development of their codebase even though Chinese is spoken by that many people. So even though you could use all sorts of characters, in a productive environment you better choose not to do so. You might end up shooting yourself in the foot in the long run. Diversity is important in other areas but I don't see much advantage here. At least for now because the spoken languages of today don't differ tremendously in what they are capable of expressing. This is also true for todays programming languages. Most of them are just different syntax for the very same ideas and concepts. That's not very helpful to bring people together and advance. My understanding is that even life with it's great diversity just has one language (DNA) to define it.
Sep 22 2018
On 9/21/18 9:08 PM, Neia Neutuladh wrote:On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:But aren't we arguing about the wrong thing here? D already accepts non-ASCII identifiers. What languages need an upgrade to unicode symbol names? In other words, what symbols aren't possible with the current support? Or maybe I'm misunderstanding something. -SteveBut identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code.....you *do* know that not every codebase has people working on it who only know English, right? If I took a software development job in China, I'd need to learn Chinese. I'd expect the codebase to be in Chinese. Because a Chinese company generally operates in Chinese, and they're likely to have a lot of employees who only speak Chinese. And no, you can't just transcribe Chinese into ASCII. Same for Spanish, Norwegian, German, Polish, Russian -- heck, it's almost easier to list out the languages you *don't* need non-ASCII characters for. Anyway, here's some more D code using non-ASCII identifiers, in case you need examples: https://git.ikeran.org/dhasenan/muzikilo
Sep 22 2018
On Saturday, 22 September 2018 at 12:35:27 UTC, Steven Schveighoffer wrote:But aren't we arguing about the wrong thing here? D already accepts non-ASCII identifiers.Walter was doing that thing that people in the US who only speak English tend to do: forgetting that other people speak other languages, and that people who speak English can learn other languages to work with people who don't speak English. He was saying it's inevitably a mistake to use non-ASCII characters in identifiers and that nobody does use them in practice. Walter talking like that sounds like he'd like to remove support for non-ASCII identifiers from the language. I've gotten by without maintaining a set of personal patches on top of DMD so far, and I'd like it if I didn't have to start.What languages need an upgrade to unicode symbol names? In other words, what symbols aren't possible with the current support?Chinese and Japanese have gained about eleven thousand symbols since Unicode 2. Unicode 2 covers 25 writing systems, while Unicode 11 covers 146. Just updating to Unicode 3 would give us Cherokee, Ge'ez (multiple languages), Khmer (Cambodian), Mongolian, Burmese, Sinhala (Sri Lanka), Thaana (Maldivian), Canadian aboriginal syllabics, and Yi (Nuosu).
Sep 22 2018
On Saturday, 22 September 2018 at 16:56:10 UTC, Neia Neutuladh wrote: On Saturday, 22 September 2018 at 16:56:10 UTC, Neia Neutuladh wrote:Walter was doing that thing that people in the US who only speak English tend to do: forgetting that other people speak other languages, and that people who speak English can learn other languages to work with people who don't speak English. He was saying it's inevitably a mistake to use non-ASCII characters in identifiers and that nobody does use them in practice.There's a more charitable view and that's that even furriners usually use English identifiers. Nobody in this thread so far has said they are programming in non-ASCII. If there was a contingent of Japanese or Chinese users doing that then surely they would speak up here or in Bugzilla to advocate for this feature?
Sep 22 2018
On Saturday, 22 September 2018 at 19:59:42 UTC, Erik van Velzen wrote:Nobody in this thread so far has said they are programming in non-ASCII.I did. https://git.ikeran.org/dhasenan/muzikilo
Sep 22 2018
On Saturday, 22 September 2018 at 19:59:42 UTC, Erik van Velzen wrote:Nobody in this thread so far has said they are programming in non-ASCII.This is the obvious observation bias I alluded to before: of course people who don't read and write English aren't in this thread, since they cannot read or write the English used in this thread! Ditto for bugzilla. Absence of evidence CAN be evidence of absence... but not when the absence is so easily explained by our shared bias. Neia Neutuladh posted one link. I have seen Japanese D code before on twitter, but cannot find it now (surely because the search engines also share this bias). Perhaps those are the only two examples in existence, but I stand by my belief that we must reach out to these other communities somehow and do a proper, proactive study before dismissing the possibility.
Sep 22 2018
On Sunday, 23 September 2018 at 00:18:06 UTC, Adam D. Ruppe wrote:I have seen Japanese D code before on twitter, but cannot find it now (surely because the search engines also share this bias).You can find a lot more Japanese D code on this blogging platform: https://qiita.com/tags/dlang Here's the most recent post to save you a click: https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62
Sep 22 2018
On 23/09/18 04:29, sarn wrote:On Sunday, 23 September 2018 at 00:18:06 UTC, Adam D. Ruppe wrote:Comments in Japanese. Identifiers in English. Not advancing your point, I think. ShacharI have seen Japanese D code before on twitter, but cannot find it now (surely because the search engines also share this bias).You can find a lot more Japanese D code on this blogging platform: https://qiita.com/tags/dlang Here's the most recent post to save you a click: https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62
Sep 22 2018
On Sunday, 23 September 2018 at 06:53:21 UTC, Shachar Shemesh wrote:On 23/09/18 04:29, sarn wrote:Well, I knew that when I posted, so I honestly have no idea what point you assumed I was making.You can find a lot more Japanese D code on this blogging platform: https://qiita.com/tags/dlang Here's the most recent post to save you a click: https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62Comments in Japanese. Identifiers in English. Not advancing your point, I think. Shachar
Sep 23 2018
On 23/09/18 15:38, sarn wrote:On Sunday, 23 September 2018 at 06:53:21 UTC, Shachar Shemesh wrote:I don't know what point you were trying to make. That's precisely why I posted. I don't think D currently or ever enforces what type of (legal UTF-8) text you could use in comments or strings. This thread is about what's legal to use in identifiers. The example you brought does not use Unicode in identifiers, and is, therefor, irrelevant to the discussion we're having. That was the point *I* was trying to make. ShacharOn 23/09/18 04:29, sarn wrote:Well, I knew that when I posted, so I honestly have no idea what point you assumed I was making.You can find a lot more Japanese D code on this blogging platform: https://qiita.com/tags/dlang Here's the most recent post to save you a click: https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62Comments in Japanese. Identifiers in English. Not advancing your point, I think. Shachar
Sep 23 2018
On Saturday, 22 September 2018 at 19:59:42 UTC, Erik van Velzen wrote:If there was a contingent of Japanese or Chinese users doing that then surely they would speak up here or in Bugzilla to advocate for this feature?https://forum.dlang.org/post/piwvbtetcwyxlalocxkw forum.dlang.org
Sep 23 2018
On 9/22/18 12:56 PM, Neia Neutuladh wrote:On Saturday, 22 September 2018 at 12:35:27 UTC, Steven Schveighoffer wrote:I don't think he was doing that. I think what he was saying was, D tried to accommodate users who don't normally speak English, and they still use English (for the most part) for coding. I'm actually surprised there isn't much code out there that is written with other identifiers besides ASCII, given that C99 supported them. I assumed it was because they weren't supported. Now I learn that they are supported, yet almost all C code I've ever seen is written in English. Perhaps that's just because I don't frequent foreign language sites though :) But many people here speak English as a second language, and vouch for their cultures still using English to write code.But aren't we arguing about the wrong thing here? D already accepts non-ASCII identifiers.Walter was doing that thing that people in the US who only speak English tend to do: forgetting that other people speak other languages, and that people who speak English can learn other languages to work with people who don't speak English.He was saying it's inevitably a mistake to use non-ASCII characters in identifiers and that nobody does use them in practice.I would expect people probably do try to use them in practice, it's just that the problems they run into aren't worth the effort (tool/environment support). But I have no first or even second hand experience with this. It does seem like Walter has a lot of experience with it though.Walter talking like that sounds like he'd like to remove support for non-ASCII identifiers from the language. I've gotten by without maintaining a set of personal patches on top of DMD so far, and I'd like it if I didn't have to start.I don't think he was saying that. I think he was against expanding support for further Unicode identifiers because the the first effort did not produce any measurable benefit. I'd be shocked from the recent positions of Walter and Andrei if they decided to remove non-ASCII identifiers that are currently supported, thereby breaking any existing code.Very interesting! I would agree that we should at least add support for unicode symbols that are used in spoken languages, especially if we already have support for symbols that aren't ASCII already. I don't see the downside, especially if you can already use Unicode 2.0 symbols for identifiers (the ship has already sailed). It could be a good incentive to get kids in countries where English isn't commonly spoken to try D out as a first programming language ;) Using your native language to show example code could be a huge benefit for teaching coding. My recommendation is to put the PR up for review (that you said you had ready) and see what happens. Having an actual patch to talk about could change minds. At the very least, it's worth not wasting your efforts that you have already spent. Even if it does need a DIP, the PR can show that one less piece of effort is needed to get it implemented. -SteveWhat languages need an upgrade to unicode symbol names? In other words, what symbols aren't possible with the current support?Chinese and Japanese have gained about eleven thousand symbols since Unicode 2. Unicode 2 covers 25 writing systems, while Unicode 11 covers 146. Just updating to Unicode 3 would give us Cherokee, Ge'ez (multiple languages), Khmer (Cambodian), Mongolian, Burmese, Sinhala (Sri Lanka), Thaana (Maldivian), Canadian aboriginal syllabics, and Yi (Nuosu).
Sep 24 2018
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:When I originally started with D, I thought non-ASCII identifiers with Unicode was a good idea. I've since slowly become less and less enthusiastic about it. First off, D source text simply must (and does) fully support Unicode in comments, characters, and string literals. That's not an issue. But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code. Extending it further will also cause problems for all the tools that work with D object code, like debuggers, disassemblers, linkers, filesystems, etc.To wit, Windows linker error with Unicode symbol: https://github.com/ldc-developers/ldc/pull/2850#issuecomment-422968161Absent a much more compelling rationale for it, I'd say no.I'm torn. I completely agree with Adam and others that people should be able to use any language they want. But the Unicode spec is such a tire fire that I'm leery of extending support for it. Someone linked this Swift chapter on Unicode handling in an earlier forum thread, read the section on emoji in particular: https://oleb.net/blog/2017/11/swift-4-strings/ I was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it. I believe Swift just punts their Unicode support to ICU, like most any other project these days. That's a horrible sign, that you've created a spec so grotesquely complicated that most everybody relies on a single project to not have to deal with it.
Sep 21 2018
On Saturday, 22 September 2018 at 04:54:59 UTC, Joakim wrote:To wit, Windows linker error with Unicode symbol: https://github.com/ldc-developers/ldc/pull/2850#issuecomment-422968161That's a good argument for sticking to ASCII for name mangling.I'm torn. I completely agree with Adam and others that people should be able to use any language they want. But the Unicode spec is such a tire fire that I'm leery of extending support for it.The compiler doesn't have to do much with Unicode processing, fortunately.
Sep 21 2018
On Friday, September 21, 2018 10:54:59 PM MDT Joakim via Digitalmars-d wrote:I'm torn. I completely agree with Adam and others that people should be able to use any language they want. But the Unicode spec is such a tire fire that I'm leery of extending support for it.Unicode identifiers may make sense in a code base that is going to be used solely by a group of developers who speak a particular language that uses a number a of non-ASCII characters (especially languages like Chinese or Japanese), but it has no business in any code that's intended for international use. It just causes problems. At best, a particular, regional keyboard may be able to handle a particular symbol, but most other keyboards won't be able too. So, using that symbol causes problems for all of the developers from other parts of the world even if those developers also have Unicode symbols in their native languages.Someone linked this Swift chapter on Unicode handling in an earlier forum thread, read the section on emoji in particular: https://oleb.net/blog/2017/11/swift-4-strings/ I was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it.Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job. - Jonathan M Davis
Sep 22 2018
On 22/09/18 11:52, Jonathan M Davis wrote:Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job. - Jonathan M DavisThank Allah that someone said it before I had to. I could not agree more. Encoding whole words as single Unicode code points makes no sense. U+FDF2 Shachar
Sep 22 2018
On Saturday, 22 September 2018 at 10:24:48 UTC, Shachar Shemesh wrote:Thank Allah that someone said it before I had to. I could not agree more. Encoding whole words as single Unicode code points makes no sense.The goal of Unicode is to support diversity, if you argue against that you don't need Unicode at all. What you are saying is basically that you would remove Chinese too. Emojis are not my world either but it is an expression system / language.
Sep 22 2018
On Saturday, September 22, 2018 4:51:47 AM MDT Thomas Mader via Digitalmars- d wrote:On Saturday, 22 September 2018 at 10:24:48 UTC, Shachar Shemesh wrote:Unicode is supposed to be a universal way of representing every character in every language. Emojis are not characters. They are sequences of characters that people use to represent images. I do not understand how an argument can even be made that they belong in Unicode. As I said, it's exactly the same as arguing that words should be represented in Unicode. Unfortunately, however, at least some of them are in there. :| - Jonathan M DavisThank Allah that someone said it before I had to. I could not agree more. Encoding whole words as single Unicode code points makes no sense.The goal of Unicode is to support diversity, if you argue against that you don't need Unicode at all. What you are saying is basically that you would remove Chinese too. Emojis are not my world either but it is an expression system / language.
Sep 22 2018
On 22/09/18 14:28, Jonathan M Davis wrote:As I said, it's exactly the same as arguing that words should be represented in Unicode. Unfortunately, however, at least some of them are in there. :| - Jonathan M DavisTo be fair to them, that word is part of the "Arabic-representation forms" section. The "Presentation forms" sections are meant as backwards compatibility toward code points that existed before, and are not meant to be generated by Unicode aware applications. Shachar
Sep 22 2018
On Saturday, 22 September 2018 at 11:28:48 UTC, Jonathan M Davis wrote:Unicode is supposed to be a universal way of representing every character in every language. Emojis are not characters. They are sequences of characters that people use to represent images. I do not understand how an argument can even be made that they belong in Unicode. As I said, it's exactly the same as arguing that words should be represented in Unicode. Unfortunately, however, at least some of them are in there. :|At least since the incorporation of Emojis it's not supposed to be a universal way of representing characters anymore. :-) Maybe there was a time when that was true I don't know but I think they see Unicode as a way to express all language symbols. And Emojis is nothing else than a language were each symbol stands for an emotion/word/sentence. If Unicode only allows languages with characters which are used to form words it's excluding languages which use other ways of expressing something. Would you suggest to remove such writing systems out of Unicode? What should a museum do which is in need of a software to somehow manage Egyptian hieroglyphs? Unicode was made to support all sorts of writing systems and using multiple characters per word is just one system to form a writing system.
Sep 22 2018
On 22/09/18 15:13, Thomas Mader wrote:Would you suggest to remove such writing systems out of Unicode? What should a museum do which is in need of a software to somehow manage Egyptian hieroglyphs?If memory serves me right, hieroglyphs actually represent consonants (vowels are implicit), and as such, are most definitely "characters". The only language I can think of, off the top of my head, where words have distinct signs is sign language. It is a good question whether Unicode should include such a language (difficulty of representing motion in a font aside). Shachar
Sep 22 2018
On Saturday, 22 September 2018 at 12:24:49 UTC, Shachar Shemesh wrote:If memory serves me right, hieroglyphs actually represent consonants (vowels are implicit), and as such, are most definitely "characters".Egyptian hieroglyphics uses logographs (symbols representing whole words, which might be multiple syllables), letters, and determinants (which don't represent any word but disambiguate the surrounding words). Looking things up serves me better than memory, usually.The only language I can think of, off the top of my head, where words have distinct signs is sign language.Logographic writing systems. There is one logographic writing system still in common use, and it's the standard writing system for Chinese and Japanese. That's about 1.4 billion people. It was used in Korea until hangul became popularized. Unicode also aims to support writing systems that aren't used anymore. That means Mayan, cuneiform (several variants), Egyptian hieroglyphics and demotic script, several extinct variants on the Chinese writing system, and Luwian. Sign languages generally don't have writing systems. They're also not generally related to any ambient spoken languages (for instance, American Sign Language is derived from French Sign Language), so if you speak sign language and can write, you're bilingual. Anyway, without writing systems, sign languages are irrelevant to Unicode.
Sep 22 2018
On 09/22/2018 09:27 AM, Neia Neutuladh wrote:Logographic writing systems. There is one logographic writing system still in common use, and it's the standard writing system for Chinese and Japanese.I had the misconception of each Chinese character meaning a word until I read "The Chinese Language, Fact and Fantasy" by John DeFrancis. One thing I learned was that Chinese is not purely logographic. Ali
Sep 23 2018
On 9/22/18 4:52 AM, Jonathan M Davis wrote:But aren't some (many?) Chinese/Japanese characters representing whole words? -SteveI was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it.Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job.
Sep 22 2018
On Saturday, September 22, 2018 6:37:09 AM MDT Steven Schveighoffer via Digitalmars-d wrote:On 9/22/18 4:52 AM, Jonathan M Davis wrote:It's true that they're not characters in the sense that Roman characters are characters, but they're still part of the alphabets for those languages. Emojis are specifically formed from sequences of characters - e.g. :) is two characters which are already expressible on their own. They're meant to represent a smiley face, but it's a sequence of characters already. There's no need whatsoever to represent anything extra Unicode. It's already enough of a disaster that there are multiple ways to represent the same character in Unicode without nonsense like emojis. It's stuff like this that really makes me wish that we could come up with a new standard that would replace Unicode, but that's likely a pipe dream at this point. - Jonathan M DavisBut aren't some (many?) Chinese/Japanese characters representing whole words?I was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it.Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job.
Sep 22 2018
On 9/22/18 8:58 AM, Jonathan M Davis wrote:On Saturday, September 22, 2018 6:37:09 AM MDT Steven Schveighoffer via Digitalmars-d wrote:But there are tons of emojis that have nothing to do with sequences of characters. Like houses, or planes, or whatever. I don't even know what the sequences of characters are for them. I think it started out like that, but turned into something else. Either way, I can't imagine any benefit from using emojis in symbol names. -SteveOn 9/22/18 4:52 AM, Jonathan M Davis wrote:It's true that they're not characters in the sense that Roman characters are characters, but they're still part of the alphabets for those languages. Emojis are specifically formed from sequences of characters - e.g. :) is two characters which are already expressible on their own. They're meant to represent a smiley face, but it's a sequence of characters already. There's no need whatsoever to represent anything extra Unicode. It's already enough of a disaster that there are multiple ways to represent the same character in Unicode without nonsense like emojis. It's stuff like this that really makes me wish that we could come up with a new standard that would replace Unicode, but that's likely a pipe dream at this point.But aren't some (many?) Chinese/Japanese characters representing whole words?I was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it.Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job.
Sep 24 2018
On Saturday, 22 September 2018 at 12:37:09 UTC, Steven Schveighoffer wrote:But aren't some (many?) Chinese/Japanese characters representing whole words? -SteveKind of hair-splitting, but it's more accurate to say that some Chinese/Japanese words can be written with one character. Like how English speakers wouldn't normally say that "A" and "I" are characters representing whole words.
Sep 22 2018
On Saturday, 22 September 2018 at 08:52:32 UTC, Jonathan M Davis wrote:Unicode identifiers may make sense in a code base that is going to be used solely by a group of developers who speak a particular language that uses a number a of non-ASCII characters (especially languages like Chinese or Japanese), but it has no business in any code that's intended for international use. It just causes problems.You have a problem when you need to share a codebase between two organizations using different languages. "Just use ASCII" is not the solution. "Use a language that most developers in both organizations can use" is. That's *usually* going to be English, but not always. For instance, a Belorussian company doing outsourcing work for a Russian company might reasonably write code in Russian. If you're writing for a global audience, as most open source code is, you're usually going to use the most widely spoken language.
Sep 22 2018
On Saturday, September 22, 2018 10:07:38 AM MDT Neia Neutuladh via Digitalmars-d wrote:On Saturday, 22 September 2018 at 08:52:32 UTC, Jonathan M Davis wrote:My point is that if your code base is definitely only going to be used within a group of people who are using a keyboard that supports a Unicode character that you want to use, then it's not necessarily a problem to use it, but if you're writing code that may be seen or used by a general audience (especially if it's going to be open source), then it needs to be in ASCII, or it's a serious problem. Even if it's a character like lambda that most everyone is going to understand, many, many programmers are not going to be able type it on their keyboards, and that's going to cause nothing but problems. For better or worse, English is the international language of science and engineering, and that includes programming. So, any programs that are intended to be seen and used by the world at large need to be in ASCII. And the biggest practical issue with that is whether a character is even on a typical keyboard. Using a Unicode character in a program makes it so that make programmers cannot type it. And even given the large breadth of Unicode characters, you could even have a keyboard that supports a number of Unicode characters and still not have the Unicode character in question. So, open source programs need to be in ASCII. Now, I don't know that it's a problem to support a wide range of Unicode characters in identifiers when you consider the issues of folks whose native language is not English (especially when it's a language like Chinese or Japanese), but open source programs should only be using ASCII identifiers. And unfortunately, sometimes, the fact that a language supports Unicode identifiers has lead English speakers to do stupid things like use the lambda character in identifiers. So, I can understand Walter's reticence to go further with supporting Unicode identifiers, but on the other hand, when you consider how many people there are on the planet who use a language that doesn't even use the latin alphabet, it's arguably a good idea to fully support Unicode identifiers. - Jonathan M DavisUnicode identifiers may make sense in a code base that is going to be used solely by a group of developers who speak a particular language that uses a number a of non-ASCII characters (especially languages like Chinese or Japanese), but it has no business in any code that's intended for international use. It just causes problems.You have a problem when you need to share a codebase between two organizations using different languages. "Just use ASCII" is not the solution. "Use a language that most developers in both organizations can use" is. That's *usually* going to be English, but not always. For instance, a Belorussian company doing outsourcing work for a Russian company might reasonably write code in Russian. If you're writing for a global audience, as most open source code is, you're usually going to use the most widely spoken language.
Sep 22 2018
On 9/22/2018 6:01 PM, Jonathan M Davis wrote:For better or worse, English is the international language of science and engineering, and that includes programming.In the earlier days of D, I put on the web pages a google widget what would automatically translate the page into any language google supported. This was eventually removed (not by me) because nobody wanted it. Nobody (besides me) even noticed it was removed. And the D community is a very international one. Supporting Unicode in identifiers gives users a false sense that it's a good idea to use them. Lots of programming tools don't work well with Unicode. Even Windows doesn't by default - you've got to run "chcp 65001" each time you open a console window. Filesystems don't work reliably with Unicode. Heck, the reason module names should be lower case in D is because mixed case doesn't work reliably across filesystems. D supports Unicode in identifiers because C and C++ do, and we want to be able to interoperate with them. Extending Unicode identifier support off into other directions, especially ones that break such interoperability, is just doing a disservice to users.
Sep 23 2018
On Sunday, 23 September 2018 at 21:12:13 UTC, Walter Bright wrote:D supports Unicode in identifiers because C and C++ do, and we want to be able to interoperate with them. Extending Unicode identifier support off into other directions, especially ones that break such interoperability, is just doing a disservice to users.Okay, that's why you previously selected C99 as the standard for what characters to allow. Do you want to update to match C11? It's been out for the better part of a decade, after all.
Sep 23 2018
On 9/23/2018 3:23 PM, Neia Neutuladh wrote:Okay, that's why you previously selected C99 as the standard for what characters to allow. Do you want to update to match C11? It's been out for the better part of a decade, after all.I wasn't aware it changed in C11.
Sep 23 2018
On Monday, 24 September 2018 at 01:39:43 UTC, Walter Bright wrote:On 9/23/2018 3:23 PM, Neia Neutuladh wrote:http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page 522 (PDF numbering) or 504 (internal numbering). Outside the BMP, almost everything is allowed, including many things that are not currently mapped to any Unicode value. Within the BMP, a heck of a lot of stuff is allowed, including a lot that D doesn't currently allow. GCC hasn't even updated to the C99 standard here, as far as I can tell, but clang-5.0 is up to date.Okay, that's why you previously selected C99 as the standard for what characters to allow. Do you want to update to match C11? It's been out for the better part of a decade, after all.I wasn't aware it changed in C11.
Sep 23 2018
On 9/24/18 12:23 AM, Neia Neutuladh wrote:On Monday, 24 September 2018 at 01:39:43 UTC, Walter Bright wrote:I searched around for the current state of symbol names in C, and found some really crappy rules, though maybe this site isn't up to date?: https://en.cppreference.com/w/c/language/identifier What I understand from that is: 1. Yes, you can use any unicode character you want in C/C++ (seemingly since C99) 2. There are no rules about what *encoding* is acceptable, it's implementation defined. So various compilers have different rules as to what will be accepted in the actual source code. In fact, I read somewhere that not even ASCII is guaranteed to be supported. The result being, that you have to write the identifiers with an ASCII escape sequence in order for it to be actually portable. Which to me, completely defeats the purpose of using such identifiers in the first place. For example, on that page, they have a line that works in clang, not in GCC (tagged as implementation defined): char *🐱 = "cat"; The portable version looks like this: char *\U0001f431 = "cat"; Seriously, who wants to use that? Now, D can potentially do better (especially when all front-ends are the same) and support such things in the spec, but I think the argument "because C supports it" is kind of bunk. Or am I reading it wrong? In any case, I would expect that symbol name support should be focused only on languages which people use, not emojis. If there are words in Chinese or Japanese that can't be expressed using D, while other words can, it would seem inconsistent to a Chinese or Japanese speaking user, and I think we should work to fix that. I just have no idea what the state of that is. I also tend to agree that most code is going to be written in English, even when the primary language of the user is not. Part of the reason, which I haven't read here yet, is that all the keywords are in English. Someone has to kind of understand those to get the meaning of some constructs, and it's going to read strangely with the non-english words. One group which I believe hasn't spoken up yet is the group making the hunt framework, whom I believe are all Chinese? At least their web site is. It would be good to hear from a group like that which has large experience writing mature D code (it appears all to be in English) and how they feel about the support. -SteveOn 9/23/2018 3:23 PM, Neia Neutuladh wrote:http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page 522 (PDF numbering) or 504 (internal numbering). Outside the BMP, almost everything is allowed, including many things that are not currently mapped to any Unicode value. Within the BMP, a heck of a lot of stuff is allowed, including a lot that D doesn't currently allow. GCC hasn't even updated to the C99 standard here, as far as I can tell, but clang-5.0 is up to date.Okay, that's why you previously selected C99 as the standard for what characters to allow. Do you want to update to match C11? It's been out for the better part of a decade, after all.I wasn't aware it changed in C11.
Sep 24 2018
On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:Part of the reason, which I haven't read here yet, is that all the keywords are in English.Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)One group which I believe hasn't spoken up yet is the group making the hunt framework, whom I believe are all Chinese? At least their web site is.I know they used a lot of my code as a starting point, and I, of course, wrote it in English, so that could have biased it a bit too. Though that might be a general point where you want to use these libraries and they are in a language. Just even so, I still find it kinda hard to believe that everybody everywhere uses only English in all their code. Maybe our efforts should be going toward the Chinese market via natural language support instead of competing with Rust on computer language features :PIt would be good to hear from a group like that which has large experience writing mature D code (it appears all to be in English) and how they feel about the support.definitely.
Sep 24 2018
On 9/24/18 10:14 AM, Adam D. Ruppe wrote:On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English). I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all... -StevePart of the reason, which I haven't read here yet, is that all the keywords are in English.Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
Sep 24 2018
On Monday, 24 September 2018 at 14:34:21 UTC, Steven Schveighoffer wrote:On 9/24/18 10:14 AM, Adam D. Ruppe wrote:You might get really funny error messages. 🙂 can't be casted to int. :-) And if you have to increment the number of cars you can write: 🚗++; This might give really funny looking programs!On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English). I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all... -StevePart of the reason, which I haven't read here yet, is that all the keywords are in English.Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
Sep 24 2018
On 9/24/18 2:20 PM, Martin Tschierschke wrote:On Monday, 24 September 2018 at 14:34:21 UTC, Steven Schveighoffer wrote:Haha, it could be cynical as well int can’t be casted to int🤔 Oh, the games we could play. -SteveOn 9/24/18 10:14 AM, Adam D. Ruppe wrote:You might get really funny error messages. 🙂 can't be casted to int.On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English). I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all...Part of the reason, which I haven't read here yet, is that all the keywords are in English.Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
Sep 24 2018
On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:2. There are no rules about what *encoding* is acceptable, it's implementation defined. So various compilers have different rules as to what will be accepted in the actual source code. In fact, I read somewhere that not even ASCII is guaranteed to be supported.Indeed. IBM mainframes have C compilers too but not ASCII. They code in EBCDIC. That's why for instance it's not portable to do things like if(c >= 'A' && c <= 'Z') printf("CAPITAL LETTER\n"); is not true in EBCDIC.
Sep 24 2018
On 9/24/18 3:18 PM, Patrick Schluter wrote:On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:Right. But it's just a side-note -- I'd guess all modern compilers support ASCII, and definitely ones that we would want to interoperate with. Besides, that example is more concerned about *input data* encoding, not *source code* encoding. If the above is written in ASCII, then I would assume that the bytes in the source file are the ASCII bytes, and probably the IBM compilers would not know what to do with such files (it would all be gibberish if you opened on an EBCDIC editor). You'd first have to translate it to EBCDIC, which is a red flag that likely this isn't going to work :) -Steve2. There are no rules about what *encoding* is acceptable, it's implementation defined. So various compilers have different rules as to what will be accepted in the actual source code. In fact, I read somewhere that not even ASCII is guaranteed to be supported.Indeed. IBM mainframes have C compilers too but not ASCII. They code in EBCDIC. That's why for instance it's not portable to do things like if(c >= 'A' && c <= 'Z') printf("CAPITAL LETTER\n"); is not true in EBCDIC.
Sep 24 2018
On Sunday, 23 September 2018 at 21:12:13 UTC, Walter Bright wrote:D supports Unicode in identifiers because C and C++ do, and we want to be able to interoperate with them. Extending Unicode identifier support off into other directions, especially ones that break such interoperability, is just doing a disservice to users.I always thought D supported Unicode with the goal of going forward with it while C was stuck with ASCII: http://www.drdobbs.com/cpp/time-for-unicode/228700405 "The D programming language has already driven stakes in the ground, saying it will not support 16 bit processors, processors that don't have 8 bit bytes, and processors with crippled, non-IEEE floating point. Is it time to drive another stake in and say the time for Unicode has come? " Have you changed your mind since?
Sep 23 2018
On 9/23/2018 6:06 PM, Dennis wrote:Have you changed your mind since?D the language is well suited to the development of Unicode apps. D source code is another matter.
Sep 23 2018
On Monday, 24 September 2018 at 01:32:38 UTC, Walter Bright wrote:D the language is well suited to the development of Unicode apps. D source code is another matter.But in the article you specifically talk about the use of Unicode in the context of source code instead of apps: "With the D programming language, we continuously run up against the problem that ASCII has reached its expressivity limits." "There are the chevrons « and » which serve as another set of brackets to lighten the overburdened ambiguities of ( ). There are the dot-product and cross-product characters · and × which would make lovely infix operator tokens for math libraries. The greek letters would be great for math variable names."
Sep 24 2018
On Monday, September 24, 2018 4:19:31 AM MDT Dennis via Digitalmars-d wrote:On Monday, 24 September 2018 at 01:32:38 UTC, Walter Bright wrote:Given that the typical keyboard has none of those characters, maintaining code that used any of them would be a royal pain. It's one thing if they're used in the occasional string as data, but it's quite another if they're used as identifiers or operators. I don't see how that would be at all maintainable. You'd be forced to constantly copy and paste rather than type. - Jonathan M DavisD the language is well suited to the development of Unicode apps. D source code is another matter.But in the article you specifically talk about the use of Unicode in the context of source code instead of apps: "With the D programming language, we continuously run up against the problem that ASCII has reached its expressivity limits." "There are the chevrons and which serve as another set of brackets to lighten the overburdened ambiguities of ( ). There are the dot-product and cross-product characters and which would make lovely infix operator tokens for math libraries. The greek letters would be great for math variable names."
Sep 24 2018
On Monday, 24 September 2018 at 10:36:50 UTC, Jonathan M Davis wrote:Given that the typical keyboard has none of those characters, maintaining code that used any of them would be a royal pain.Note that I'm not trying to argue either way, it's just that I used to think of Walter's stance on D and Unicode as: "D would fully embrace Unicode if only editors/debuggers etc. would embrace it too" But now I read:D supports Unicode in identifiers because C and C++ do, and we want to be able to interoperate with them."So I wonder what changed. I guess it's mostly answered in the first reply:When I originally started with D, I thought non-ASCII identifiers with Unicode was a good idea. I've since slowly become less and less enthusiastic about it.
Sep 24 2018
On Monday, 24 September 2018 at 10:36:50 UTC, Jonathan M Davis wrote:Given that the typical keyboard has none of those characters, maintaining code that used any of them would be a royal pain.It is pretty easy to type them with a little keyboard config change, and like vim can pick those up from comments in the file even, though you have to train your fingers to know how to use it effectively too... but if you were maintaining something long term, you'd just do that.
Sep 24 2018
On Saturday, 22 September 2018 at 08:52:32 UTC, Jonathan M Davis wrote:Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job.According to the Unicode website, http://unicode.org/standard/WhatIsUnicode.html, """ Support of Unicode forms the foundation for the representation of languages and symbols in all major operating systems, search engines, browsers, laptops, and smart phones—plus the Internet and World Wide Web (URLs, HTML, XML, CSS, JSON, etc.)""" Note, unicode supports symbols, not just characters. The smiley face symbol predates its ':-)' usage in ascii text, https://www.smithsonianmag.com/arts-culture/who-really-invented-the-s iley-face-2058483/. It's fundamentally a symbol, not a sequence of characters. Therefore it is not unreasonable for it to be encoded with a unicode number. I do agree though, of course, that it would seem bizarre to use an emoji as a D identifier. The early history of computer science is completely dominated by cultures who use latin script based characters, and hence, quiet reasonably, text encoding and its automated visual representation by compute based devices is dominated by the requirements of latin script languages. However, the world keeps turning and, despite DT's best efforts, China et al. look to become dominant. Even if not China, the chances are that eventually a non-latin script based language will become very important. Parochial views like "all open source code should be in ASCII" will look silly. However, until that time D developers have to spend their time where it can be most useful. Hence the condition of whether to apply Neia's patch / ideas or not mainly depends on how much effort the donwstream effort will be (debuggers etc. as Walter pointed out), and how much the gain is. As unicode 2.0 is already supported I would take a guess that the vast majority of people with access to a computer can already enter identifiers in D that are rich enough for them. As Adam said though, it would be a good idea to at least ask!
Sep 23 2018
On 9/23/2018 12:06 PM, Abdulhaq wrote:The early history of computer science is completely dominated by cultures who use latin script based characters,Small character sets are much more implementable on primitive systems like telegraphs and electro-mechanical ttys. It wasn't even practical to display a rich character set until the early 1980's or so. There wasn't enough memory. Glass ttys at the time could barely, and I mean barely, display ASCII. I know because I designed and built one.
Sep 25 2018
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:When I originally started with D, I thought non-ASCII identifiers with Unicode was a good idea. I've since slowly become less and less enthusiastic about it. First off, D source text simply must (and does) fully support Unicode in comments, characters, and string literals. That's not an issue. But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code.Not seeing identifiers in languages you don't program in or can read in is expected. If it's supported it will be used: Japanese Swift: https://speakerdeck.com/codelynx/programming-swift-in-japaneseExtending it further will also cause problems for all the tools that work with D object code, like debuggers, disassemblers, linkers, filesystems, etc. Absent a much more compelling rationale for it, I'd say no.More compelling than: "there're 6 billion people in this world who don't speak english?" Allowing people to program in their own language while reducing the cognitive friction for people who want to learn programming in the majority of the world seems like a no-brainer thing to do.
Sep 23 2018
On 9/23/2018 9:52 AM, aliak wrote:Not seeing identifiers in languages you don't program in or can read in is expected.On the other hand, I've been programming for 40 years. I've customized my C++ compiler to emit error messages in various languages: https://github.com/DigitalMars/Compiler/blob/master/dm/src/dmc/msgsx.c I've implemented SHIFT-JIS encodings, along with .950 (Chinese) and .949 (Korean) code pages in the C++ compiler. I've worked in Japan writing software for Japanese companies. I've sold compilers internationally for 30 years (mostly to Germany and Japan). I did the tech support, meaning I'd see their code. --- There's a reason why dmd doesn't have international error messages. My experience with it is that international users don't want it. They prefer the english messages. I'm sure if you look hard enough you'll find someone using non-ASCII characters in identifiers. --- When I visited Remedy Games in Finland a few years back, I was surprised that everyone in the company was talking in english. I asked if they were doing that out of courtesy to me. They laughed, and said no, they talked in English because they came from all over the world, and english was the only language they had in common.
Sep 23 2018
On Sunday, 23 September 2018 at 20:49:39 UTC, Walter Bright wrote:There's a reason why dmd doesn't have international error messages. My experience with it is that international users don't want it. They prefer the english messages.I'm a native German speaker. As for my part, I agree on this, indeed. There are several reasons for this: - Usually such translations are terrible, simply put. - Uncontinuous translations [0] - Non-idiomatic sentences that still sound like English somehow. - Translations of tech terms [1] - Non-idiomatic translations of tech terms [2] However, well done translations might be quite nice at the in VS 2010 I was happy with the German error messages. I'm not sure whether it was just delusion but I think it got worse with some later version, though. [0] There's nothing worse than every single sentence being treated on its own during the translation process. At least that's what you'd often think when you face a longer error message. Usually you're confronted with non-linked and kindergarten-like sentences that don't seem to be meant to be put together. Often you'd think there were several translators. Favorite problem with this: 2 different terms for the same thing in two sentences. [1] e.g. "integer type" -> "ganzzahliger Datentyp" This just sounds weird. Anyone using "int" in their code knows what it means anyway... Nevertheless, there are some common translations that are fine (primarily because they're common), e.g. "error" -> "Fehler" [2] e.g. "assertion" -> "Assertionsfehler" This particular one can be found in Windows 10 and is not even proper German.
Sep 24 2018
On Monday, 24 September 2018 at 15:17:14 UTC, 0xEAB wrote:German error messages.addendum: I've been using the English version since VS2017
Sep 24 2018
On 09/24/2018 08:17 AM, 0xEAB wrote:- Non-idiomatic translations of tech terms [2]This is something I had heard from a Digital Research programmer in early 90s: English message was something like "No memory left" and the German translation was "No memory on the left hand side" :) Ali
Sep 25 2018
On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli wrote:On 09/24/2018 08:17 AM, 0xEAB wrote:My ex-girlfriend tried to learn SQL from a book that had gotten a prize for its use of Norwegian. As a result, every single concept used a different name from what everybody else uses, and while it may be possible to learn som SQL from this, it made googling an absolute nightmare. Just imagine a whole book saying CHOOSE for SELECT, IF for WHERE, and USING instead of FROM - only worse, since it's a different language. It even used SQL pseudo-code with these made-up names, and showed how to translate it to proper SQL as more of an afterthought. -- Simen- Non-idiomatic translations of tech terms [2]This is something I had heard from a Digital Research programmer in early 90s: English message was something like "No memory left" and the German translation was "No memory on the left hand side" :)
Sep 25 2018
On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli wrote:On 09/24/2018 08:17 AM, 0xEAB wrote:The K&R in German was of the same "quality". That happens when the translator is not an IT person himself.- Non-idiomatic translations of tech terms [2]This is something I had heard from a Digital Research programmer in early 90s: English message was something like "No memory left" and the German translation was "No memory on the left hand side" :)
Sep 25 2018
On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli wrote:On 09/24/2018 08:17 AM, 0xEAB wrote:[snip]- Non-idiomatic translations of tech terms [2]English message was something like "No memory left" and the German translation was "No memory on the left hand side" :) AliNot sure if this was not just some urban legend, but there was a delightful story back in the late 80s/early 90s about the early translation programs. They were in particular not very good at idiomatic translations, so people would play with idiomatic expressions from language X (say english) to language Y, and then back from Y to X - and then see what was returned. Apparently the expression "the spirit is willing but the flesh is weak" translated to Russian and back was returned by one such program as: "The vodka is good but the meat is rotten!"
Sep 26 2018
On Wednesday, 26 September 2018 at 12:57:21 UTC, ShadoLight wrote:On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli wrote:In case you missed it, this was well spreaded in the tech news last month or so: https://translate.google.fr/?hl=fr#so/en/ngoo%20m%20goon%20goob%20goo%20goo%20goo%20mgoo%20goo%20goo%20goo%20goo%20goo%20m%20goo Still progress to do.On 09/24/2018 08:17 AM, 0xEAB wrote:[snip]- Non-idiomatic translations of tech terms [2]English message was something like "No memory left" and the German translation was "No memory on the left hand side" :) AliNot sure if this was not just some urban legend, but there was a delightful story back in the late 80s/early 90s about the early translation programs. They were in particular not very good at idiomatic translations, so people would play with idiomatic expressions from language X (say english) to language Y, and then back from Y to X - and then see what was returned. Apparently the expression "the spirit is willing but the flesh is weak" translated to Russian and back was returned by one such program as: "The vodka is good but the meat is rotten!"
Sep 26 2018
A delicious Turkish desert is "kabak tatlısı", made of squash. Now, it so happens that "kabak" also means "zucchini" in Turkish. Imagine my shock when I came across that desert recipe in English that used zucchini as the ingredient! :) Ali
Sep 26 2018
On Wednesday, September 26, 2018 11:15:01 PM MDT Ali Çehreli via Digitalmars-d wrote:A delicious Turkish desert is "kabak tatlısı", made of squash. Now, it so happens that "kabak" also means "zucchini" in Turkish. Imagine my shock when I came across that desert recipe in English that used zucchini as the ingredient! :)Was it any good? ;) - Jonathan M Davis
Sep 26 2018
On Thursday, 27 September 2018 at 05:15:01 UTC, Ali Çehreli wrote:A delicious Turkish desert is "kabak tatlısı", made of squash. Now, it so happens that "kabak" also means "zucchini" in Turkish. Imagine my shock when I came across that desert recipe in English that used zucchini as the ingredient! :) AliYou can't even imagine how many italian words and recipes are distorted... Andrea
Sep 27 2018
On Thursday, 27 September 2018 at 07:03:51 UTC, Andrea Fontana wrote:On Thursday, 27 September 2018 at 05:15:01 UTC, Ali Çehreli wrote:+1 :-PA delicious Turkish desert is "kabak tatlısı", made of squash. Now, it so happens that "kabak" also means "zucchini" in Turkish. Imagine my shock when I came across that desert recipe in English that used zucchini as the ingredient! :) AliYou can't even imagine how many italian words and recipes are distorted... Andrea
Sep 27 2018
On Sunday, 23 September 2018 at 20:49:39 UTC, Walter Bright wrote:On 9/23/2018 9:52 AM, aliak wrote: There's a reason why dmd doesn't have international error messages. My experience with it is that international users don't want it. They prefer the english messages.Yes please. Keep them in english. But please, add an error code too in front of them.I'm sure if you look hard enough you'll find someone using non-ASCII characters in identifiers.It depends on what I'm developing. If I'm writing a public library I'm planning to release on github, I use english identifiers. But of course if is a piece of software for my company or for myself, I use italian identifiers. Andrea
Sep 26 2018
On Sunday, September 23, 2018 2:49:39 PM MDT Walter Bright via Digitalmars-d wrote:There's a reason why dmd doesn't have international error messages. My experience with it is that international users don't want it. They prefer the english messages.It reminds me of one of the reasons that Bryan Cantrill thinks that many folks use Linux - they want to be able to google their stack traces. Of course, that same argument would be a reason to use C/C++ rather than switching to D, but having an error be in a format that's more common and therefore more likely to have been posted somewhere where you might be able to find a discussion on it and therefore maybe be able to find the solution for it can be valuable - and that's without even getting into all of the translation issues discussed elsewher in this thread. And it's not like compiler error messages - or programming speak in general - are really traditional English anyway. - Jonathan M Davis
Sep 26 2018
Agreed with Walter. I'm all on board with i18n but I see no need for non-ascii identifiers. Even identifiers with a non-latin origin are usually written in the latin script. As for real-world usage I've seen Cyrillic identifiers a few times in PHP.
Sep 21 2018
On Friday, 21 September 2018 at 23:00:45 UTC, Erik van Velzen wrote:Agreed with Walter. I'm all on board with i18n but I see no need for non-ascii identifiers. Even identifiers with a non-latin origin are usually written in the latin script. As for real-world usage I've seen Cyrillic identifiers a few times in PHP.A: Wait. Using emojis as identifiers is not a good idea? B: Yes. A: But the cool kids are doing it: https://codepen.io/andresgalante/pen/jbGqXj In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it. (This is already supported in D.)
Sep 21 2018
On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:A: Wait. Using emojis as identifiers is not a good idea? B: Yes. A: But the cool kids are doing it:The C11 spec says that emoji should be allowed in identifiers (ISO publication N1570 page 504/522), so it's not just the cool kids. I'm not in favor of emoji in identifiers.In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it.It's supported because λ is a letter in a language spoken by thirteen million people. I mean, would you want to have to name a variable "lumиnosиty" because someone got annoyed at people using "i" as a variable name?
Sep 21 2018
On 22/09/2018 11:17 AM, Seb wrote:In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it. (This is already supported in D.)This can be strongly mitigated by using a compose key. But they are not terribly common unfortunately.
Sep 21 2018
On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:A: Wait. Using emojis as identifiers is not a good idea? B: Yes. A: But the cool kids are doing it: https://codepen.io/andresgalante/pen/jbGqXjIt's not like we have a lot of good fonts (I know only one), and even fewer of them are suitable for code, and they can't be realistically expected to do everything, monospace fonts are even often ascii-only.
Sep 23 2018
On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it. (This is already supported in D.)I just want to chime in that I've definitely used greek letters in "ordinary" code - it's handy when writing math and feeling lazy. Note that on Linux, with a simple configuration tweak (Windows key mapped to Compose, and https://gist.githubusercontent.com/zkat/6718053/raw/4535a2e2a988aa90937a69dbb8f10e 6a43b4010/.XCompose ), you can for instance type "<windows key> l a m" to make the lambda symbol, or other greek letters very easily.
Sep 25 2018
When I make code that I expect to be only used around here, I generally write the code itself in english but comments in my own language. I agree that in general, it's better to stick with english in identifiers when the programming language and the standard library is English. On Tuesday, 25 September 2018 at 09:28:33 UTC, FeepingCreature wrote:On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:On the other hand, Unicode identifiers till have their value IMO. The quote above is one reason for that -if there is a very specialized codebase it may be just inpractical to letterize everything. Another reason is that something may not have a good translation to English. If there is an enum type listing city names, it is IMO better to write them as normal, using Unicode. CityName.seinäjoki, not CityName.seinaejoki.In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it. (This is already supported in D.)I just want to chime in that I've definitely used greek letters in "ordinary" code - it's handy when writing math and feeling lazy.
Sep 25 2018
On 25/09/18 15:35, Dukc wrote:Another reason is that something may not have a good translation to English. If there is an enum type listing city names, it is IMO better to write them as normal, using Unicode. CityName.seinäjoki, not CityName.seinaejoki.This sounded like a very compelling example, until I gave it a second thought. I now fail to see how this example translates to a real-life scenario. City names (data, changes over time) as enums (compile time set) seem like a horrible idea. That may sound like a very technical objection to an otherwise valid point, but it really think that's not the case. The properties that cause city names to be poor candidates for enum values are the same as those that make them Unicode candidates. Shachar
Sep 25 2018
On Wednesday, 26 September 2018 at 06:50:47 UTC, Shachar Shemesh wrote:The properties that cause city names to be poor candidates for enum values are the same as those that make them Unicode candidates.How so?City names (data, changes over time) as enums (compile time set) seem like a horrible idea.In most cases yes. But not always. You might me doing some sort of game where certain cities are a central concept, not just data with properties. Another possibility is that you're using code as data, AKA scripting. And who says anyway you can't make a program that's designed specificially for certain cities?
Sep 26 2018
On 26/09/18 10:26, Dukc wrote:On Wednesday, 26 September 2018 at 06:50:47 UTC, Shachar Shemesh wrote:Sure you can. It's just very poor design. I think, when asking such questions, two types of answers are relevant. One is hypotheticals where you say "this design requires this". For such answers, the design needs to be a good one. It makes no sense to design a language to support a hypothetical design which is not a good one. The other type of answer is "it's being done in the real world". If it's in active use in the real world, it might make sense to support it, even if we can agree that the design is not optimal. Since your answer is hypothetical, I think arguing this is not a good way to code is a valid one. ShacharThe properties that cause city names to be poor candidates for enum values are the same as those that make them Unicode candidates.How so?City names (data, changes over time) as enums (compile time set) seem like a horrible idea.In most cases yes. But not always. You might me doing some sort of game where certain cities are a central concept, not just data with properties. Another possibility is that you're using code as data, AKA scripting. And who says anyway you can't make a program that's designed specificially for certain cities?
Sep 26 2018
On Wednesday, 26 September 2018 at 07:37:28 UTC, Shachar Shemesh wrote:The other type of answer is "it's being done in the real world". If it's in active use in the real world, it might make sense to support it, even if we can agree that the design is not optimal. ShacharTwo years ago, I taked part in implementing a commerical game. It would have faced the same thing, were it used. Anyway, the game has three characters with completely different abilites. The abilites were unique enough that it made sense to name some functions after the characters. One of the characters really has a non-ASCII character in his name, and that meant naming him differently in the code.
Sep 26 2018
On 9/26/18 2:50 AM, Shachar Shemesh wrote:On 25/09/18 15:35, Dukc wrote:Hm... I could see actually some "clever" use of opDispatch being used to define cities or other such names. In any case, I think the biggest pro for supporting Unicode symbol names is -- we already support Unicode symbol names. It doesn't make a whole lot of sense to only support some of them. -SteveAnother reason is that something may not have a good translation to English. If there is an enum type listing city names, it is IMO better to write them as normal, using Unicode. CityName.seinäjoki, not CityName.seinaejoki.This sounded like a very compelling example, until I gave it a second thought. I now fail to see how this example translates to a real-life scenario. City names (data, changes over time) as enums (compile time set) seem like a horrible idea. That may sound like a very technical objection to an otherwise valid point, but it really think that's not the case. The properties that cause city names to be poor candidates for enum values are the same as those that make them Unicode candidates.
Sep 26 2018
On 9/25/2018 11:50 PM, Shachar Shemesh wrote:This sounded like a very compelling example, until I gave it a second thought. I now fail to see how this example translates to a real-life scenario.Also, there are usually common ASCII versions of city names, such as Cologne for Köln.
Sep 26 2018
On 2018-09-21 18:27, Neia Neutuladh wrote:D's currently accepted identifier characters are based on Unicode 2.0: * ASCII range values are handled specially. * Letters and combining marks from Unicode 2.0 are accepted. * Numbers outside the ASCII range are accepted. * Eight random punctuation marks are accepted. This follows the C99 standard. Python, ECMAScript, just to name a few. A small number of languages reject non-ASCII characters: Dart, Perl. Some languages are weirdly generous: Swift and C11 allow everything outside the Basic Multilingual Plane. I'd like to update that so that D accepts something as a valid identifier character if it's a letter or combining mark or modifier symbol that's present in Unicode 11, or a non-ASCII number. This allows the 146 most popular writing systems and a lot more characters from those writing systems. This *would* reject those eight random punctuation marks, so I'll keep them in as legacy characters. It would mean we don't have to reference the C99 standard when enumerating the allowed characters; we just have to refer to the Unicode standard, which we already need to talk about in the lexical part of the spec. It might also make the lexer a tiny bit faster; it reduces the number of valid-ident-char segments to search from 245 to 134. On the other hand, it will change the ident char ranges from wchar to dchar, which means the table takes up marginally more memory. And, of course, it lets you write programs entirely in Linear B, and that's a marketing ploy not to be missed. I've got this coded up and can submit a PR, but I thought I'd get feedback here first. Does anyone see any horrible potential problems here? Or is there an interestingly better option? Does this need a DIP?I'm not a native English speaker but I write all my public and private code in English. Anyone I work with, I will expect them and make sure they're writing the code in English as well. English is not enough either, it has to be American English. Despite this I think that D should support as much of the Unicode as possible (including using Unicode for identifiers). It should not be up to the programming language to decide which language the developer should write the code in. -- /Jacob Carlborg
Sep 25 2018
On Fri, 21 Sep 2018 16:27:46 +0000, Neia Neutuladh wrote:I've got this coded up and can submit a PR, but I thought I'd get feedback here first. Does anyone see any horrible potential problems here? Or is there an interestingly better option? Does this need a DIP?I just want to point out since this thread is still living that there have been very few answers to the actual question ("should I submit my PR?"). Walter did answer the question, with the reasons that Unicode identifier support is not useful/helpful and could cause issues with tooling. Which is likely correct; and if we really want to follow this logic, Unicode identifier support should be removed from D entirely. I don't recall seeing anyone in favor providing technical reasons, save the OP. Especially since the work is done, it makes sense to me to ask for the PR for review. Worst case scenario, it sits there until we need it.
Sep 26 2018
On 9/26/18 5:54 AM, rjframe wrote:On Fri, 21 Sep 2018 16:27:46 +0000, Neia Neutuladh wrote:This is a non-starter. We can't break people's code, especially for trivial reasons like 'you shouldn't code that way because others don't like it'. I'm pretty sure Walter would be against removing Unicode support for identifiers.I've got this coded up and can submit a PR, but I thought I'd get feedback here first. Does anyone see any horrible potential problems here? Or is there an interestingly better option? Does this need a DIP?I just want to point out since this thread is still living that there have been very few answers to the actual question ("should I submit my PR?"). Walter did answer the question, with the reasons that Unicode identifier support is not useful/helpful and could cause issues with tooling. Which is likely correct; and if we really want to follow this logic, Unicode identifier support should be removed from D entirely.I don't recall seeing anyone in favor providing technical reasons, save the OP.There doesn't necessarily need to be a technical reason. In fact, there really isn't one -- people can get by with using ASCII identifiers just fine (and many/most people do). Supporting Unicode would be purely for social or inclusive reasons (it may make D more approachable to non-English speaking schoolchildren for instance). As an only-English speaking person, it doesn't bother me either way to have Unicode identifiers. But the fact that we *already* support Unicode identifiers leads me to expect that we support *all* Unicode identifiers. It doesn't make a whole lot of sense to only support some of them.Especially since the work is done, it makes sense to me to ask for the PR for review. Worst case scenario, it sits there until we need it.I suggested this as well. https://forum.dlang.org/post/poaq1q$its$1 digitalmars.com I think it stands a good chance of getting incorporated, just for the simple fact that it's enabling and not disruptive. -Steve
Sep 26 2018
On 9/26/2018 5:46 AM, Steven Schveighoffer wrote:This is a non-starter. We can't break people's code, especially for trivial reasons like 'you shouldn't code that way because others don't like it'. I'm pretty sure Walter would be against removing Unicode support for identifiers.We're not going to remove it, because there's not much to gain from it. But expanding it seems of vanishingly little value. Note that each thing that gets added to D adds weight to it, and it needs to pull its weight. Nothing is free. I don't see a scenario where someone would be learning D and not know English. Non-English D instructional material is nearly non-existent. dlang.org is all in English. Don't most languages have a Romanji-like representation? C/C++ have made efforts in the past to support non-ASCII coding - digraphs, trigraphs, and alternate keywords. They've all failed miserably. The only people who seem to know those features even exist are language lawyers.
Sep 26 2018
On Wednesday, 26 September 2018 at 20:43:47 UTC, Walter Bright wrote:I don't see a scenario where someone would be learning D and not know English. Non-English D instructional material is nearly non-existent.http://ddili.org/ders/d/
Sep 26 2018
On 9/26/18 4:43 PM, Walter Bright wrote:But expanding it seems of vanishingly little value. Note that each thing that gets added to D adds weight to it, and it needs to pull its weight. Nothing is free.It may be the weight is already there in the form of unicode symbol support, just the range of the characters supported isn't good enough for some languages. It might be like replacing your refrigerator -- you get an upgrade, but it's not going to take up any more space because you get rid of the old one. I would like to see the PR before passing judgment on the heft of the change. The value is simply in the consistency -- when some of the words for your language can be valid symbols but others can't, then it becomes a weird guessing game as to what is supported. It would be like saying all identifiers can have any letters except `q`. Sure, you can get around that, but it's weirdly exclusive. I claim complete ignorance as to what is required, it hasn't been technically laid out what is at stake, and I'm not bilingual anyway. It could be true that I'm completely misunderstanding the positions of others. -Steve
Sep 26 2018
On 09/26/2018 01:43 PM, Walter Bright wrote:Don't most languages have a Romanji-like representation?Yes, a lot of languages that don't use the Latin alphabet have standard transcriptions into the Latin alphabet. Standard transcriptions into ASCII are much less common, and newer Unicode versions include more Latin characters to better support languages (and other use cases) using the Latin alphabet.
Sep 26 2018
On Wednesday, 26 September 2018 at 20:43:47 UTC, Walter Bright wrote:On 9/26/2018 5:46 AM, Steven Schveighoffer wrote:It's not that they don't know English. It's that non-English speakers can process words and sentences in non-English much more efficiently than in English. Knowing a language is not binary. Here's an example from this years spring semester and NTNU (norwegian uni): http://folk.ntnu.no/frh/grprog/eksempel/eks_20.cpp ... That's the basic programming course. Whether the professor would use that I guess would depend on ratio of English/non-English speakers. But it's there nonetheless. Of course Norway is a bad example because the English level here is, arguably, higher than many English countries :p But it's a great example because even if you're great at English, still sometimes people are more comfortable/confident/efficient/ in their own native language. Some tech meetups from different countries try and do things in English and mostly it works. But it's been seen consistently with non-English audiences that presentations given in English result in silence whereas if it's in their native language you have actual engagement. I fail to understand how supporting a version of unicode from (not sure when it was released) 3 billion decades ago should just be left as is and also cannot be removed when there's someone who's willing to update it.This is a non-starter. We can't break people's code, especially for trivial reasons like 'you shouldn't code that way because others don't like it'. I'm pretty sure Walter would be against removing Unicode support for identifiers.We're not going to remove it, because there's not much to gain from it. But expanding it seems of vanishingly little value. Note that each thing that gets added to D adds weight to it, and it needs to pull its weight. Nothing is free. I don't see a scenario where someone would be learning D and not know English. Non-English D instructional material is nearly non-existent. dlang.org is all in English. Don't most languages have a Romanji-like representation?C/C++ have made efforts in the past to support non-ASCII coding - digraphs, trigraphs, and alternate keywords. They've all failed miserably. The only people who seem to know those features even exist are language lawyers.This is not relevant. Trigraphs and digraphs did indeed fail miserably but they do not represent any non-ascii characters. The existential reasons for those abominations were different. Anyway, on a related note: D itself (not identifiers, but std) also supports unicode 6 or something. That's from 2010. That's a decade ago. We're at unicode 11 now. And I've already had someone tell me (while trying to get them to use D) - "hold on it supports unicode from a decade ago? Nah I'm not touching it". Not that it's the same as supporting identifiers in code, but still the reaction is relevant. Cheers, - Ali
Sep 27 2018
On 27/09/18 10:35, aliak wrote:Here's an example from this years spring semester and NTNU (norwegian uni): http://folk.ntnu.no/frh/grprog/eksempel/eks_20.cpp ... That's the basic programming course. Whether the professor would use that I guess would depend on ratio of English/non-English speakers. But it's there nonetheless.I'm sorry I keep bringing this up, but context is really important here. The program you link to has non-ASCII in the comments and in the literals, but not in the identifiers. Nobody is opposed to having those. Shachar
Sep 27 2018
On Thursday, 27 September 2018 at 08:16:00 UTC, Shachar Shemesh wrote:On 27/09/18 10:35, aliak wrote:The point was that being able to use non-English in code is demonstrably both helpful and useful to people. Norwegian happens to be easily anglicize-able. I've already linked to non ascii code versions in a previous post if you want that too.Here's an example from this years spring semester and NTNU (norwegian uni): http://folk.ntnu.no/frh/grprog/eksempel/eks_20.cpp ... That's the basic programming course. Whether the professor would use that I guess would depend on ratio of English/non-English speakers. But it's there nonetheless.I'm sorry I keep bringing this up, but context is really important here. The program you link to has non-ASCII in the comments and in the literals, but not in the identifiers. Nobody is opposed to having those. Shachar
Sep 27 2018
On 27/09/18 16:38, aliak wrote:The point was that being able to use non-English in code is demonstrably both helpful and useful to people. Norwegian happens to be easily anglicize-able. I've already linked to non ascii code versions in a previous post if you want that too.If you wish to make a point about something irrelevant to the discussion, that's fine. It is, however, irrelevant, mostly because it is uncontested. This thread is about the use of non-English in *identifiers*. This thread is not about comments. It is not about literals (i.e. - strings). Only about identifiers (function names, variable names etc.). If you have real world examples of those, that would be both interesting and relevant. Shachar
Sep 27 2018
On Thursday, 27 September 2018 at 13:59:48 UTC, Shachar Shemesh wrote:On 27/09/18 16:38, aliak wrote:English doesn't mean ascii. You can write non-English in ascii, which you would've noticed if you'd opened the link, which had identifiers in Norwegian (which is not English). And again, I've already posted a link that shows non-ascii identifiers. I'll paste it again here incase you don't want to read the thread: https://speakerdeck.com/codelynx/programming-swift-in-japaneseThe point was that being able to use non-English in code is demonstrably both helpful and useful to people. Norwegian happens to be easily anglicize-able. I've already linked to non ascii code versions in a previous post if you want that too.If you wish to make a point about something irrelevant to the discussion, that's fine. It is, however, irrelevant, mostly because it is uncontested. This thread is about the use of non-English in *identifiers*. This thread is not about comments. It is not about literals (i.e. - strings). Only about identifiers (function names, variable names etc.). If you have real world examples of those, that would be both interesting and relevant. Shachar
Sep 27 2018
On Thursday, 27 September 2018 at 16:34:37 UTC, aliak wrote:On Thursday, 27 September 2018 at 13:59:48 UTC, Shachar Shemesh wrote:Shachar seems to be aiming for an internet high score by shooting down threads without reading them. You have better things to do. http://www.paulgraham.com/vb.htmlOn 27/09/18 16:38, aliak wrote:English doesn't mean ascii. You can write non-English in ascii, which you would've noticed if you'd opened the link, which had identifiers in Norwegian (which is not English). And again, I've already posted a link that shows non-ascii identifiers. I'll paste it again here incase you don't want to read the thread: https://speakerdeck.com/codelynx/programming-swift-in-japaneseThe point was that being able to use non-English in code is demonstrably both helpful and useful to people. Norwegian happens to be easily anglicize-able. I've already linked to non ascii code versions in a previous post if you want that too.If you wish to make a point about something irrelevant to the discussion, that's fine. It is, however, irrelevant, mostly because it is uncontested. This thread is about the use of non-English in *identifiers*. This thread is not about comments. It is not about literals (i.e. - strings). Only about identifiers (function names, variable names etc.). If you have real world examples of those, that would be both interesting and relevant. Shachar
Sep 27 2018
On Friday, 28 September 2018 at 02:23:32 UTC, sarn wrote:Shachar seems to be aiming for an internet high score by shooting down threads without reading them. You have better things to do. http://www.paulgraham.com/vb.htmlI believe you're being too harsh. It's easy to miss a part of a post sometimes.
Sep 28 2018
On Friday, 28 September 2018 at 11:37:10 UTC, Dukc wrote:It's easy to miss a part of a post sometimes.That's very true, and it's always good to give people the benefit of the doubt. But most people are able to post constructively here without * Abrasively and condescendingly declaring others' posts to be completely pointless * Doing that based on one single aspect of a post, without bothering to check the whole post or parent post * Doubling down even after getting a hint that the poster might not have posted 100% cluelessly * Doing all this more than once in a thread If Shachar starts posting constructively, I'll happily engage. I mean that. Otherwise I won't waste my time, and I'll tell others not to waste theirs, too.
Sep 28 2018
On 28/09/18 14:37, Dukc wrote:On Friday, 28 September 2018 at 02:23:32 UTC, sarn wrote:A minor correction: Aliak is not accusing me of missing a part of the post. He's accusing me of not taking into account something he said in a different part of the *thread*. I.e. - I missed something he said in one of the other (as of this writing, 98) posts of this thread, and thus causing Dukc to label me a bullshitter.Shachar seems to be aiming for an internet high score by shooting down threads without reading them. You have better things to do. http://www.paulgraham.com/vb.htmlI believe you're being too harsh. It's easy to miss a part of a post sometimes.
Sep 28 2018
On Saturday, 29 September 2018 at 02:22:55 UTC, Shachar Shemesh wrote:I missed something he said in one of the other (as of this writing, 98) posts of this thread, and thus causing Dukc to label me a bullshitter.I know you meant Sarn, but still... can you please be a bit less aggresive with our wording?
Sep 29 2018
On 29/09/18 16:52, Dukc wrote:On Saturday, 29 September 2018 at 02:22:55 UTC, Shachar Shemesh wrote:From the article (the furthest point I read in it):I missed something he said in one of the other (as of this writing, 98) posts of this thread, and thus causing Dukc to label me a bullshitter.I know you meant Sarn, but still... can you please be a bit less aggresive with our wording?When I ask myself what I've found life is too short for, the word that pops into my head is "bullshit."That is the word used by the article *you* linked to, in reference to me. If it offends you enough to be accused of *calling* someone that, just imagine how I felt being *called* that very same name. Seriously, I don't make it a habit of being offended by random people on the Internet, but this is more a conscious decision than a naturally thick skin. Seeing that label hurt. Don't worry. I've been on the Internet since 1991. That's longer than the median age here (i.e. - I've been on the Internet since before most of you have been born). I've had my own fair share of flame wars, include some that, to my chagrin, I've started. In other words, I got over it. I did not reply, big though the temptation was. But the right time to be sensitive about what words are being used was *before* you linked to the article. Taking offense from being called out for calling someone something you find offensive is hypocritical. I never understood the focus on words. It's not the use of that word that offended me, it's the fact that you thought anything I did justified using it. I don't think using "cattle excrement" instead would have been any less hurtful. And it's not that the rest of your post was thoughtful, considerate and took pains to give constructive criticism, with or without hurting anyone's feelings. It's just that it doesn't seem to be that part bothers you. Shachar
Sep 29 2018
On Saturday, 29 September 2018 at 16:19:38 UTC, ag0aep6g wrote:On 09/29/2018 04:19 PM, Shachar Shemesh wrote:You are 100% correct. My most sincere apologies. I am going to stop responding to this thread now. ShacharOn 29/09/18 16:52, Dukc wrote:[...]Dukc didn't post that link. sarn did.I know you meant Sarn, but still... can you please be a bit less aggresive with our wording?From the article (the furthest point I read in it):When I ask myself what I've found life is too short for, the word that pops into my head is "bullshit."
Sep 29 2018
On 9/27/2018 12:35 AM, aliak wrote:Anyway, on a related note: D itself (not identifiers, but std) also supports unicode 6 or something. That's from 2010. That's a decade ago. We're at unicode 11 now. And I've already had someone tell me (while trying to get them to use D) - "hold on it supports unicode from a decade ago? Nah I'm not touching it". Not that it's the same as supporting identifiers in code, but still the reaction is relevant.Nobody is suggesting D not support Unicode in strings, comments, and the standard library. Please file any issues on Bugzilla, and PRs to fix them.
Sep 27 2018
On 9/26/2018 5:46 AM, Steven Schveighoffer wrote:Feel free to write one, but its chances of getting incorporated are remote and would require a pretty strong rationale that I haven't seen yet.Does this need a DIP?
Sep 26 2018