D - D is too english centric

Martin M. Pedersen (20/20) May 27 2003 Hi,

Walter (6/13) May 27 2003 No, only characters that fall into certain unicode ranges.

Martin M. Pedersen (7/12) May 28 2003 I haven't found that, but I you are the export, so I believe you . It ma...

Walter (6/13) May 28 2003 makes

Burton Radons (20/41) May 28 2003 This could be more easily done by encoding into UTF-8 and assuming any

Walter (15/38) May 28 2003 character

Bill Cox (4/56) May 29 2003 I'll put in a vote for UTF-8 support. It seems to have the best chance

Benji Smith (3/59) May 29 2003 I agree. Source should be UTF-8.

Martin M. Pedersen (13/18) May 29 2003 Another way of resolving this would be to give the programmer control of...

Ilya Minkov (32/49) May 28 2003 Hello, i believe there was a flamewar to this topic a few months ago, st...

Martin M. Pedersen (39/45) May 28 2003 starting
Bill Cox (6/8) May 28 2003 The latest version of Vim supports UTF-8. However, it requires a kernel...

Bill Cox (3/17) May 28 2003 Err.... I read your post a little more carefully... I don't know of any...

Georg Wrede (22/30) May 28 2003 Back in the bad old days, before MSDOS, we all used CP/M.

Martin M. Pedersen (5/9) May 28 2003 So do we. Yet there are exceptions. If the customer pays us to develop a...

Walter (3/5) May 28 2003 Yup. Listen to the customers, not the marketing department .

Mark Evans (4/4) May 28 2003 I agree that D is too English-centric (even ASCII-centric).

Mark Evans (3/3) May 28 2003 Actually I still think that link compatibility with Digital Mars C++ wou...

Mark T (3/7) May 29 2003 I don't think there is a full implementation of C99 yet. It was adopted ...

Martin M. Pedersen (7/10) May 30 2003 late

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:

Hi,

I have noted that C99 allows *any* unicode character to be used in
identifiers using \u. The D specification limits characters in identifiers
to letters, digits, and '_', but does not even define what a letter is. The
DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].

I find this unfortunate, and in contrast to the one of the main goals of D:
Link compability with C.

It has previously been argued, that only english should be used for
identifiers in order to support reuse better across language boundaries. But
that argument isn't always valid. For example, half a decade ago, I was
involved in building the IT-infrastructure for a nation-wide real estate
network. One of the requirements was that *everything* was in dansh.. It
involved lots of developers nation-wide, but noone outside Denmark. Of
cause, identifiers couldn't be fully danish - and thereby introduced
inconsistency in how things was names. But that was only a limitation of C
back than, which might not be an issue a few years from now. If D has this
limitation, it might be a valid reason to deselect D in favor of other
languages. After all, english is only the native language of a miniority.

Regards,
Martin M. Pedersen

May 27 2003

"Walter" <walter digitalmars.com> writes:

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> wrote in message
news:bb0sqs$1t1k$1 digitaldaemon.com...
 I have noted that C99 allows *any* unicode character to be used in
 identifiers using \u.

No, only characters that fall into certain unicode ranges.

 The D specification limits characters in identifiers
 to letters, digits, and '_', but does not even define what a letter is.

The
 DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].
 I find this unfortunate, and in contrast to the one of the main goals of

D:
 Link compability with C.

It's a good idea to change it to match C for the reasons you state.

May 27 2003

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:

"Walter" <walter digitalmars.com> wrote in message
news:bb1c8v$2e2l$1 digitaldaemon.com...
 I have noted that C99 allows *any* unicode character to be used in
 identifiers using \u.

 No, only characters that fall into certain unicode ranges.

I haven't found that, but I you are the export, so I believe you . It makes
sense too.

 Link compability with C.

 It's a good idea to change it to match C for the reasons you state.

I'm glad we are in line here :-)

Regard,
Martin M. Pedersen

May 28 2003

"Walter" <walter digitalmars.com> writes:

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> wrote in message
news:bb2hou$oas$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:bb1c8v$2e2l$1 digitaldaemon.com...
 I have noted that C99 allows *any* unicode character to be used in
 identifiers using \u.

 No, only characters that fall into certain unicode ranges.

 I haven't found that, but I you are the export, so I believe you . It

makes
 sense too.

"Each universal character name in an identifier shall designate a character
whose encoding in ISO/IEC 10646 falls into one of the ranges specified in
annex D." C99 6.4.2.1-3

May 28 2003

Burton Radons <loth users.sourceforge.net> writes:

Walter wrote:

 "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> wrote in message
 news:bb2hou$oas$1 digitaldaemon.com...
 
"Walter" <walter digitalmars.com> wrote in message
news:bb1c8v$2e2l$1 digitaldaemon.com...

I have noted that C99 allows *any* unicode character to be used in
identifiers using \u.

No, only characters that fall into certain unicode ranges.

I haven't found that, but I you are the export, so I believe you . It

 
 makes
 
sense too.

 
 
 "Each universal character name in an identifier shall designate a character
 whose encoding in ISO/IEC 10646 falls into one of the ranges specified in
 annex D." C99 6.4.2.1-3

This could be more easily done by encoding into UTF-8 and assuming any 
byte with the eighth bit set is an identifier.  It allows weird 
obfuscations, yes, but why care about that?  I won't write code that 
uses one of UNICODE's whitespace characters, and anyone whose code would 
be worth use by me would also not abuse it.  At worst it'd be one of 
those features that kids get into abusing before they smarten up.

C99's decision itself looks pretty bad.  I'd use \u escapes for codes 
which I don't WANT rendered because either they have no rendering 
(whitespaces), because they would screw up rendering (controls), don't 
have a rendering in my code-writing font, or have special numeric 
significance.

Whether this feature is implemented by any compilers and editors is 
certainly important to Martin's stated requirements.  If his clients 
can't read the code he's written, he hasn't fulfilled his contract. 
Much more successful would be to use an encoding like UTF-8 or one of 
the BOM'd encodings D supports; all programs developed for Finns will 
surely render that.  If it develops that C gets a link standard for 
UNICODE identifiers, then that can be emulated when mangling extern (C). 
  There's no cause for following C99 exactly in the code itself.

May 28 2003

"Walter" <walter digitalmars.com> writes:

"Burton Radons" <loth users.sourceforge.net> wrote in message
news:bb3s9f$29qv$1 digitaldaemon.com...
 Walter wrote:
 "Each universal character name in an identifier shall designate a


character
 whose encoding in ISO/IEC 10646 falls into one of the ranges specified


in
 annex D." C99 6.4.2.1-3

 This could be more easily done by encoding into UTF-8 and assuming any
 byte with the eighth bit set is an identifier.  It allows weird
 obfuscations, yes, but why care about that?  I won't write code that
 uses one of UNICODE's whitespace characters, and anyone whose code would
 be worth use by me would also not abuse it.  At worst it'd be one of
 those features that kids get into abusing before they smarten up.

 C99's decision itself looks pretty bad.  I'd use \u escapes for codes
 which I don't WANT rendered because either they have no rendering
 (whitespaces), because they would screw up rendering (controls), don't
 have a rendering in my code-writing font, or have special numeric
 significance.

 Whether this feature is implemented by any compilers and editors is
 certainly important to Martin's stated requirements.  If his clients
 can't read the code he's written, he hasn't fulfilled his contract.
 Much more successful would be to use an encoding like UTF-8 or one of
 the BOM'd encodings D supports; all programs developed for Finns will
 surely render that.  If it develops that C gets a link standard for
 UNICODE identifiers, then that can be emulated when mangling extern (C).
   There's no cause for following C99 exactly in the code itself.

This is C's third attempt at internationalizing C source code. In 15 years I
have yet to see any C source outside of a test suite that used trigraphs or
digraphs. I'm skeptical the \u scheme will catch on, either. I think the
best way is to simply declare that the source text is UTF-8, UTF-16, or
UTF-32. D already recognizes and automatically handles all three. Then, it
is simply a matter of deciding which unicode characters to allow as
identifiers and whitespace.

The advantage of that is you can edit the source in any text editor that
supports unicode if you want to use more than ascii. There is no need for
any special editors that recognize trigraphs, digraphs, or on-the-fly \u
translation.

May 28 2003

Bill Cox <bill viasic.com> writes:

I'll put in a vote for UTF-8 support.  It seems to have the best chance 
of getting support from Linux IDEs and debuggers.

Bill

Walter wrote:
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:bb3s9f$29qv$1 digitaldaemon.com...
 
Walter wrote:

"Each universal character name in an identifier shall designate a


 character
 
whose encoding in ISO/IEC 10646 falls into one of the ranges specified


 in
 
annex D." C99 6.4.2.1-3

This could be more easily done by encoding into UTF-8 and assuming any
byte with the eighth bit set is an identifier.  It allows weird
obfuscations, yes, but why care about that?  I won't write code that
uses one of UNICODE's whitespace characters, and anyone whose code would
be worth use by me would also not abuse it.  At worst it'd be one of
those features that kids get into abusing before they smarten up.

C99's decision itself looks pretty bad.  I'd use \u escapes for codes
which I don't WANT rendered because either they have no rendering
(whitespaces), because they would screw up rendering (controls), don't
have a rendering in my code-writing font, or have special numeric
significance.

Whether this feature is implemented by any compilers and editors is
certainly important to Martin's stated requirements.  If his clients
can't read the code he's written, he hasn't fulfilled his contract.
Much more successful would be to use an encoding like UTF-8 or one of
the BOM'd encodings D supports; all programs developed for Finns will
surely render that.  If it develops that C gets a link standard for
UNICODE identifiers, then that can be emulated when mangling extern (C).
  There's no cause for following C99 exactly in the code itself.

 
 
 This is C's third attempt at internationalizing C source code. In 15 years I
 have yet to see any C source outside of a test suite that used trigraphs or
 digraphs. I'm skeptical the \u scheme will catch on, either. I think the
 best way is to simply declare that the source text is UTF-8, UTF-16, or
 UTF-32. D already recognizes and automatically handles all three. Then, it
 is simply a matter of deciding which unicode characters to allow as
 identifiers and whitespace.
 
 The advantage of that is you can edit the source in any text editor that
 supports unicode if you want to use more than ascii. There is no need for
 any special editors that recognize trigraphs, digraphs, or on-the-fly \u
 translation.

May 29 2003

Benji Smith <Benji_member pathlink.com> writes:

I agree. Source should be UTF-8.

--Benji


In article <3ED5FFE7.3040100 viasic.com>, Bill Cox says...
I'll put in a vote for UTF-8 support.  It seems to have the best chance 
of getting support from Linux IDEs and debuggers.

Bill

Walter wrote:
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:bb3s9f$29qv$1 digitaldaemon.com...
 
Walter wrote:

"Each universal character name in an identifier shall designate a


 character
 
whose encoding in ISO/IEC 10646 falls into one of the ranges specified


 in
 
annex D." C99 6.4.2.1-3

This could be more easily done by encoding into UTF-8 and assuming any
byte with the eighth bit set is an identifier.  It allows weird
obfuscations, yes, but why care about that?  I won't write code that
uses one of UNICODE's whitespace characters, and anyone whose code would
be worth use by me would also not abuse it.  At worst it'd be one of
those features that kids get into abusing before they smarten up.

C99's decision itself looks pretty bad.  I'd use \u escapes for codes
which I don't WANT rendered because either they have no rendering
(whitespaces), because they would screw up rendering (controls), don't
have a rendering in my code-writing font, or have special numeric
significance.

Whether this feature is implemented by any compilers and editors is
certainly important to Martin's stated requirements.  If his clients
can't read the code he's written, he hasn't fulfilled his contract.
Much more successful would be to use an encoding like UTF-8 or one of
the BOM'd encodings D supports; all programs developed for Finns will
surely render that.  If it develops that C gets a link standard for
UNICODE identifiers, then that can be emulated when mangling extern (C).
  There's no cause for following C99 exactly in the code itself.

 
 
 This is C's third attempt at internationalizing C source code. In 15 years I
 have yet to see any C source outside of a test suite that used trigraphs or
 digraphs. I'm skeptical the \u scheme will catch on, either. I think the
 best way is to simply declare that the source text is UTF-8, UTF-16, or
 UTF-32. D already recognizes and automatically handles all three. Then, it
 is simply a matter of deciding which unicode characters to allow as
 identifiers and whitespace.
 
 The advantage of that is you can edit the source in any text editor that
 supports unicode if you want to use more than ascii. There is no need for
 any special editors that recognize trigraphs, digraphs, or on-the-fly \u
 translation.

May 29 2003

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:

"Walter" <walter digitalmars.com> wrote in message
news:bb1c8v$2e2l$1 digitaldaemon.com...
 DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].
 I find this unfortunate, and in contrast to the one of the main goals of

 D:
 Link compability with C.

 It's a good idea to change it to match C for the reasons you state.

Another way of resolving this would be to give the programmer control of the
external identifer. Something like this:

extern (C) {
      extern("foo\u4444") void foo() { bar(); }
      extern("bar\u4444") void bar();
}

That would also allow us to access mangled C++ identifiers, and identifiers
containing '$'. It would not be easy, but that is not what I ask for. I only
want it to be possible.

Regards,
Martin M. Pedersen

May 29 2003

Ilya Minkov <Ilya_member pathlink.com> writes:

In article <bb0sqs$1t1k$1 digitaldaemon.com>, Martin M. Pedersen says...

It has previously been argued, that only english should be used for
identifiers in order to support reuse better across language boundaries. But
that argument isn't always valid. For example, half a decade ago, I was
involved in building the IT-infrastructure for a nation-wide real estate
network. One of the requirements was that *everything* was in dansh.. It
involved lots of developers nation-wide, but noone outside Denmark. Of
cause, identifiers couldn't be fully danish - and thereby introduced
inconsistency in how things was names. But that was only a limitation of C
back than, which might not be an issue a few years from now. If D has this
limitation, it might be a valid reason to deselect D in favor of other
languages. After all, english is only the native language of a miniority.

Hello, i believe there was a flamewar to this topic a few months ago, starting
from an old 1st april joke article from Bjarne Stroustrup about adding unicode
identifiers to C++.

I believe that most people on this newsgroup are not native english speakers.
And nontheless, the idea has found very little support, since:
- for almost any language, a transliteration scheme exists which approximates
the language in terms of latin alphabet;
- keywords are english anyway, and in D there is no preprocessor to un-english
them. :) Using any language other than english would yuild to inclonsistency
anyway.
- i know quite a number of languages, but i have tremendous problems switching
between them. It may take minutes every time. And having seen a single english
keyword, i start thinking in english and you can be sure of all my subsequent
comments to be in english. Then, i also cant't read both code and comments
simultaneously. So i have to translate the comments into english to get going. I
even refuse to use any code with comments in my native language. I believe there
are plenty of people experiencing the same problem.

So, if you *really* want to mix your native language into a project, why don't
you write a scanner, which would:
- translate keywords from your language into D;
- transliterate all other identifiers into latin letters.

This would basically be an extended version of a lexer, and lexing D is really
simple. Besides, there's a good readymade lexer to borrow. :)

I have noted that C99 allows *any* unicode character to be used in
identifiers using \u. The D specification limits characters in identifiers
to letters, digits, and '_', but does not even define what a letter is. The
DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].

It is defined in the library. :> 

I find this unfortunate, and in contrast to the one of the main goals of D:
Link compability with C.

I have not seen a single piece of code using this silly feature. Is there any
programmer's editor which has \u unicode support as of yet? And any IDE?

I would also like to see how many compilers implement that - and in what manner.
Even if some does, it would probably be incompatible with that of other
compilers. So would you say, C violates the requierement of link compatibility
with itself as well? :>

-i.

May 28 2003

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:

"Ilya Minkov" <Ilya_member pathlink.com> wrote in message
news:bb2cup$in4$1 digitaldaemon.com...
 Hello, i believe there was a flamewar to this topic a few months ago,

starting
 from an old 1st april joke article from Bjarne Stroustrup about adding

unicode
 identifiers to C++.

I don't want to get into a flamewar, and I don't want to argue against your
preferences for using english. My point is simply that sometimes it is not a
choice one can make. For example, if you are supplied with libraries using
unicode identifiers, that you are required to use. If it is necessary to
wrap such functions in other C code, D cannot be said to be link compatible
with C (C99). Likewise, you might also be required to implement an interface
using such identifiers.


 I have not seen a single piece of code using this silly feature.

That is not really an argument. The feature exists, and will get support by
compilers as times go by. Silly or not, compilers cannot be said to be C99
compliant if they do not support it. Any serious compiler vendor will go in
that direction. And some will use this feature - there must have been a
reason for its introduction.


 Is there any programmer's editor which has \u unicode support as of yet?

And any IDE?

They don't have to, as I read the document. They only need to support
editing unicode. Translation phase 1 is:

 "Physical source file multibyte characters are mapped to the source
character set (introducing new-line characters for end-of-line indicators)
if necessary. Trigraph sequences are replaced by corresponding
single-character internal representation."

I believe this is also how DMD does things (except the trigraph stuff) -
maps unicode chars \u-sequences, that is.


 I would also like to see how many compilers implement that - and in what

manner.

I don't know if the ABI is completely standardized, but the translation
limits chapter gives me a clue how it is to be done:

"31 significant initial characters in an external identifier (each universal
character name specifying a character short identifier of 0000FFFF or less
is considered 6 characters, each universal character name specifying a
character short identifier of 00010000 or more is considered 10 characters,
and each extended source character is considered the same number of
characters as the corresponding universal character name, if any)"

The numbers 6 and 10 indicates to me, that they will be encoded using
"\uXXXX" and "\uXXXXXXXX" or something very similar. But that is only a
guess.


Regards,
Martin M. Pedersen

May 28 2003

Bill Cox <bill viasic.com> writes:

Hi, Ilya.

 I have not seen a single piece of code using this silly feature. Is there any
 programmer's editor which has \u unicode support as of yet? And any IDE?

The latest version of Vim supports UTF-8.  However, it requires a kernel 
patch that isn't in RedHat 7.3.  It is suppose to be in 8.0 on.  It also 
doesn't work in the last version of Cygwin I installed.  Anyone know how 
UTF support is comming along in emacs?

Bill

May 28 2003

Bill Cox <bill viasic.com> writes:

Err.... I read your post a little more carefully...  I don't know of any 
programming editors directly supporting the \u and \U features of C.

Bill Cox wrote:
 Hi, Ilya.
 
 I have not seen a single piece of code using this silly feature. Is 
 there any
 programmer's editor which has \u unicode support as of yet? And any IDE?

 
 
 The latest version of Vim supports UTF-8.  However, it requires a kernel 
 patch that isn't in RedHat 7.3.  It is suppose to be in 8.0 on.  It also 
 doesn't work in the last version of Cygwin I installed.  Anyone know how 
 UTF support is comming along in emacs?
 
 Bill

May 28 2003

Georg Wrede <Georg_member pathlink.com> writes:

In article <bb0sqs$1t1k$1 digitaldaemon.com>, Martin M. Pedersen says...
It has previously been argued, that only english should be used for
identifiers in order to support reuse better across language boundaries. But
that argument isn't always valid. For example, half a decade ago, I was
involved in building the IT-infrastructure for a nation-wide real estate
network. One of the requirements was that *everything* was in dansh.. It
involved lots of developers nation-wide, but noone outside Denmark. Of
cause, identifiers couldn't be fully danish - and thereby introduced
inconsistency in how things was names.

Back in the bad old days, before MSDOS, we all used CP/M.
There was this Nationalist project in Finland, with the
goal of translating all operating system commands to 
Finnish, or Finnish abbreviations. Ostensibly this would
be easier on people.

Turned out nobody wanted to use or learn the Finnish 
version. Their explanation: since these commands are 
"new words" to you anyway, the least of your troubles is
the spelling. Compared with trying to grasp the meaning
of these new concepts the spelling is a non-issue. And
if you then have to use a non Finnish version, you're
totally lost.

Sure, D code written in Chinese would be more compact,
maybe even more legible (in an absolute sense), with
its one character variable names and method names.
Maybe even parentheses and plus signs could be in
Chinese equivalents. But I don't believe they'd want it.

Most Finnish companies have a policy where all program
code and comments have to be in English. Even in those 
companies where the programmers and staff speak hardly
any English at all.

May 28 2003

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:

"Georg Wrede" <Georg_member pathlink.com> wrote in message
 Most Finnish companies have a policy where all program
 code and comments have to be in English. Even in those
 companies where the programmers and staff speak hardly
 any English at all.

So do we. Yet there are exceptions. If the customer pays us to develop and
deliver source code, it is his requirements that counts, not our policy.

Regards,
Martin M. Pedersen

May 28 2003

"Walter" <walter digitalmars.com> writes:

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> wrote in message
news:bb2l7u$s2i$1 digitaldaemon.com...
 So do we. Yet there are exceptions. If the customer pays us to develop and
 deliver source code, it is his requirements that counts, not our policy.

Yup. Listen to the customers, not the marketing department <g>.

May 28 2003

Mark Evans <Mark_member pathlink.com> writes:

I agree that D is too English-centric (even ASCII-centric).

Concern about C99 link compatibility leads me to reflect on C99's boolean type:

http://www.uic.edu/classes/mcs/mcs494/f01/transparencies/sec8.4.pdf

Mark

May 28 2003

Mark Evans <Mark_member pathlink.com> writes:

Actually I still think that link compatibility with Digital Mars C++ would be a
huge win for D.  C++ also has a bool type.

Mark

May 28 2003

Mark T <Mark_member pathlink.com> writes:

I have noted that C99 allows *any* unicode character to be used in
identifiers using \u. The D specification limits characters in identifiers
to letters, digits, and '_', but does not even define what a letter is. The
DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].

I don't think there is a full implementation of C99 yet. It was adopted in late
1999.  Maybe some of this stuff will disappear due to lack of use. Did ISO sack
the trigraph crap from C89/C90?

May 29 2003

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:

"Mark T" <Mark_member pathlink.com> wrote in message
news:bb6710$1v5d$1 digitaldaemon.com...
 I don't think there is a full implementation of C99 yet. It was adopted in

late
 1999.  Maybe some of this stuff will disappear due to lack of use. Did ISO

sack
 the trigraph crap from C89/C90?

No, trigraphs are still there.


Regards,
Martin M. Pedersen

May 30 2003

D Programming

C/C++ Programming

Other

D - D is too english centric