digitalmars.D - suggestion: clean white space / end of line definition
- Thomas Kuehne (20/68) Oct 28 2006 -----BEGIN PGP SIGNED MESSAGE-----
- Walter Bright (3/57) Oct 30 2006 Is it really worth doing all that?
- Thomas Kuehne (49/94) Oct 31 2006 -----BEGIN PGP SIGNED MESSAGE-----
- Walter Bright (4/4) Oct 31 2006 There is a problem though with replacing it all with a function - lexing...
- Thomas Kuehne (140/144) Nov 01 2006 -----BEGIN PGP SIGNED MESSAGE-----
- Georg Wrede (40/41) Nov 02 2006 (Apologies in advance, and totally ignoring the good code, standards
- Thomas Kuehne (25/48) Nov 03 2006 -----BEGIN PGP SIGNED MESSAGE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Current definition(http://www.digitalmars.com/d/lex.html):EndOfLine: \u000D \u000A \u000D \u000A EndOfFile WhiteSpace: Space Space WhiteSpace Space: \u0020 \u0009 \u000B \u000CDMD's frontend however doesn't strictly conform to those definitions. doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces html.c:351: \u000D and \u000A are treated as space too html.c:683: \u00A0 is treated as space only if it was encountered via a html entity inifile.c:264: \u000D and \u000A are treated as space too lexer.c:2360: \u000B and \u000C aren't treated as spaces lexer.c: treats \u2028 and \u2029 as line seperators too The oddest case is enitiy.c:577: treat "\ " as "\u0020" istead of "\u00A0" suggested definition:EndOfLine: Unicode(all non-tailorable Line Breaking Classes causing a line break) EndOfFile WhiteSpace: Space Space WhiteSpace Space: ( Unicode(General_Category == Space_Seperator) || Unicode(Bidi_Class == Segment_Separator) || Unicode(Bidi_Class == Whitespace) ) && !EndOfLinethis expands to:EndOfLine: 000A // LINE FEED 000B // LINE TABULATION 000C // FORM FEED 000D // CARRIAGE RETURN 000D 000A // CARRIAGE RETURN followed by LINE FEED 0085 // NEXT LINE 2028 // LINE SEPARATOR 2029 // PARAGRAPH SEPARATOR Space: Unicode(General_Category == Space_Seperator) && !EndOfLine 0020 // SPACE 00A0 // NO-BREAK SPACE 1680 // OGHAM SPACE MARK 180E // MONGOLIAN VOWEL SEPARATOR 2000..200A // EN QUAD..HAIR SPACE 202F // NARROW NO-BREAK SPACE 205F // MEDIUM MATHEMATICAL SPACE 3000 // IDEOGRAPHIC SPACE Unicode(Bidi_Class == Segment_Separator) && !EndOfLine 0009 // CHARACTER TABULATION 001F // INFORMATION SEPARATOR ONE Unicode(Bidi_Class == Whitespace) && !EndOfLine <all part of the Space_Seperator listing>Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFQzdILK5blCcjpWoRArgLAJ90xljYG+pNPEit3WU8JtAYlC+3PACfRPTU J0cixnT2X7yynpjxBQx+rps= =IDK6 -----END PGP SIGNATURE-----
Oct 28 2006
Thomas Kuehne wrote:DMD's frontend however doesn't strictly conform to those definitions. doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces html.c:351: \u000D and \u000A are treated as space too html.c:683: \u00A0 is treated as space only if it was encountered via a html entity inifile.c:264: \u000D and \u000A are treated as space too lexer.c:2360: \u000B and \u000C aren't treated as spaces lexer.c: treats \u2028 and \u2029 as line seperators too The oddest case is enitiy.c:577: treat "\ " as "\u0020" istead of "\u00A0"Thanks, I'll try to get those fixed.suggested definition:Is it really worth doing all that?EndOfLine: Unicode(all non-tailorable Line Breaking Classes causing a line break) EndOfFile WhiteSpace: Space Space WhiteSpace Space: ( Unicode(General_Category == Space_Seperator) || Unicode(Bidi_Class == Segment_Separator) || Unicode(Bidi_Class == Whitespace) ) && !EndOfLinethis expands to:EndOfLine: 000A // LINE FEED 000B // LINE TABULATION 000C // FORM FEED 000D // CARRIAGE RETURN 000D 000A // CARRIAGE RETURN followed by LINE FEED 0085 // NEXT LINE 2028 // LINE SEPARATOR 2029 // PARAGRAPH SEPARATOR Space: Unicode(General_Category == Space_Seperator) && !EndOfLine 0020 // SPACE 00A0 // NO-BREAK SPACE 1680 // OGHAM SPACE MARK 180E // MONGOLIAN VOWEL SEPARATOR 2000..200A // EN QUAD..HAIR SPACE 202F // NARROW NO-BREAK SPACE 205F // MEDIUM MATHEMATICAL SPACE 3000 // IDEOGRAPHIC SPACE Unicode(Bidi_Class == Segment_Separator) && !EndOfLine 0009 // CHARACTER TABULATION 001F // INFORMATION SEPARATOR ONE Unicode(Bidi_Class == Whitespace) && !EndOfLine <all part of the Space_Seperator listing>
Oct 30 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Walter Bright schrieb am 2006-10-31:Thomas Kuehne wrote:<snip>What is actually changing for EndOfLine? 000A new 000B formerly white space 000C formerly white space 0085 new 2028 implemented but undocumented 2029 implemented but undocumented \v and \f were probably defined as white space to due to C's isspace. Please note however that \r and \n are recognised by isspace too. Implementing 2028 and 2029 seems implicit due to the use of UTF encodings. All the different line endings can be converted to '\n' for non UTF-8 D files in Module::parse. UTF-8 encoded HTML sources can use a similar approach in html.c(GDC currently uses a isLineSeperator there). UTF-8 encoded D files would require support at lexer.c: 303,709,763,835,1113,1301,1375,1457,1520,1520,2258,2272,2386. The alternative and more robust solution would be a 'new line cleanup' at module.c:485 and a goto from module.c:523. This way, all the '\r', LS and PS tests sprinkled around lexer.c and html.c could be removed. In my opinion the EndOfLine change is well worth it. The SPACE changed was prompted by the broken 00A0 (NO-BREAK SPACE) kludges in html.c and entity.c. The issue isn't that the idea was bad but the reasons wasn't layed out properly. If 00A0 is to be considered a SPACE, then why 00A0 and not character foo-bar? At least the 2000..200A range will become the same problem 00A0 was originally. Using the Unicode standard as reference would direct all further debates if a character is a space to the Unicode consortium and leave D out of potentially length debates. Changes would be required somewhere around lexer.c:490,1331,2218,2368,2375,2404 Using a function like // returns NULL or end of white space char* isUniSpace(char*) would also clean up white space parsing. lexer.c currently tests for '\t' on 6 occasions, 7 times for ' ' and only 3 times for '\f' and '\v' each. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFR4z+LK5blCcjpWoRAmbAAJoDASDAvpcpZzWcDl2gh7MhCX5mvgCfdvNm x3IrjxWSgml7rc3R/soHZn0= =YYyK -----END PGP SIGNATURE-----suggested definition:Is it really worth doing all that?EndOfLine: Unicode(all non-tailorable Line Breaking Classes causing a line break) EndOfFile WhiteSpace: Space Space WhiteSpace Space: ( Unicode(General_Category == Space_Seperator) || Unicode(Bidi_Class == Segment_Separator) || Unicode(Bidi_Class == Whitespace) ) && !EndOfLinethis expands to:EndOfLine: 000A // LINE FEED 000B // LINE TABULATION 000C // FORM FEED 000D // CARRIAGE RETURN 000D 000A // CARRIAGE RETURN followed by LINE FEED 0085 // NEXT LINE 2028 // LINE SEPARATOR 2029 // PARAGRAPH SEPARATOR Space: Unicode(General_Category == Space_Seperator) && !EndOfLine 0020 // SPACE 00A0 // NO-BREAK SPACE 1680 // OGHAM SPACE MARK 180E // MONGOLIAN VOWEL SEPARATOR 2000..200A // EN QUAD..HAIR SPACE 202F // NARROW NO-BREAK SPACE 205F // MEDIUM MATHEMATICAL SPACE 3000 // IDEOGRAPHIC SPACE Unicode(Bidi_Class == Segment_Separator) && !EndOfLine 0009 // CHARACTER TABULATION 001F // INFORMATION SEPARATOR ONE Unicode(Bidi_Class == Whitespace) && !EndOfLine <all part of the Space_Seperator listing>
Oct 31 2006
There is a problem though with replacing it all with a function - lexing speed. Lexing speed is critically dependent on being able to consume whitespace fast, hence all the inline code to do it. Running the source through two passes makes it half as fast.
Oct 31 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Walter Bright schrieb am 2006-11-01:There is a problem though with replacing it all with a function - lexing speed. Lexing speed is critically dependent on being able to consume whitespace fast, hence all the inline code to do it. Running the source through two passes makes it half as fast.Here is a faster mock-up(untested!) using functions. Use of macros is certanly possible too. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFSGXwLK5blCcjpWoRAvzLAKCO0gfLsLKj0nLykQoYOobQ1TKJXwCfUwg+ mSqDFqxJiaBcbCh5LR1Cae4= =mpgd -----END PGP SIGNATURE-----
Nov 01 2006
Thomas Kuehne wrote:Here is a faster mock-up(Apologies in advance, and totally ignoring the good code, standards compliance and some other good things,) I have to ask: Is this a Good Thing? Admittedly not having thought through this issue myself, all I have is a gut feeling. But that gut feeling says that source code (especially in a systems language in the C family) should strive to hinder all kinds of Funny Stuff from entering the toolchain. Accepting "foreign" characters within strings (and possibly even in comments) is OK, but in the source code itself, that's my issue here. We can already have variable names in D written in Afghan and Negro-Potamian, which I definitely don't consider a good idea. If we were to follow this line of thought, then the next thing we know somebody might demand to have all D keywords translated to every single language in the bushes, and to have the compiler accept them as Equal synonyms to the Original Keywords. (This actually happened with the CP/M operating system in Finland in the early eighties! You don't want to hear the whole story.) What will this do to cross-cultural study, reuse, and copying of example code? Won't it eventually compatmentalize most all code written outside of the Anglo-Centric world? That is, alienate it from us, but also from each of the other cultures too. And who says parentheses and operators should only be the ones you need a Western keyboard to type? I bet there are cultures that use (or will insist on using, once the rumour is it's possible) some preposterous ink blots instead, for example. And the next thing of course would be the idiot Humanists who'd demand that a non-breaking space really has to be equal to the underscore "for people think in words, and subjecting humans to CamelCase or under_scored names constitutes deplorable Oppression". And this kind of people refuse to see the [to us] obvious horrible ramifications of it. And this I wrote in spite of my mother tongue needing non-ASCII characters in every single sentence. But, as I said at the outset, this is just a gut feeling, so I'm not pressing the issue as if it were something I'd analyzed through-and-through. --- Now, what is obvious, however, is that the current compiler *should* be consistent with whitespace and the like, instead of haphazardly enumerating some of them each time. No argument there.
Nov 02 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Georg Wrede schrieb am 2006-11-02:Thomas Kuehne wrote:Keywords are a few "magic" words, teaching those doesn't require any knowledge of the natural language they were taken from. I definitely agree with your view on keywords. The rest however ... sounds like a typical culture-centric view. Forcing everyone - especially beginners and non-IT people - to use English isn't a viable solution. No, transliteration doesn't cut it: mama ma ma ma hint: This a variant of a Chinese language joke and involves 4 different characters. In addition there are quite a few words and concepts that have no English equivalent. For simplicity's sake lets use an ASCII representable German word: "Heimat" * home - to narrow * native country - quite often wrongHere is a faster mock-up(Apologies in advance, and totally ignoring the good code, standards compliance and some other good things,) I have to ask: Is this a Good Thing? Admittedly not having thought through this issue myself, all I have is a gut feeling. But that gut feeling says that source code (especially in a systems language in the C family) should strive to hinder all kinds of Funny Stuff from entering the toolchain. Accepting "foreign" characters within strings (and possibly even in comments) is OK, but in the source code itself, that's my issue here. We can already have variable names in D written in Afghan and Negro-Potamian, which I definitely don't consider a good idea. If we were to follow this line of thought, then the next thing we know somebody might demand to have all D keywords translated to every single language in the bushes, and to have the compiler accept them as Equal synonyms to the Original Keywords. (This actually happened with the CP/M operating system in Finland in the early eighties! You don't want to hear the whole story.)What will this do to cross-cultural study, reuse, and copying of example code? Won't it eventually compatmentalize most all code written outside of the Anglo-Centric world? That is, alienate it from us, but also from each of the other cultures too.That's what coding standards are for. The same reuse issue goes for C/C++ and the preprocessor and seems to work reasonably well. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFS63HLK5blCcjpWoRApQSAJ9grRZKbapVxKmO4uh8b7jeO8RVfgCbB2p2 d7kuli8re+qW4WTVk1Fi2y4= =CB4r -----END PGP SIGNATURE-----
Nov 03 2006