digitalmars.D - Newline character set in the D lexer

digitalmars.D - Newline character set in the D lexer - NEL

Cecil Ward (4/4) Aug 30 2020 Would there be any benefit from the following suggestion? Add the

Dominikus Dittes Scherkl (41/45) Aug 31 2020 I personally think we should have these definitions:

Cecil Ward (5/5) Sep 03 2020 I agree with Dominikus

NilsLankila (4/10) Sep 03 2020 Given the lack of answers I would suggest to go ahead with a PR

Cecil Ward (5/8) Sep 07 2020 Agreed, Nils. Mind you someone cared enough to include U+2028 and

James Blachly (11/22) Sep 08 2020 PR = "Pull Request".

Nils Lankila (3/7) Aug 31 2020 Pardon me but why bother while ascii gives already all we need to

Paul Backus (4/12) Aug 31 2020 D already recognizes some non-ascii characters as spaces and line

Cecil Ward <cecil cecilward.com> writes:

Would there be any benefit from the following suggestion? Add the 
character Unicode NEL U+0085 into the set of EndOfLine characters 
in the lexer ?

Cecil Ward.

Aug 30 2020

Dominikus Dittes Scherkl <dominikus scherkl.de> writes:

On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote:
 Would there be any benefit from the following suggestion? Add 
 the character Unicode NEL U+0085 into the set of EndOfLine 
 characters in the lexer ?

 Cecil Ward.

I personally think we should have these definitions:

              /*  NUL    EM    SUB */
EndOfFile   = { 0x00 | 0x19 | 0x1A | PhysicalEndOfFile };
              /*  LF     FF     CR      CR LF     NEL     LSEP     
PSEP  */
EndOfLine   = { 0x0A | 0x0C | 0x0D | 0x0D 0x0A | 0x85 | 0x2028 | 
0x2029 | EndOfFile };

              /*  HT     VT     SP    NBSP    NQSP     MQSP     
ENSP     EMSP     3/MSP */
WhiteSpace  = { 0x09 | 0x0B | 0x20 | 0xA0 | 0x2000 | 0x2001 | 
0x2002 | 0x2003 | 0x2004

              /*  4/MSP    6/MSP     FSP      PSP     THSP      
HSP     ZWSP     NNBSP */
               | 0x2005 | 0x2006 | 0x2007 | 0x2008 | 0x2009 | 
0x200A | 0x200B | 0x202F

              /*  MMSP      WJ      IDSP    ZWNBSP */
               | 0x205F | 0x2060 | 0x3000 | 0xFEFF | EndOfLine };

The definition of D source files misses quite a lot of them :-(

EM = end of medium (what if not this should end a file?!?)
NEL = New Line
LSEP = Line Separator
PSEP = Paragraph Separator

NBSP = non-braking space
NQSP = ENSP = N-wide space
MQSP = EMSP = M-wide space
3/MSP = 1/3 M-wide space (three spaces together are as wide as an 
M)
4/MSP = 1/4 M-wide space
6/MSP = 1/6 M-wide space
FSP = figure space
PSP = point space
THSP = thin space
HSP = hair space
ZWSP = zero width space
NNBSP = narrow non-braking space
MMSP = mathematic space
WJ = word joiner (invisible space that separate words for the 
spelling correction)
IDSP = ideographic space (same width as a chinese character)
ZWNBSP = zero-width non-braking space

Aug 31 2020

Cecil Ward <cecil cecilward.com> writes:

I agree with Dominikus

Note to earlier poster: NEL was used and just possibly may still 
be used by IBM mainframe users; XML 1.1 understands NEL iirc;

see https://www.w3.org/TR/newline/   and

        https://www.w3.org/International/questions/qa-controls

Sep 03 2020

NilsLankila <NilsLankila gmx.us> writes:

On Friday, 4 September 2020 at 00:48:59 UTC, Cecil Ward wrote:
 I agree with Dominikus

 Note to earlier poster: NEL was used and just possibly may 
 still be used by IBM mainframe users; XML 1.1 understands NEL 
 iirc;

 see https://www.w3.org/TR/newline/   and

        https://www.w3.org/International/questions/qa-controls

Given the lack of answers I would suggest to go ahead with a PR 
or at least open an issue. Lexing is not a big deal but if nobody 
cares this will never be done.

Sep 03 2020

Cecil Ward <cecil cecilward.com> writes:

On Friday, 4 September 2020 at 05:28:47 UTC, NilsLankila wrote:
 Given the lack of answers I would suggest to go ahead with a PR 
 or at least open an issue. Lexing is not a big deal but if 
 nobody cares this will never be done.

Agreed, Nils. Mind you someone cared enough to include U+2028 and 
U+2029 in the lexer spec.

I have no idea how to initiate a "PR". Perhaps someone could help 
me with this?

Sep 07 2020

James Blachly <james.blachly gmail.com> writes:

PR = "Pull Request".

Easy way is to fork the project on github, clone your (forked version of 
the) project, make changes, push back. This could be in ~master on your 
own fork, or ideally in a separate branch.

Then on github, go to the original project and start a new pull request. 
It should automagically detect that you've made changes (again ideally 
in a branch of your fork), and offer to make a pull request with your 
changes against ~master (or whatever is set as the default branch for 
the project).

James

On 9/8/20 2:42 AM, Cecil Ward wrote:
 On Friday, 4 September 2020 at 05:28:47 UTC, NilsLankila wrote:
 Given the lack of answers I would suggest to go ahead with a PR or at 
 least open an issue. Lexing is not a big deal but if nobody cares this 
 will never be done.

 
 Agreed, Nils. Mind you someone cared enough to include U+2028 and U+2029 
 in the lexer spec.
 
 I have no idea how to initiate a "PR". Perhaps someone could help me 
 with this?

Sep 08 2020

Nils Lankila <NilsLankila gmx.us> writes:

On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote:
 Would there be any benefit from the following suggestion? Add 
 the character Unicode NEL U+0085 into the set of EndOfLine 
 characters in the lexer ?

 Cecil Ward.

Pardon me but why bother while ascii gives already all we need to 
put spaces and new lines with fast decode (< 80h) ?

Aug 31 2020

Paul Backus <snarwin gmail.com> writes:

On Monday, 31 August 2020 at 09:39:12 UTC, Nils Lankila wrote:
 On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote:
 Would there be any benefit from the following suggestion? Add 
 the character Unicode NEL U+0085 into the set of EndOfLine 
 characters in the lexer ?

 Cecil Ward.

 Pardon me but why bother while ascii gives already all we need 
 to put spaces and new lines with fast decode (< 80h) ?

D already recognizes some non-ascii characters as spaces and line 
separators [1], so the decision to "bother" has already been made.

[1] https://dlang.org/spec/lex.html#character_set

Aug 31 2020

D Programming

C/C++ Programming

Other

digitalmars.D - Newline character set in the D lexer - NEL