digitalmars.D.learn - Why is BOM required to use unicode in tokens?

James Blachly (8/8) Sep 14 2020 I wish to write a function including ∂x and ∂y (these are trivial to...

Paul Backus (5/13) Sep 14 2020 According to the spec [1] this should Just Work. I'd recommend

Jon Degenhardt (6/25) Sep 14 2020 Under the identifiers section

Dominikus Dittes Scherkl (69/77) Sep 15 2020 ISO/IEC 9899:1999 (E)

James Blachly (7/31) Sep 15 2020 Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.

Steven Schveighoffer (5/37) Sep 15 2020 I'm thinking your issue will not be fixed (just like we don't allow $abc...

Jon Degenhardt (34/42) Sep 15 2020 Looks like it has to do with the '∂' character. But non-ascii

GK (4/12) Sep 18 2020 Yes. The same troubles for widely used Greek symbols (Sigma,

James Blachly (4/15) Sep 15 2020 Steve: It sounds as if the spec is correct but the glyph (codepoint?)

Steven Schveighoffer (10/27) Sep 15 2020 I don't really know the answer, as I'm not a unicode expert.

Dominikus Dittes Scherkl (10/14) Sep 16 2020 UnicodeData.txt (a data file provided by the unicode organization

Dominikus Dittes Scherkl (17/19) Sep 16 2020 I think the following change in the grammar would be sufficient:

Patrick Schluter (3/20) Sep 18 2020 I checked, it's not a letter. None of the math symbols are.

James Blachly (15/18) Sep 15 2020 OK interestingly this code point 0x2202 falls within the range

James Blachly (4/4) Sep 15 2020 On 9/15/20 8:24 PM, James Blachly wrote:

H. S. Teoh (18/28) Sep 14 2020 Tested it locally, with and without BOM; the lexer rejects ∂ as a vali...
Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (4/5) Sep 15 2020 You can use the greek letter delta instead:

starcanopy (3/8) Sep 15 2020 Wouldn't that imply a normal differential?

wjoe (9/17) Sep 16 2020 As you probably already know BOM means byte order mark so it is

James Blachly <james.blachly gmail.com> writes:

I wish to write a function including ∂x and ∂y (these are trivial to 
type with appropriate keyboard shortcuts - alt+d on Mac), but without a 
unicode byte order mark at the beginning of the file, the lexer rejects 
the tokens.

It is not apparently easy to insert such marks (AFAICT no common tool 
does this specifically), while other languages work fine (i.e., accept 
unicode in their source) without it.

Is there a downside to at least presuming UTF-8?

Sep 14 2020

Paul Backus <snarwin gmail.com> writes:

On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly 
wrote:
 I wish to write a function including ∂x and ∂y (these are 
 trivial to type with appropriate keyboard shortcuts - alt+d on 
 Mac), but without a unicode byte order mark at the beginning of 
 the file, the lexer rejects the tokens.

 It is not apparently easy to insert such marks (AFAICT no 
 common tool does this specifically), while other languages work 
 fine (i.e., accept unicode in their source) without it.

 Is there a downside to at least presuming UTF-8?

According to the spec [1] this should Just Work. I'd recommend 
filing a bug.

[1] https://dlang.org/spec/lex.html#source_text

Sep 14 2020

Jon Degenhardt <jond noreply.com> writes:

On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
 On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly 
 wrote:
 I wish to write a function including ∂x and ∂y (these are 
 trivial to type with appropriate keyboard shortcuts - alt+d on 
 Mac), but without a unicode byte order mark at the beginning 
 of the file, the lexer rejects the tokens.

 It is not apparently easy to insert such marks (AFAICT no 
 common tool does this specifically), while other languages 
 work fine (i.e., accept unicode in their source) without it.

 Is there a downside to at least presuming UTF-8?

 According to the spec [1] this should Just Work. I'd recommend 
 filing a bug.

 [1] https://dlang.org/spec/lex.html#source_text

Under the identifiers section 
(https://dlang.org/spec/lex.html#identifiers) it describes 
identifiers as:

 Identifiers start with a letter, _, or universal alpha, and are 
 followed by any number of letters, _, digits, or universal 
 alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) 
 Appendix D of the C99 Standard.

I was unable to find the definition of a "universal alpha", or 
whether that includes non-ascii alphabetic characters.

Sep 14 2020

Dominikus Dittes Scherkl <dominikus scherkl.de> writes:

On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt 
wrote:
 On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus 
 wrote:
 Identifiers start with a letter, _, or universal alpha, and 
 are followed by any number of letters, _, digits, or universal 
 alphas. Universal alphas are as defined in ISO/IEC 
 9899:1999(E) Appendix D of the C99 Standard.

 I was unable to find the definition of a "universal alpha", or 
 whether that includes non-ascii alphabetic characters.

ISO/IEC 9899:1999 (E)
Annex D

Universal character names for identifiers
-----------------------------------------

Latin: 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 
0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 
03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 
1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 
1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 
1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC
Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 
04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
Armenian: 0531-0556, 0561-0587
Hebrew: 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2, 05D0-05EA, 
05F0-05F2
Arabic: 0621-063A, 0640-0652, 0670-06B7, 06BA-06BE, 06C0-06CE, 
06D0-06DC, 06E5-06E8, 06EA-06ED
Devanagari: 0901-0903, 0905-0939, 093E-094D, 0950-0952, 0958-0963
Bengali: 0981-0983, 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 
09B2, 09B6-09B9, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09DC-09DD, 
09DF-09E3, 09F0-09F1
Gurmukhi: 0A02, 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 
0A32-0A33, 0A35-0A36, 0A38-0A39, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D, 
0A59-0A5C, 0A5E, 0A74
Gujarati: 0A81-0A83, 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 
0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD-0AC5, 0AC7-0AC9, 0ACB-0ACD, 
0AD0, 0AE0
Oriya: 0B01-0B03, 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 
0B32-0B33, 0B36-0B39, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D, 0B5C-0B5D, 
0B5F-0B61
Tamil: 0B82-0B83, 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 
0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9, 
0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD
Telugu: 0C01-0C03, 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 
0C35-0C39, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D, 0C60-0C61
Kannada: 0C82-0C83, 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 
0CB5-0CB9, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD, 0CDE, 0CE0-0CE1
Malayalam: 0D02-0D03, 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 
0D3E-0D43, 0D46-0D48, 0D4A-0D4D, 0D60-0D61
Thai: 0E01-0E3A, 0E40-0E5B
Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 
0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 
0EB0-0EB9, 0EBB-0EBD, 0EC0-0EC4, 0EC6, 0EC8-0ECD, 0EDC-0EDD
Tibetan: 0F00, 0F18-0F19, 0F35, 0F37, 0F39, 0F3E-0F47, 0F49-0F69, 
0F71-0F84, 0F86-0F8B, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
Georgian: 10A0-10C5, 10D0-10F6
Hiragana: 3041-3093, 309B-309C
Katakana: 30A1-30F6, 30FB-30FC
Bopomofo: 3105-312C
CJK Unified Ideographs: 4E00-9FA5
Hangul: AC00-D7A3
Digits: 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 
0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 
0E50-0E59, 0ED0-0ED9, 0F20-0F33
Special characters: 00B5, 00B7, 02B0-02B8, 02BB, 02BD-02C1, 
02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 
2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 
212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029

-----------------------

This is outdated to the brim. Also it doesn't allow for 
letter-like symbols (which is debatable, but especially the 
mathematical ones like double-struck letters are intended for 
such use).
Instead of some old C-Standard, D should better rely directly on 
the properties from UnicodeData.txt, which is updated with every 
new unicode version.

Sep 15 2020

James Blachly <james.blachly gmail.com> writes:

On 9/15/20 4:36 AM, Dominikus Dittes Scherkl wrote:
 On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote:
 On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
 Identifiers start with a letter, _, or universal alpha, and are 
 followed by any number of letters, _, digits, or universal alphas. 
 Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of 
 the C99 Standard.

 I was unable to find the definition of a "universal alpha", or whether 
 that includes non-ascii alphabetic characters.

 
 ISO/IEC 9899:1999 (E)
 Annex D
 
 Universal character names for identifiers
 -----------------------------------------

...
 -----------------------
 
 This is outdated to the brim. Also it doesn't allow for letter-like 
 symbols (which is debatable, but especially the mathematical ones like 
 double-struck letters are intended for such use).
 Instead of some old C-Standard, D should better rely directly on the 
 properties from UnicodeData.txt, which is updated with every new unicode 
 version.
 

Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.

What will it take (i.e. order of difficulty) to get this fixed -- will 
merely a bug report (and PR, not sure if I can tackle or not) do it, or 
will this require more in-depth discussion with compiler maintainers?

James

Sep 15 2020

Steven Schveighoffer <schveiguy gmail.com> writes:

On 9/15/20 10:18 AM, James Blachly wrote:
 On 9/15/20 4:36 AM, Dominikus Dittes Scherkl wrote:
 On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote:
 On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
 Identifiers start with a letter, _, or universal alpha, and are 
 followed by any number of letters, _, digits, or universal alphas. 
 Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D 
 of the C99 Standard.

 I was unable to find the definition of a "universal alpha", or 
 whether that includes non-ascii alphabetic characters.

 ISO/IEC 9899:1999 (E)
 Annex D

 Universal character names for identifiers
 -----------------------------------------

 ....
 -----------------------

 This is outdated to the brim. Also it doesn't allow for letter-like 
 symbols (which is debatable, but especially the mathematical ones like 
 double-struck letters are intended for such use).
 Instead of some old C-Standard, D should better rely directly on the 
 properties from UnicodeData.txt, which is updated with every new 
 unicode version.

 
 Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.
 
 What will it take (i.e. order of difficulty) to get this fixed -- will 
 merely a bug report (and PR, not sure if I can tackle or not) do it, or 
 will this require more in-depth discussion with compiler maintainers?

I'm thinking your issue will not be fixed (just like we don't allow $abc 
to be an identifier). But the spec can be fixed to refer to the correct 
standards.

-Steve

Sep 15 2020

Jon Degenhardt <jond noreply.com> writes:

On Tuesday, 15 September 2020 at 14:59:03 UTC, Steven 
Schveighoffer wrote:
 On 9/15/20 10:18 AM, James Blachly wrote:
 What will it take (i.e. order of difficulty) to get this fixed 
 -- will merely a bug report (and PR, not sure if I can tackle 
 or not) do it, or will this require more in-depth discussion 
 with compiler maintainers?

 I'm thinking your issue will not be fixed (just like we don't 
 allow $abc to be an identifier). But the spec can be fixed to 
 refer to the correct standards.

Looks like it has to do with the '∂' character. But non-ascii 
alphabetic characters work generally.


$ echo $'import std.stdio; void Шä() { writeln("Hello World!"); } 
void main() { Шä(); }' | dmd -run -
Hello World!


$ echo $'import std.stdio; void x∂() { writeln("Hello World!"); } 
void main() { x∂(); }' | dmd -run -
__stdin.d(1): Error: char 0x2202 not allowed in identifier
__stdin.d(1): Error: character 0x2202 is not a valid token
__stdin.d(1): Error: char 0x2202 not allowed in identifier
__stdin.d(1): Error: character 0x2202 is not a valid token

However, 'Ш' and 'ä' satisfy the definition of a Unicode letter, 
'∂' does not. (Using D's current Unicode definitions). I'll use 
tsv-filter (from tsv-utils) to show this rather than writing out 
the full D code. But, this uses std.regex.matchFirst().


$ echo $'x\n∂\nШ\nä'
x
∂
Ш
ä

The input filtered by Unicode letter '\p{L}'
$ echo $'x\n∂\nШ\nä' | tsv-filter --regex 1:'^\p{L}$'
x
Ш
ä

The spec can be made more clear and correct. But if a "universal 
alpha" is essentially about Unicode letters you might be looking 
for a change in the spec to use the symbol chosen.

--Jon

Sep 15 2020

GK <gleb.tsk gmail.com> writes:

On Tuesday, 15 September 2020 at 16:23:01 UTC, Jon Degenhardt 
wrote:


 $ echo $'import std.stdio; void Шä() { writeln("Hello World!"); 
 } void main() { Шä(); }' | dmd -run -
 Hello World!


 $ echo $'import std.stdio; void x∂() { writeln("Hello World!"); 
 } void main() { x∂(); }' | dmd -run -
 __stdin.d(1): Error: char 0x2202 not allowed in identifier

Yes. The same troubles for widely used Greek symbols (Sigma, 
alpha and some other). Unfortunally...

Sep 18 2020

James Blachly <james.blachly gmail.com> writes:

On 9/15/20 10:59 AM, Steven Schveighoffer wrote:
 Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.

 What will it take (i.e. order of difficulty) to get this fixed -- will 
 merely a bug report (and PR, not sure if I can tackle or not) do it, 
 or will this require more in-depth discussion with compiler maintainers?

 
 I'm thinking your issue will not be fixed (just like we don't allow $abc 
 to be an identifier). But the spec can be fixed to refer to the correct 
 standards.
 
 -Steve

Steve: It sounds as if the spec is correct but the glyph (codepoint?) 
range is outdated. If this is the case, it would be a worthwhile update. 
Do you really think it would be rejected out of hand?

Sep 15 2020

Steven Schveighoffer <schveiguy gmail.com> writes:

On 9/15/20 8:10 PM, James Blachly wrote:
 On 9/15/20 10:59 AM, Steven Schveighoffer wrote:
 Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.

 What will it take (i.e. order of difficulty) to get this fixed -- 
 will merely a bug report (and PR, not sure if I can tackle or not) do 
 it, or will this require more in-depth discussion with compiler 
 maintainers?

 I'm thinking your issue will not be fixed (just like we don't allow 
 $abc to be an identifier). But the spec can be fixed to refer to the 
 correct standards.

 
 Steve: It sounds as if the spec is correct but the glyph (codepoint?) 
 range is outdated. If this is the case, it would be a worthwhile update. 
 Do you really think it would be rejected out of hand?
 

I don't really know the answer, as I'm not a unicode expert.

Someone should verify that the character you want to use for a symbol 
name is actually considered a letter or not. Using phobos to prove this 
is kind of self-defeating, as I'm pretty sure it would be in league with 
DMD if there is a bug.

But if it's not a letter, then it would take more than just updating the 
range. It would be a change in the philosophy of what constitutes an 
identifier name.

-Steve

Sep 15 2020

Dominikus Dittes Scherkl <dominikus scherkl.de> writes:

On Wednesday, 16 September 2020 at 00:22:15 UTC, Steven 
Schveighoffer wrote:

 Someone should verify that the character you want to use for a 
 symbol name is actually considered a letter or not. Using 
 phobos to prove this is kind of self-defeating, as I'm pretty 
 sure it would be in league with DMD if there is a bug.

UnicodeData.txt (a data file provided by the unicode organization 
itself since version 1)

contains exactly the necessary properties (in an easy parsable 
format), so we don't need to hard-code the list of allowed 
identifier characters, but can instead use the latest version 
provided by unicode (changing every year!). We only need to 
define which properties a character need to be allowed in an 
identifier.

Sep 16 2020

Dominikus Dittes Scherkl <dominikus scherkl.de> writes:

On Wednesday, 16 September 2020 at 07:38:26 UTC, Dominikus Dittes 
Scherkl wrote:
 We only need to define which properties a character need to be 
 allowed in an identifier.

I think the following change in the grammar would be sufficient:

Identifier:
     IdentifierStart
     IdentifierStart IdentifierChars

IdentifierChars:
     IdentifierChar
     IdentifierChar IdentifierChars

IdentifierStart:
     _
     Any Unicode codepoint with general category Lu, Ll, Lt, Lo, 
Nl or No

IdentifierChar:
     IdentifierStart
     Any Unicode codepoint with general category Lm, Mn, Me, Mc or 
Nd

Sep 16 2020

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 16 September 2020 at 00:22:15 UTC, Steven 
Schveighoffer wrote:
 On 9/15/20 8:10 PM, James Blachly wrote:
 On 9/15/20 10:59 AM, Steven Schveighoffer wrote:
[...]

 
 Steve: It sounds as if the spec is correct but the glyph 
 (codepoint?) range is outdated. If this is the case, it would 
 be a worthwhile update. Do you really think it would be 
 rejected out of hand?
 

 I don't really know the answer, as I'm not a unicode expert.

 Someone should verify that the character you want to use for a 
 symbol name is actually considered a letter or not. Using 
 phobos to prove this is kind of self-defeating, as I'm pretty 
 sure it would be in league with DMD if there is a bug.

I checked, it's not a letter. None of the math symbols are.

 But if it's not a letter, then it would take more than just 
 updating the range. It would be a change in the philosophy of 
 what constitutes an identifier name.

Sep 18 2020

James Blachly <james.blachly gmail.com> writes:

On 9/15/20 8:10 PM, James Blachly wrote:
 Steve: It sounds as if the spec is correct but the glyph (codepoint?) 
 range is outdated. If this is the case, it would be a worthwhile update. 
 Do you really think it would be rejected out of hand?

OK interestingly this code point 0x2202 falls within the range 
"mathematical operators" [0] , and I could see why in general a range 
called "operators" (which includes e.g. set membership, relations, 
operators you would see in abstract algebra, etc.) however, the first 8 
codepoints in the range are "Miscellaneous mathematical symbols" and 
include several that would be appropriately included as/in token names.

Indeed, chapter 22, page 823 of the Unicode standard groups ∂ U+2202 
(the partial differential symbol in question) along with "Basic Set of 
Alphanumeric Characters" that includes Latin 0-9, [a-z,A-Z], uppercase 
greek A-Ω, nabla and variant theta, the lowercase Greek letters, and 
besides U+2202 ∂, six additional glyph variants.

Due to de-duplication of code points, some things that may rightly 
appear in multiple ranges (like U+2202 ∂) are deduplicated and that I 
think is the fate that befell this variant delta.

Sep 15 2020

James Blachly <james.blachly gmail.com> writes:

On 9/15/20 8:24 PM, James Blachly wrote:

Again with the self-reply :/

Forgot the reference: 
https://www.unicode.org/versions/Unicode13.0.0/ch22.pdf

Sep 15 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Sep 14, 2020 at 09:49:13PM -0400, James Blachly via Digitalmars-d-learn
wrote:
 I wish to write a function including ∂x and ∂y (these are trivial to
 type with appropriate keyboard shortcuts - alt+d on Mac), but without
 a unicode byte order mark at the beginning of the file, the lexer
 rejects the tokens.
 
 It is not apparently easy to insert such marks (AFAICT no common tool
 does this specifically), while other languages work fine (i.e., accept
 unicode in their source) without it.
 
 Is there a downside to at least presuming UTF-8?

Tested it locally, with and without BOM; the lexer rejects ∂ as a valid
token. I suspect the reason has nothing to do with BOMs, but with the
fact that ∂ is not classified as an alphanumeric (see std.uni.isAlpha,
which returns false for ∂).  The following code, which contains Cyrillic
letters, compiles just fine without BOM (std.uni.isAlpha('Ш') returns
true):

	void main() {
		int Ш = 1;
		writeln(Ш);
	}

As the docs for std.uni.isAlpha states, it tests for general Unicode
category 'Alphabetic'.  Probably identifiers are restricted to
characters of this category plus the numerics and '_' (and maybe one or
two others, perhaps '$'? Don't remember now).


T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

Sep 14 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly 
wrote:
 I wish to write a function including ∂x and ∂y (these are

You can use the greek letter delta instead:

δ

Sep 15 2020

starcanopy <starcanopy protonmail.com> writes:

On Tuesday, 15 September 2020 at 21:27:25 UTC, Ola Fosheim 
Grøstad wrote:
 On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly 
 wrote:
 I wish to write a function including ∂x and ∂y (these are

 You can use the greek letter delta instead:

 δ

Wouldn't that imply a normal differential?

Sep 15 2020

wjoe <invalid example.com> writes:

On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly 
wrote:
 I wish to write a function including ∂x and ∂y (these are 
 trivial to type with appropriate keyboard shortcuts - alt+d on 
 Mac), but without a unicode byte order mark at the beginning of 
 the file, the lexer rejects the tokens.

 It is not apparently easy to insert such marks (AFAICT no 
 common tool does this specifically), while other languages work 
 fine (i.e., accept unicode in their source) without it.

 Is there a downside to at least presuming UTF-8?

As you probably already know BOM means byte order mark so it is 
only relevant for multi byte encodings (UTF-16, UTF-32). A BOM 
for UTF-8 isn't required an in fact it's discouraged.

Your editor should automatically insert a BOM if appropriate when 
you save your file. Probably you need to select the appropriate 
encoding for your file. Typically this is available in the 'Save 
as..' dialog, or the settings.

Sep 16 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Why is BOM required to use unicode in tokens?