digitalmars.D - Parsing D files with non-unicode characters
- Arun Chandrasekaran (8/8) Nov 05 2018 I'm converting a large amount of header files from C to D using
- Roland Hadinger (5/11) Nov 05 2018 Just an idea: if you have 'iconv' available, you could always
- Arun Chandrasekaran (5/17) Nov 05 2018 Thanks! Can't we preserve the comments? Comments are invaluable,
- Roland Hadinger (4/7) Nov 05 2018 If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is',
- Jonathan M Davis (9/16) Nov 05 2018 If I understand correctly, non-UTF isn't legal in D source files, so tha...
- Arun Chandrasekaran (17/25) Nov 06 2018 This did the trick. It uses https://github.com/BYVoid/uchardet to
- Arun Chandrasekaran (3/16) Nov 06 2018 s/dstep $file/dstep */g
- Bastiaan Veelo (6/12) Nov 05 2018 I may be missing something, but isn’t it possible to open these
- Roland Hadinger (9/13) Nov 06 2018 Yes. Better text editors are capable of automatically inferring
- Jonathan Marler (5/13) Nov 06 2018 So you have code that has characters that are neither ascii nor
- H. S. Teoh (7/15) Nov 06 2018 [...]
- Arun Chandrasekaran (5/21) Nov 06 2018 I was not able to find the character encoding. file -i said
- Jonathan Marler (10/32) Nov 07 2018 I hadn't seen that you provided a link to the file. After I
- Neia Neutuladh (6/13) Nov 07 2018 It can handle multibyte UTF8 characters without a byte order mark. Shoul...
- Jonathan Marler (6/20) Nov 07 2018 Ok, that matches what I saw in the code. It looks like my editor
- Stanislav Blinov (3/6) Nov 07 2018 https://github.com/dlang/dmd/blob/master/src/dmd/dmodule.d#L652
I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? Is this a bug in D to reject non-unicode chars in comments? Arun
Nov 05 2018
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?Just an idea: if you have 'iconv' available, you could always strip out non-utf-8 characters beforehand. See: https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file
Nov 05 2018
On Tuesday, 6 November 2018 at 00:38:03 UTC, Roland Hadinger wrote:On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?Just an idea: if you have 'iconv' available, you could always strip out non-utf-8 characters beforehand. See: https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file
Nov 05 2018
On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote:Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
Nov 05 2018
On Monday, November 5, 2018 6:19:17 PM MST Roland Hadinger via Digitalmars-d wrote:On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote:If I understand correctly, non-UTF isn't legal in D source files, so that's just plain not possible period. They will have to be converted to Unicode in order to be in a D source file even in comments. If the characters are legal in some other encoding, then the encoding will need to be correctly detected and converted to Unicode somehow. If they're just invalid, then arguably, there really isn't anything to preserve anyway. - Jonathan M DavisThanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
Nov 05 2018
On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger wrote:On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote:This did the trick. It uses https://github.com/BYVoid/uchardet to determine the character set. for dir in $(find <DIR> -name include -type d); do pushd $dir for file in $(ls); do iconv -f $(uchardet $file) -t UTF-8 $file > t /bin/mv t $file done it back. sed -i 's,¥,\\,g' * dstep $file popd doneThanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
Nov 06 2018
On Wednesday, 7 November 2018 at 04:43:19 UTC, Arun Chandrasekaran wrote:On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger wrote:s/dstep $file/dstep */gOn Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote:dstep $fileThanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
Nov 06 2018
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?I may be missing something, but isn’t it possible to open these files one for one in their current encoding and save them in UTF-8 encoding using an editor that supports that, e.g., Sublime Text or Kate?
Nov 05 2018
On Tuesday, 6 November 2018 at 07:25:09 UTC, Bastiaan Veelo wrote:I may be missing something, but isn’t it possible to open these files one for one in their current encoding and save them in UTF-8 encoding using an editor that supports that, e.g., Sublime Text or Kate?Yes. Better text editors are capable of automatically inferring (guess) the source encoding, although not always correctly. Guessing the source encoding is something 'iconv' cannot do. I forgot to mention that 'iconv' can also convert text between different encodings, but only when the source encoding is known. When it isn't known or when a file contains a mix of different encodings, iconv can only be used to filter out byte sequences that are invalid in the target encoding.
Nov 06 2018
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? Is this a bug in D to reject non-unicode chars in comments? ArunSo you have code that has characters that are neither ascii nor unicode? What encoding is it using? And what characters does it contain that can't be represented with unicode?
Nov 06 2018
On Tue, Nov 06, 2018 at 05:21:13PM +0000, Jonathan Marler via Digitalmars-d wrote:On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:[...] If you're on Linux, you could use the recode utility: https://github.com/rrthomas/recode/ T -- The easy way is the wrong way, and the hard way is the stupid way. Pick one.I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?
Nov 06 2018
On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler wrote:On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:I was not able to find the character encoding. file -i said unknown-8bit. Ultimately https://github.com/BYVoid/uchardet helped me to determine the charset, it was SHIFT-JIS.I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? Is this a bug in D to reject non-unicode chars in comments? ArunSo you have code that has characters that are neither ascii nor unicode? What encoding is it using? And what characters does it contain that can't be represented with unicode?
Nov 06 2018
On Wednesday, 7 November 2018 at 05:33:20 UTC, Arun Chandrasekaran wrote:On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler wrote:I hadn't seen that you provided a link to the file. After I found it, I played with it a bit. It looks like if you add a UTF-8 BOM in the beginning then DMD successfully parses it. However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM. Does anyone know if it does? Is DMD supposed to allow multi-byte UTF-8 characters if there is no BOM? If so, then this is a bug.On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:I was not able to find the character encoding. file -i said unknown-8bit. Ultimately https://github.com/BYVoid/uchardet helped me to determine the charset, it was SHIFT-JIS.I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? Is this a bug in D to reject non-unicode chars in comments? ArunSo you have code that has characters that are neither ascii nor unicode? What encoding is it using? And what characters does it contain that can't be represented with unicode?
Nov 07 2018
On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote:I hadn't seen that you provided a link to the file. After I found it, I played with it a bit. It looks like if you add a UTF-8 BOM in the beginning then DMD successfully parses it. However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM. Does anyone know if it does? Is DMD supposed to allow multi-byte UTF-8 characters if there is no BOM? If so, then this is a bug.It can handle multibyte UTF8 characters without a byte order mark. Should be straightforward to test this: echo '/* ſ™🅻 */' > file.d dmd -c file.d On dmd 2.081.1, the byte order mark changes nothing for me.
Nov 07 2018
On Wednesday, 7 November 2018 at 17:37:11 UTC, Neia Neutuladh wrote:On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote:Ok, that matches what I saw in the code. It looks like my editor was actually changing the file. The encoding being used in that file is not valid UTF8, so the solution is to re-encode the files in utf8.I hadn't seen that you provided a link to the file. After I found it, I played with it a bit. It looks like if you add a UTF-8 BOM in the beginning then DMD successfully parses it. However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM. Does anyone know if it does? Is DMD supposed to allow multi-byte UTF-8 characters if there is no BOM? If so, then this is a bug.It can handle multibyte UTF8 characters without a byte order mark. Should be straightforward to test this: echo '/* ſ™🅻 */' > file.d dmd -c file.d On dmd 2.081.1, the byte order mark changes nothing for me.
Nov 07 2018
On Wednesday, 7 November 2018 at 15:51:13 UTC, Jonathan Marler wrote:However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM. Does anyone know if it does?https://github.com/dlang/dmd/blob/master/src/dmd/dmodule.d#L652
Nov 07 2018