www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Parsing D files with non-unicode characters

reply Arun Chandrasekaran <aruncxy gmail.com> writes:
I'm converting a large amount of header files from C to D using 
DStep and I'm stuck at 
https://github.com/jacob-carlborg/dstep/issues/215

https://dlang.org/spec/intro.html shows that ASCII and UTF char 
formats are accepted. How do I go about converting a large code 
base like this?

Is this a bug in D to reject non-unicode chars in comments?

Arun
Nov 05 2018
next sibling parent reply Roland Hadinger <rolandh.dlangforum maildrop.cc> writes:
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
wrote:
 I'm converting a large amount of header files from C to D using 
 DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF char 
 formats are accepted. How do I go about converting a large code 
 base like this?
Just an idea: if you have 'iconv' available, you could always strip out non-utf-8 characters beforehand. See: https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file
Nov 05 2018
parent reply Arun Chandrasekaran <aruncxy gmail.com> writes:
On Tuesday, 6 November 2018 at 00:38:03 UTC, Roland Hadinger 
wrote:
 On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
 wrote:
 I'm converting a large amount of header files from C to D 
 using DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF 
 char formats are accepted. How do I go about converting a 
 large code base like this?
Just an idea: if you have 'iconv' available, you could always strip out non-utf-8 characters beforehand. See: https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file
Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.
Nov 05 2018
parent reply Roland Hadinger <rolandh.dlangforum maildrop.cc> writes:
On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran 
wrote:
 Thanks! Can't we preserve the comments? Comments are 
 invaluable, especially on the headerfiles. We generate 
 documentation using doxygen.
If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
Nov 05 2018
next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, November 5, 2018 6:19:17 PM MST Roland Hadinger via Digitalmars-d 
wrote:
 On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran

 wrote:
 Thanks! Can't we preserve the comments? Comments are
 invaluable, especially on the headerfiles. We generate
 documentation using doxygen.
If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
If I understand correctly, non-UTF isn't legal in D source files, so that's just plain not possible period. They will have to be converted to Unicode in order to be in a D source file even in comments. If the characters are legal in some other encoding, then the encoding will need to be correctly detected and converted to Unicode somehow. If they're just invalid, then arguably, there really isn't anything to preserve anyway. - Jonathan M Davis
Nov 05 2018
prev sibling parent reply Arun Chandrasekaran <aruncxy gmail.com> writes:
On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger 
wrote:
 On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun 
 Chandrasekaran wrote:
 Thanks! Can't we preserve the comments? Comments are 
 invaluable, especially on the headerfiles. We generate 
 documentation using doxygen.
If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
This did the trick. It uses https://github.com/BYVoid/uchardet to determine the character set. for dir in $(find <DIR> -name include -type d); do pushd $dir for file in $(ls); do iconv -f $(uchardet $file) -t UTF-8 $file > t /bin/mv t $file done it back. sed -i 's,¥,\\,g' * dstep $file popd done
Nov 06 2018
parent Arun Chandrasekaran <aruncxy gmail.com> writes:
On Wednesday, 7 November 2018 at 04:43:19 UTC, Arun 
Chandrasekaran wrote:
 On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger 
 wrote:
 On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun 
 Chandrasekaran wrote:
 Thanks! Can't we preserve the comments? Comments are 
 invaluable, especially on the headerfiles. We generate 
 documentation using doxygen.
If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
dstep $file
s/dstep $file/dstep */g
Nov 06 2018
prev sibling next sibling parent reply Bastiaan Veelo <Bastiaan Veelo.net> writes:
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
wrote:
 I'm converting a large amount of header files from C to D using 
 DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF char 
 formats are accepted. How do I go about converting a large code 
 base like this?
I may be missing something, but isn’t it possible to open these files one for one in their current encoding and save them in UTF-8 encoding using an editor that supports that, e.g., Sublime Text or Kate?
Nov 05 2018
parent Roland Hadinger <rolandh.dlangforum maildrop.cc> writes:
On Tuesday, 6 November 2018 at 07:25:09 UTC, Bastiaan Veelo wrote:
 I may be missing something, but isn’t it possible to open these 
 files one for one in their current encoding and save them in 
 UTF-8 encoding using an editor that supports that, e.g., 
 Sublime Text or Kate?
Yes. Better text editors are capable of automatically inferring (guess) the source encoding, although not always correctly. Guessing the source encoding is something 'iconv' cannot do. I forgot to mention that 'iconv' can also convert text between different encodings, but only when the source encoding is known. When it isn't known or when a file contains a mix of different encodings, iconv can only be used to filter out byte sequences that are invalid in the target encoding.
Nov 06 2018
prev sibling parent reply Jonathan Marler <johnnymarler gmail.com> writes:
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
wrote:
 I'm converting a large amount of header files from C to D using 
 DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF char 
 formats are accepted. How do I go about converting a large code 
 base like this?

 Is this a bug in D to reject non-unicode chars in comments?

 Arun
So you have code that has characters that are neither ascii nor unicode? What encoding is it using? And what characters does it contain that can't be represented with unicode?
Nov 06 2018
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Nov 06, 2018 at 05:21:13PM +0000, Jonathan Marler via Digitalmars-d
wrote:
 On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:
 I'm converting a large amount of header files from C to D using
 DStep and I'm stuck at
 https://github.com/jacob-carlborg/dstep/issues/215
 
 https://dlang.org/spec/intro.html shows that ASCII and UTF char
 formats are accepted. How do I go about converting a large code base
 like this?
[...] If you're on Linux, you could use the recode utility: https://github.com/rrthomas/recode/ T -- The easy way is the wrong way, and the hard way is the stupid way. Pick one.
Nov 06 2018
prev sibling parent reply Arun Chandrasekaran <aruncxy gmail.com> writes:
On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler 
wrote:
 On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
 wrote:
 I'm converting a large amount of header files from C to D 
 using DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF 
 char formats are accepted. How do I go about converting a 
 large code base like this?

 Is this a bug in D to reject non-unicode chars in comments?

 Arun
So you have code that has characters that are neither ascii nor unicode? What encoding is it using? And what characters does it contain that can't be represented with unicode?
I was not able to find the character encoding. file -i said unknown-8bit. Ultimately https://github.com/BYVoid/uchardet helped me to determine the charset, it was SHIFT-JIS.
Nov 06 2018
parent reply Jonathan Marler <johnnymarler gmail.com> writes:
On Wednesday, 7 November 2018 at 05:33:20 UTC, Arun 
Chandrasekaran wrote:
 On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler 
 wrote:
 On Monday, 5 November 2018 at 23:50:46 UTC, Arun 
 Chandrasekaran wrote:
 I'm converting a large amount of header files from C to D 
 using DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF 
 char formats are accepted. How do I go about converting a 
 large code base like this?

 Is this a bug in D to reject non-unicode chars in comments?

 Arun
So you have code that has characters that are neither ascii nor unicode? What encoding is it using? And what characters does it contain that can't be represented with unicode?
I was not able to find the character encoding. file -i said unknown-8bit. Ultimately https://github.com/BYVoid/uchardet helped me to determine the charset, it was SHIFT-JIS.
I hadn't seen that you provided a link to the file. After I found it, I played with it a bit. It looks like if you add a UTF-8 BOM in the beginning then DMD successfully parses it. However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM. Does anyone know if it does? Is DMD supposed to allow multi-byte UTF-8 characters if there is no BOM? If so, then this is a bug.
Nov 07 2018
next sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote:
 I hadn't seen that you provided a link to the file.  After I found it, I
 played with it a bit.  It looks like if you add a UTF-8 BOM in the
 beginning then DMD successfully parses it. However, from my quick scan
 of lexer.d, I didn't see anywhere in the code that actually changes how
 it decodes the file based on the the presence of the BOM.  Does anyone
 know if it does?  Is DMD supposed to allow multi-byte UTF-8 characters
 if there is no BOM?  If so, then this is a bug.
It can handle multibyte UTF8 characters without a byte order mark. Should be straightforward to test this: echo '/* ſ™🅻 */' > file.d dmd -c file.d On dmd 2.081.1, the byte order mark changes nothing for me.
Nov 07 2018
parent Jonathan Marler <johnnymarler gmail.com> writes:
On Wednesday, 7 November 2018 at 17:37:11 UTC, Neia Neutuladh 
wrote:
 On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote:
 I hadn't seen that you provided a link to the file.  After I 
 found it, I played with it a bit.  It looks like if you add a 
 UTF-8 BOM in the beginning then DMD successfully parses it. 
 However, from my quick scan of lexer.d, I didn't see anywhere 
 in the code that actually changes how it decodes the file 
 based on the the presence of the BOM.  Does anyone know if it 
 does?  Is DMD supposed to allow multi-byte UTF-8 characters if 
 there is no BOM?  If so, then this is a bug.
It can handle multibyte UTF8 characters without a byte order mark. Should be straightforward to test this: echo '/* ſ™🅻 */' > file.d dmd -c file.d On dmd 2.081.1, the byte order mark changes nothing for me.
Ok, that matches what I saw in the code. It looks like my editor was actually changing the file. The encoding being used in that file is not valid UTF8, so the solution is to re-encode the files in utf8.
Nov 07 2018
prev sibling parent Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Wednesday, 7 November 2018 at 15:51:13 UTC, Jonathan Marler 
wrote:

 However, from my quick scan of lexer.d, I didn't see anywhere 
 in the code that actually changes how it decodes the file based 
 on the the presence of the BOM.  Does anyone know if it does?
https://github.com/dlang/dmd/blob/master/src/dmd/dmodule.d#L652
Nov 07 2018