digitalmars.D - Parsing D files with non-unicode characters

Arun Chandrasekaran (8/8) Nov 05 2018 I'm converting a large amount of header files from C to D using

Roland Hadinger (5/11) Nov 05 2018 Just an idea: if you have 'iconv' available, you could always

Arun Chandrasekaran (5/17) Nov 05 2018 Thanks! Can't we preserve the comments? Comments are invaluable,

Roland Hadinger (4/7) Nov 05 2018 If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is',

Jonathan M Davis (9/16) Nov 05 2018 If I understand correctly, non-UTF isn't legal in D source files, so tha...
Arun Chandrasekaran (17/25) Nov 06 2018 This did the trick. It uses https://github.com/BYVoid/uchardet to

Arun Chandrasekaran (3/16) Nov 06 2018 s/dstep $file/dstep */g

Bastiaan Veelo (6/12) Nov 05 2018 I may be missing something, but isn’t it possible to open these

Roland Hadinger (9/13) Nov 06 2018 Yes. Better text editors are capable of automatically inferring

Jonathan Marler (5/13) Nov 06 2018 So you have code that has characters that are neither ascii nor

H. S. Teoh (7/15) Nov 06 2018 [...]
Arun Chandrasekaran (5/21) Nov 06 2018 I was not able to find the character encoding. file -i said

Jonathan Marler (10/32) Nov 07 2018 I hadn't seen that you provided a link to the file. After I

Neia Neutuladh (6/13) Nov 07 2018 It can handle multibyte UTF8 characters without a byte order mark. Shoul...

Jonathan Marler (6/20) Nov 07 2018 Ok, that matches what I saw in the code. It looks like my editor

Stanislav Blinov (3/6) Nov 07 2018 https://github.com/dlang/dmd/blob/master/src/dmd/dmodule.d#L652

Arun Chandrasekaran <aruncxy gmail.com> writes:

I'm converting a large amount of header files from C to D using 
DStep and I'm stuck at 
https://github.com/jacob-carlborg/dstep/issues/215

https://dlang.org/spec/intro.html shows that ASCII and UTF char 
formats are accepted. How do I go about converting a large code 
base like this?

Is this a bug in D to reject non-unicode chars in comments?

Arun

Nov 05 2018

Roland Hadinger <rolandh.dlangforum maildrop.cc> writes:

On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
wrote:
 I'm converting a large amount of header files from C to D using 
 DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF char 
 formats are accepted. How do I go about converting a large code 
 base like this?

Just an idea: if you have 'iconv' available, you could always 
strip out non-utf-8 characters beforehand. See:

https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file

Nov 05 2018

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Tuesday, 6 November 2018 at 00:38:03 UTC, Roland Hadinger 
wrote:
 On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
 wrote:
 I'm converting a large amount of header files from C to D 
 using DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF 
 char formats are accepted. How do I go about converting a 
 large code base like this?

 Just an idea: if you have 'iconv' available, you could always 
 strip out non-utf-8 characters beforehand. See:

 https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file

Thanks! Can't we preserve the comments? Comments are invaluable, 
especially on the headerfiles. We generate documentation using 
doxygen.

Nov 05 2018

Roland Hadinger <rolandh.dlangforum maildrop.cc> writes:

On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran 
wrote:
 Thanks! Can't we preserve the comments? Comments are 
 invaluable, especially on the headerfiles. We generate 
 documentation using doxygen.

If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', 
then no, what I suggested wouldn't work.

Nov 05 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, November 5, 2018 6:19:17 PM MST Roland Hadinger via Digitalmars-d 
wrote:
 On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran

 wrote:
 Thanks! Can't we preserve the comments? Comments are
 invaluable, especially on the headerfiles. We generate
 documentation using doxygen.

 If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is',
 then no, what I suggested wouldn't work.

If I understand correctly, non-UTF isn't legal in D source files, so that's
just plain not possible period. They will have to be converted to Unicode in
order to be in a D source file even in comments. If the characters are legal
in some other encoding, then the encoding will need to be correctly detected
and converted to Unicode somehow. If they're just invalid, then arguably,
there really isn't anything to preserve anyway.

- Jonathan M Davis

Nov 05 2018

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger 
wrote:
 On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun 
 Chandrasekaran wrote:
 Thanks! Can't we preserve the comments? Comments are 
 invaluable, especially on the headerfiles. We generate 
 documentation using doxygen.

 If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', 
 then no, what I suggested wouldn't work.

This did the trick. It uses https://github.com/BYVoid/uchardet to 
determine the character set.

for dir in $(find <DIR> -name include -type d); do
     pushd $dir

     for file in $(ls); do
	iconv -f $(uchardet $file) -t UTF-8 $file > t
	/bin/mv t $file
     done

it back.
     sed -i 's,¥,\\,g' *


     dstep $file

     popd
done

Nov 06 2018

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Wednesday, 7 November 2018 at 04:43:19 UTC, Arun 
Chandrasekaran wrote:
 On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger 
 wrote:
 On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun 
 Chandrasekaran wrote:
 Thanks! Can't we preserve the comments? Comments are 
 invaluable, especially on the headerfiles. We generate 
 documentation using doxygen.

 If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', 
 then no, what I suggested wouldn't work.


     dstep $file

s/dstep $file/dstep */g

Nov 06 2018

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
wrote:
 I'm converting a large amount of header files from C to D using 
 DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF char 
 formats are accepted. How do I go about converting a large code 
 base like this?

I may be missing something, but isn’t it possible to open these 
files one for one in their current encoding and save them in 
UTF-8 encoding using an editor that supports that, e.g., Sublime 
Text or Kate?

Nov 05 2018

Roland Hadinger <rolandh.dlangforum maildrop.cc> writes:

On Tuesday, 6 November 2018 at 07:25:09 UTC, Bastiaan Veelo wrote:
 I may be missing something, but isn’t it possible to open these 
 files one for one in their current encoding and save them in 
 UTF-8 encoding using an editor that supports that, e.g., 
 Sublime Text or Kate?

Yes. Better text editors are capable of automatically inferring 
(guess) the source encoding, although not always correctly. 
Guessing the source encoding is something 'iconv' cannot do.

I forgot to mention that 'iconv' can also convert text between 
different encodings, but only when the source encoding is known. 
When it isn't known or when a file contains a mix of different 
encodings, iconv can only be used to filter out byte sequences 
that are invalid in the target encoding.

Nov 06 2018

Jonathan Marler <johnnymarler gmail.com> writes:

On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
wrote:
 I'm converting a large amount of header files from C to D using 
 DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF char 
 formats are accepted. How do I go about converting a large code 
 base like this?

 Is this a bug in D to reject non-unicode chars in comments?

 Arun

So you have code that has characters that are neither ascii nor 
unicode?  What encoding is it using?  And what characters does it 
contain that can't be represented with unicode?

Nov 06 2018

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Nov 06, 2018 at 05:21:13PM +0000, Jonathan Marler via Digitalmars-d
wrote:
 On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:
 I'm converting a large amount of header files from C to D using
 DStep and I'm stuck at
 https://github.com/jacob-carlborg/dstep/issues/215
 
 https://dlang.org/spec/intro.html shows that ASCII and UTF char
 formats are accepted. How do I go about converting a large code base
 like this?


[...]

If you're on Linux, you could use the recode utility:

	https://github.com/rrthomas/recode/


T

-- 
The easy way is the wrong way, and the hard way is the stupid way. Pick one.

Nov 06 2018

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler 
wrote:
 On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran 
 wrote:
 I'm converting a large amount of header files from C to D 
 using DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF 
 char formats are accepted. How do I go about converting a 
 large code base like this?

 Is this a bug in D to reject non-unicode chars in comments?

 Arun

 So you have code that has characters that are neither ascii nor 
 unicode?  What encoding is it using?  And what characters does 
 it contain that can't be represented with unicode?

I was not able to find the character encoding. file -i said 
unknown-8bit. Ultimately https://github.com/BYVoid/uchardet 
helped me to determine the charset, it was SHIFT-JIS.

Nov 06 2018

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 7 November 2018 at 05:33:20 UTC, Arun 
Chandrasekaran wrote:
 On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler 
 wrote:
 On Monday, 5 November 2018 at 23:50:46 UTC, Arun 
 Chandrasekaran wrote:
 I'm converting a large amount of header files from C to D 
 using DStep and I'm stuck at 
 https://github.com/jacob-carlborg/dstep/issues/215

 https://dlang.org/spec/intro.html shows that ASCII and UTF 
 char formats are accepted. How do I go about converting a 
 large code base like this?

 Is this a bug in D to reject non-unicode chars in comments?

 Arun

 So you have code that has characters that are neither ascii 
 nor unicode?  What encoding is it using?  And what characters 
 does it contain that can't be represented with unicode?

 I was not able to find the character encoding. file -i said 
 unknown-8bit. Ultimately https://github.com/BYVoid/uchardet 
 helped me to determine the charset, it was SHIFT-JIS.

I hadn't seen that you provided a link to the file.  After I 
found it, I played with it a bit.  It looks like if you add a 
UTF-8 BOM in the beginning then DMD successfully parses it. 
However, from my quick scan of lexer.d, I didn't see anywhere in 
the code that actually changes how it decodes the file based on 
the the presence of the BOM.  Does anyone know if it does?  Is 
DMD supposed to allow multi-byte UTF-8 characters if there is no 
BOM?  If so, then this is a bug.

Nov 07 2018

Neia Neutuladh <neia ikeran.org> writes:

On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote:
 I hadn't seen that you provided a link to the file.  After I found it, I
 played with it a bit.  It looks like if you add a UTF-8 BOM in the
 beginning then DMD successfully parses it. However, from my quick scan
 of lexer.d, I didn't see anywhere in the code that actually changes how
 it decodes the file based on the the presence of the BOM.  Does anyone
 know if it does?  Is DMD supposed to allow multi-byte UTF-8 characters
 if there is no BOM?  If so, then this is a bug.

It can handle multibyte UTF8 characters without a byte order mark. Should 
be straightforward to test this:

  echo '/* ſ™🅻 */' > file.d
  dmd -c file.d

On dmd 2.081.1, the byte order mark changes nothing for me.

Nov 07 2018

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 7 November 2018 at 17:37:11 UTC, Neia Neutuladh 
wrote:
 On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote:
 I hadn't seen that you provided a link to the file.  After I 
 found it, I played with it a bit.  It looks like if you add a 
 UTF-8 BOM in the beginning then DMD successfully parses it. 
 However, from my quick scan of lexer.d, I didn't see anywhere 
 in the code that actually changes how it decodes the file 
 based on the the presence of the BOM.  Does anyone know if it 
 does?  Is DMD supposed to allow multi-byte UTF-8 characters if 
 there is no BOM?  If so, then this is a bug.

 It can handle multibyte UTF8 characters without a byte order 
 mark. Should be straightforward to test this:

   echo '/* ſ™🅻 */' > file.d
   dmd -c file.d

 On dmd 2.081.1, the byte order mark changes nothing for me.

Ok, that matches what I saw in the code.  It looks like my editor 
was actually changing the file.  The encoding being used in that 
file is not valid UTF8, so the solution is to re-encode the files 
in utf8.

Nov 07 2018

Stanislav Blinov <stanislav.blinov gmail.com> writes:

On Wednesday, 7 November 2018 at 15:51:13 UTC, Jonathan Marler 
wrote:

 However, from my quick scan of lexer.d, I didn't see anywhere 
 in the code that actually changes how it decodes the file based 
 on the the presence of the BOM.  Does anyone know if it does?

https://github.com/dlang/dmd/blob/master/src/dmd/dmodule.d#L652

Nov 07 2018

D Programming

C/C++ Programming

Other

digitalmars.D - Parsing D files with non-unicode characters