digitalmars.D.learn - Strange behavior in console with UTF-8
- Jonathan Villa (66/66) Mar 24 2016 I prefer to post this thing here because it could that I'm doing
- =?UTF-8?Q?Ali_=c3=87ehreli?= (5/8) Mar 24 2016 Try char:
- Jonathan Villa (3/8) Mar 24 2016 Thankf fot he quick reply.
- Jonathan Villa (2/6) Mar 24 2016 Also tried with dchar ... there's no changes.
- Steven Schveighoffer (10/74) Mar 25 2016 D's File i/o uses C's FILE * i/o system. At least on Windows, this has
- Jonathan Villa (7/19) Mar 25 2016 It's the same Ali suggested (if I get it right) and the behaviour
- Steven Schveighoffer (5/25) Mar 26 2016 At this point, I think knowing exactly what input you are sending would
- Jonathan Villa (8/26) Mar 27 2016 OK, the following inputs I've tested: á, é, í, ó, ú, ñ, à, è, ì...
- Steven Schveighoffer (13/34) Mar 28 2016 I can reproduce your issue on windows.
- Jonathan Villa (4/18) Mar 28 2016 Ok, I'm gonna register it with your data. Thanks.
- Jonathan Villa (11/17) Mar 27 2016 I've tested on Debian 4.2 x64 using CHAR type, and it behaves
I prefer to post this thing here because it could that I'm doing something wrong. I'm using std.stdio -> readln() to read whatever I'm typing in the console. BUT, if the line contains some UTF-8 characters, the data obtained is EMPTY and <code> module runnable; import std.stdio; import std.string : chomp; import std.experimental.logger; void doSomethingElse(wchar[] data) { writeln("hello!"); } int main(string[] args) { /* Some fix I found to fix UTF-8 related problems, I'm using Windows 10 */ version(Windows) { import core.sys.windows.windows; if (SetConsoleCP(65001) == 0) throw new Exception("failure"); if (SetConsoleOutputCP(65001) == 0) throw new Exception("failure"); } FileLogger fl = new FileLogger("log.log"); wchar[] readerBuffer; readln(readerBuffer); readerBuffer = chomp(readerBuffer); fl.info(readerBuffer.length); /* <- if the readed string contains at least one UTF-8 char this prints 0, else it prints its length */ if (readerBuffer != "exit"w) doSomethingElse(readerBuffer); /* Also, all the following code doesn't run as expected, the program doesn't wait for you, it executes readln() even without pressing/sending a key */ readln(readerBuffer); fl.info(readerBuffer.length); readln(readerBuffer); fl.info(readerBuffer.length); readln(readerBuffer); fl.info(readerBuffer.length); readln(readerBuffer); fl.info(readerBuffer.length); readln(readerBuffer); fl.info(readerBuffer.length); return 0; } </code> The real code is bigger but this describes the bug. Also, if it needs to print UTF-8 there's no problem. My main problem is that the line is gonna be sended through a TCP socket and I wanna make it work with UTF-8. I'm using WCHAR instead of CHAR with the hope to get less problems in the future. I you comment the fixed Windows code, the program crashes http://prntscr.com/ajmy14 Also I tried stdin.flush() right after the first readln() but nothing seems to fix it. I'm doing something wrong? many thanks.
Mar 24 2016
On 03/24/2016 05:54 PM, Jonathan Villa wrote:I'm using WCHAR instead of CHAR with the hope to get less problems in the future.Try char: char[] readerBuffer;Also I tried stdin.flush()flush() has no effect on input streams. Ali
Mar 24 2016
On Friday, 25 March 2016 at 01:03:06 UTC, Ali Çehreli wrote:On 03/24/2016 05:54 PM, Jonathan Villa wrote: Try char: char[] readerBuffer;flush() has no effect on input streams. AliThankf fot he quick reply. Unfortunately it behaves exactly as before with wchar.
Mar 24 2016
On Friday, 25 March 2016 at 01:03:06 UTC, Ali Çehreli wrote:Also tried with dchar ... there's no changes.Try char: char[] readerBuffer; Ali
Mar 24 2016
On 3/24/16 8:54 PM, Jonathan Villa wrote:I prefer to post this thing here because it could that I'm doing something wrong. I'm using std.stdio -> readln() to read whatever I'm typing in the console. BUT, if the line contains some UTF-8 characters, the data obtained is EMPTY and <code> module runnable; import std.stdio; import std.string : chomp; import std.experimental.logger; void doSomethingElse(wchar[] data) { writeln("hello!"); } int main(string[] args) { /* Some fix I found to fix UTF-8 related problems, I'm using Windows 10 */ version(Windows) { import core.sys.windows.windows; if (SetConsoleCP(65001) == 0) throw new Exception("failure"); if (SetConsoleOutputCP(65001) == 0) throw new Exception("failure"); } FileLogger fl = new FileLogger("log.log"); wchar[] readerBuffer; readln(readerBuffer); readerBuffer = chomp(readerBuffer); fl.info(readerBuffer.length); /* <- if the readed string contains at least one UTF-8 char this prints 0, else it prints its length */ if (readerBuffer != "exit"w) doSomethingElse(readerBuffer); /* Also, all the following code doesn't run as expected, the program doesn't wait for you, it executes readln() even without pressing/sending a key */ readln(readerBuffer); fl.info(readerBuffer.length); readln(readerBuffer); fl.info(readerBuffer.length); readln(readerBuffer); fl.info(readerBuffer.length); readln(readerBuffer); fl.info(readerBuffer.length); readln(readerBuffer); fl.info(readerBuffer.length); return 0; } </code> The real code is bigger but this describes the bug. Also, if it needs to print UTF-8 there's no problem. My main problem is that the line is gonna be sended through a TCP socket and I wanna make it work with UTF-8. I'm using WCHAR instead of CHAR with the hope to get less problems in the future. I you comment the fixed Windows code, the program crashes http://prntscr.com/ajmy14 Also I tried stdin.flush() right after the first readln() but nothing seems to fix it. I'm doing something wrong? many thanks.D's File i/o uses C's FILE * i/o system. At least on Windows, this has literally zero support for wchar (you can set stream width, and the library just ignores it). What is likely happening is that it is putting the char code units into wchar buffer directly, which is not what you want. I am not certain of this cause, but I would steer clear of any i/o that is not char-based. What you can do is read into a char buffer, and then re-encode using std.conv.to to get wchar strings if you need that. -Steve
Mar 25 2016
On Friday, 25 March 2016 at 13:58:44 UTC, Steven Schveighoffer wrote:On 3/24/16 8:54 PM, Jonathan Villa wrote:It's the same Ali suggested (if I get it right) and the behaviour its the same. It just get to send a UTF8 char to reproduce the mess, independently of the char type you send. JV[...]D's File i/o uses C's FILE * i/o system. At least on Windows, this has literally zero support for wchar (you can set stream width, and the library just ignores it). What is likely happening is that it is putting the char code units into wchar buffer directly, which is not what you want. I am not certain of this cause, but I would steer clear of any i/o that is not char-based. What you can do is read into a char buffer, and then re-encode using std.conv.to to get wchar strings if you need that. -Steve
Mar 25 2016
On 3/25/16 6:47 PM, Jonathan Villa wrote:On Friday, 25 March 2016 at 13:58:44 UTC, Steven Schveighoffer wrote:At this point, I think knowing exactly what input you are sending would be helpful. Can you attach a file which has the input that causes the error? Or just paste the input into your post. -SteveOn 3/24/16 8:54 PM, Jonathan Villa wrote:It's the same Ali suggested (if I get it right) and the behaviour its the same. It just get to send a UTF8 char to reproduce the mess, independently of the char type you send.[...]D's File i/o uses C's FILE * i/o system. At least on Windows, this has literally zero support for wchar (you can set stream width, and the library just ignores it). What is likely happening is that it is putting the char code units into wchar buffer directly, which is not what you want. I am not certain of this cause, but I would steer clear of any i/o that is not char-based. What you can do is read into a char buffer, and then re-encode using std.conv.to to get wchar strings if you need that.
Mar 26 2016
On Saturday, 26 March 2016 at 16:34:34 UTC, Steven Schveighoffer wrote:On 3/25/16 6:47 PM, Jonathan Villa wrote:OK, the following inputs I've tested: á, é, í, ó, ú, ñ, à, è, ì, ò, ù. Just one input is enough to reproduce the behaviour. JVOn Friday, 25 March 2016 at 13:58:44 UTC, Steven Schveighoffer wrote:[...]The following chars I've tested: á, é, í, ó, ú, ñ, à, è, ì, ò, ù. Just one input of thouse is enough to reproduce the behaviourIt's the same Ali suggested (if I get it right) and the behaviour its the same. It just get to send a UTF8 char to reproduce the mess, independently of the char type you send.At this point, I think knowing exactly what input you are sending would be helpful. Can you attach a file which has the input that causes the error? Or just paste the input into your post. -Steve
Mar 27 2016
On 3/27/16 12:04 PM, Jonathan Villa wrote:On Saturday, 26 March 2016 at 16:34:34 UTC, Steven Schveighoffer wrote:I can reproduce your issue on windows. It works on Mac OS X. I see different behavior on 32-bit (DMC stdlib) vs. 64-bit (MSVC stdlib). On both, the line is not read properly (I get a length of 0). On 32-bit, the program exits immediately, indicating it cannot read any more data. On 64-bit, the program continues to allow input. I don't think this is normal behavior, and should be filed as a bug. I'm not a Windows developer normally, but I would guess this is an issue with the Windows flavors of readln. Please file here: https://issues.dlang.org under the Phobos component. -SteveOn 3/25/16 6:47 PM, Jonathan Villa wrote:OK, the following inputs I've tested: á, é, í, ó, ú, ñ, à, è, ì, ò, ù. Just one input is enough to reproduce the behaviour. JVOn Friday, 25 March 2016 at 13:58:44 UTC, Steven Schveighoffer wrote:[...]The following chars I've tested: á, é, í, ó, ú, ñ, à, è, ì, ò, ù. Just one input of thouse is enough to reproduce the behaviourIt's the same Ali suggested (if I get it right) and the behaviour its the same. It just get to send a UTF8 char to reproduce the mess, independently of the char type you send.At this point, I think knowing exactly what input you are sending would be helpful. Can you attach a file which has the input that causes the error? Or just paste the input into your post.
Mar 28 2016
On Monday, 28 March 2016 at 18:28:33 UTC, Steven Schveighoffer wrote:On 3/27/16 12:04 PM, Jonathan Villa wrote: I can reproduce your issue on windows. It works on Mac OS X. I see different behavior on 32-bit (DMC stdlib) vs. 64-bit (MSVC stdlib). On both, the line is not read properly (I get a length of 0). On 32-bit, the program exits immediately, indicating it cannot read any more data. On 64-bit, the program continues to allow input. I don't think this is normal behavior, and should be filed as a bug. I'm not a Windows developer normally, but I would guess this is an issue with the Windows flavors of readln. Please file here: https://issues.dlang.org under the Phobos component. -SteveOk, I'm gonna register it with your data. Thanks. JV.
Mar 28 2016
On Saturday, 26 March 2016 at 16:34:34 UTC, Steven Schveighoffer wrote:On 3/25/16 6:47 PM, Jonathan Villa wrote: At this point, I think knowing exactly what input you are sending would be helpful. Can you attach a file which has the input that causes the error? Or just paste the input into your post. -SteveI've tested on Debian 4.2 x64 using CHAR type, and it behaves correctly without any problems. Clearly this bug must be something related with the Windows console. Here's the behaviour in Windows 10 x64: http://prntscr.com/akskt1 And here's in Debian x64 4.2: http://prntscr.com/akskjw JV
Mar 27 2016