digitalmars.D - Character set conversions
- Adam D. Ruppe (11/11) May 29 2011 I've encountered some problems with other charsets recently. Phobos has
- Jonathan M Davis (17/32) May 29 2011 Well, generally the idea is that you just use UTF-8, UTF-16, or UTF-32, ...
- Adam D. Ruppe (4/9) May 29 2011 Translation is all I want. Internally, everything is utf8 strings,
- Daniel Gibson (7/17) May 29 2011 Hmm on the one hand iconv already does this for a plethora of
- Jonathan M Davis (14/24) May 29 2011 Well, likely no one has done it yet because none of the Phobos developer...
- Kagamin (3/7) May 30 2011 May be, it's his cgi lib? :)
- Adam D. Ruppe (13/15) May 30 2011 In practice, that hasn't been a problem because browser tend to
- =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (11/29) May 30 2011 Fun fact about Excel generated CSV files: quite apart from encoding
- Nick Sabalausky (3/8) May 30 2011 Heh, that's just wonderful: localized file format specs...
-
Simen Kjaeraas
(11/16)
May 30 2011
On Mon, 30 May 2011 19:57:32 +0200, J=C3=A9r=C3=B4me M. Berger
- Daniel Gibson (19/35) May 30 2011 CSV in Excel is totally misleading anyway. At least in the German
- =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (15/26) May 30 2011 il.
- Jonathan M Davis (6/23) May 30 2011 Well, knowing Microsoft, they probably did it with printf (or fprintf or...
- Kagamin (2/7) May 31 2011 Doesn't C standard specify the locale to be "C" until you set it explici...
- Daniel Gibson (6/15) May 31 2011 At least on Linux it is usually set to whatever you specified on
- Jonathan M Davis (11/27) May 30 2011 ed
- Adam D. Ruppe (13/17) May 30 2011 Yeah, I've seen the semicolon in the wild before too, though I didn't
- Jacob Carlborg (7/32) May 30 2011 Yeah, that is a nightmare. I tried SYLK, symbolic link as well, it's
- Sean Kelly (4/19) May 30 2011 I suggest looking into ICU if you're doing this stuff. I believe =
- Kagamin (2/4) May 31 2011 I suppose it's system ANSI encoding, which is locale-dependent, you can ...
- Kagamin (2/6) May 31 2011 The client usually send information about its locale, from this info you...
- Kagamin (7/23) May 31 2011 according to N1425
- Daniel Gibson (2/28) May 31 2011 So they break it deliberately in Excel? Smart.
- Kagamin (3/12) May 31 2011 Excel deliberately localizes data presented to the user. Wouldn't it be ...
- Daniel Gibson (5/19) May 31 2011 I'm not talking about representing the values on the screen - I'm
- Kagamin (2/7) May 31 2011
- Daniel Gibson (8/18) May 31 2011 It's natural to have an internal representation of the value of a field
- Kagamin (2/4) May 31 2011 I've checked excel 2007, seems like it stores (in xlsx) numbers and func...
- Daniel Gibson (3/9) May 31 2011 Ok. Everything else would be a really unusable mess, especially for
I've encountered some problems with other charsets recently. Phobos has a std.encoding that can do some useful stuff, but there's some encodings I've seen in the wild that it can't handle (indeed, it's a fairly short list that it does support) I used gnu iconv for one of my projects and it works for me, but I wonder: Is anyone planning to add more charset support to Phobos? (alternatively, am I missing something already there?) If no, maybe I'll do a few myself. I've never actually written code to do this, but it can't be rocket science. I suspect it's more tedious than anything else.
May 29 2011
On 2011-05-29 19:21, Adam D. Ruppe wrote:I've encountered some problems with other charsets recently. Phobos has a std.encoding that can do some useful stuff, but there's some encodings I've seen in the wild that it can't handle (indeed, it's a fairly short list that it does support) I used gnu iconv for one of my projects and it works for me, but I wonder: Is anyone planning to add more charset support to Phobos? (alternatively, am I missing something already there?) If no, maybe I'll do a few myself. I've never actually written code to do this, but it can't be rocket science. I suspect it's more tedious than anything else.Well, generally the idea is that you just use UTF-8, UTF-16, or UTF-32, and for the most part, I wouldn't really expect people to be using UTF-16 when they need to interface with Windows system functions which require it. By definition, char is supposed to be UTF-8, wchar is supposed to be UTF-16, and dchar is supposed to be UTF-32. I don't really think that it's expected that you be using any other encodings within your typical D program. Sometimes it may be necessary to translate from another encoding to UTF-8, UTF-16, or UTF-32 when getting input from somewhere, and sometimes it may be necessary to translate to another encoding from UTF-8, UTF-16, or UTF-16 when outputting somewhere, but it certainly isn't the norm. It may be that we need better suppport for dealing with those cases, but they should really only be for converting on input or output. So, if you want to improve std.encoding to handle more charsets, then feel free, but don't expect the rest of Phobos to work with anything beyond UTF-8, UTF-16, and UTF-16. It's going to be throwing UtfExceptions if you do. - Jonathan M Davis
May 29 2011
Jonathan M Davis wrote:Sometimes it may be necessary to translate from another encoding to UTF-8, UTF-16, or UTF-32 when getting input from somewhere, and sometimes it may be necessary to translate to another encoding from UTF-8, UTF-16, or UTF-16 when outputting somewhere, but it certainly isn't the norm.Translation is all I want. Internally, everything is utf8 strings, but sometimes the program is fed files in another encoding and it needs to handle them too.
May 29 2011
Am 30.05.2011 05:03, schrieb Adam D. Ruppe:Jonathan M Davis wrote:Hmm on the one hand iconv already does this for a plethora of encodings.. on the other hand AFAIK there is no iconv implementation that could be shipped with Phobos, so if a module for translating between encodings should become part of Phobos there seems be no other way than writing one from scratch :/ (And I think having this in Phobos would make sense)Sometimes it may be necessary to translate from another encoding to UTF-8, UTF-16, or UTF-32 when getting input from somewhere, and sometimes it may be necessary to translate to another encoding from UTF-8, UTF-16, or UTF-16 when outputting somewhere, but it certainly isn't the norm.Translation is all I want. Internally, everything is utf8 strings, but sometimes the program is fed files in another encoding and it needs to handle them too.
May 29 2011
On 2011-05-29 20:03, Adam D. Ruppe wrote:Jonathan M Davis wrote:Well, likely no one has done it yet because none of the Phobos developers have needed it enough to implement it, and no one outside of them has taken the time to do so and tried to get it into Phobos. And with everything else there is to do, it's the sort of thing that's likely not to get done anytime soon - especially with no feature requests or bug reports no the matter. Personally, I wasn't even aware that it was an issue. Pure UTF-8 has always worked just fine for me. Presumably, you're running into issues with it because you're actually using D at work. So, you can either implement it yourself and create a pull request for it, or you can create an enhancement request, and it'll probably get done eventually, but with everything else that needs doing, I don't know how quickly it'll get done. - Jonathan M DavisSometimes it may be necessary to translate from another encoding to UTF-8, UTF-16, or UTF-32 when getting input from somewhere, and sometimes it may be necessary to translate to another encoding from UTF-8, UTF-16, or UTF-16 when outputting somewhere, but it certainly isn't the norm.Translation is all I want. Internally, everything is utf8 strings, but sometimes the program is fed files in another encoding and it needs to handle them too.
May 29 2011
Jonathan M Davis Wrote:especially with no feature requests or bug reports no the matter. Personally, I wasn't even aware that it was an issue. Pure UTF-8 has always worked just fine for me. Presumably, you're running into issues with it because you're actually using D at work.May be, it's his cgi lib? :) Client is free to send requests in any encoding, I suppose.
May 30 2011
Kagamin wrote:May be, it's his cgi lib? :) Client is free to send requests in any encoding, I suppose.In practice, that hasn't been a problem because browser tend to send requests in the same encoding as the html you served. Since the D always outputs utf8, the browsers all send back utf8 too. The first problem I had was users can upload csv files, which they generally make in Excel... which apparently outputs Windows-1252. Fine for 99% of text, but then someone puts in a curly quote or an em dash and it throws an invalid utf 8 sequence. Converting that is easy enough though. Second problem is now I want to fetch and process random websites on the internet, and they come in a variety of encodings... again, utf covers a big majority, but not all of them.
May 30 2011
Adam D. Ruppe wrote:Kagamin wrote:Fun fact about Excel generated CSV files: quite apart from encoding issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon... Just thought I'd point it out in case you did not know. Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.frMay be, it's his cgi lib? :) Client is free to send requests in any encoding, I suppose.=20 In practice, that hasn't been a problem because browser tend to send requests in the same encoding as the html you served. =20 Since the D always outputs utf8, the browsers all send back utf8 too. =20 =20 The first problem I had was users can upload csv files, which they generally make in Excel... which apparently outputs Windows-1252. Fine for 99% of text, but then someone puts in a curly quote or an em dash and it throws an invalid utf 8 sequence. =20 Converting that is easy enough though. =20
May 30 2011
""Jérôme M. Berger"" <jeberger free.fr> wrote in message news:is0m2h$1s32$1 digitalmars.com...Fun fact about Excel generated CSV files: quite apart from encoding issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon... Just thought I'd point it out in case you did not know.Heh, that's just wonderful: localized file format specs...
May 30 2011
On Mon, 30 May 2011 19:57:32 +0200, J=C3=A9r=C3=B4me M. Berger <jeberger= free.fr> = wrote:Fun fact about Excel generated CSV files: quite apart from encoding issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon... Just thought I'd point it out in case you did not know.Fun? Gods, it's the most horrible idea I've witnessed in computing. If only they'd call it something other than CSV, at least - Comma Separa= ted Values separated by semicolons? WTF? And the fantastic joy of opening one of those abominations in some other= program... *shiver* -- = Simen
May 30 2011
Am 30.05.2011 22:20, schrieb Simen Kjaeraas:On Mon, 30 May 2011 19:57:32 +0200, Jérôme M. Berger <jeberger free.fr> wrote:CSV in Excel is totally misleading anyway. At least in the German Version, if you want to import a CSV file, the standard seperator is tab, not comma.. If you use File->Open this is all you can get, importing with custom seperators is hidden somewhere else IIRC. (This refers to Office XP, dunno if newer versions are better in this regard.) In plain C (at least on Linux) you have fun locale-dependent in/output as well: printf and scanf are locale dependent, so if you use sprintf to generate a string you'll write into a file (or fprintf directly) with one locale, reading it with scanf functions with another locale will fail. Pretty fucking stupid IMHO. This was/is(?) a bug in GtkRadiant, a level editor for Quake like games, which uses printf or something to write the map files. The map compiler will reject them if decimals use a , instead of a . and stuff like that. (The workaround is to always use the standard LOCALE, i.e. "LC_ALL=C gtkradiant" to start it). Cheers, - DanielFun fact about Excel generated CSV files: quite apart from encoding issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon... Just thought I'd point it out in case you did not know.Fun? Gods, it's the most horrible idea I've witnessed in computing. If only they'd call it something other than CSV, at least - Comma Separated Values separated by semicolons? WTF? And the fantastic joy of opening one of those abominations in some other program... *shiver*
May 30 2011
Daniel Gibson wrote:In plain C (at least on Linux) you have fun locale-dependent in/output as well: printf and scanf are locale dependent, so if you use sprintf to generate a string you'll write into a file (or fprintf directly) wit=hone locale, reading it with scanf functions with another locale will fa=il.Pretty fucking stupid IMHO. This was/is(?) a bug in GtkRadiant, a level editor for Quake like games=,which uses printf or something to write the map files. The map compiler=will reject them if decimals use a , instead of a . and stuff like that==2E(The workaround is to always use the standard LOCALE, i.e. "LC_ALL=3DC gtkradiant" to start it). =20Actually, that is the same issue: Excel outputs numbers to CSV in a locale dependent way (probably using printf), which means that in some locales the decimal point is a coma, which prevents using it as a field separator. Braindead of course, and a real pain when you want to interface with other software. Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr
May 30 2011
On 2011-05-30 14:40, J=E9r=F4me M. Berger wrote:Daniel Gibson wrote:Well, knowing Microsoft, they probably did it with printf (or fprintf or=20 whatever), not realizing that it had locale issues, but once they figured o= ut,=20 they wouldn't fix it because that would break backwards compatibility. =2D Jonathan M DavisIn plain C (at least on Linux) you have fun locale-dependent in/output as well: printf and scanf are locale dependent, so if you use sprintf to generate a string you'll write into a file (or fprintf directly) with one locale, reading it with scanf functions with another locale will fail. Pretty fucking stupid IMHO. This was/is(?) a bug in GtkRadiant, a level editor for Quake like games, which uses printf or something to write the map files. The map compiler will reject them if decimals use a , instead of a . and stuff like that. (The workaround is to always use the standard LOCALE, i.e. "LC_ALL=3DC gtkradiant" to start it).=20 Actually, that is the same issue: Excel outputs numbers to CSV in a locale dependent way (probably using printf), which means that in some locales the decimal point is a coma, which prevents using it as a field separator. Braindead of course, and a real pain when you want to interface with other software.
May 30 2011
Daniel Gibson Wrote:In plain C (at least on Linux) you have fun locale-dependent in/output as well: printf and scanf are locale dependent, so if you use sprintf to generate a string you'll write into a file (or fprintf directly) with one locale, reading it with scanf functions with another locale will fail. Pretty fucking stupid IMHO.Doesn't C standard specify the locale to be "C" until you set it explicitly?
May 31 2011
Am 31.05.2011 09:02, schrieb Kagamin:Daniel Gibson Wrote:At least on Linux it is usually set to whatever you specified on installation (usually you just say "I want a german/english/whatever installation" and the installer then sets the locales to de_DE.UTF8 or whatever). Applications use these settings to decide the language of their menus etcIn plain C (at least on Linux) you have fun locale-dependent in/output as well: printf and scanf are locale dependent, so if you use sprintf to generate a string you'll write into a file (or fprintf directly) with one locale, reading it with scanf functions with another locale will fail. Pretty fucking stupid IMHO.Doesn't C standard specify the locale to be "C" until you set it explicitly?
May 31 2011
On 2011-05-30 13:20, Simen Kjaeraas wrote:On Mon, 30 May 2011 19:57:32 +0200, J=E9r=F4me M. Berger <jeberger free.f=r>=20 wrote:edFun fact about Excel generated CSV files: quite apart from encoding =20 issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon... =20 Just thought I'd point it out in case you did not know.=20 Fun? Gods, it's the most horrible idea I've witnessed in computing. If only they'd call it something other than CSV, at least - Comma Separat=Values separated by semicolons? WTF? And the fantastic joy of opening one of those abominations in some other program... *shiver*Well, then it isn't really CSV anymore. They different screwed the French o= n=20 that one. Oh, you wanted your supposedly universal format to work with othe= r=20 programs? Sorry, no can do. But you can keep using Excel! See, no reason to= be=20 unhappy about it. :P =2D Jonathan M Davis
May 30 2011
Fun fact about Excel generated CSV files: quite apart from encoding issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon...Yeah, I've seen the semicolon in the wild before too, though I didn't know it was a locale thing. My program solves it by confirming with the user. When you upload a file, it tries to parse it with a few different assumptions. The one that looks best is presented back to the user. (Looks best means it has headings that roughly match what we expect and number of columns that's more or less consistent). It does charset the same way, actually. First, guess UTF-8. If that doesn't validate, assume it's Windows-1252 unless told otherwise. The user then confirms the guesses and organizes the final data import. It's worked out pretty well so far aside from unsupported charsets; the users seem to like it.
May 30 2011
On 2011-05-30 19:57, "Jérôme M. Berger" wrote:Adam D. Ruppe wrote:Yeah, that is a nightmare. I tried SYLK, symbolic link as well, it's something like CSV but more advanced, didn't work out that well either. I ended up using real Excel documents with the help of the rubygem "spreadsheet". -- /Jacob CarlborgKagamin wrote:Fun fact about Excel generated CSV files: quite apart from encoding issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon... Just thought I'd point it out in case you did not know. JeromeMay be, it's his cgi lib? :) Client is free to send requests in any encoding, I suppose.In practice, that hasn't been a problem because browser tend to send requests in the same encoding as the html you served. Since the D always outputs utf8, the browsers all send back utf8 too. The first problem I had was users can upload csv files, which they generally make in Excel... which apparently outputs Windows-1252. Fine for 99% of text, but then someone puts in a curly quote or an em dash and it throws an invalid utf 8 sequence. Converting that is easy enough though.
May 30 2011
I suggest looking into ICU if you're doing this stuff. I believe = there's even a wrapper somewhere in the Mango tree on DSource. On May 29, 2011, at 7:21 PM, Adam D. Ruppe wrote:I've encountered some problems with other charsets recently. Phobos =hasa std.encoding that can do some useful stuff, but there's some encodings I've seen in the wild that it can't handle (indeed, it's a fairly short list that it does support) =20 I used gnu iconv for one of my projects and it works for me, but I wonder: =20 Is anyone planning to add more charset support to Phobos? (alternatively, am I missing something already there?) =20 =20 If no, maybe I'll do a few myself. I've never actually written code to do this, but it can't be rocket science. I suspect it's more tedious than anything else.
May 30 2011
Adam D. Ruppe Wrote:The first problem I had was users can upload csv files, which they generally make in Excel... which apparently outputs Windows-1252.I suppose it's system ANSI encoding, which is locale-dependent, you can see the list of ANSI encodings for different locales somewhere in MSDN.
May 31 2011
Adam D. Ruppe Wrote:The first problem I had was users can upload csv files, which they generally make in Excel... which apparently outputs Windows-1252. Fine for 99% of text, but then someone puts in a curly quote or an em dash and it throws an invalid utf 8 sequence.The client usually send information about its locale, from this info you can infer ANSI encoding.
May 31 2011
Daniel Gibson Wrote:Am 31.05.2011 09:02, schrieb Kagamin:according to N1425 7.11.1.1 4. At program startup, the equivalent of setlocale(LC_ALL, "C"); is executed. Fun fact is MS conforms with this specification.Daniel Gibson Wrote:At least on Linux it is usually set to whatever you specified on installation (usually you just say "I want a german/english/whatever installation" and the installer then sets the locales to de_DE.UTF8 or whatever). Applications use these settings to decide the language of their menus etcIn plain C (at least on Linux) you have fun locale-dependent in/output as well: printf and scanf are locale dependent, so if you use sprintf to generate a string you'll write into a file (or fprintf directly) with one locale, reading it with scanf functions with another locale will fail. Pretty fucking stupid IMHO.Doesn't C standard specify the locale to be "C" until you set it explicitly?
May 31 2011
Am 31.05.2011 09:12, schrieb Kagamin:Daniel Gibson Wrote:So they break it deliberately in Excel? Smart.Am 31.05.2011 09:02, schrieb Kagamin:according to N1425 7.11.1.1 4. At program startup, the equivalent of setlocale(LC_ALL, "C"); is executed. Fun fact is MS conforms with this specification.Daniel Gibson Wrote:At least on Linux it is usually set to whatever you specified on installation (usually you just say "I want a german/english/whatever installation" and the installer then sets the locales to de_DE.UTF8 or whatever). Applications use these settings to decide the language of their menus etcIn plain C (at least on Linux) you have fun locale-dependent in/output as well: printf and scanf are locale dependent, so if you use sprintf to generate a string you'll write into a file (or fprintf directly) with one locale, reading it with scanf functions with another locale will fail. Pretty fucking stupid IMHO.Doesn't C standard specify the locale to be "C" until you set it explicitly?
May 31 2011
Daniel Gibson Wrote:Excel deliberately localizes data presented to the user. Wouldn't it be strange for user to work with C locale (Excel users aren't programmers)? It even translates builtin function names :) As this presentation is quite customizable, I doubt it's done by c runtime. I think it just gets string values from cells during CSV generation.according to N1425 7.11.1.1 4. At program startup, the equivalent of setlocale(LC_ALL, "C"); is executed. Fun fact is MS conforms with this specification.So they break it deliberately in Excel? Smart.
May 31 2011
Am 31.05.2011 13:12, schrieb Kagamin:Daniel Gibson Wrote:I'm not talking about representing the values on the screen - I'm talking about the format of CSV files. And I find translated function names pretty strange.. I'm wondering how well that works when opening a file with another locale etc.Excel deliberately localizes data presented to the user. Wouldn't it be strange for user to work with C locale (Excel users aren't programmers)? It even translates builtin function names :)according to N1425 7.11.1.1 4. At program startup, the equivalent of setlocale(LC_ALL, "C"); is executed. Fun fact is MS conforms with this specification.So they break it deliberately in Excel? Smart.As this presentation is quite customizable, I doubt it's done by c runtime. I think it just gets string values from cells during CSV generation.
May 31 2011
Daniel Gibson Wrote:I'm not talking about representing the values on the screen - I'm talking about the format of CSV files. And I find translated function names pretty strange.. I'm wondering how well that works when opening a file with another locale etc.Isn't it natural to get string from a cell and put it into the output CSV stream? UI also gets string from a cell and presents it.As this presentation is quite customizable, I doubt it's done by c runtime. I think it just gets string values from cells during CSV generation.
May 31 2011
Am 31.05.2011 14:54, schrieb Kagamin:Daniel Gibson Wrote:It's natural to have an internal representation of the value of a field and different representations in the UI (this could be dependent on locale settings etc) and for saving it on the disk (this should really not depend on a locale, but be portable). Ideally the different representations can be converted to each other in a lossless way. So the representation on the disk (as CSV, .xls, XML, whatever) doesn't have to match the screen representation.I'm not talking about representing the values on the screen - I'm talking about the format of CSV files. And I find translated function names pretty strange.. I'm wondering how well that works when opening a file with another locale etc.Isn't it natural to get string from a cell and put it into the output CSV stream? UI also gets string from a cell and presents it.As this presentation is quite customizable, I doubt it's done by c runtime. I think it just gets string values from cells during CSV generation.
May 31 2011
Daniel Gibson Wrote:And I find translated function names pretty strange.. I'm wondering how well that works when opening a file with another locale etc.I've checked excel 2007, seems like it stores (in xlsx) numbers and function names in locale independent form. Don't know, how it works in older versions.
May 31 2011
Am 31.05.2011 15:02, schrieb Kagamin:Daniel Gibson Wrote:Ok. Everything else would be a really unusable mess, especially for international companies.And I find translated function names pretty strange.. I'm wondering how well that works when opening a file with another locale etc.I've checked excel 2007, seems like it stores (in xlsx) numbers and function names in locale independent form. Don't know, how it works in older versions.
May 31 2011