digitalmars.D - Auto-UTF-detection - Feature Request
- Arcane Jill (46/46) Jul 25 2004 In the source text analysis phase, the compiler does this (according to ...
- Arcane Jill (40/40) Jul 25 2004 Actually, it's just occurred to me that it's /really easy/ to tell the e...
- J C Calvarese (21/29) Jul 25 2004 All is not lost. I downloaded and installed TextPad. I observed the
- Walter (4/10) Jul 25 2004 Send them a bug report!
- Walter (9/15) Jul 25 2004 favorite
- Arcane Jill (18/20) Jul 26 2004 This is incorrect. UTF-16LE does not require a BOM.
- Walter (4/6) Jul 26 2004 source file
- Arcane Jill (14/38) Jul 26 2004 Come on, that's a /tiny/ function, and I've written it all for you. The
- Arcane Jill (4/5) Jul 26 2004 I should have written that in C, shouldn't I? Never mind - I think just
- Walter (10/21) Jul 26 2004 input, it
- Arcane Jill (5/8) Jul 26 2004 Fair enough. No hurry. I'm sure we're all in agreement that more importa...
- James McComb (8/18) Jul 26 2004 Okay, I agree that the overhead is non-existent.
- James McComb (11/16) Jul 25 2004 I like this rule. It says that D is not going to try and *guess* the
In the source text analysis phase, the compiler does this (according to the manual): "The source text is assumed to be in UTF-8, unless one of the following BOMs (Byte Order Marks) is present at the beginning of the source text". However, it is heuristically possible to distinguish between the various UTFs even /without/ a BOM. Okay, so it is /theoretically/ possible for an ambiguity to exist, but those edge cases are going to be almost infinitesimally rare for text files in general, and I'd say zero for D source files (which will consist mostly of ASCII characters). I say, try to auto-detect the difference. Here's how ya do it: Since a D source file mostly consists of ASCII characters, excluding NULL, any 4-byte-aligned fragment of a D source file is likely to look like this, where xx stands for any non-zero byte, and ?? stands for any byte at all: Simply by analysing a few such four-byte chunks (say, the first 1024 byte of the file) and counting how many fit each pattern, you can easily determine the most likely encoding. This is a statistical test, obviously, since not /all/ bytes will be ASCII, but it will catch all but a few extreme edge cases. If you haven't made your mind up within the first 1024 bytes of the file, read the /next/ 1024 bytes and try again, and so on. Alternatively, if that sounds too hard, there's an even easier (but less efficient) algorithm: 1) Assume UTF-32LE. Validate. 2) Assume UTF-32BE. Validate. 3) Assume UTF-16BE. Validate. 4) Assume UTF-16BE. Validate. 5) Assume UTF-8. Validate. If precisely one of these validations succeeds, you've sussed it. If more than one succeeds, it's still ambiguous, but the chances of this happening are microscopic. If none succeed, the source file is not UTF. Arcane Jill
Jul 25 2004
Actually, it's just occurred to me that it's /really easy/ to tell the encoding of a D source file, because of the fact that the very first character of a D source file MUST be either a UTF BOM or a non-NULL ASCII. So all we have to do is to test for each of these contingencies. Here's a short function that does just that: This is an important issue, because I just did a quick test. Using my favorite text editor (TextPad), I saved a text file in UTF-16LE. I then examined the saved file with a hex editor. I can confirm that the file was saved in UTF-16LE, but, critically, /without a BOM/. I don't know what other text editors do, but clearly, if there is even a remote chance that source files will get saved without a BOM, then we really ought to be able to compile those source files! Arcane Jill
Jul 25 2004
Arcane Jill wrote: ...This is an important issue, because I just did a quick test. Using my favorite text editor (TextPad), I saved a text file in UTF-16LE. I then examined the saved file with a hex editor. I can confirm that the file was saved in UTF-16LE, but, critically, /without a BOM/. I don't know what other text editors do, but clearly, if there is even a remote chance that source files will get saved without a BOM, then we really ought to be able to compile those source files! Arcane JillAll is not lost. I downloaded and installed TextPad. I observed the problem of no BOMs, but I also found a solution: * Choose the Configure/Preferences from the menubar. * Find Document Classes/Default in the tree on the left (I guess you might have to choose something else here if you've set up a "D mode"). Document class options [x] Write Unicode and UTF-8 BOM Now when you save again, it'll add the BOM's. Unicode: FF FE (UTF-16 LE) Unicode (big endian): FE FF (UTF-16 BE) UTF-8: EF BB BF (UTF-8) So there's no problem using TextPad. (I don't know why the BOMs wouldn't be enabled by default, but that's a whole other issue.) The BOMs are standard, right? If a supposedly Unicode-capable won't add BOMs, it might not really be considered Unicode-capable. If a person want to use one of those editors, fine. But please stick to UTF-8. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Jul 25 2004
"J C Calvarese" <jcc7 cox.net> wrote in message news:ce1d18$2sa5$1 digitaldaemon.com...So there's no problem using TextPad. (I don't know why the BOMs wouldn't be enabled by default, but that's a whole other issue.)Send them a bug report!The BOMs are standard, right?Yes.If a supposedly Unicode-capable won't add BOMs, it might not really be considered Unicode-capable. If a person want to use one of those editors, fine. But please stick to UTF-8.
Jul 25 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:ce1848$2p2j$1 digitaldaemon.com...This is an important issue, because I just did a quick test. Using myfavoritetext editor (TextPad), I saved a text file in UTF-16LE. I then examinedthesaved file with a hex editor. I can confirm that the file was saved inUTF-16LE,but, critically, /without a BOM/. I don't know what other text editors do,butclearly, if there is even a remote chance that source files will get saved without a BOM, then we really ought to be able to compile those sourcefiles! Ack. What are the Textpad programmers thinking? They need to fix Textpad to put out the BOM.
Jul 25 2004
In article <ce21g6$93r$1 digitaldaemon.com>, Walter says...Ack. What are the Textpad programmers thinking? They need to fix Textpad to put out the BOM.This is incorrect. UTF-16LE does not require a BOM. Almost all questions regarding UTFs and BOMs can be answered by heading over to the Unicode web site (www.unicode.org) and clicking on "FAQ". I cite from this FAQ here: "In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor /permitted/" (italics present in original FAQ). This doesn't apply to D source code, of course, since D source files are not "marked" in any way, however, read that FAQ. Everything about BOMs in that FAQ tells you that BOMs are "useful". Nowhere does it say they are "required" - and, as noted, in some cases they are even prohibited. Fortunately, since D syntax requires that the first character of a D source file /must/ be an ASCII character, detecting the encoding is quick and easy. See my other posts on this thread for source code which works. You *WILL* encounter BOM-less source files in the wild. Insisting that the BOM be there is in defiance of Unicode rules, and is just going to cripple DMD. Arcane Jill
Jul 26 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:ce2onb$pug$1 digitaldaemon.com...Fortunately, since D syntax requires that the first character of a Dsource file/must/ be an ASCII character, detecting the encoding is quick and easy.Yes, and that's a key insight that I'd missed. Thanks!
Jul 26 2004
In article <ce1848$2p2j$1 digitaldaemon.com>, Arcane Jill says... Revised source - now handles files which are less than four bytes long:Come on, that's a /tiny/ function, and I've written it all for you. The "overhead" is to call it /once/ during the source text stage (and that's /instead of/, not as well as, the current detection routine). As its input, it needs only the first four bytes of the source file (fewer if the file size is less than four bytes). You are correct in that many applications which are out there are buggy when it comes to Unicode, and equally correct that we should complain about that. But I've just given you a new /feature/ which is almost trivially small and which you can boast about freely - and which will save you from receiving a few misdirected bug reports in the future. Not even worth thinking about? Arcane Jill
Jul 26 2004
In article <ce2due$jos$1 digitaldaemon.com>, Arcane Jill says...Revised source - now handles files which are less than four bytes long:I should have written that in C, shouldn't I? Never mind - I think just replacing "ubyte" with "unsigned char" should do it. Jill
Jul 26 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:ce2due$jos$1 digitaldaemon.com...Come on, that's a /tiny/ function, and I've written it all for you. The "overhead" is to call it /once/ during the source text stage (and that's /instead of/, not as well as, the current detection routine). As itsinput, itneeds only the first four bytes of the source file (fewer if the file sizeisless than four bytes). You are correct in that many applications which are out there are buggywhen itcomes to Unicode, and equally correct that we should complain about that.ButI've just given you a new /feature/ which is almost trivially small andwhichyou can boast about freely - and which will save you from receiving a few misdirected bug reports in the future. Not even worth thinking about?I'd already added it to my todo list, Jill <g>. But these things sometimes have hidden gotchas, so I wanted to let it simmer for a bit. I've put in stuff too quickly before, and had to back it out later :-(
Jul 26 2004
In article <ce2gk5$l00$1 digitaldaemon.com>, Walter says...I'd already added it to my todo list, Jill <g>. But these things sometimes have hidden gotchas, so I wanted to let it simmer for a bit. I've put in stuff too quickly before, and had to back it out later :-(Fair enough. No hurry. I'm sure we're all in agreement that more important bug-fixes should come first. Keep up the good work. :) Jill
Jul 26 2004
Walter wrote:"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:ce2due$jos$1 digitaldaemon.com...Okay, I agree that the overhead is non-existent. This may be a potential "gotcha" (but I don't know how significant it is): Walter implements this detection routine, but D compilers from other vendors don't. Then there would be D files that only compile on DMD. To prevent this from happening, the new Unicode detection algortithm needs to be explicit in the D spec. James McCombCome on, that's a /tiny/ function, and I've written it all for you. The "overhead" is to call it /once/ during the source text stage (and that's /instead of/, not as well as, the current detection routine). As itsI'd already added it to my todo list, Jill <g>. But these things sometimes have hidden gotchas, so I wanted to let it simmer for a bit. I've put in stuff too quickly before, and had to back it out later :-(
Jul 26 2004
Arcane Jill wrote:In the source text analysis phase, the compiler does this (according to the manual): "The source text is assumed to be in UTF-8, unless one of the following BOMs (Byte Order Marks) is present at the beginning of the source text".I like this rule. It says that D is not going to try and *guess* the character encoding of the file. Okay, maybe Walter can write some code that can guess the encoding correctly 99% of the time, but I don't think that it is worth complicating the compiler and slightly increasing compile times just to handle missing BOMs. Here's why: Almost all the time, code will be written in UTF-8. If someone has gone to the trouble of writing their code in UTF-16 or UTF-32, they can go to the trouble of including a BOM in their file. After all, people are advised to use a BOM in those circumstances anyway. James McComb
Jul 25 2004