digitalmars.D - BOMs and std.stream
- Carlos Santander B. (9/9) Nov 20 2004 Currently std.stream doesn't recognize BOMs, and while it might not be a...
- J C Calvarese (14/24) Nov 20 2004 I think there is a need for something like this in std.stream. I ran
- Ben Hinkle (6/30) Nov 21 2004 I like the enum idea. It would be nice if the stream remembered the BOM ...
- Kris (48/48) Nov 21 2004 The ICU project provides this kind of thing: (from the documentation)
- Stewart Gordon (7/14) Nov 22 2004 The problem is that std.stream seems to be designed to work with binary
Currently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important. I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM. I think std.stream should change somehow, but I just don't know how. ----------------------- Carlos Santander Bernal
Nov 20 2004
Carlos Santander B. wrote:Currently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important. I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM. I think std.stream should change somehow, but I just don't know how. ----------------------- Carlos Santander BernalI think there is a need for something like this in std.stream. I ran into this challenge a while back, and I didn't really think of a good solution at the time. But I just came up with an idea for a fix (it's not complicated, but I think it'd work). We could add a function called something like getBOM. If a BOM is present, it will return a string with the BOM and move the current location past the BOM. If there isn't a BOM, an empty string is returned and the current location doesn't change. That's just one idea for a design. A similar idea is that an enum could be returned instead of a string. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Nov 20 2004
In article <cnp7hv$2tls$1 digitaldaemon.com>, J C Calvarese says...Carlos Santander B. wrote:I like the enum idea. It would be nice if the stream remembered the BOM in the UTF-16 case so that the code that reads strings can swap byte orders if needed. Otherwise the user is hosed if the stream is in the wrong byte-ordering. I sense another std.stream project in the next few days... -BenCurrently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important. I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM. I think std.stream should change somehow, but I just don't know how. ----------------------- Carlos Santander BernalI think there is a need for something like this in std.stream. I ran into this challenge a while back, and I didn't really think of a good solution at the time. But I just came up with an idea for a fix (it's not complicated, but I think it'd work). We could add a function called something like getBOM. If a BOM is present, it will return a string with the BOM and move the current location past the BOM. If there isn't a BOM, an empty string is returned and the current location doesn't change. That's just one idea for a design. A similar idea is that an enum could be returned instead of a string. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Nov 21 2004
The ICU project provides this kind of thing: (from the documentation) static final char[] detectSignature (void[] input) Detects Unicode signature byte sequences at the start of the byte stream and returns the charset name of the indicated Unicode charset. A null is returned where no Unicode signature is recognized. A caller can create a UConverter using the charset name. The first code unit (wchar) from the start of the stream will be U+FEFF (the Unicode BOM/signature character) and can usually be ignored. You might take a look at the breadth of that project; you'll find it covers pretty much anything you'll need for regular Unicode processing, and then some ... http://www.dsource.org/forums/viewtopic.php?t=420 "J C Calvarese" <jcc7 cox.net> wrote in message news:cnp7hv$2tls$1 digitaldaemon.com... | Carlos Santander B. wrote: | > Currently std.stream doesn't recognize BOMs, and while it might not be a big | > thing, there're times where it could be important. | > I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried | > to read it using Miguel Ferreira Simões' XML library, it complained about the | > file not being well-formed. Further testing made me discover that removing the | > BOM solved the problem. So it's a problem, ATM. | > I think std.stream should change somehow, but I just don't know how. | > | > ----------------------- | > Carlos Santander Bernal | | I think there is a need for something like this in std.stream. I ran | into this challenge a while back, and I didn't really think of a good | solution at the time. But I just came up with an idea for a fix (it's | not complicated, but I think it'd work). | | We could add a function called something like getBOM. If a BOM is | present, it will return a string with the BOM and move the current | location past the BOM. If there isn't a BOM, an empty string is returned | and the current location doesn't change. | | That's just one idea for a design. A similar idea is that an enum could | be returned instead of a string. | | -- | Justin (a/k/a jcc7) | http://jcc_7.tripod.com/d/
Nov 21 2004
Carlos Santander B. wrote:Currently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important. I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM.The problem is that std.stream seems to be designed to work with binary files, with a few text capabilities thrown in but not to this level.I think std.stream should change somehow, but I just don't know how.My thought is to develop a new set of classes for working with text files. I posted something on this a while back: http://www.digitalmars.com/drn-bin?wwwnews?digitalmars.D/6089 Stewart.
Nov 22 2004