digitalmars.D.bugs - [Issue 3465] New: isIdeographic can be wrong in std.xml
- d-bugmail puremagic.com (117/117) Nov 01 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3465
- d-bugmail puremagic.com (17/17) Nov 01 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3465
- d-bugmail puremagic.com (10/10) May 23 2010 http://d.puremagic.com/issues/show_bug.cgi?id=3465
- d-bugmail puremagic.com (15/15) May 23 2010 http://d.puremagic.com/issues/show_bug.cgi?id=3465
http://d.puremagic.com/issues/show_bug.cgi?id=3465 Summary: isIdeographic can be wrong in std.xml Product: D Version: 2.035 Platform: Other OS/Version: All Status: NEW Severity: minor Priority: P2 Component: Phobos AssignedTo: nobody puremagic.com ReportedBy: y0uf00bar gmail.com The std.xml functionisIdeographic failed my parser on one of the xml conformance tests for the character 0x4E00. // As implemented in XML Piece Parser Project, http://source.miryn.org/ // but I took it from std.xml //WRONG in std.xml //invariant IdeographicTable=[0x4E00,0x9FA5,0x3007,0x3007,0x3021,0x3029]; //RIGHT, because for lookup function, // the table data range pairs should be ordered! dchar[] IdeographicTable=[0x3007,0x3007,0x3021,0x3029,0x4E00,0x9FA5]; // PERFORMANCE SUGGESTION // also lookup is best done for tables that are larger // for smaller tables, like this one, or character, // surely a hard coded search will be faster // Surely not much more code, is generated for this. // and faster, since no function call to lookup, and no array slices used. bool isIdeographic(dchar c) { if (c == 0x3007) return true; if (c >= 0x3007 && c <= 0x3029) return true; if (c >= 0x4E00 && c <= 0x9FA5) return true; return false; } // Only suggestion here.. // isChar has to be called for every single character in the document, and // it must be worth a bit of optimisation, // especially for common cases. /** * Returns true if the character is a character according to the XML standard * Character references must refer to one of these. * Any unicode character, excluding surrogate blocks FFFE and FFFF. * #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] * Avoid [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF], * Standards: $(LINK2 http://www.w3.org/TR/1998/REC-xml-19980210, XML 1.0) * * Params: * c = the character to be tested * The standard ASCII case gets at most 3 value comparisons. */ bool isChar(dchar c) { if (c <= 0xD7FF) { if (c >= 0x20) { if (c >= 0x7F) { if (c <= 0x84) return false; if (c >= 0x86) { if (c <= 0x9F) return false; } } return true; } switch(c) { case 0xA: case 0x9: case 0xD: return true; default: return false; } } else if (c >= 0xE000) { if (c < 0xFFFE) { if (c >= 0xFDD0 && c <= 0xFDEF) return false; return true; } if (c >= 0x10000) { if (c <= 0x10FFFF) { /* some conformance tests have the 0x10FFFF if ((c & 0xFFFE) == 0xFFFE) { return false; } */ return true; } } } return false; } // Most digits are expected to be ASCII ones bool isDigit(dchar c) { if (c <= 0x0039 && c >= 0x0030) return true; else return lookup(DigitTable,c); } -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 01 2009
http://d.puremagic.com/issues/show_bug.cgi?id=3465 // A check on my code indicates afternoon doziness, so here is the better version bool isIdeographic(dchar c) { if (c == 0x3007) return true; if (c <= 0x3029 && c >= 0x3021 ) return true; if (c <= 0x9FA5 && c >= 0x4E00) return true; return false; } -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 01 2009
http://d.puremagic.com/issues/show_bug.cgi?id=3465 Shin Fujishiro <rsinfu gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED CC| |rsinfu gmail.com AssignedTo|nobody puremagic.com |rsinfu gmail.com -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 23 2010
http://d.puremagic.com/issues/show_bug.cgi?id=3465 Shin Fujishiro <rsinfu gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Fixed in svn r1552. Thanks for your contribution! Excuse me: I removed certain part of your code from the actual commit. The contributed code took care of newer Unicode standards. I like new things, but as far as supporting XML 1.0, we have to stick to Unicode 2.0. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 23 2010