D - Unicode Character and String Intrinsics
- Mark Evans (53/55) Mar 31 2003 I'm dubious about this claim. ANSI C char arrays are UTF-8 too, if the ...
- Mark Evans (2/3) Mar 31 2003 Typo: that was "why 'ubyte' and 'ubyte[]' do not suffice."
- Walter (37/93) Mar 31 2003 contents
- Mark Evans (25/35) Mar 31 2003 But the only use for raw bytes is precisely such low-level format
- Walter (21/47) Mar 31 2003 That's only partially true - the downside comes from needing high
- Mark Evans (43/71) Mar 31 2003 If I understand correctly, the translation is that it's better to let
- Sean L. Palmer (12/20) Apr 01 2003 But there is still concern for there to be a separate type, for function
- Ilya Minkov (54/59) Apr 10 2003 Wait... won't language-supported iterators fix a need for accessing the
- Helmut Leitner (8/13) Apr 11 2003 Being used to Perl, I think that the current D regex module has to be
- Matthew Wilson (1/5) Mar 31 2003 Agree. Let's have more char types
- Mark Evans (17/24) Mar 31 2003 Hi again Bill
- Matthew Wilson (14/38) Mar 31 2003 I'm sold. Where can I sign up?
- Peter Hercek (18/64) Mar 31 2003 Well, I went through character and code page problems too about a year
- Ilya Minkov (20/37) Apr 10 2003 ply;
- Bill Cox (9/14) Mar 31 2003 A maximalist wants many built-in features, from functional programming s...
- Matthew Wilson (11/27) Mar 31 2003 And a pragmatist wants as much as is possible in libraries, but what he/...
- Mark Evans (11/11) Mar 31 2003 Bill the point is that trying to paint me this or that color, instead of
- Bill Cox (36/48) Apr 01 2003 Ok, I'll bite... Why do you feel I'm blowing smoke in your face?
- Mark Evans (8/8) Apr 01 2003 Please don't turn this into yet another thread about DataDraw or dubious
- Helmut Leitner (14/18) Apr 01 2003 When I read one of your postings a week ago, I googled for DataDraw
- Bill Cox (100/126) Apr 02 2003 I'm not trying to advertise DataDraw. In fact, I'd love to see D
- Helmut Leitner (34/134) Apr 02 2003 That means its dead outside of the heads of its few experts and
- Bill Cox (75/243) Apr 03 2003 I agree with all your comments.
- Mark Evans (26/28) Mar 31 2003 The compiler is open-source. Contributions are welcome. (Wasn't it you...
- Bill Cox (18/23) Apr 01 2003 I wrote a toy compiler to test out some ideas in a few days off, not a D...
- Mark Evans (25/28) Apr 01 2003 Unicode intrinsics make D a simple language. That is the point of havin...
- Matthew Wilson (23/51) Apr 01 2003 Mark
- Luna Kid (6/6) Apr 03 2003 Hmm... Mark, appreciating all your informedess and
- Walter (8/13) May 21 2003 I have
- J. Daniel Smith (7/20) May 22 2003 If you've got a UTF-32 string, UTF-16 is really only needed when calling
- Matthew Wilson (40/95) Mar 31 2003 One minor point:
- Mark Evans (49/49) Mar 31 2003 Walter -
- Matthew Wilson (42/91) Mar 31 2003 Qualifying this again with the stipulation that I am far from an expert ...
- Sean L. Palmer (9/12) Apr 01 2003 intrigued
- Walter (5/13) May 21 2003 flat
- Sean L. Palmer (6/20) May 22 2003 That lets you index sequentially pretty fast, but not randomly.
- Mark Evans (54/63) Apr 01 2003 No. There is one table per string.
- Sean L. Palmer (10/73) Apr 01 2003 The only problem with this idea is that passing this dual structure to a
- Mark Evans (11/13) Apr 01 2003 Serialization at choke points has a cost of (a) zero, because the string...
- Walter (10/15) May 21 2003 passed
- Sean L. Palmer (42/91) Apr 01 2003 That's so crazy it just might work! ;)
- Walter (7/9) May 21 2003 efficient
- Mark Evans (25/30) May 23 2003 I would need a specific implementation code example to understand your t...
Walter says (in response to my post)...I'm dubious about this claim. ANSI C char arrays are UTF-8 too, if the contents are 7-bit ACSII (a subset of UTF-8). That doesn't mean they support UTF-8. UTF-8 is on D's very own 'to-do' list: http://www.digitalmars.com/d/future.html UTF-8 has a maximum encoding length of 6 bytes for one character. If such a character appears at index 100 in char[] myString, what is the return value from myString[100]? The answer should be "one UTF-8 char with an internal 6-byte representation." I don't think D does that. Besides which, my idea was a native string primitive, not a quasi-array. The confusion of strings with arrays was a basic, fundamental mistake of C. While some string semantics do resemble those of arrays, this resemblance should not mandate identical data types. Strings are important enough to merit their own intrinsic type. Icon is not the only language to recognize that fact. D documents make no mention of any string primitive: http://www.digitalmars.com/d/type.html D has two intrinsic character types, a dynamic array type, and _no_ intrinsic string type. Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and "wide." The differing cross-platform widths of the 'wide' char is asking for trouble; poof goes data portability. D characters are not based on Unicode, but archaic MS Windows API and legacy C terminology spot-welded onto Linux. How about Unicode as a basis? The ideal type system would offer as intrinsic/primitive/native language types: - UTF-8 char - UTF-16 char - UTF-32 char - UTF-8 string - UTF-16 string - UTF-32 string - built-in conversions between all of the above (e.g. UTF-8 to UTF-16) - built-in conversions to/from UTF strings and C-style byte arrays The preceding list will not seem very long when you consider how many numeric types D supports. Strings are as important as numbers. The old C 'char' type is merely a byte; D already has 'ubyte.' The distinction between ubyte and char in D escapes me. Maybe the reasoning is that a char might be 'wide' so D needs a separate type? But that reason disappears once you have nice UTF characters. So even if the list is a bit long it also eliminates two redundant types, char and wchar. I would not be against retention of char and char[] for C compatibility purposes if someone could point out why 'ubyte' and 'char[]' do not suffice. Otherwise I would just alias 'char' into 'ubyte' and be done with it. The wchar could be stored inside a UTF-16 or UTF-32 char, or be declared as a struct. To the user, strings would act like dynamic arrays. Internally they are different animals. Each 'element' of the 'array' can have varying length per Unicode specifications. String primitives would hide Unicode complexity under the hood. That's just the beginning. Now that you have string intrinsics, you can give them special behaviors pertaining to i/o streams and such. You can define 'streaming' conversions from other intrinsic types to strings for i/o purposes. And...permit me to dream!...you can define Icon-style string scanning expressions. MarkD needs a Unicode string primitive.It does already. In D, a char[] is really a utf-8 array.
Mar 31 2003
if someone could point out why 'ubyte' and 'char[]' do not suffice.Typo: that was "why 'ubyte' and 'ubyte[]' do not suffice." - Mark
Mar 31 2003
"Mark Evans" <Mark_member pathlink.com> wrote in message news:b6abjh$12m8$1 digitaldaemon.com...Walter says (in response to my post)...contentsI'm dubious about this claim. ANSI C char arrays are UTF-8 too, if theD needs a Unicode string primitive.It does already. In D, a char[] is really a utf-8 array.are 7-bit ACSII (a subset of UTF-8). That doesn't mean they supportUTF-8.UTF-8 is on D's very own 'to-do' list: http://www.digitalmars.com/d/future.htmlIt is incompletely implemented, sure.UTF-8 has a maximum encoding length of 6 bytes for one character. If suchacharacter appears at index 100 in char[] myString, what is the returnvalue frommyString[100]? The answer should be "one UTF-8 char with an internal6-byterepresentation." I don't think D does that.No, it doesn't do that. Sometimes you want the byte, sometimes the assembled unicode char.Besides which, my idea was a native string primitive, not a quasi-array.Theconfusion of strings with arrays was a basic, fundamental mistake of C.Whilesome string semantics do resemble those of arrays, this resemblance shouldnotmandate identical data types. Strings are important enough to merit theirownintrinsic type. Icon is not the only language to recognize that fact. D documents make no mention of any string primitive: http://www.digitalmars.com/d/type.html D has two intrinsic character types, a dynamic array type, and _no_intrinsicstring type.D does have an intrinsic string literal.Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and "wide." The differing cross-platform widths of the 'wide' char is askingfortrouble; poof goes data portability. D characters are not based onUnicode, butarchaic MS Windows API and legacy C terminology spot-welded onto Linux.Howabout Unicode as a basis?Actually, this has changed. Wide chars are now fixed at 16 bits, i.e. UTF-16. For UTF-32, just use uint's.The ideal type system would offer as intrinsic/primitive/native languagetypes:- UTF-8 char - UTF-16 char - UTF-32 char - UTF-8 string - UTF-16 string - UTF-32 string - built-in conversions between all of the above (e.g. UTF-8 to UTF-16) - built-in conversions to/from UTF strings and C-style byte arrays The preceding list will not seem very long when you consider how manynumerictypes D supports. Strings are as important as numbers.That's actually pretty close to what D supports.The old C 'char' type is merely a byte; D already has 'ubyte.' Thedistinctionbetween ubyte and char in D escapes me. Maybe the reasoning is that acharmight be 'wide' so D needs a separate type? But that reason disappearsonce youhave nice UTF characters. So even if the list is a bit long it alsoeliminatestwo redundant types, char and wchar.The distinction is char is UTF-8, and byte is well, just a byte. The distinction comes in handy when dealing with overloaded functions.I would not be against retention of char and char[] for C compatibilitypurposesif someone could point out why 'ubyte' and 'char[]' do not suffice.Function overloading.Otherwise I would just alias 'char' into 'ubyte' and be done with it. The wchar couldbestored inside a UTF-16 or UTF-32 char, or be declared as a struct. To the user, strings would act like dynamic arrays. Internally they are different animals. Each 'element' of the 'array' can have varying lengthperUnicode specifications. String primitives would hide Unicode complexityunderthe hood. That's just the beginning. Now that you have string intrinsics, you cangivethem special behaviors pertaining to i/o streams and such. You can define 'streaming' conversions from other intrinsic types to strings for i/opurposes.And...permit me to dream!...you can define Icon-style string scanning expressions. Mark
Mar 31 2003
But the only use for raw bytes is precisely such low-level format conversions as are proposed to go under the hood. String usage involves character analysis, not bit shuffling. There is a place for getting raw bytes, but a string subscript is not it. Maybe a typecast to ubyte[], and then an array subscript. The whole point of built-in Unicode support is to let users avoid dealing with bytes and let them deal with characters instead.The answer should be "one UTF-8 char with an internal 6-byte representation."No, it doesn't do that. Sometimes you want the byte, sometimes the assembled unicode char.D does have an intrinsic string literal.But it's not Unicode, just char or wchar. Those are both fixed byte-width, but all Unicode chars, except UTF-32, are variable byte-width.Wide chars are now fixed at 16 bits, i.e. UTF-16.Ditto. Wide chars are not UTF-16 chars since they are fixed width. UTF-16 characters can be 16 or 32 bits wide. (UTF-8 characters can be anywhere from 1 byte to 6 bytes wide.)For UTF-32, just use uint's.Possible, but see my final point.That's actually pretty close to what D supports.I don't see anything close. (a) There is no Unicode string primitive (char[] is not a string primitive, let alone Unicode; it's an array type). (b) There are no Unicode characters. There are merely types with similar 'average' sizes being touted as Unicode capable (they are not).This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways. Thanks for taking all our thoughts into consideration. Markif someone could point out why 'ubyte' and 'ubyte[]' do not suffice.Function overloading.
Mar 31 2003
"Mark Evans" <Mark_member pathlink.com> wrote in message news:b6an4p$1bpj$1 digitaldaemon.com...The whole point of built-in Unicode support is to let users avoid dealing with bytes and let them deal with characters instead.That's only partially true - the downside comes from needing high performance you'll need byte indices, not UTF character strides. There is no getting away from the variable byte encoding. In my (limited) experience with string processing and UTF-8, rarely is it necessary to decode it. Most manipulation is done with indices.No, in D, the intrinsic string literal is not just char or wchar. It's a unicode string - its internal format is not fixed until semantic processing, when it is adjusted to be UTF-8, -16, or -32 as needed.D does have an intrinsic string literal.But it's not Unicode, just char or wchar. Those are both fixed byte-width, but all Unicode chars, except UTF-32, are variable byte-width.What I meant is they do not change size from implementation to implementation. They are 16 bits, and line up with the UTF-16 API's of Win32.Wide chars are now fixed at 16 bits, i.e. UTF-16.Ditto. Wide chars are not UTF-16 chars since they are fixed width.UTF-16 characters can be 16 or 32 bits wide. (UTF-8 characters can be anywhere from 1 byte to 6 bytes wide.)Yes.I think that's a matter of perspective.For UTF-32, just use uint's.Possible, but see my final point.That's actually pretty close to what D supports.I don't see anything close. (a) There is no Unicode string primitive (char[] is not a string primitive, let alone Unicode; it's an array type).(b) There are no Unicode characters. There are merely types with similar 'average' sizes being touted as Unicode capable (they are not).I believe they are unicode capable. Now, I have not written the I/O routines so they will print as unicode, and there are other gaps in the implementation, but the core concept is there.I think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.Function overloading.Thanks for taking all our thoughts into consideration.You're welcome.
Mar 31 2003
If I understand correctly, the translation is that it's better to let end users process bytes, so they can waste hours <g> tuning inner loops, than to offer language support, with pre-tuned inner loops. I don't see that. In fact native language support is better from a performance perspective (both in time of execution and in time of development).The whole point of built-in Unicode support is to let users avoid dealing with bytes and let them deal with characters instead.That's only partially true - the downside comes from needing high performance you'll need byte indices, not UTF character strides. There is no getting away from the variable byte encoding.In my (limited) experience with string processing and UTF-8, rarely is it necessary to decode it. Most manipulation is done with indices.Manipulation is done with indices in C, because that is all C offers. It's one of the big problems with C vis-a-vis Unicode.in D, the intrinsic string literal is not just char or wchar. It's a unicode string - its internal format is not fixed until semantic processing, when it is adjusted to be UTF-8, -16, or -32 as needed.I think your definition of "Unicode" is basically wrong. What you are calling UTF-8 and UTF-16 is really just fixed-width slots that the user must conglomerate, not true native Unicode characters. So we are talking past each other. For example when you say "internal format" I don't suppose you have in mind that 6-byte-wide UTF-8 character I mentioned. When I say Unicode character, I mean an object that the language recognizes, intrinsically, as a variable-byte-width object, but which it presents to the user as an integrated (opaque) whole. I do not mean a user-defined conglomeration of fixed-width fields. That seems to be your working definition and it does not satisfy me.That's what I understood you to mean; and that much is good, as far as it goes, but doesn't address Unicode.What I meant is they do not change size from implementation to implementation.Wide chars are now fixed at 16 bits, i.e. UTF-16.Ditto. Wide chars are not UTF-16 chars since they are fixed width.They are 16 bits, and line up with the UTF-16 API's of Win32.If Windows supports full UTF-16, then D does not support UTF-16 API's of Win32 with any native data type. The user still faces the same labor (more or less) as supporting Unicode in ANSI C.I think that's a matter of perspective.... I believe they are unicode capable. Now, I have not written the I/O routines so they will print as unicode, and there are other gaps in the implementation, but the core concept is there.I've tried to explain why there is no Unicode character in D, and on that basis alone, I could say there is no Unicode string in D. The syntax and semantics of char[] are identical across all types of arrays, not limited to strings. (What syntax or semantics are unique to strings?) End users can create and manipulate almost any data structure -- any collection of bits -- in D, or for that matter C, or assembly language, or even machine language. What I'm talking about is intrinsic language support to save the labor (and mistakes). I could build Unicode strings with a Turing machine if I wanted to. That's not "language support" in my book. Saying that we already have 8-bit things, and 16-bit things, and 32-bit things, and that users can do Unicode by combining these things in various ways, is not a reasonable argument that the language supports Unicode. At best one might say, D does not prevent users from implementing Unicode, if they want to take the extra trouble.Then you are ignoring your own argument about function overloading! :-) MarkI think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.Function overloading.
Mar 31 2003
"Walter" <walter digitalmars.com> wrote in message news:b6aoge$1cnp$1 digitaldaemon.com...But there is still concern for there to be a separate type, for function overloading. Otherwise, how shall we print a Unicode character higher than position 0xFFFF? Perhaps the basic char type would actually be 32 bits and capable of holding any Unicode character? And when used in array form, char[] would transmogrify into UTF-8? Would we then even need wchar? Obviously this Unicode thing is a whole can of worms. Too bad we can't get everyone to forget about enough characters that they all fit in 16 bits! ;) SeanI think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.Function overloading.
Apr 01 2003
Walter wrote:That's only partially true - the downside comes from needing high performance you'll need byte indices, not UTF character strides. There is no getting away from the variable byte encoding. In my (limited) experience with string processing and UTF-8, rarely is it necessary to decode it. Most manipulation is done with indices.Wait... won't language-supported iterators fix a need for accessing the underlying array indices directly? I *definately* don't want to know anything about underlying format, which can be really anything - UTF-8/16/32, or even an agregate of 2 arrays like i or Mark have proposed. Walter, you also don't: look what i found in this newsgroup. :) And you claim it to be better to work with pointers into a char[], pretending it was an UTF-8 string!!! --- 8< --- At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. It turned out to be a lot of trouble and I finally converted it to wchar's. --- >8 --- BTW, as to the possibilities that Mark wishes for himself, i've dug his message up, which was posted as i wasn't around yet. Here. --- 8< --- Short summaries here: http://www.nmt.edu/tcc/help/lang/icon/positions.html http://www.nmt.edu/tcc/help/lang/icon/substring.html http://www.cs.arizona.edu/icon/docs/ipd266.htm http://www.toolsofcomputing.com/IconHandbook/IconHandbook.pdf Sections 6.2 and following. Icon is simply unsurpassed in string processing and is for that reason famous among linguists. There is more to the string processing than just character position indices. Icon supports special clauses called "string scanning environments" which work like file i/o in a vague analogy. (See third link above, section 3.) Icon also has nice built-in structures like sets (*character sets* turn out to be insanely useful), hash tables, and lists. Somehow Icon never made it to the Big Leagues and that is a shame. It deserves to be up there with Perl. Icon is wicked fast when written correctly. The Unicon project is the next-generation Icon, and has added objects and other modern features to base Icon. It is on SourceForge. (There was only one project in which I recall desiring a new Icon built-in. I wanted a two-way hash table which could index off of either data column. The workaround was to implement two mutually mirroring one-way hash tables.) Icon has a very interesting 'success/failure' paradigm which might also be something to study, esp. in light of D's contract emphasis. The unique 'goal-directed' paradigm is quite interesting but may have no application to D. I have for a very long time desired Icon's string scanning capabilities in my C/C++ programs. Even with std::string or string classes from various class libraries (I've used them all), there is just no comparison with Icon. I would become a total D convert if it could do strings like Icon. Mark http://www.cs.arizona.edu/icon/ http://unicon.sourceforge.net/index.html --- >8 --- -i.
Apr 10 2003
Ilya Minkov wrote:I have for a very long time desired Icon's string scanning capabilities in my C/C++ programs. Even with std::string or string classes from various class libraries (I've used them all), there is just no comparison with Icon. I would become a total D convert if it could do strings like Icon.Being used to Perl, I think that the current D regex module has to be extended. In what way does Icon differ (or have advantages) in string processing compared to Perl? -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 11 2003
This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.Agree. Let's have more char types
Mar 31 2003
Hi again Bill After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that... Under my scheme we gain 3 character types and drop 2: net gain 1. We gain 3 string types and drop 1: net gain 2. Total net gain, 3 types. What does that buy us? Complete internationalization of D, complete freedom from ugly C string idioms, data portability across platforms, ease of interfacing with Win32 APIs and other software languages. The idea of "just one" Unicode type holds little water. Why don't you make the same argument about numeric types, of which we have some twenty-odd? Or how about if D offered just one data type, the bit, and let you construct everything else from that? If D does Unicode then D should do it right. It's a poor, asymmetric design to have some Unicode built-in and the rest tacked on as library routines. MarkThis is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill
Mar 31 2003
I'm sold. Where can I sign up? I presume you'll be working on the libraries ... ;) To suck up: I've been faffing around with this issue for years, and have been (unjustifiably, in my opinion) called on numerous times to expertly opine on it for clients. (My expertise is limited to the C/C++ char/wchar_t/horrid-TCHAR type stuff, which I'm well aware is not the full picture.) Your discussion here is the first time I even get a hint that I'm listening to someone that know's what they're talking about. It's nasty, nasty stuff, and I hope that your promise can bear fruit for D. If it can, then it'll earn massive brownie points for D over its peer languages. There's a big market out there of peoples whose character sets don't fall into 7-bits ... "Mark Evans" <Mark_member pathlink.com> wrote in message news:b6al79$1ahd$1 digitaldaemon.com...Hi again Bill After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that... Under my scheme we gain 3 character types and drop 2: net gain 1. We gain 3 string types and drop 1: net gain 2. Total net gain, 3 types. What does that buy us? Complete internationalization of D, complete freedom from ugly C string idioms, data portability across platforms, ease of interfacing with Win32 APIs and other software languages. The idea of "just one" Unicode type holds little water. Why don't you make the same argument about numeric types, of which we have some twenty-odd? Or how about if D offered just one data type, the bit, and let you construct everything else from that? If D does Unicode then D should do it right. It's a poor, asymmetric design to have some Unicode built-in and the rest tacked on as library routines. MarkThis is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill
Mar 31 2003
Well, I went through character and code page problems too about a year ago. Very bad experience in C/C++ ... (I'm from place where 7 bits is not enough). I have two points about this: 1) D should support characters and not bytes (8bits) or words (16bits); when I'm indexing string I do so by characters and not by a byte multiply; if I would want to index by eg bytes I would ask for string byte length and cast to a byte array 2) Support for 3 character types (UTF8, UTF16, UTF32) is handy, but not critical (can be solved by conversion functions); actually for one character only, UTF32 has the shortest representation; it may be also interesting not to be able to specify the the exact encoding for a string (as oposed to an encoding for a character) - let's compiler to decide what is the best representation (may be some optimization can be achieved based on this later; eg compiler can decide to store strings in partially balanced trees like STLPort does for ropes, but with posibly different encodings for different nodes ... whatever just writting down my thoughts) "Matthew Wilson" <dmd synesis.com.au> wrote in message news:b6aq84$1dn4$1 digitaldaemon.com...I'm sold. Where can I sign up? I presume you'll be working on the libraries ... ;) To suck up: I've been faffing around with this issue for years, and have been (unjustifiably, in my opinion) called on numerous times to expertly opine on it for clients. (My expertise is limited to the C/C++ char/wchar_t/horrid-TCHAR type stuff, which I'm well aware is not the full picture.) Your discussion here is the first time I even get a hint that I'm listening to someone that know's what they're talking about. It's nasty, nasty stuff, and I hope that your promise can bear fruit for D. If it can, then it'll earn massive brownie points for D over its peer languages. There's a big market out there of peoples whose character sets don't fall into 7-bits ... "Mark Evans" <Mark_member pathlink.com> wrote in message news:b6al79$1ahd$1 digitaldaemon.com...Hi again Bill After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that... Under my scheme we gain 3 character types and drop 2: net gain 1. We gain 3 string types and drop 1: net gain 2. Total net gain, 3 types. What does that buy us? Complete internationalization of D, complete freedom from ugly C string idioms, data portability across platforms, ease of interfacing with Win32 APIs and other software languages. The idea of "just one" Unicode type holds little water. Why don't you make the same argument about numeric types, of which we have some twenty-odd? Or how about if D offered just one data type, the bit, and let you construct everything else from that? If D does Unicode then D should do it right. It's a poor, asymmetric design to have some Unicode built-in and the rest tacked on as library routines. MarkThis is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill
Mar 31 2003
Peter Hercek wrote:Well, I went through character and code page problems too about a year ago. Very bad experience in C/C++ ... (I'm from place where 7 bits is not enough). I have two points about this:Me too :)1) D should support characters and not bytes (8bits) or words (16bits);=when I'm indexing string I do so by characters and not by a byte multi=ply;if I would want to index by eg bytes I would ask for string byte lengt=h andcast to a byte arrayRight.2) Support for 3 character types (UTF8, UTF16, UTF32) is handy, but not critical (can be solved by conversion functions); actually for one=character only, UTF32 has the shortest representation; it may be also interesting not to be able to specify the the exact encoding for a str=ing(as oposed to an encoding for a character) - let's compiler to decide what is the best representation (may be some optimization can be achieved based on this later; eg compiler can decide to store strings in partially balanced trees like STLPort does for ropes, but with posibly different encodings for different nodes ... whatever just writting down my thoughts)UTF-32 doesn't have the shortest representation, since "in all 3=20 encodings [i.e. UTF-8/16/32] the maximim possible character=20 representation length is 4 bytes", as the official description says.=20 Though i agree that it's the most practical one, in part because working = with an array of longs is nowadays faster than an array of shorts. This is an implementation detail and should not matter though, because=20 whatever string implementation is, it should hide the undelying complexit= y. What matters though is that in UNICODE there are 2 kinds of characters - = normal and modifyers. So an "=E4" can be represented as well as "a" and= =20 a special accent symbol. I'm pretty much sure you want to access these=20 as a whole, not separately. -i.
Apr 10 2003
In article <b6al79$1ahd$1 digitaldaemon.com>, Mark Evans says...Hi again Bill After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that...A maximalist wants many built-in features, from functional programming support, to multimethods, to support of every character format known to man. Not in libraries, where we could all contribute, but built-in, where Walter has to write it. As a minimalist, I'd settle for features that allow me to add the features I need to the language in libraries. The meta-programming stuff I'd mentioned leads in that direction. Bill
Mar 31 2003
And a pragmatist wants as much as is possible in libraries, but what he/she feels must be in the compiler because of the likelihood of stuff-ups if left to the full spectrum of the developer community (such as meaningful ==, string types and my auto-stringise thingo with char null *) "Bill Cox" <Bill_member pathlink.com> wrote in message news:b6b05r$1hsv$1 digitaldaemon.com...In article <b6al79$1ahd$1 digitaldaemon.com>, Mark Evans says...support,Hi again Bill After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that...A maximalist wants many built-in features, from functional programmingto multimethods, to support of every character format known to man. Notinlibraries, where we could all contribute, but built-in, where Walter hastowrite it. As a minimalist, I'd settle for features that allow me to add the featuresIneed to the language in libraries. The meta-programming stuff I'dmentionedleads in that direction. Bill
Mar 31 2003
Bill the point is that trying to paint me this or that color, instead of focusing on something specific, is ad hominem. I find it patronizing. Especially since on this point you've already agreed with me explicitly. We can quibble on specifics. I want 3 char types, you want 2 (UTF8 + char) or maybe even 3 (UTF8 + char + wchar). I have much to say about those bizarre meta programming concepts. I have worked in EDA and know that domain - you can't blow smoke in my face, even if others are impressed. All I would say here is that by your own admission, you're trying to write code for 'average' or 'dumb' programmers, so please focus on doing just that. Mark
Mar 31 2003
Hi, Mark. Mark Evans wrote:Bill the point is that trying to paint me this or that color, instead of focusing on something specific, is ad hominem. I find it patronizing. Especially since on this point you've already agreed with me explicitly. We can quibble on specifics. I want 3 char types, you want 2 (UTF8 + char) or maybe even 3 (UTF8 + char + wchar). I have much to say about those bizarre meta programming concepts. I have worked in EDA and know that domain - you can't blow smoke in my face, even if others are impressed. All I would say here is that by your own admission, you're trying to write code for 'average' or 'dumb' programmers, so please focus on doing just that.Ok, I'll bite... Why do you feel I'm blowing smoke in your face? As for the meta-programming stuff, we use DataDraw today to do lots of it, and I find it very productive, particularly for our EDA work. In particular, we added dynamic class extensions, recursive destructors, array bounds checking, pointer indrection checking to C. The code generators also give us much of the power of template framworks. We also use a memory mapping model that works great on 64-bit machines, where EDA is headed fast (we use theSheesh Kabob code generator). All of these have very specific benifits for EDA, which I've covered in previous posts. Before calling it bizarre, why not look into it? A fairly receint version of DataDraw is available at: http://www.viasic.com/download/datadraw.tar.gz Most GUI programmers use Class Wizzard, which is much the same kind of thing. Should that capability be in the language? Possibly. The concept has been researched by other groups, and one way to do it is to add "compile-time reflection classes" to the language. OpenC++ is one example of this aproach. XL does it, too. Also, we don't hire average or dumb programmers. We hire brilliant programmers, and train them to code as-if the target audience were stupid people. This really helps them work together, and helps the code last over time. It helps our business output a consistent product - the code looks much the same no matter who wrote it. There are good business reasons for this. Putting a restrictive coding methodology in place doesn't restrict how an algorithm works, just how the implementation looks. So far, there have been exactly 0 algorithms that had to be changed in order to fit into our methodology. We encourage our programmers to be as creative as possible in algorithm development, and to come up with brilliant solutions. We enable them to implement those algorithms quickly and efficiently with a consistent, solid, and proven coding methodology. They spend less time thinking about how to write code, and more time writing it. It's one of our competitive tools for success. Bill
Apr 01 2003
Please don't turn this into yet another thread about DataDraw or dubious management 'expertise.' (Put up a wiki board somewhere, OK? I could show you five different ways from Sunday to replace DataDraw with better code using standard languages/libraries/mixins/design patterns/tools of which you seem ignorant. Sorry you'll have to pay me though.) Thank you for supporting the idea that D needs some kind of native Unicode support. Mark
Apr 01 2003
Bill Cox wrote:Before calling it bizarre, why not look into it? A fairly receint version of DataDraw is available at: http://www.viasic.com/download/datadraw.tar.gzWhen I read one of your postings a week ago, I googled for DataDraw and didn't find references or a download page, although you said it is open source. I found this very weird. I also didn't get the impression that you were connected to the project. Now a see in the About-Box, that you are the lead developer... There is no LICENSE. The documentation is so imcomplete that I wouldn't even start trying to use it (Although it's date says 1993). There are surely better ways to advertise you project. Why don't you set up an official OS project at sourceforge and complete the documentation. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 01 2003
Hi, Helmut. Helmut Leitner wrote:Bill Cox wrote:I'm not trying to advertise DataDraw. In fact, I'd love to see D incorporate features that would allow me to kill it. I'd prefer that user's didn't start adopting DataDraw, as I don't have the time to do free support. It's open-source, as the copyright file describes. It's a very weak copyright, meant to be weaker than the GNU GPL. The documentation sucks, and I think it will probably stay that way. I did write the first version, and place it into the open-source domain. The guys who wrote the second one kept me listed in the about box, but I didn't write the code. So far as I know, DataDraw is only in use at ViASIC (my company), QuickLogic, and Synplicity. None of these companies has any reason to promote it. It's specific insights I've gained in working with DataDraw that I've been trying to describe in this group, rather than trying to promote DataDraw. I only posted it because someone asked me to, and the license requres that I do. Through using DataDraw for many years, however, I think I've had some fairly unique insights into language design. Adding features to a target langauge is what DataDraw is for, and I've been able to try out several features not found in C++ in a real industrial coding environment. Some of those features I've described in other posts. As I said, I was hoping D could be extended to make DataDraw obsolete. That turns out not to be the case. I'll describe some of my current thinking about this matter below. DataDraw currently just models data structures, and allows me to write code generators. This is much like the old OM tool for UML (which DataDraw preceeds). It gives me the power of compile-time reflection classes, like those in OpenC++. However, for each new language, or coding style, I have to write a new code generator, and these things get really complex. DataDraw currenly has 5. That kind of sucks. Instead, DataDraw should allow me to write one awesome code generator that targets in an intermediate language. Then, it should allow me to write simple translators for each target language and coding style. The bulk of the work could then be shared. With a built-in language translator, DataDraw would be much simpler than it is now. However, with a built-in language translator, DataDraw becomes a language in itself. What's unique about it? Simple. It's extendable by me and others I work with who are familiar with the DataDraw code base. I can generate code of any type, and add literally any feature I wish. However, I do that by directly editing the code generators, which are written in C and which link into DataDraw's database. That's not elegant, or usable by anyone not familiar with the DataDraw code base, although it does cover my needs. So, I've been looking into what it takes to get the same power, but in a language that anyone could work with. In particular, I've been examining what it would take for D to cover DataDraw's functionality. That, it turns out, is hard (which is one reason the XL compiler isn't done). The more power you give the user, the more you open up the internals of the compiler, and the more complex you make the language. For example, to do that in D, a natural way would be to make Walter's representation of D as data structures part of the language definition (thus greatly restricting how D compilers are built). Then, you could offer access to reflection classes at compile time (as OpenC++ does). A natural way to use these classes at compile time is to interpret D code. Now, you have to write a D interpreter as well as a compiler. This is the aproach taken by VHDL for their generators, and it really complicated implementations of compilers. An alternative is to re-compile the compiler instead. This is a bit brain-bending, but I think getting rid of the interpreter is worth it. Besides, I already recompile DataDraw every time I fix or add a feature, and that's never been much of a problem. Even if we added compile-time reflection classes, I still don't get all the power of DataDraw, which I can extend in any way, because I directly edit the source. What's still missing? For one thing, reflection classes can't be used to add syntax to the language. That's a serious limitation. XL's aproach allows some syntax extension. Scheme also has a nice mechanism. However, both systems are limited, and complex, and slow. I'm toying with another aproach that is easy if you already allow users to compile custom versions of the compiler (which you do to get rid of the interpreter). Just provide a simple mechanism for generating a syntax description for use by bison. That nails the problem. Any new syntax can then be added by a user, so long as it's compatible with what's already there. A drawback is that bison now becomes part of the language, along with all its quirks and strong points. At least bison is pretty much available everywhere. Just adding new syntax to the language doesn't get you all the way there. You still are stuck with those reflection classes used to model the language. If you have a new construct to implement, you can add the syntax, but what objects do you build to represent it? The reflection classes themselves need to be extendable. Really. At that point, nothing in the language is left as non-configurable. You're stuck with LAR1 parsers, but that's no big deal. However, adding reflection classes is tricky. Being C-derived, the language still needs to link with the C linker, including the compiler itself, especially if users are going to compile custom compilers for their applications. That means that new types can't be added to the compiler's database, since C libraries are limited that way. I'm currently toying with the age-old style of non-typed syntax trees rather than fully typed reflection classes. It looks like it will work out, but in the end, all this has done is provide a compiler that's easy to extend. It's easy to extend because it's parser, and internal data structures are simple, and extendable. Plug-ins should be easy to write. However, it's not really a standard language any more. It's just a customizable compiler that's fairly easy to work with. I'm left with the conclusion that D can't be enhanced be extendable the way XL wants to be, or the way I'd like D to be. I don't see how D can get there from here. BillBefore calling it bizarre, why not look into it? A fairly receint version of DataDraw is available at: http://www.viasic.com/download/datadraw.tar.gzWhen I read one of your postings a week ago, I googled for DataDraw and didn't find references or a download page, although you said it is open source. I found this very weird. I also didn't get the impression that you were connected to the project. Now a see in the About-Box, that you are the lead developer... There is no LICENSE. The documentation is so imcomplete that I wouldn't even start trying to use it (Although it's date says 1993). There are surely better ways to advertise you project. Why don't you set up an official OS project at sourceforge and complete the documentation. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 02 2003
Bill Cox wrote:Ok, I think it's good to have this said.There are surely better ways to advertise you project.... I'm not trying to advertise DataDraw. In fact, I'd love to see D incorporate features that would allow me to kill it. I'd prefer that user's didn't start adopting DataDraw, as I don't have the time to do free support.It's open-source, as the copyright file describes. The documentation sucks, and I think it will probably stay that way.That means its dead outside of the heads of its few experts and will remain so.... It's specific insights I've gained in working with DataDraw that I've been trying to describe in this group, rather than trying to promote DataDraw. ...I'm very interested in your experiences and insights. I'm doing software projects since 1979 and feel very strong about the way systems present themselves towards the programmer (APIs).Through using DataDraw for many years, however, I think I've had some fairly unique insights into language design. Adding features to a target langauge is what DataDraw is for, and I've been able to try out several features not found in C++ in a real industrial coding environment. Some of those features I've described in other posts.I'll try to reread some of your postings and arguments. Can you give me some hints to find my way?As I said, I was hoping D could be extended to make DataDraw obsolete. That turns out not to be the case. I'll describe some of my current thinking about this matter below. DataDraw currently just models data structures, and allows me to write code generators. This is much like the old OM tool for UML (which DataDraw preceeds). It gives me the power of compile-time reflection classes, like those in OpenC++. However, for each new language, or coding style, I have to write a new code generator, and these things get really complex. DataDraw currenly has 5. That kind of sucks. Instead, DataDraw should allow me to write one awesome code generator that targets in an intermediate language. Then, it should allow me to write simple translators for each target language and coding style. The bulk of the work could then be shared.That's a natural idea, that doesn't seem to work. I think that Charles Simonyi has put 10 years into Intentional Programming to follow similiar ideas and they burned millions of $.With a built-in language translator, DataDraw would be much simpler than it is now. However, with a built-in language translator, DataDraw becomes a language in itself. What's unique about it? Simple. It's extendable by me and others I work with who are familiar with the DataDraw code base. I can generate code of any type, and add literally any feature I wish. However, I do that by directly editing the code generators, which are written in C and which link into DataDraw's database. That's not elegant, or usable by anyone not familiar with the DataDraw code base, although it does cover my needs.This is a certain way to solve problems but it may or may not be optimal. The fact that you have this tool at hand gives power but may mislead.So, I've been looking into what it takes to get the same power, but in a language that anyone could work with. In particular, I've been examining what it would take for D to cover DataDraw's functionality.Analytically this is not a goal. The goal is to enable programmers to write great applications. What are their problems and how can they be solved?That, it turns out, is hard (which is one reason the XL compiler isn't done). The more power you give the user, the more you open up the internals of the compiler, and the more complex you make the language.I agree. I think this is the problem of C++ itself. Too much complexity for to little gain.For example, to do that in D, a natural way would be to make Walter's representation of D as data structures part of the language definition (thus greatly restricting how D compilers are built). Then, you could offer access to reflection classes at compile time (as OpenC++ does). A natural way to use these classes at compile time is to interpret D code. Now, you have to write a D interpreter as well as a compiler. This is the aproach taken by VHDL for their generators, and it really complicated implementations of compilers. An alternative is to re-compile the compiler instead. This is a bit brain-bending, but I think getting rid of the interpreter is worth it. Besides, I already recompile DataDraw every time I fix or add a feature, and that's never been much of a problem. Even if we added compile-time reflection classes, I still don't get all the power of DataDraw, which I can extend in any way, because I directly edit the source. What's still missing? For one thing, reflection classes can't be used to add syntax to the language. That's a serious limitation. XL's aproach allows some syntax extension. Scheme also has a nice mechanism. However, both systems are limited, and complex, and slow. I'm toying with another aproach that is easy if you already allow users to compile custom versions of the compiler (which you do to get rid of the interpreter). Just provide a simple mechanism for generating a syntax description for use by bison. That nails the problem. Any new syntax can then be added by a user, so long as it's compatible with what's already there. A drawback is that bison now becomes part of the language, along with all its quirks and strong points. At least bison is pretty much available everywhere.I still don't know what problems you are trying to solve. A language that is able to extend its own syntax? Surely an faszinating idea but 99.9 percent of programmers would not be able to make good use of it.Just adding new syntax to the language doesn't get you all the way there. You still are stuck with those reflection classes used to model the language. If you have a new construct to implement, you can add the syntax, but what objects do you build to represent it? The reflection classes themselves need to be extendable. Really. At that point, nothing in the language is left as non-configurable. You're stuck with LAR1 parsers, but that's no big deal. However, adding reflection classes is tricky. Being C-derived, the language still needs to link with the C linker, including the compiler itself, especially if users are going to compile custom compilers for their applications. That means that new types can't be added to the compiler's database, since C libraries are limited that way. I'm currently toying with the age-old style of non-typed syntax trees rather than fully typed reflection classes. It looks like it will work out, but in the end, all this has done is provide a compiler that's easy to extend. It's easy to extend because it's parser, and internal data structures are simple, and extendable. Plug-ins should be easy to write. However, it's not really a standard language any more. It's just a customizable compiler that's fairly easy to work with. I'm left with the conclusion that D can't be enhanced be extendable the way XL wants to be, or the way I'd like D to be.As I see it D was never designed to have an extensible syntax.I don't see how D can get there from here.For this reason it is unreasonable to think it could go there. Currently I don't understand why it should go there, other than it would allow you to carry your DataDraw methods of problem solving on to D. But, as I said, I'll try to read some of your threads. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 02 2003
I agree with all your comments. At this point, I'm not advocating major changes to D, so this reply is more just to answer your questions that to give Walter any ideas. You'd asked about specific features I'd been advocating, so I'll re-summarize them below. 1) Compile-time reflection classes. I threw this out there as a possibility to be investigated. Now that I've done that, I'm dropping that request, for reasons described in the you replied to below. 2) I'd still like to see more powerful iterators that the ones discussed lately. You can look up my recomendations under "Cool iterators", or something like that. 3) Dynamic class extensions are also a great thing, and it's sad C++, databases have to emulate the extensions with cross-coupled void pointers. 4) A class framework inheritance mechnaism, such as Sather's "include" construct, virtual classes, or Dan's "Template Frameworks". All of these cover a gaping hole in C++, but I'm concerned about the complexity of the virtual class aproach Walter was considering. Embedded replies to a couple questions you posed are below. Helmut Leitner wrote:Bill Cox wrote:I believe it. The hard part isn't making a nice intermediate language I can work with. The hard part is making an extendable version that one anyone can work with.Ok, I think it's good to have this said.There are surely better ways to advertise you project.... I'm not trying to advertise DataDraw. In fact, I'd love to see D incorporate features that would allow me to kill it. I'd prefer that user's didn't start adopting DataDraw, as I don't have the time to do free support.It's open-source, as the copyright file describes. The documentation sucks, and I think it will probably stay that way.That means its dead outside of the heads of its few experts and will remain so.... It's specific insights I've gained in working with DataDraw that I've been trying to describe in this group, rather than trying to promote DataDraw. ...I'm very interested in your experiences and insights. I'm doing software projects since 1979 and feel very strong about the way systems present themselves towards the programmer (APIs).Through using DataDraw for many years, however, I think I've had some fairly unique insights into language design. Adding features to a target langauge is what DataDraw is for, and I've been able to try out several features not found in C++ in a real industrial coding environment. Some of those features I've described in other posts.I'll try to reread some of your postings and arguments. Can you give me some hints to find my way?That's a natural idea, that doesn't seem to work. I think that Charles Simonyi has put 10 years into Intentional Programming to follow similiar ideas and they burned millions of $.As I said, I was hoping D could be extended to make DataDraw obsolete.That turns out not to be the case. I'll describe some of my current thinking about this matter below. DataDraw currently just models data structures, and allows me to write code generators. This is much like the old OM tool for UML (which DataDraw preceeds). It gives me the power of compile-time reflection classes, like those in OpenC++. However, for each new language, or coding style, I have to write a new code generator, and these things get really complex. DataDraw currenly has 5. That kind of sucks. Instead, DataDraw should allow me to write one awesome code generator that targets in an intermediate language. Then, it should allow me to write simple translators for each target language and coding style. The bulk of the work could then be shared.You're right about that. You have to be extremely careful about adding features to a language using a custom pre-processor. In particular, every extension has to be carefully though out, and agreed to by the whole group. If anyone could add a feature any time they wished, it'd result in mayhem.With a built-in language translator, DataDraw would be much simpler than it is now. However, with a built-in language translator, DataDraw becomes a language in itself. What's unique about it? Simple. It's extendable by me and others I work with who are familiar with the DataDraw code base. I can generate code of any type, and add literally any feature I wish. However, I do that by directly editing the code generators, which are written in C and which link into DataDraw's database. That's not elegant, or usable by anyone not familiar with the DataDraw code base, although it does cover my needs.This is a certain way to solve problems but it may or may not be optimal. The fact that you have this tool at hand gives power but may mislead.Oh, there are lots of problems. Big stuff and little stuff. How about array bounds checking in debug mode? We added it to C. Need a few fields added to existing classes at run-time? We do that. The space of solutions to real problems programmers are facing out there is a lot bigger than what most languages address. I agree with your point, though. A good D design is a design that covers most people's most common needs, but not all of anybody's needs. IMO, D's basically on track.So, I've been looking into what it takes to get the same power, but in a language that anyone could work with. In particular, I've been examining what it would take for D to cover DataDraw's functionality.Analytically this is not a goal. The goal is to enable programmers to write great applications. What are their problems and how can they be solved?You're right about how many programmers should use it. It's dangerous stuff, and extensions need to be carefully considered by a few and then adopted by many. Scheme has a nice mechanism for this kind of thing. Much of the syntax of Scheme can acutally be written in Scheme. However, without an ability to add syntax, some new features can't cleanly be added to a language, and thus the language isn't fully extensible. For example, how could we add Sather-like "include" constructs to allow module level inheritance? There's no way in D, C++, parser a little. After that, it's a simple thing to implement with compile-time reflection classes. I'm not pushing for any syntax extension mechanism for D. It's pretty worthless without some way to tie it into reflection classes or an equivalent mechanism.That, it turns out, is hard (which is one reason the XL compiler isn't done). The more power you give the user, the more you open up the internals of the compiler, and the more complex you make the language.I agree. I think this is the problem of C++ itself. Too much complexity for to little gain.For example, to do that in D, a natural way would be to make Walter's representation of D as data structures part of the language definition (thus greatly restricting how D compilers are built). Then, you could offer access to reflection classes at compile time (as OpenC++ does). A natural way to use these classes at compile time is to interpret D code. Now, you have to write a D interpreter as well as a compiler. This is the aproach taken by VHDL for their generators, and it really complicated implementations of compilers. An alternative is to re-compile the compiler instead. This is a bit brain-bending, but I think getting rid of the interpreter is worth it. Besides, I already recompile DataDraw every time I fix or add a feature, and that's never been much of a problem. Even if we added compile-time reflection classes, I still don't get all the power of DataDraw, which I can extend in any way, because I directly edit the source. What's still missing? For one thing, reflection classes can't be used to add syntax to the language. That's a serious limitation. XL's aproach allows some syntax extension. Scheme also has a nice mechanism. However, both systems are limited, and complex, and slow. I'm toying with another aproach that is easy if you already allow users to compile custom versions of the compiler (which you do to get rid of the interpreter). Just provide a simple mechanism for generating a syntax description for use by bison. That nails the problem. Any new syntax can then be added by a user, so long as it's compatible with what's already there. A drawback is that bison now becomes part of the language, along with all its quirks and strong points. At least bison is pretty much available everywhere.I still don't know what problems you are trying to solve. A language that is able to extend its own syntax? Surely an faszinating idea but 99.9 percent of programmers would not be able to make good use of it.I agree. At this point, I've concluded that D should not try to solve the problems I solve with DataDraw. I've started working on a new system that should replace DataDraw when finished. It's already got the syntax extension mechanism I described that generates a bison file. It's got a simple list based lanugage parse tree that is capable of representing any feature I wish to support. These get used like compile-time reflection classes, allowing users to write code in the intermediate langauge in order to add features to the target language. The output can be in any language (as with DataDraw), and users can write new generators to target new languages or coding styles. I'm thinking of calling it Hack-C, since allowing me to hack in new features to C or other languages is it's primary function, and because the whole system seems like one of the world's largest hacks. It's a translator that compiles application specific versions of itself in order to add features to other languages. The opportunities for serious hacking in such a system are vast. If you think there might be interest in this system in the open-source community, I could try to finish it's development that way. It might be fun enough for me to actually support an open-source effort, and if anyone else were to help, I could benifit from that. I haven't seen much interest in this kind of project out there in the past. Languages are always hot, bot CASE tools never are. Do you think this could be successful as an open-source effort? BillJust adding new syntax to the language doesn't get you all the way there. You still are stuck with those reflection classes used to model the language. If you have a new construct to implement, you can add the syntax, but what objects do you build to represent it? The reflection classes themselves need to be extendable. Really. At that point, nothing in the language is left as non-configurable. You're stuck with LAR1 parsers, but that's no big deal. However, adding reflection classes is tricky. Being C-derived, the language still needs to link with the C linker, including the compiler itself, especially if users are going to compile custom compilers for their applications. That means that new types can't be added to the compiler's database, since C libraries are limited that way. I'm currently toying with the age-old style of non-typed syntax trees rather than fully typed reflection classes. It looks like it will work out, but in the end, all this has done is provide a compiler that's easy to extend. It's easy to extend because it's parser, and internal data structures are simple, and extendable. Plug-ins should be easy to write. However, it's not really a standard language any more. It's just a customizable compiler that's fairly easy to work with. I'm left with the conclusion that D can't be enhanced be extendable the way XL wants to be, or the way I'd like D to be.As I see it D was never designed to have an extensible syntax.I don't see how D can get there from here.For this reason it is unreasonable to think it could go there. Currently I don't understand why it should go there, other than it would allow you to carry your DataDraw methods of problem solving on to D. But, as I said, I'll try to read some of your threads. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 03 2003
Bill Cox wrote,Not in libraries, where we could all contribute, but built-in, where Walter has to write it.The compiler is open-source. Contributions are welcome. (Wasn't it you who said recently, 'I had a few days off and rewrote the D compiler' or words to that effect? Forgive me if memory fails, I think it was you.) Whatever reasons you accept for UTF-8 as a native type hold equally well for UTF-16 and UTF-32. The only rationale advanced otherwise was a vague impression of unease (coupled with slurs on my design sense). Dividing type families is a war crime. It's more complex having one member in the compiler and the rest stranded in a library. Think about slicing Unicode strings. Suppose the compiler includes code for slicing UTF-8 strings. Why do we want to duplicate that in a library for UTF-16? We have to write identical logic, in C for the compiler and in D for the library? Yuk! And what about the conversions between Unicode formats? They are easier with the strings all living in the same place. Either these strings belong in the language together, or they belong in a library together. I see no objective reason to divide them up. Just think about what you're saying in terms of numeric types and the fallacy will jump out at you. C has trained people too well about what strings really are. Suppose for example that we put all floats in the compiler and all doubles in the library. Silly! <g> Maybe it will mend fences to say in public that UTF-32 could be dropped. I have objective reasons for saying so, not vague unease: UTF-32 is rarely used and truly fixed-width (so it can be 'faked' as Walter suggests). Nonetheless intrinsic UTF-32 is just as reasonable to support as, say, the equally rarely used, and equally fake-able 'ifloat' type. Mark
Mar 31 2003
Hi, Mark. Mark Evans wrote:Bill Cox wrote, The compiler is open-source. Contributions are welcome. (Wasn't it you who said recently, 'I had a few days off and rewrote the D compiler' or words to that effect? Forgive me if memory fails, I think it was you.)I wrote a toy compiler to test out some ideas in a few days off, not a D compiler. There's a huge difference between a week's effort, and what D has become. In fact C++ is so complex, the compilers out there still aren't complete. Keeping D simple is key to avoiding this fate. The fact that D's front-end is open-source is an even greater reason for the language itself to be simple. The author of Linux has a lot to say about keeping open-source code simple. He blasted GNU's Herd effort for it's complexity. I agree with him. The fact that I'm writing this note using a Linux kernel instead of a GNU Herd kernel supports his assertion. Last I checked, the D front-end was 35K lines of hand written code, which is impressively small given the functionality and commenting. However, that's still a lot to learn if you just want to contribute, but it's doable. When it reaches 100K lines, the language is in real trouble. Not many of us will be willing to work with a program that huge, unless we're getting paid. Bill
Apr 01 2003
Keeping D simple is key to avoiding this fate.Unicode intrinsics make D a simple language. That is the point of having them. I assume you are still with me that D needs them. The notion is to rid D of ugly 30-year-old C confusions about strings, and to bring their formats up to modern standards in the bargain. We can't help the extra work of Unicode; that is what the world wants.The fact that D's front-end is open-source is an even greater reason for the language itself to be simple.No one said otherwise. You keep propping up straw-men to tear down. They are purely your own creations. It's amusing to watch you rip them down, but little else beyond that. We all want the language to be as simple and orthogonal as possible. That's why I worry about D's rigid adherence to C++ as a design baseline. Look Bill - my design sense is as good as yours, maybe better, and definitely more informed. You need not lecture me about simplicity. To be frank, your work belies complicated over-engineering and reinvented wheels. From my viewpoint you are the one who needs simplicity lessons. Furthermore I do not 'advocate' everything that I post. You halfway accused me of 'advocating' multimethods, and I don't recall once doing that. I merely linked to a short article showing how multimethods simplify code. I do advocate functional approaches, for this reason: they allow me to simplify my code. You see, I like simplicity. There are software engineering concepts that C++ does not offer and it's important for a new language effort to know about them. That way, even if rejected, a decision about the concepts was made on facts, not ignorance. If you agree with me about Unicode intrinsics, to whatever degree, then bite the bullet and be done with it. You really are going over the top on this. Mark
Apr 01 2003
Mark Not wishing to get in the middle of you two stags, but aren't you getting a bit over the top? I don't doubt that all your skills are as incomparable as you assert - though I note you did not add an entry to the "Introductions" thread, why was that? - but do we really need to be told all the time? Frankly it's beginning to taste a little like Boost, not to mention a waste of time in the lives of lots of busy people in reading through them to get to the technical points (which are very interesting, I must say) that you're making. "Mark Evans" <Mark_member pathlink.com> wrote in message news:b6du7v$jiv$1 digitaldaemon.com...them.Keeping D simple is key to avoiding this fate.Unicode intrinsics make D a simple language. That is the point of havingI assume you are still with me that D needs them. The notion is to rid D of ugly 30-year-old C confusions about strings, andtobring their formats up to modern standards in the bargain. We can't helptheextra work of Unicode; that is what the world wants.areThe fact that D's front-end is open-source is an even greater reason for the language itself to be simple.No one said otherwise. You keep propping up straw-men to tear down. Theypurely your own creations. It's amusing to watch you rip them down, butlittleelse beyond that. We all want the language to be as simple and orthogonalaspossible. That's why I worry about D's rigid adherence to C++ as a design baseline. Look Bill - my design sense is as good as yours, maybe better, anddefinitelymore informed. You need not lecture me about simplicity. To be frank,yourwork belies complicated over-engineering and reinvented wheels. From my viewpoint you are the one who needs simplicity lessons. Furthermore I do not 'advocate' everything that I post. You halfwayaccused meof 'advocating' multimethods, and I don't recall once doing that. Imerelylinked to a short article showing how multimethods simplify code. I do advocate functional approaches, for this reason: they allow me tosimplifymy code. You see, I like simplicity. There are software engineering concepts that C++ does not offer and it's important for a new language effort to know about them. That way, even if rejected, a decision about the concepts was made on facts, not ignorance. If you agree with me about Unicode intrinsics, to whatever degree, thenbite thebullet and be done with it. You really are going over the top on this. Mark
Apr 01 2003
Hmm... Mark, appreciating all your informedess and very welcome sharp and clear view on this matter (and others), how about improving your diplomatic skills a bit? Sorry about the noise. The Luna Kid
Apr 03 2003
"Mark Evans" <Mark_member pathlink.com> wrote in message news:b6beep$1qom$1 digitaldaemon.com...Maybe it will mend fences to say in public that UTF-32 could be dropped.I haveobjective reasons for saying so, not vague unease: UTF-32 is rarely usedandtruly fixed-width (so it can be 'faked' as Walter suggests). Nonetheless intrinsic UTF-32 is just as reasonable to support as, say, the equallyrarelyused, and equally fake-able 'ifloat' type.My understanding is that the linux wchar_t type is UTF-32, which puts it in common use. UTF-32 is also handy as an intermediate form when converting between UTF-8 and UTF-16.
May 21 2003
If you've got a UTF-32 string, UTF-16 is really only needed when calling things like Win32 APIs. Dan "Walter" <walter digitalmars.com> wrote in message news:bagjlo$308t$1 digitaldaemon.com..."Mark Evans" <Mark_member pathlink.com> wrote in message news:b6beep$1qom$1 digitaldaemon.com...NonethelessMaybe it will mend fences to say in public that UTF-32 could be dropped.I haveobjective reasons for saying so, not vague unease: UTF-32 is rarely usedandtruly fixed-width (so it can be 'faked' as Walter suggests).inintrinsic UTF-32 is just as reasonable to support as, say, the equallyrarelyused, and equally fake-able 'ifloat' type.My understanding is that the linux wchar_t type is UTF-32, which puts itcommon use. UTF-32 is also handy as an intermediate form when converting between UTF-8 and UTF-16.
May 22 2003
One minor point: We *must* have char/wchar and byte/ubyte/short/ushort as separate, and overloadable, entities. This is about the most egregious and toxic aspect of C/C++ that I can think of. Absolute nightmare when trying to write generic serialisation components, messing around with compiler discrimination pre-processor guff to work out whether the compiler "knows" about wchar_t, and crying oneself to sleep with char, signed char, unsigned char, etc. etc. Following this logic, if D does evolve to support different character encoding schemes, it would be nice to have separate char types, although I know this will draw the succinctness crowd down on me like a pack of blood-thursty vultures. Swoop away flying beasties, my gizard is exposed. "Mark Evans" <Mark_member pathlink.com> wrote in message news:b6abjh$12m8$1 digitaldaemon.com...Walter says (in response to my post)...contentsI'm dubious about this claim. ANSI C char arrays are UTF-8 too, if theD needs a Unicode string primitive.It does already. In D, a char[] is really a utf-8 array.are 7-bit ACSII (a subset of UTF-8). That doesn't mean they supportUTF-8.UTF-8 is on D's very own 'to-do' list: http://www.digitalmars.com/d/future.html UTF-8 has a maximum encoding length of 6 bytes for one character. If suchacharacter appears at index 100 in char[] myString, what is the returnvalue frommyString[100]? The answer should be "one UTF-8 char with an internal6-byterepresentation." I don't think D does that. Besides which, my idea was a native string primitive, not a quasi-array.Theconfusion of strings with arrays was a basic, fundamental mistake of C.Whilesome string semantics do resemble those of arrays, this resemblance shouldnotmandate identical data types. Strings are important enough to merit theirownintrinsic type. Icon is not the only language to recognize that fact. D documents make no mention of any string primitive: http://www.digitalmars.com/d/type.html D has two intrinsic character types, a dynamic array type, and _no_intrinsicstring type. Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and "wide." The differing cross-platform widths of the 'wide' char is askingfortrouble; poof goes data portability. D characters are not based onUnicode, butarchaic MS Windows API and legacy C terminology spot-welded onto Linux.Howabout Unicode as a basis? The ideal type system would offer as intrinsic/primitive/native languagetypes:- UTF-8 char - UTF-16 char - UTF-32 char - UTF-8 string - UTF-16 string - UTF-32 string - built-in conversions between all of the above (e.g. UTF-8 to UTF-16) - built-in conversions to/from UTF strings and C-style byte arrays The preceding list will not seem very long when you consider how manynumerictypes D supports. Strings are as important as numbers. The old C 'char' type is merely a byte; D already has 'ubyte.' Thedistinctionbetween ubyte and char in D escapes me. Maybe the reasoning is that acharmight be 'wide' so D needs a separate type? But that reason disappearsonce youhave nice UTF characters. So even if the list is a bit long it alsoeliminatestwo redundant types, char and wchar. I would not be against retention of char and char[] for C compatibilitypurposesif someone could point out why 'ubyte' and 'char[]' do not suffice.Otherwise Iwould just alias 'char' into 'ubyte' and be done with it. The wchar couldbestored inside a UTF-16 or UTF-32 char, or be declared as a struct. To the user, strings would act like dynamic arrays. Internally they are different animals. Each 'element' of the 'array' can have varying lengthperUnicode specifications. String primitives would hide Unicode complexityunderthe hood. That's just the beginning. Now that you have string intrinsics, you cangivethem special behaviors pertaining to i/o streams and such. You can define 'streaming' conversions from other intrinsic types to strings for i/opurposes.And...permit me to dream!...you can define Icon-style string scanning expressions. Mark
Mar 31 2003
Walter - On a positive and constructive note, an implementation concept might hold some interest. I'm just bringing it to attention, not advocating yet <g>. There's no hard requirement for serial bytewise storage of the proposed intrinsic Unicode strings. Other ways to build Unicode strings exist. The one offered here would do little or no damage to the current compiler. Really it's just a set of small additions. Consider a Unicode string made of two data structures: a C-style array, and a lookup table. The C-style array holds the first code word for each character. The table holds all second, third, and additional code words. (A 'code word' meaning 8/16/32 bits for UTF 8/16/32 respectively.) The keys to the table are accessed via some function like table_access(100). This setup unifies C array indices with Unicode character indices. So D can employ straight pointer arithmetic to find any character in the string. Character index = array index. String length (in chars) = implementation array size (in elements). These features may address your hesitation over implementation issues that are complex in the serial case. Having found the character, D need only check the high bit(s) which flag additional code words. Unicode requires such a test in any case; it's unavoidable. If flagged, D performs a table lookup. This table lookup is the only serious runtime cost. The table could take whatever form is most efficient. * UTF-32 has no extended codes, so UTF-32 strings don't need tables. * UTF-16 characters involve only a few percent with extended codes. Ergo - the table is small, and the runtime cost is, say, 2-3%. * UTF-8 needs the biggest and most table entries, but manageably so. A downside might be file and network serialization - but we might skate by. D could supply streams on demand, without an intermediate serialized format. If I tell D "write(myFile, myString)" no intermediate format is required. D can just empty the internal array and table to disk in proper byte sequence. The disk or network won't care how D get the bytes from memory. The only hard serialization requirement would be actual user conversion to byte arrays. (If the user is doing that, let him suffer!) This scheme supports 7-bit ASCII. An optimization could yield raw C speed. Put an extra boolean flag inside each string structure. This flag is the logical OR of all contained Unicode bit flags. If the string has no extended chars, the flag is FALSE, and D can use alternate string code on that basis. (No bit tests, no table lookups.) That works for UTF-32, 7-bit ASCII, and the majority of UTF-16 strings. The idea can be nitpicked to death, but it's a concept. Unicode strings and characters will never enjoy the simplicity or speed of 7-bit ASCII. That's a fact of life, meaning that implementation concepts cannot be faulted on such a basis. What would be nice is to make Unicode maximally simple and maximally efficient for D users. Thanks again Walter, Best- Mark
Mar 31 2003
Qualifying this again with the stipulation that I am far from an expert on this issue (aside from having a fair amount of experience in a negative sense): This sounds like a nice idea - array of 1st-byte plus lookups. I'm intrigued as to the nature of the lookup table. Is this a constant, process-wide, entity? If I had time when it was introduced I'd be keen to participate in the serialisation stuff, on which I have more firmer footing. It's not clear now whether you've dropped the suggestion for a separate string class, or just that arrays of "char" types would be dealt with in the fashion that you've outlined. Finally, I'm troubled by your comments "on a positive and constructive note" and "maybe it will mend fences to " (other post). Have I missed some animus that everyone else has perceived? If so, I don't know which side to be on. Seriously, though, I don't think anyone's getting shirty, so chill, baby. :) Keep those great comments coming. I'm learning heaps. "Mark Evans" <Mark_member pathlink.com> wrote in message news:b6bb6i$1ont$1 digitaldaemon.com...Walter - On a positive and constructive note, an implementation concept might holdsomeinterest. I'm just bringing it to attention, not advocating yet <g>. There's no hard requirement for serial bytewise storage of the proposed intrinsic Unicode strings. Other ways to build Unicode strings exist.The oneoffered here would do little or no damage to the current compiler. Reallyit'sjust a set of small additions. Consider a Unicode string made of two data structures: a C-style array,and alookup table. The C-style array holds the first code word for eachcharacter.The table holds all second, third, and additional code words. (A 'codeword'meaning 8/16/32 bits for UTF 8/16/32 respectively.) The keys to the tablearethey areaccessed via some function like table_access(100). This setup unifies C array indices with Unicode character indices. So Dcanemploy straight pointer arithmetic to find any character in the string. Character index = array index. String length (in chars) = implementationarraysize (in elements). These features may address your hesitation over implementation issues that are complex in the serial case. Having found the character, D need only check the high bit(s) which flag additional code words. Unicode requires such a test in any case; it's unavoidable. If flagged, D performs a table lookup. This table lookup istheonly serious runtime cost. The table could take whatever form is most efficient. * UTF-32 has no extended codes, so UTF-32 strings don't need tables. * UTF-16 characters involve only a few percent with extended codes. Ergo - the table is small, and the runtime cost is, say, 2-3%. * UTF-8 needs the biggest and most table entries, but manageably so. A downside might be file and network serialization - but we might skateby. Dcould supply streams on demand, without an intermediate serialized format.If Itell D "write(myFile, myString)" no intermediate format is required. Dcan justempty the internal array and table to disk in proper byte sequence. Thedisk ornetwork won't care how D get the bytes from memory. The only hard serialization requirement would be actual user conversion tobytearrays. (If the user is doing that, let him suffer!) This scheme supports 7-bit ASCII. An optimization could yield raw Cspeed. Putan extra boolean flag inside each string structure. This flag is thelogical ORof all contained Unicode bit flags. If the string has no extended chars,theflag is FALSE, and D can use alternate string code on that basis. (No bit tests, no table lookups.) That works for UTF-32, 7-bit ASCII, and themajorityof UTF-16 strings. The idea can be nitpicked to death, but it's a concept. Unicode stringsandcharacters will never enjoy the simplicity or speed of 7-bit ASCII.That's afact of life, meaning that implementation concepts cannot be faulted onsuch abasis. What would be nice is to make Unicode maximally simple and maximallyefficientfor D users. Thanks again Walter, Best- Mark
Mar 31 2003
"Matthew Wilson" <dmd synesis.com.au> wrote in message news:b6bgt5$1sai$1 digitaldaemon.com...This sounds like a nice idea - array of 1st-byte plus lookups. I'mintriguedas to the nature of the lookup table. Is this a constant, process-wide, entity?No, because the map is indexed by the same index used to index into the flat array. Unless I'm misunderstanding something. Perhaps these could be grouped into separate maps by the total size of the char, which I think is determinable from the first char? May speed lookups a tad, or slow them down, not sure. Sean
Apr 01 2003
"Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:b6bjg5$1ut5$1 digitaldaemon.com..."Matthew Wilson" <dmd synesis.com.au> wrote in message news:b6bgt5$1sai$1 digitaldaemon.com...flatThis sounds like a nice idea - array of 1st-byte plus lookups. I'mintriguedas to the nature of the lookup table. Is this a constant, process-wide, entity?No, because the map is indexed by the same index used to index into thearray. Unless I'm misunderstanding something.You could use a static 256 byte lookup table to give you the 'stride' to the next char.
May 21 2003
That lets you index sequentially pretty fast, but not randomly. Sean "Walter" <walter digitalmars.com> wrote in message news:bagk8l$30ti$2 digitaldaemon.com..."Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:b6bjg5$1ut5$1 digitaldaemon.com...process-wide,"Matthew Wilson" <dmd synesis.com.au> wrote in message news:b6bgt5$1sai$1 digitaldaemon.com...This sounds like a nice idea - array of 1st-byte plus lookups. I'mintriguedas to the nature of the lookup table. Is this a constant,theflatentity?No, because the map is indexed by the same index used to index into thearray. Unless I'm misunderstanding something.You could use a static 256 byte lookup table to give you the 'stride' tonext char.
May 22 2003
This sounds like a nice idea - array of 1st-byte plus lookups.Thanks. Correction, "array of first code words." Only in UTF-8 are they byte-sized.I'm intrigued as to the nature of the lookup table. Is this a constant, process-wide, entity?No. There is one table per string.I'd be keen to participate in the serialisation stuffNo need for serialization. Even the compiler can do serialization with no memory footprint. Only something like an explicit conversion to ubyte[] would mandate that.It's not clear now whether you've dropped the suggestion for a separate string class, or just that arrays of "char" types would be dealt with in the fashion that you've outlined.I never suggested a string 'class,' just Unicode string and char intrinsic types. My list of proposed intrinsics has already been supplied. Think int, float, string8, string16, char 8, etc. C made a huge mistake in confusing arrays with strings. Strings deserve intrinsic status and a type all their own. The ugly char/wchar gimmick has also seen its day and needs replacement. Mark The internal implementation might read like this in C++-ish, heavy on the "ish," this is the ideal, it's just a communication vehicle for the concept: // code word storage types typedef ubyte UTF8_CODE; typedef ushort UTF16_CODE; typedef uint UTF32_CODE; // max code words per Unicode character const ushort UTF8_CODE_MAX = 6; const ushort UTF16_CODE_MAX = 2; const ushort UTF32_CODE_MAX = 1; template <typename UTF_CODE, ushort UTF_CODE_MAX> class ExtensionTableEntry { public: int myStringPositionIndex; UTF_CODE myStorage[UTF_CODE_MAX+1]; // null terminated? }; // a partially defined Unicode String class concept template <typename UTF_CODE, ushort UTF_CODE_MAX> class UnicodeString { public: long length; UTF_CODE* operator[]; private: UTF_CODE* firstWordsArray; std::hash_map< int, ExtensionTableEntry<UTF_CODE,UTF_CODE_MAX>myLookup;}; typedef UnicodeString<UTF8_CODE,UTF8_CODE_MAX> String8; typedef UnicodeString<UTF16_CODE,UTF16_CODE_MAX> String16; typedef UnicodeString<UTF32_CODE,UTF32_CODE_MAX> String32; /* Walter - each table entry should hold the full Unicode char not just its extension codes. This tactic would create some redundancy, but not much. Having the whole character in contiguous memory could be advantageous for passing pointers around. So the C++ operator[] either returns a pointer into the firstWordsArray, or a pointer to the table entry's myStorage field. In all cases the firstWordsArray always holds the first code word of the char, whether it's an extended one or not. */
Apr 01 2003
The only problem with this idea is that passing this dual structure to a piece of code that expects a linear string of data won't work. Typecasting to ubyte[] or ushort[] should solve that, right? You would probably need to know the length of such a string both in bytes and in chars. Sean "Mark Evans" <Mark_member pathlink.com> wrote in message news:b6bpf9$22g9$1 digitaldaemon.com...theThis sounds like a nice idea - array of 1st-byte plus lookups.Thanks. Correction, "array of first code words." Only in UTF-8 are they byte-sized.I'm intrigued as to the nature of the lookup table. Is this a constant, process-wide, entity?No. There is one table per string.I'd be keen to participate in the serialisation stuffNo need for serialization. Even the compiler can do serialization with no memory footprint. Only something like an explicit conversion to ubyte[] would mandate that.It's not clear now whether you've dropped the suggestion for a separate string class, or just that arrays of "char" types would be dealt with inhas alsofashion that you've outlined.I never suggested a string 'class,' just Unicode string and char intrinsic types. My list of proposed intrinsics has already been supplied. Think int, float, string8, string16, char 8, etc. C made a huge mistake in confusing arrays with strings. Strings deserve intrinsic status and a type all their own. The ugly char/wchar gimmickseen its day and needs replacement. Mark The internal implementation might read like this in C++-ish, heavy on the "ish," this is the ideal, it's just a communication vehicle for the concept: // code word storage types typedef ubyte UTF8_CODE; typedef ushort UTF16_CODE; typedef uint UTF32_CODE; // max code words per Unicode character const ushort UTF8_CODE_MAX = 6; const ushort UTF16_CODE_MAX = 2; const ushort UTF32_CODE_MAX = 1; template <typename UTF_CODE, ushort UTF_CODE_MAX> class ExtensionTableEntry { public: int myStringPositionIndex; UTF_CODE myStorage[UTF_CODE_MAX+1]; // null terminated? }; // a partially defined Unicode String class concept template <typename UTF_CODE, ushort UTF_CODE_MAX> class UnicodeString { public: long length; UTF_CODE* operator[]; private: UTF_CODE* firstWordsArray; std::hash_map< int, ExtensionTableEntry<UTF_CODE,UTF_CODE_MAX>myLookup;}; typedef UnicodeString<UTF8_CODE,UTF8_CODE_MAX> String8; typedef UnicodeString<UTF16_CODE,UTF16_CODE_MAX> String16; typedef UnicodeString<UTF32_CODE,UTF32_CODE_MAX> String32; /* Walter - each table entry should hold the full Unicode char not just its extension codes. This tactic would create some redundancy, but not much. Having the whole character in contiguous memory could be advantageous for passing pointers around. So the C++ operator[] either returns a pointer into the firstWordsArray, or a pointer to the table entry's myStorage field. In all cases the firstWordsArray always holds the first code word of the char, whether it's an extended one or not. */
Apr 01 2003
Sean L. Palmer says...The only problem with this idea is that passing this dual structure to a piece of code that expects a linear string of data won't work.Serialization at choke points has a cost of (a) zero, because the string has no extended codes (say typ. 95%+ of UTF-16 and by definition 100% of UTF-32), or (b) an alloc plus copy equivalent, which is acceptable for small to medium strings (another statistically large class in software programs). You run into problems only with large UTF-8 strings that are frequently passed to/from Unicode APIs. Windows uses UTF-16 so it's no problem. Where you find UTF-8 happening is on the web, but that has inherent delays of its own, so the cost might go unnoticed. Consider for example that plenty of web sites are driven with UTF-8 by languages far slower than D. Mark
Apr 01 2003
"Mark Evans" <Mark_member pathlink.com> wrote in message news:b6dolr$di3$1 digitaldaemon.com...You run into problems only with large UTF-8 strings that are frequentlypassedto/from Unicode APIs. Windows uses UTF-16 so it's no problem. Where youfindUTF-8 happening is on the web, but that has inherent delays of its own, sothecost might go unnoticed. Consider for example that plenty of web sitesaredriven with UTF-8 by languages far slower than D.I've been looking at some books for programming CGI apps in C. I see the dreaded buffer overflow errors in the sample code even in highly regarded books. No wonder security is such a mess! Doing CGI in D would eliminate those problems.
May 21 2003
That's so crazy it just might work! ;) I think it's a fine concept. One point I'd like to add is that when straight iterating over the string, the library function can iterate over both the main array and the secondary map at the same time, in sync, with no map lookups, only iteration. This would be an interesting bit to actually implement. But no harder than the many other possible solutions, and easier and more efficient than most, especially for random-access indexing, which seems to be what D is leaning toward in general. I'd prefer iteration to be the normal way of using D arrays, rather than explicit loops and indexing. Those are, for obvious reasons, difficult to optimize. But Walter has not decided on a good foreach construct, and newsgroup discussion on the topic has died down. Anyone have any good proposals? I haven't used any language that has good iterators, except if you count C++ STL. Sean "Mark Evans" <Mark_member pathlink.com> wrote in message news:b6bb6i$1ont$1 digitaldaemon.com...Walter - On a positive and constructive note, an implementation concept might holdsomeinterest. I'm just bringing it to attention, not advocating yet <g>. There's no hard requirement for serial bytewise storage of the proposed intrinsic Unicode strings. Other ways to build Unicode strings exist.The oneoffered here would do little or no damage to the current compiler. Reallyit'sjust a set of small additions. Consider a Unicode string made of two data structures: a C-style array,and alookup table. The C-style array holds the first code word for eachcharacter.The table holds all second, third, and additional code words. (A 'codeword'meaning 8/16/32 bits for UTF 8/16/32 respectively.) The keys to the tablearethey areaccessed via some function like table_access(100). This setup unifies C array indices with Unicode character indices. So Dcanemploy straight pointer arithmetic to find any character in the string. Character index = array index. String length (in chars) = implementationarraysize (in elements). These features may address your hesitation over implementation issues that are complex in the serial case. Having found the character, D need only check the high bit(s) which flag additional code words. Unicode requires such a test in any case; it's unavoidable. If flagged, D performs a table lookup. This table lookup istheonly serious runtime cost. The table could take whatever form is most efficient. * UTF-32 has no extended codes, so UTF-32 strings don't need tables. * UTF-16 characters involve only a few percent with extended codes. Ergo - the table is small, and the runtime cost is, say, 2-3%. * UTF-8 needs the biggest and most table entries, but manageably so. A downside might be file and network serialization - but we might skateby. Dcould supply streams on demand, without an intermediate serialized format.If Itell D "write(myFile, myString)" no intermediate format is required. Dcan justempty the internal array and table to disk in proper byte sequence. Thedisk ornetwork won't care how D get the bytes from memory. The only hard serialization requirement would be actual user conversion tobytearrays. (If the user is doing that, let him suffer!) This scheme supports 7-bit ASCII. An optimization could yield raw Cspeed. Putan extra boolean flag inside each string structure. This flag is thelogical ORof all contained Unicode bit flags. If the string has no extended chars,theflag is FALSE, and D can use alternate string code on that basis. (No bit tests, no table lookups.) That works for UTF-32, 7-bit ASCII, and themajorityof UTF-16 strings. The idea can be nitpicked to death, but it's a concept. Unicode stringsandcharacters will never enjoy the simplicity or speed of 7-bit ASCII.That's afact of life, meaning that implementation concepts cannot be faulted onsuch abasis. What would be nice is to make Unicode maximally simple and maximallyefficientfor D users. Thanks again Walter, Best- Mark
Apr 01 2003
"Mark Evans" <Mark_member pathlink.com> wrote in message news:b6bb6i$1ont$1 digitaldaemon.com...What would be nice is to make Unicode maximally simple and maximallyefficientfor D users.I appreciate the thought, but carrying around an extra array for each string seems difficult to make work, especially in view of slicing, etc. I don't think there's any way to design the language so it is both efficient at dealing with ordinary ascii, and transparently able to do multibytes.
May 21 2003
Walter wrote:I appreciate the thought, but carrying around an extra array for each string seems difficult to make work, especially in view of slicing, etc.I would need a specific implementation code example to understand your thinking. (Clarification: I did not propose an extra array per string, but a lookup table -- something considerably smaller and often empty.) My gut says it would be easy.I don't think there's any way to design the language so it is both efficient at dealing with ordinary ascii, and transparently able to do multibytes.The problem here is either/or thinking. Both are possible. People who desperately want C byte arrays can declare them, irrespective of Unicode strings. If the idea is that an intrinsic string type must simultaneously support Unicode and ASCII at equal performance levels, then I think the problem is one of definition. In the first place D lacks an honest string intrinsic, so a new one could be defined just for Unicode, leaving the current whatever-it-is in place. If people don't care for Unicode, then they can use whatever-it-is D offers currently. However my gut says that a Unicode string intrinsic holding just ASCII vs. an ASCII string as currently implemented would be neck and neck in terms of performance. Remember that you don't necessarily need a bit test on every character every time. The table object can flag callers when it's totally empty and they can proceed with manipulations on that basis. In that sense the Unicode concept is really just a superset of what you already have. Considering the number of languages now being retrofitted for Unicode, I think it would be a mistake not to build it into D when the chance to do it cleanly exists, one that will be regretted later. Best, Mark
May 23 2003