www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - string-ish range/stream from curl ubyte[] chunks?

reply "Vlad" <b100dian gmail.com> writes:
Hello D programmers,

I am toying with writing my own HTML parser as a pet project, and 
I strive to have a range API for the tokenizer and the parser 
output itself.

However it occurs to me that in real-life browsers the advantage 
of this type of 'streaming' parsing would be given by also having 
the string that plays as input to the tokenizer treated as a 
'stream'/'range'.

While D's *string classes do play as ranges, what I want to write 
is a 'ChunkDecoder' range that would take curl 'byChunk' output 
and make it consumable by the tokenizer.

Now, the problem: string itself has ElementType!string == dchar. 
Consuming a string a dchar at a time looks like a wasteful 
operation if e.g. your string is UTF-8 or UTF-16.

So, naturally, I would like to use indexOf() - instead of 
countUntil() - and opSlice (without opDollar?) on my ChunkDecoder 
(forward) range.

Q: Is anything like this already in use somewhere in the standard 
library or a project you know?
Q2: Or do you have any pointers for what the smallest API would 
be for a string-like range class?

And bonus:
Q3: any uses of such a string-ish range in other standard library 
methods that you can think of and could be contributed to? e.g. 
suppose this doesn't exist and I / we come up with a proposal of 
minimal API to consume a string from left to right.

Thanks for your time and your suggestions!
May 16 2014
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 16 May 2014 16:57:41 -0400, Vlad <b100dian gmail.com> wrote:

 Hello D programmers,

 I am toying with writing my own HTML parser as a pet project, and I  
 strive to have a range API for the tokenizer and the parser output  
 itself.

 However it occurs to me that in real-life browsers the advantage of this  
 type of 'streaming' parsing would be given by also having the string  
 that plays as input to the tokenizer treated as a 'stream'/'range'.

 While D's *string classes do play as ranges, what I want to write is a  
 'ChunkDecoder' range that would take curl 'byChunk' output and make it  
 consumable by the tokenizer.

 Now, the problem: string itself has ElementType!string == dchar.  
 Consuming a string a dchar at a time looks like a wasteful operation if  
 e.g. your string is UTF-8 or UTF-16.

 So, naturally, I would like to use indexOf() - instead of countUntil() -  
 and opSlice (without opDollar?) on my ChunkDecoder (forward) range.

 Q: Is anything like this already in use somewhere in the standard  
 library or a project you know?
There is an effort by myself and Dmitry Olshansky to create a stream API that looks like a range. I am way behind on getting it to work, but I have something that compiles. The effort is to replace the underlying mechanism for std.stdio (optionally), and to replace std.stream
 Q2: Or do you have any pointers for what the smallest API would be for a  
 string-like range class?
I think Dmitry has a pretty good API. I will hopefully be posting my prototype soon. I hate to say wait for it, because I have been very lousy at getting things finished lately. But I want to have something to show before the conference. The code I have will support all encodings, and provide a range API that works with dchar-like ranges. The idea is to be able to make code that works with both arrays and streams seamlessly.
 And bonus:
 Q3: any uses of such a string-ish range in other standard library  
 methods that you can think of and could be contributed to? e.g. suppose  
 this doesn't exist and I / we come up with a proposal of minimal API to  
 consume a string from left to right.
I hate for you to duplicate efforts, hold off until we get something workable. Then we can discuss the API. Dmitry's message is here: http://forum.dlang.org/post/l9q66g$2he3$1 digitalmars.com My updates have not been posted yet to github, I don't want to post half-baked code yet. Stay tuned. -Steve
May 16 2014
parent reply "Vlad" <b100dian gmail.com> writes:
On Friday, 16 May 2014 at 21:35:04 UTC, Steven Schveighoffer 
wrote:
 On Fri, 16 May 2014 16:57:41 -0400, Vlad <b100dian gmail.com> 
 wrote:

 Q: Is anything like this already in use somewhere in the 
 standard library or a project you know?
There is an effort by myself and Dmitry Olshansky to create a stream API that looks like a range. I am way behind on getting it to work, but I have something that compiles. The effort is to replace the underlying mechanism for std.stdio (optionally), and to replace std.stream
 Q2: Or do you have any pointers for what the smallest API 
 would be for a string-like range class?
I think Dmitry has a pretty good API. I will hopefully be posting my prototype soon. I hate to say wait for it, because I have been very lousy at getting things finished lately. But I want to have something to show before the conference. The code I have will support all encodings, and provide a range API that works with dchar-like ranges. The idea is to be able to make code that works with both arrays and streams seamlessly.
 And bonus:
 Q3: any uses of such a string-ish range in other standard 
 library methods that you can think of and could be contributed 
 to? e.g. suppose this doesn't exist and I / we come up with a 
 proposal of minimal API to consume a string from left to right.
I hate for you to duplicate efforts, hold off until we get something workable. Then we can discuss the API. Dmitry's message is here: http://forum.dlang.org/post/l9q66g$2he3$1 digitalmars.com My updates have not been posted yet to github, I don't want to post half-baked code yet. Stay tuned. -Steve
Thanks Steve for your prompt reply. This is exactly why I asked on the forums, since it was hard for me to believe I was the only one thinking of this. I would also hate to duplicate the effort, so I'll just code my parser against string and wait to see how your proposal and Dimitry's (I did checked his post, and sounds EXACTLY like the problem I was facing with my toy parser!). Just to make one thing clear: would this future module work with e.g. the ubyte[] chunks I receive from curl? Thanks! p.s. Is this the talk? http://dconf.org/2014/talks/olshansky.html
May 16 2014
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 16 May 2014 18:36:02 -0400, Vlad <b100dian gmail.com> wrote:

 On Friday, 16 May 2014 at 21:35:04 UTC, Steven Schveighoffer wrote:
 On Fri, 16 May 2014 16:57:41 -0400, Vlad <b100dian gmail.com> wrote:

 Q: Is anything like this already in use somewhere in the standard  
 library or a project you know?
There is an effort by myself and Dmitry Olshansky to create a stream API that looks like a range. I am way behind on getting it to work, but I have something that compiles. The effort is to replace the underlying mechanism for std.stdio (optionally), and to replace std.stream
 Q2: Or do you have any pointers for what the smallest API would be for  
 a string-like range class?
I think Dmitry has a pretty good API. I will hopefully be posting my prototype soon. I hate to say wait for it, because I have been very lousy at getting things finished lately. But I want to have something to show before the conference. The code I have will support all encodings, and provide a range API that works with dchar-like ranges. The idea is to be able to make code that works with both arrays and streams seamlessly.
 And bonus:
 Q3: any uses of such a string-ish range in other standard library  
 methods that you can think of and could be contributed to? e.g.  
 suppose this doesn't exist and I / we come up with a proposal of  
 minimal API to consume a string from left to right.
I hate for you to duplicate efforts, hold off until we get something workable. Then we can discuss the API. Dmitry's message is here: http://forum.dlang.org/post/l9q66g$2he3$1 digitalmars.com My updates have not been posted yet to github, I don't want to post half-baked code yet. Stay tuned. -Steve
Just to make one thing clear: would this future module work with e.g. the ubyte[] chunks I receive from curl?
Most likely. I would expect a curl-based stream to fit right in, it's just passing in bytes. One piece that I haven't quite fleshed out is how to drive the process. In some cases, you are pulling data from the source (traditional stream-based I/O), in other cases, something else is pushing the data (CURL). We need to handle both seamlessly. I admit I have never looked at D's curl package, just used it via C/C++.
 p.s.
 Is this the talk? http://dconf.org/2014/talks/olshansky.html
That is Dmitry's talk, from the same guy. But I think this is not about his I/O ideas, but his excellent std.regex package. -Steve
May 16 2014