digitalmars.D - [RFC] I/O and Buffer Range
- Dmitry Olshansky (155/155) Dec 29 2013 Didn't mean this to be kind of under the tree .. Planned to post long
- Vladimir Panteleev (27/28) Dec 29 2013 Hmm, just yesterday I was rewriting a parser to use a buffer
- Marco Leise (20/27) Dec 29 2013 Yeah, been there too. I guess after X years of programming
- Dmitry Olshansky (5/20) Dec 30 2013 Yeah, that's a good idea.
- Marco Leise (5/29) Dec 30 2013 One day it'll catch on :D
- Dmitry Olshansky (27/52) Dec 30 2013 It's expected that the window is increased. The exact implementation may...
- Brad Anderson (66/67) Dec 30 2013 Having never written any parser I'm not really qualified to
- Joseph Cassman (11/43) Dec 30 2013 I find the same to be the case with API's used to process files.
- Dmitry Olshansky (15/53) Dec 31 2013 Exactly, the situation is simply not good enough. I can assure you that
- Joseph Cassman (34/35) Dec 30 2013 Interesting idea. Seems to fill a need I have been facing with
- Dmitry Olshansky (25/58) Dec 31 2013 The trick is that the name Forward range is misleading actually (I was
- Joseph Cassman (24/41) Dec 31 2013 I think I now understand a bit better what you were thinking when
- Dmitry Olshansky (52/86) Jan 04 2014 I've created a fork where I've implemented just that.
- Jason White (12/18) Jan 04 2014 I agree. I wrote a (mostly complete) file stream implementation
- Dmitry Olshansky (7/23) Jan 05 2014 As an advice I'd suggest to drop the 'Data' part in writeData/readData.
- Jason White (10/14) Jan 05 2014 You're right, but it avoids a name clash if it's composed with
- Dmitry Olshansky (15/29) Jan 05 2014 I my view text implies something like:
- Jason White (23/39) Jan 05 2014 Those would do the same thing for either text or binary data.
- Dmitry Olshansky (38/81) Jan 06 2014 Ok, now I see. In my eye though serialization completely hides raw
- Jason White (15/70) Jan 06 2014 I was thinking it should also have "alias stream this;", but
- Dmitry Olshansky (10/42) Jan 07 2014 Then our goals are aligned. Be sure to take a peek at (if you haven't
- Steven Schveighoffer (68/109) Jan 16 2014 no
- Dmitry Olshansky (40/91) Jan 16 2014 The only interesting thing I'd add here s that some buffer may work
- Steven Schveighoffer (80/169) Jan 16 2014 :
- Dmitry Olshansky (16/104) Jan 16 2014 The other way around :) 4 code units - 1 code point.
- Steven Schveighoffer (19/35) Jan 16 2014 I knew that was going to happen :)
- Dmitry Olshansky (44/82) Jan 17 2014 Aye. BTW I haven't thought of writing into the buffer, but it works
- Steven Schveighoffer (12/44) Jan 17 2014 In my experience, the code/process for a write buffer is significantly
- Dmitry Olshansky (5/18) Jan 17 2014 Agreed, read/write buffer is a bad idea. As for write-only buffer
- Steven Schveighoffer (28/47) Jan 17 2014 :
- Dmitry Olshansky (10/46) Jan 17 2014 I'm doubting that actually. Just use the freaking MM-file directly you'd...
- Steven Schveighoffer (10/17) Jan 17 2014 :
- Kagamin (4/10) Jan 17 2014 If you have a struct-based buffer, how would you enlarge the
- Dmitry Olshansky (8/18) Jan 17 2014 What's the problem? I don't see how struct/class can change there
- Kagamin (4/7) Jan 17 2014 Ah, I thought one can copy the buffer, now I see you pass it by
- Jakob Ovrum (3/14) Jan 17 2014 It is the lazily initialized nature of AAs that causes the
- Brian Schott (9/9) Jan 04 2014 I've been rewriting a bit of the lexer in DScanner.
- Dmitry Olshansky (13/21) Jan 04 2014 startsWith & peek could be implemented with proposed the lookahead, so
- Brian Schott (5/5) Jan 09 2014 My experimental lexer generator now uses the buffer range.
- Dmitry Olshansky (19/24) Jan 12 2014 Cool!
- Walter Bright (6/8) Jan 16 2014 I am confused because there are 4 terms conflated here:
- Dmitry Olshansky (6/16) Jan 16 2014 One and the same fr the moment.
- Walter Bright (5/21) Jan 16 2014 I know. But using two names for the same thing makes for confusing
Didn't mean this to be kind of under the tree .. Planned to post long ago but I barely had the time to flesh it out. TL;DR: Solve the input side of I/O problem with new kind of ranges. I'm taking on the buffering primitive on top of the low-level unbuffered I/O sources specifically. Links Prior related discussions: http://forum.dlang.org/thread/op.wegg9vyneav7ka steves-laptop?page=4 http://forum.dlang.org/thread/mailman.17.1317564725.22016.digitalmars-d puremagic.com?page=1 http://forum.dlang.org/thread/vnpriguleebpbzhkpdwi forum.dlang.org#post-kh5g46:242prc:241:40digitalmars.com An early precursor of the proposed design found in DScanner: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/d/lexer.d#L2388 Into and motivation Anyone doing serious work on parsers (text & binary alike) with D soon finds out that while slice-em-up with random access ranges works very nice any attempts to extend that beyond arrays and in-memory stuff are getting real awkward. It sums up to implementing not a small amount of bookkeeping (buffering) all the while having to use primitives that just don't map well to the underlying "hardware" - buffered stream. For instance - want to peek 3 bytes ahead (as in LL(3) disambiguation)? Save the range, do 3 front/popFront with empty checks then copy saved range back. Painful especially if you know full well that it must be just a length check plus 3 direct reads of array. Even if the underlying buffer allows you to slice up the contents and do array-wise operations on them, the dumbed-down forward range interface just won't let you. And once all of the hard (and useless) work to support forward ranges is done - you get that you have this 2-nd parser that should support forward ranges as well. It leads to conclusion that parsers simply don't want to work forward ranges, they need something bigger and all extra work you did on buffering simply belongs to that bigger primitive. To put the last nail in the coffin - most of interesting sources can't even be forward ranges! Stream over a TCP socket - how do you 'save' that, Sherlock? Keeping in mind to return the same type - we'd better called it "fork" back then. To sum it up - the current selection of ranges is not good enough to _efficiently_ work with I/O. Yet ranges are very neat and we'd better retain their strengths and capitalize on the existing work done in this area (e.g. a slew of good stuff in std.algorithm and std.range). Got to extend the notion then. Goals Introduce a better primitive aimed squarely at solving the _input_ side of I/O problem. Output IMHO might need some tweaks but with OutputRange it's in a much better shape then input. The said new input primitive (from now on "buffer range") must naturally and efficiently support the following: 1) Zero-copy insanely fast & popular slice em up approach for in-memory use. 2) "Over the wire" sources such as pipes, sockets and the like 3) Memory mapped files esp. the ones that don't fit into memory 4) Primitives that (all kinds of) parsers need, including lookahead and lookbehind. It would be awesome if we can: 4) Retain backwards compatibility and integrate well with existing ranges. 5) Avoid dependency on C run-time and surpass it performance-wise 6) Remove extra burden and decouple dependencies in the chain: input-source <--> buffer range <--> parser/consumer Meaning that if we can mix and match parsers with buffer ranges, and buffer ranges with input sources we had grown something powerful indeed. Spoiler I have a proof of concept implemented. It *works*. What I'm looking for is to simplify the set of primitives and/or better suggestions. Code: https://github.com/blackwhale/datapicked/tree/master/dpick/buffer Docs: http://blackwhale.github.io/datapicked/ See it in action with e.g. a bit trimmed-down std.regex using this abstraction and working directly over files: grep-like tool https://github.com/blackwhale/datapicked/blob/master/dgrep.d and the module itself https://github.com/blackwhale/datapicked/blob/master/regex.d (for usage see build.sh and bench.sh) It's as efficient as it was with arrays but no more need to work line by line and/or load the whole file. In fact it's faster then the fastest line by line solution I had before: fgets + std.regex.matchFirst. Note that this (old) solution is cheating twice - it seeks only 1 match per line and it knows the line length ahead of time. For me this proves that (5) is well within our reach. Proposal The BufferRange concept itself (for now called simply Buffer) is defined here: http://blackwhale.github.io/datapicked/dpick.buffer.traits.html I'm not comfortable with the sheer number of primitives but my goal was sufficient first, minimal second. Rationale for the selection of the primitives follows. 1. Start with InputRange, as there is no need to break the good thing (and foreach). This gets us somewhat close to (4). :) Accept that given requirements (1)-(3) we are working with "sliding window" over whatever is the true source of data. Thus a sliding window can be the whole array, a mapped area of a file or a buffer of network stream. A sliding window may be moved across the input stream (~ re-loading the buffer) or extended to reach further. Now what we need is to properly exploit capabilities of sliding window model and match that with requirements a parser would "like". 2. Slicing "after the fact". This means ability to mark relevant parts of buffer as a start of something "useful" and require the underlying implementation when the time comes to move the window to keep the data starting from here. Typically later "down the stream" when the boundaries of slice are established it's extracted, examined and (rarely!) copied over. Ability to avoid copy unless absolutely necessary is _crucial_. Motivating (pseudo-)code I've seen in lexers (e.g. DScanner) with my primitives looks like this: { //pin this position so that buffering won't lose it auto m = input.mark(); while(!input.empty && isAlpha(input.front)){ input.popFront(); } //get a slice from 'm' to current position auto word = input.slice(m); //grab a copy if haven't seen before if(word !in identifiers) identifiers.insert(word.idup); } //here m's destructor unpins position in input To address slicing (parser requirement) I had to introduce marking and pinning concept explicitly. See mark/slice in docs. 3. Cheap save/restore. Copying the whole range to save iteration state was a bad idea. It's not just wasteful as in memory, it's _semantically_ costly. In case of buffer range it would also imply some "re-reading" of I/O source as the copy has to be completely independent view of the same input(!). Hence "save" of forward range is an inadequate primitive that must be dropped. However when time comes to backtrack (and many parsers do that, quite often) the state has to be restored. To minimize primitive count and not break requirement (2) reuse 'mark' to save the state and add 'seek' to restore position to the marked point. As it was pinned it's always in the buffer and readily accessible, keeping the ability to work over streams. 4. Even cheaper save/restore. Something discovered the hard way in the practical setting is that setting up boundaries of stuff to slice (captures in regex) must be dirt cheap, like integer assignment cheap. This means always using marking for every sub-slice is prohibitively costly as it has to communicate with buffer (currently bump a counter in some array). The idea that worked out with std.regex is to use a single "vantage point" mark and take a big slice off that and then make sub-slices of it as the plain array it is. Then from time to time "vantage point" should be updated when there is no sub-slices in mid-air. This leads to a 'tell' primitive that gives offset from a given mark to the current position plus 'seek' overloads that work with relative offsets (positive and negative). Another usage for relative seek is skipping the of data without looking, potentially dropping the whole buffers away (there is overlap with popFrontN though). Also fixed-width backtracking is easily had with relative seek if we can instruct the buffer range to keep at least K last bytes in the buffer (require minimal history). 5. Cheap lookahead/lookbehind. This exploits the fact that underlying buffer is nothing but array, hence one may easily take a peek at some portion of it as plain ubyte[]. Implementation makes sure it buffers up enough of data as required and/or returns empty range if not possible. This supports things like LL(k) lookahead and fixed-length lookbehind that is common in regex 0-width assertions. These 2 can be implemented with relative 'seek' +front/popFront, the question remains is how effective it'll be. -- Dmitry Olshansky
Dec 29 2013
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:[snip]Hmm, just yesterday I was rewriting a parser to use a buffer instead of loading the whole file in memory, so this is quite timely for me. Questions: 1. What happens when the distance between the pinned and current position exceeds the size of the buffer (sliding window)? Is the buffer size increased, or is the stream rewound if possible and the range returned by the slice does seeking? 2. I don't understand the rationale behind the current semantics of lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the first whitespace char, you don't know how many chars to request. Wouldn't it be better to make these functions return the ENTIRE available buffer in O(1)? I guess I see the point when applied to regular expressions, where the user explicitly specifies how many characters to look ahead/behind. However, I think in most use cases the amount is not known beforehand (without imposing arbitrary limitations on users like "Thou shalt not have variable identifiers longer than 32 characters"), so the pattern would be "try a cheap lookahead/behind, and if that fails, do an expensive one". 3. I think ideally the final design would use something like what std.allocator does with "unbounded" and "chooseAtRuntime" - some uses might not need lookahead or lookbehind or other features at all, so having a way to disable the relevant code would benefit those cases.
Dec 29 2013
Am Sun, 29 Dec 2013 22:45:35 +0000 schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:[snip] =20 2. I don't understand the rationale behind the current semantics=20 of lookahead/lookbehind. If you want to e.g. peek ahead/behind to=20 find the first whitespace char, you don't know how many chars to=20 request. Wouldn't it be better to make these functions return the=20 ENTIRE available buffer in O(1)?Yeah, been there too. I guess after X years of programming chances are good you implemented some buffer with look-ahead. :)=20 I use those primitives: ubyte[] mapAvailable() pure nothrow ubyte[] mapAtLeast(in =E2=84=95 count) See: https://github.com/mleise/piped/blob/master/src/piped/circularbuffer.d= #L400 Both return a slice over all the available buffered data. mapAtLeast() also waits till enough data is available from the producer thread. The circular buffer is automatically extended if the producer wants to write a larger chunk or the consumer needs a larger window. (I was mostly focused on lock-free operation in not-starved situations and minimal bounds checking, so the code is a bit more sophisticated than the average circular buffer.) --=20 Marco
Dec 29 2013
30-Dec-2013 09:29, Marco Leise пишет:Am Sun, 29 Dec 2013 22:45:35 +0000 schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:Yeah, that's a good idea. I mean primitives, not Unicode in type names ;) -- Dmitry Olshansky[snip] 2. I don't understand the rationale behind the current semantics of lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the first whitespace char, you don't know how many chars to request. Wouldn't it be better to make these functions return the ENTIRE available buffer in O(1)?Yeah, been there too. I guess after X years of programming chances are good you implemented some buffer with look-ahead. :) I use those primitives: ubyte[] mapAvailable() pure nothrow ubyte[] mapAtLeast(in ℕ count)
Dec 30 2013
Am Mon, 30 Dec 2013 12:08:18 +0400 schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:30-Dec-2013 09:29, Marco Leise =D0=BF=D0=B8=D1=88=D0=B5=D1=82:One day it'll catch on :D --=20 MarcoAm Sun, 29 Dec 2013 22:45:35 +0000 schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:=20 Yeah, that's a good idea. I mean primitives, not Unicode in type names ;)[snip] 2. I don't understand the rationale behind the current semantics of lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the first whitespace char, you don't know how many chars to request. Wouldn't it be better to make these functions return the ENTIRE available buffer in O(1)?Yeah, been there too. I guess after X years of programming chances are good you implemented some buffer with look-ahead. :) I use those primitives: ubyte[] mapAvailable() pure nothrow ubyte[] mapAtLeast(in =E2=84=95 count)
Dec 30 2013
30-Dec-2013 02:45, Vladimir Panteleev пишет:On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:It's expected that the window is increased. The exact implementation may play any dirty tricks it sees fit as long as it can provide a slice over the pinned area. In short - maintain the illusion that the window has increased. I would be against seeking range and would most likely opt for memory-mapped files instead but it all depends on the exact numbers.[snip]Hmm, just yesterday I was rewriting a parser to use a buffer instead of loading the whole file in memory, so this is quite timely for me. Questions: 1. What happens when the distance between the pinned and current position exceeds the size of the buffer (sliding window)? Is the buffer size increased, or is the stream rewound if possible and the range returned by the slice does seeking?2. I don't understand the rationale behind the current semantics of lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the first whitespace char, you don't know how many chars to request.If you want to 'find' just do front/popFront, no? Or do you specifically want to do array-wise operations?Wouldn't it be better to make these functions return the ENTIRE available buffer in O(1)?Indeed, now I think that 2 overloads would be better: auto lookahead(size_t n); //exactly n bytes, re-buffering as needed auto lookahead(); // all that is available in the window, no re-buffering Similar for lookbehind.I guess I see the point when applied to regular expressions, where the user explicitly specifies how many characters to look ahead/behind.Actually the user doesn't - our lookahead/lookbehind is variable length. One thing I would have to drop is unbound lookbehind, not that it's so critical.However, I think in most use cases the amount is not known beforehand (without imposing arbitrary limitations on users like "Thou shalt not have variable identifiers longer than 32 characters"), so the pattern would be "try a cheap lookahead/behind, and if that fails, do an expensive one".I would say that in case where you need arbitrary-length lookahead: m = mark, seek + popFront x N, seek(m) should fit the bill. Or as is the case in regex at the moment - mark once, and use seek back to some position relative to it. In one word - backtracking. An example of where fixed lookahead rocks: https://github.com/blackwhale/datapicked/blob/master/dpick/buffer/buffer.d#L4213. I think ideally the final design would use something like what std.allocator does with "unbounded" and "chooseAtRuntime" - some uses might not need lookahead or lookbehind or other features at all, so having a way to disable the relevant code would benefit those cases.It makes sense to make lookahead and lookbehind optional. As for the code - for the moment it doesn't add much and builds on stuff already there. Though I suspect some other implementations would be able to "cut corners" more efficiently. -- Dmitry Olshansky
Dec 30 2013
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:ProposalHaving never written any parser I'm not really qualified to seriously give comments or review it but it all looks very nice to me. Speaking as just an end user of these things whenever I use ranges over files or from, say, std.net.curl the byLine/byChunk interface always feels terribly awkward to use which often leads to me just giving up and loading the entire file/resource into an array. It's the boundaries that I stumble over. byLine never fits when I want to extract something multiline but byChunk doesn't fit because I often if what I'm searching for lands on the boundary I'll miss it. Being able to just do a matchAll() on a file, std.net.curl, etc. without sacrificing performance and memory would be such a massive gain for usability. Just a simple example of where I couldn't figure out how to utilize either byLine or byChunk without adding some clunky homegrown buffering solution. This is code that scrapes website titles from the pages of URLs in IRC messages. --- auto scrapeTitles(M)(in M message) { "i"); static title_re = ctRegex!(r"<title.*?>(.*?)<", "si"); static ws_re = ctRegex!(r"(\s{2,}|\n|\t)"); auto utf8 = new EncodingSchemeUtf8; auto titles = matchAll(message, url_re) .map!(match => match.captures[0]) .map!((url) => get(url).ifThrown([])) .map!(bytes => cast(string) utf8.sanitize(cast(immutable(ubyte)[])bytes)) .map!(content => matchFirst(content, title_re)) .filter!(captures => !captures.empty) .map!(capture => capture[1].idup) // dup so original is GCed .map!(title => title.entitiesToUni.replace(ws_re, " ")) .array; return titles; } --- I really, really didn't want to use that std.net.curl.get(). It causes all sorts of problems if someone links to a huge resource. I just could not figure out how to utilize byLine (the title regex capture can be multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob Ovrum's help in IRC) was really a goal here as this is just a toy so I went with get() for the time being but it's always sad to sacrifice elegance for performance. I certainly didn't want to add some elaborate evergrowing buffer in the middle of this otherwise clean UFCS chain (and I'm not even sure how to incrementally regex search the growing buffer or if that's even possible). If I'm understanding your proposal correctly that get(url) could be replaced with a hypothetical std.net.curl "buffer range" and that could be passed directly to matchFirst. It would only take up, at most, the size of the buffer in memory (which could grow if the capture grows to be larger than the buffer) and wouldn't read the unneeded portion of the resource at all. That would be such a huge win for everyone so I'm very excited about this proposal. It addresses all of my current problems. P.S. I love std.regex more and more every day. It made that entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7d
Dec 30 2013
On Tuesday, 31 December 2013 at 01:51:23 UTC, Brad Anderson wrote:On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:I find the same to be the case with API's used to process files. It would be a real win to have a simple, efficient, and possibly non-blocking range interface over a file able to access a single byte or character at a time. I remember a discussion on this forum that spoke positively about the async and await combination of keywords in .NET 4.5. If asynchronous and synchronous functionality could be combined with a flexible buffer in a UFCS call chain I think that would hit a sweet spot for code composability and understandability. JosephProposalHaving never written any parser I'm not really qualified to seriously give comments or review it but it all looks very nice to me. Speaking as just an end user of these things whenever I use ranges over files or from, say, std.net.curl the byLine/byChunk interface always feels terribly awkward to use which often leads to me just giving up and loading the entire file/resource into an array. It's the boundaries that I stumble over. byLine never fits when I want to extract something multiline but byChunk doesn't fit because I often if what I'm searching for lands on the boundary I'll miss it. Being able to just do a matchAll() on a file, std.net.curl, etc. without sacrificing performance and memory would be such a massive gain for usability. Just a simple example of where I couldn't figure out how to utilize either byLine or byChunk without adding some clunky homegrown buffering solution. This is code that scrapes website titles from the pages of URLs in IRC messages. [...] If I'm understanding your proposal correctly that get(url) could be replaced with a hypothetical std.net.curl "buffer range" and that could be passed directly to matchFirst. It would only take up, at most, the size of the buffer in memory (which could grow if the capture grows to be larger than the buffer) and wouldn't read the unneeded portion of the resource at all. That would be such a huge win for everyone so I'm very excited about this proposal. It addresses all of my current problems. [...]
Dec 30 2013
31-Dec-2013 05:51, Brad Anderson пишет:On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:Exactly, the situation is simply not good enough. I can assure you that on the side of parser writers it's even less appealing.ProposalHaving never written any parser I'm not really qualified to seriously give comments or review it but it all looks very nice to me. Speaking as just an end user of these things whenever I use ranges over files or from, say, std.net.curl the byLine/byChunk interface always feels terribly awkward to use which often leads to me just giving up and loading the entire file/resource into an array. It's the boundaries that I stumble over. byLine never fits when I want to extract something multiline but byChunk doesn't fit because I often if what I'm searching for lands on the boundary I'll miss it.Being able to just do a matchAll() on a file, std.net.curl, etc. without sacrificing performance and memory would be such a massive gain for usability... and performance ;)Just a simple example of where I couldn't figure out how to utilize either byLine or byChunk without adding some clunky homegrown buffering solution. This is code that scrapes website titles from the pages of URLs in IRC messages.[snip]I really, really didn't want to use that std.net.curl.get(). It causes all sorts of problems if someone links to a huge resource.*Nods*I just could not figure out how to utilize byLine (the title regex capture can be multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob Ovrum's help in IRC) was really a goal here as this is just a toy so I went with get() for the time being but it's always sad to sacrifice elegance for performance. I certainly didn't want to add some elaborate evergrowing buffer in the middle of this otherwise clean UFCS chain (and I'm not even sure how to incrementally regex search the growing buffer or if that's even possible).I thought to provide something like that, incremental match that takes pieces of data slice by slice, having to mess with the not-yet-matched kind of object. But it was solving the wrong problem. And it shows that backtracking engines simply can't work like that, they would want to go back to the prior pieces.If I'm understanding your proposal correctly that get(url) could be replaced with a hypothetical std.net.curl "buffer range" and that could be passed directly to matchFirst. It would only take up, at most, the size of the buffer in memory (which could grow if the capture grows to be larger than the buffer) and wouldn't read the unneeded portion of the resource at all. That would be such a huge win for everyone so I'm very excited about this proposal. It addresses all of my current problems.That's indeed what the proposal is all about. Glad it makes sense :)P.S. I love std.regex more and more every day. It made that entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7dAye, replace with functor rox! -- Dmitry Olshansky
Dec 31 2013
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:[...]Interesting idea. Seems to fill a need I have been facing with some parsing code. Since I was unclear about how its functionality compares to ForwardRange I took a look through std.algorithm. If typed versions of lookahead/lookbehind were added it seems like ForwardRange could be replaced with this new range type, at least in certain places. For example, it seems to me that the following code (from the article "On Iteration" https://www.informit.com/articles/article.aspx?p=1407357&seqNum=7) ForwardRange findAdjacent(ForwardRange r){ if (!r.empty()) { auto s = r.save(); s.popFront(); for (; !s.empty(); r.popFront(), s.popFront()) { if (r.front() == s.front()) break; } } return r; } also could be written as BufferedRange findAdjacent(BufferedRange r) { for(;!r.empty;r.popFront()) { if(r.front == r.lookahead(1)[0]) break; } return r; } Perhaps the mark and/or slice functionality could be leveraged to do something similar but be more performant. If anyone can see how the new range type compares to ForwardRange that would be cool. Joseph
Dec 30 2013
31-Dec-2013 05:53, Joseph Cassman пишет:On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:The trick is that the name Forward range is misleading actually (I was surprized myself). A proper name would be ForkableRange, indeed look at this code: auto s = r.save(); s.popFront(); s is a fork of r (independent view that has a live of its own). The original algortihm even _reads_ easier if the correct term is used: auto s = r.fork(); s.popFront(); ... Now it's clear that we fork a range and advance the forked copy one step ahead of the original.[...]Interesting idea. Seems to fill a need I have been facing with some parsing code. Since I was unclear about how its functionality compares to ForwardRange I took a look through std.algorithm. If typed versions of lookahead/lookbehind were added it seems like ForwardRange could be replaced with this new range type, at least in certain places. For example, it seems to me that the following code (from the article "On Iteration" https://www.informit.com/articles/article.aspx?p=1407357&seqNum=7) ForwardRange findAdjacent(ForwardRange r){ if (!r.empty()) { auto s = r.save(); s.popFront(); for (; !s.empty(); r.popFront(), s.popFront()) { if (r.front() == s.front()) break; } } return r; }also could be written as BufferedRange findAdjacent(BufferedRange r) { for(;!r.empty;r.popFront()) { if(r.front == r.lookahead(1)[0]) break; } return r; }Aye, just don't forget to test lookahead for empty instead of r.empty.Perhaps the mark and/or slice functionality could be leveraged to do something similar but be more performant. If anyone can see how the new range type compares to ForwardRange that would be cool.I'm thinking there might be a way to bridge the new range type with ForwardRange but not directly as defined at the moment. A possibility I consider is to separate a Buffer object (not a range), and let it be shared among views - light-weight buffer-ranges. Then if we imagine that these light-weight buffer-ranges are working as marks (i.e. they pin down the buffer) in the current proposal then they could be forward ranges. I need to think on this, as the ability to integrate well with forwardish algorithms would be a great improvement. -- Dmitry Olshansky
Dec 31 2013
On Tuesday, 31 December 2013 at 09:04:58 UTC, Dmitry Olshansky wrote:31-Dec-2013 05:53, Joseph Cassman пишет:I think I now understand a bit better what you were thinking when you first postedOn Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:I'm thinking there might be a way to bridge the new range type with ForwardRange but not directly as defined at the moment. A possibility I consider is to separate a Buffer object (not a range), and let it be shared among views - light-weight buffer-ranges. Then if we imagine that these light-weight buffer-ranges are working as marks (i.e. they pin down the buffer) in the current proposal then they could be forward ranges. I need to think on this, as the ability to integrate well with forwardish algorithms would be a great improvement.input-source <--> buffer range <--> parser/consumer Meaning that if we can mix and match parsers with buffer ranges, and buffer ranges with input sources we had grown something powerful indeed.Being able to wrap an already-in-use range object with the buffer interface as you do in the sample code (https://github.com/blackwhale/datapicked/blob/master/dgrep.d) is good for composability. Also allows for existing functionality in std.algorithm to be reused as-is. I think the new range type could also be added directly to some new, or perhaps retrofitted into existing, code to add the new functionality without sacrificing performance. In that way the internal payload already used to get the data (say by the input range) could be reused without having to allocate new memory to support the buffer API. As one idea of using a buffer range from the start, a function template by(T) (where T is ubyte, char, wchar, or dchar) could be added to std.stdio. It would return a buffer range object providing more functionality than byChunk or byLine while adding access to the entire stream of data in a file in a contiguous and yet efficient manner. Seems to help with the issues faced in processing file data mentioned in previous comments in this thread. Joseph
Dec 31 2013
31-Dec-2013 22:46, Joseph Cassman пишет:On Tuesday, 31 December 2013 at 09:04:58 UTC, Dmitry Olshansky wrote:31-Dec-2013 05:53, Joseph Cassman пишет:On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:I've created a fork where I've implemented just that. As a bonus I also tweaked stream primitives so it now works with pipes or whatever input stdin happens to be. Links stay the same: Docs: http://blackwhale.github.io/datapicked/dpick.buffer.traits.html Code: https://github.com/blackwhale/datapicked/tree/fwd-buffer-range/dpick/buffer The description has largely simplified and the primitive count reduced. 1. A buffer range is a forward range. It has reference semantics. 2. A copy produced by _save_ is an independent view of the underlying buffer (or window). 3. No bytes can be discarded that are seen in some existing view. Thus each reference pins its position in the buffer. 4. 3 new primitives are: Range slice(BufferRange r); Returns a slice of a window between the current range position and r. It must be a random access range. ptrdiff_t tell(BufferRange r); Returns a difference in positions in the window of current range and r. Note that unlike slice(r).length this can be both positive and negative. bool seek(ptrdiff_t ofs); Reset buffer state to an offset from the current position. Return indicates success of the operation. It may fail if there is not enough data, or (if ofs is negative) that this portion of data was already discarded. 5. Lookahead and lookbehind are a extra primitives that were left intact for the moment. Where applicable a range may provide lookahead: Range lookahead(); //as much as available in the window Range lookahead(size_t n); // either n exactly or nothing if not And lookbehind: Range lookbehind(); //as much as available in the window Range lookbehind(size_t n); //either n exactly or nothing if not These should probably be tested as separate traits.I'm thinking there might be a way to bridge the new range type with ForwardRange but not directly as defined at the moment. A possibility I consider is to separate a Buffer object (not a range), and let it be shared among views - light-weight buffer-ranges. Then if we imagine that these light-weight buffer-ranges are working as marks (i.e. they pin down the buffer) in the current proposal then they could be forward ranges.It was more about wrapping an array but it's got to integrate well with what we have. I could imagine a use case for buffering an input range. Then I think a buffer range of anything other then bytes would be in order.input-source <--> buffer range <--> parser/consumer Meaning that if we can mix and match parsers with buffer ranges, and buffer ranges with input sources we had grown something powerful indeed.Being able to wrap an already-in-use range object with the buffer interface as you do in the sample code (https://github.com/blackwhale/datapicked/blob/master/dgrep.d) is good for composability. Also allows for existing functionality in std.algorithm to be reused as-is.I think the new range type could also be added directly to some new, or perhaps retrofitted into existing, code to add the new functionality without sacrificing performance. In that way the internal payload already used to get the data (say by the input range) could be reused without having to allocate new memory to support the buffer API. As one idea of using a buffer range from the start, a function template by(T) (where T is ubyte, char, wchar, or dchar) could be added to std.stdio.IMHO C run-time I/O has no use in D. The amount of work spent on special-casing the non-locking primitives of each C run-time, repeating legacy mistakes (like text mode, codepages and locales) and stumbling on portability problems (getc is a macro we can't have) would have been better spent elsewhere - designing our own I/O framework. I've put together up something pretty simple and fast for buffer range directly on native I/O: https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/stream.d It needs a bit better error messages then naked enforce, and a bit of tweaks to memory management. It does runs circles around existing std.stdio already.It would return a buffer range object providing more functionality than byChunk or byLine while adding access to the entire stream of data in a file in a contiguous and yet efficient manner.Drop 'efficient' if we talk interfacing with C run-time. Otherwise, yes, absolutely.Seems to help with the issues faced in processing file data mentioned in previous comments in this thread.-- Dmitry Olshansky
Jan 04 2014
On Saturday, 4 January 2014 at 13:32:15 UTC, Dmitry Olshansky wrote:IMHO C run-time I/O has no use in D. The amount of work spent on special-casing the non-locking primitives of each C run-time, repeating legacy mistakes (like text mode, codepages and locales) and stumbling on portability problems (getc is a macro we can't have) would have been better spent elsewhere - designing our own I/O framework.I agree. I wrote a (mostly complete) file stream implementation that uses the native I/O API: https://github.com/jasonwhite/io/blob/master/src/io/file.d It allows for more robust open-flags than the fopen-style flags (like "r+"). For seeking, I adapted it to use your concept of marks (which I quite like) instead of SEEK_CUR and friends. Please feel free to use this! (The Windows implementation hasn't been tested yet, so it probably doesn't work.) BTW, I was also working on buffered streams, but couldn't figure out a good way to do it.
Jan 04 2014
05-Jan-2014 09:22, Jason White пишет:On Saturday, 4 January 2014 at 13:32:15 UTC, Dmitry Olshansky wrote:As an advice I'd suggest to drop the 'Data' part in writeData/readData. It's obvious and adds no extra value.IMHO C run-time I/O has no use in D. The amount of work spent on special-casing the non-locking primitives of each C run-time, repeating legacy mistakes (like text mode, codepages and locales) and stumbling on portability problems (getc is a macro we can't have) would have been better spent elsewhere - designing our own I/O framework.I agree. I wrote a (mostly complete) file stream implementation that uses the native I/O API: https://github.com/jasonwhite/io/blob/master/src/io/file.dIt allows for more robust open-flags than the fopen-style flags (like "r+"). For seeking, I adapted it to use your concept of marks (which I quite like) instead of SEEK_CUR and friends. Please feel free to use this! (The Windows implementation hasn't been tested yet, so it probably doesn't work.)Will poke around. I like this (I mean composition): https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17BTW, I was also working on buffered streams, but couldn't figure out a good way to do it.-- Dmitry Olshansky
Jan 05 2014
On Sunday, 5 January 2014 at 09:33:46 UTC, Dmitry Olshansky wrote:As an advice I'd suggest to drop the 'Data' part in writeData/readData. It's obvious and adds no extra value.You're right, but it avoids a name clash if it's composed with text writing. "write" would be used for text and "writeData" would be used for raw data. std.stdio.File uses the names rawRead/rawWrite to avoid that problem (which, I suppose, are more appropriate names).Will poke around. I like this (I mean composition): https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17Yeah, the idea is to separate buffering, text, and locking operations so that they can be composed with any other type of stream (e.g., files, in-memory arrays, or sockets). Currently, std.stdio has all three of those facets rolled into one.
Jan 05 2014
05-Jan-2014 15:08, Jason White пишет:On Sunday, 5 January 2014 at 09:33:46 UTC, Dmitry Olshansky wrote:I my view text implies something like: void write(const(char)[]); size_t read(char[]); And binary would be: void write(const(ubyte)[]); size_t read(ubyte[]); Should not clash.As an advice I'd suggest to drop the 'Data' part in writeData/readData. It's obvious and adds no extra value.You're right, but it avoids a name clash if it's composed with text writing. "write" would be used for text and "writeData" would be used for raw data. std.stdio.File uses the names rawRead/rawWrite to avoid that problem (which, I suppose, are more appropriate names).In-memory array IMHO better not pretend to be a stream. This kind of wrapping goes in the wrong direction (losing capabilities). Instead wrapping a stream and/or array as a buffer range proved to me to be more natural (extending capabilities).Will poke around. I like this (I mean composition): https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17Yeah, the idea is to separate buffering, text, and locking operations so that they can be composed with any other type of stream (e.g., files, in-memory arrays, or sockets).Currently, std.stdio has all three of those facets rolled into one.Locking though is a province of shared and may need a bit more thought. -- Dmitry Olshansky
Jan 05 2014
On Sunday, 5 January 2014 at 13:30:59 UTC, Dmitry Olshansky wrote:I my view text implies something like: void write(const(char)[]); size_t read(char[]); And binary would be: void write(const(ubyte)[]); size_t read(ubyte[]); Should not clash.Those would do the same thing for either text or binary data. When I say text writing, I guess I mean the serialization of any type to text (like what std.stdio.write does): void write(T)(T value); // Text writing void write(const(ubyte)[] buf); // Binary writing write([1, 2, 3]); // want to write "[1, 2, 3]" // but writes "\x01\x02\x03" This clashes. We need to be able to specify if we want to write/read a text representation or just the raw binary data. In the above case, the most specialized overload will be called.In-memory array IMHO better not pretend to be a stream. This kind of wrapping goes in the wrong direction (losing capabilities). Instead wrapping a stream and/or array as a buffer range proved to me to be more natural (extending capabilities).Shouldn't buffers/arrays provide a stream interface in addition to buffer-specific operations? I don't see why it would conflict with a range interface. As I understand it, ranges act on a single element at a time while streams act on multiple elements at a time. For ArrayBuffer in datapicked, a stream-style read is just lookahead(n) and cur += n. What capabilities are lost? If buffers/arrays provide a stream interface, then they can be used by code that doesn't directly need the buffering capabilities but would still benefit from them.Locking of streams is something that I haven't explored too deeply yet. Streams that communicate with the OS certainly need locking as thread locality makes no difference there.Currently, std.stdio has all three of those facets rolled into one.Locking though is a province of shared and may need a bit more thought.
Jan 05 2014
06-Jan-2014 09:41, Jason White пишет:On Sunday, 5 January 2014 at 13:30:59 UTC, Dmitry Olshansky wrote:Ok, now I see. In my eye though serialization completely hides raw stream write. So: struct SomeStream{ void write(const(ubyte)[] data...); } struct Serializer(Stream){ void write(T)(T value); //calls stream.write inside of it private: Stream stream; }I my view text implies something like: void write(const(char)[]); size_t read(char[]); And binary would be: void write(const(ubyte)[]); size_t read(ubyte[]); Should not clash.Those would do the same thing for either text or binary data. When I say text writing, I guess I mean the serialization of any type to text (like what std.stdio.write does): void write(T)(T value); // Text writing void write(const(ubyte)[] buf); // Binary writing write([1, 2, 3]); // want to write "[1, 2, 3]" // but writes "\x01\x02\x03" This clashes. We need to be able to specify if we want to write/read a text representation or just the raw binary data. In the above case, the most specialized overload will be called.I think it may be best not to. Buffer builds on top of unbuffered stream. If there is a need to do large reads it may as well use naked stream and not worry about extra copying happening in the buffer layer. I need to think on this. Seeing as lookahead + seek could be labeled as read even though it's not.In-memory array IMHO better not pretend to be a stream. This kind of wrapping goes in the wrong direction (losing capabilities). Instead wrapping a stream and/or array as a buffer range proved to me to be more natural (extending capabilities).Shouldn't buffers/arrays provide a stream interface in addition to buffer-specific operations?I don't see why it would conflict with a range interface. As I understand it, ranges act on a single element at a time while streams act on multiple elements at a time. For ArrayBuffer in datapicked, a stream-style read is just lookahead(n) and cur += n. What capabilities are lost?In short - lookahead is slicing, read would be copying. For me prime capability of an array is slicing that is dirt cheap O(1). On the other hand stream interface is all about copying bytes to the user provided array. In this setting it means that if you want to wrap array as stream, then it must follow generic stream interface. The latter cannot and should not think of slicing and the like. Then while wrapping it in some adapter up the chain it's no longer seen as array (because adapter is generic too and is made for streams). This is what I call capability loss.If buffers/arrays provide a stream interface, then they can be used by code that doesn't directly need the buffering capabilities but would still benefit from them.See above - it would be better if the code was written for ranges not streams. Then e.g. slicing of buffer range on top of array works just as cheap as it was for arrays. And zero copies are made (=performance).Actually these objects do just fine, since OS does the locking (or makes sure of something equivalent). If your stream is TLS there is no need for extra locking at all (no interleaving of I/O calls is possible) regardless of its kind. Shared instances would need locking as 2 threads may request some operation, and as OS locks only on per sys-call basis something cruel may happen in the code that deals with buffering etc. -- Dmitry OlshanskyLocking of streams is something that I haven't explored too deeply yet. Streams that communicate with the OS certainly need locking as thread locality makes no difference there.Currently, std.stdio has all three of those facets rolled into one.Locking though is a province of shared and may need a bit more thought.
Jan 06 2014
On Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:Ok, now I see. In my eye though serialization completely hides raw stream write. So: struct SomeStream{ void write(const(ubyte)[] data...); } struct Serializer(Stream){ void write(T)(T value); //calls stream.write inside of it private: Stream stream; }I was thinking it should also have "alias stream this;", but maybe that's not the best thing to do for a serializer. I concede, I've s/(read|write)Data/\1/g on https://github.com/jasonwhite/io/blob/master/src/io/file.d and it now works on Windows with useful exception messages.Okay, I see. I'm just concerned about composability. I'll have to think more about how it's affected. (BTW, you can probably simplify lookahead/lookbehind with look(ptrdiff_t n) where the sign of n indicates ahead/behind.)Shouldn't buffers/arrays provide a stream interface in addition to buffer-specific operations?I think it may be best not to. Buffer builds on top of unbuffered stream. If there is a need to do large reads it may as well use naked stream and not worry about extra copying happening in the buffer layer. I need to think on this. Seeing as lookahead + seek could be labeled as read even though it's not.I don't see why it would conflict with a range interface. As I understand it, ranges act on a single element at a time while streams act on multiple elements at a time. For ArrayBuffer in datapicked, a stream-style read is just lookahead(n) and cur += n. What capabilities are lost?In short - lookahead is slicing, read would be copying. For me prime capability of an array is slicing that is dirt cheap O(1). On the other hand stream interface is all about copying bytes to the user provided array. In this setting it means that if you want to wrap array as stream, then it must follow generic stream interface. The latter cannot and should not think of slicing and the like. Then while wrapping it in some adapter up the chain it's no longer seen as array (because adapter is generic too and is made for streams). This is what I call capability loss.If buffers/arrays provide a stream interface, then they can be used by code that doesn't directly need the buffering capabilities but would still benefit from them.See above - it would be better if the code was written for ranges not streams. Then e.g. slicing of buffer range on top of array works just as cheap as it was for arrays. And zero copies are made (=performance).Actually these objects do just fine, since OS does the locking (or makes sure of something equivalent). If your stream is TLS there is no need for extra locking at all (no interleaving of I/O calls is possible) regardless of its kind. Shared instances would need locking as 2 threads may request some operation, and as OS locks only on per sys-call basis something cruel may happen in the code that deals with buffering etc.Oh yeah, you're right. As a side note: I would love to get a kick-ass I/O stream package into Phobos. It could replace std.stream as well as std.stdio. Stuff like serializers and lexers would be more robust and easier to write.
Jan 06 2014
07-Jan-2014 11:59, Jason White пишет:On Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:Cool, got to steal sysErrorString too! :)Ok, now I see. In my eye though serialization completely hides raw stream write. So: struct SomeStream{ void write(const(ubyte)[] data...); } struct Serializer(Stream){ void write(T)(T value); //calls stream.write inside of it private: Stream stream; }I was thinking it should also have "alias stream this;", but maybe that's not the best thing to do for a serializer. I concede, I've s/(read|write)Data/\1/g on https://github.com/jasonwhite/io/blob/master/src/io/file.d and it now works on Windows with useful exception messages.Then our goals are aligned. Be sure to take a peek at (if you haven't already): https://github.com/schveiguy/phobos/blob/new-io/std/io.d I have my share criticisms for it but it's a nice piece of work that addresses many questions of I/O I've yet to consider.Actually these objects do just fine, since OS does the locking (or makes sure of something equivalent). If your stream is TLS there is no need for extra locking at all (no interleaving of I/O calls is possible) regardless of its kind. Shared instances would need locking as 2 threads may request some operation, and as OS locks only on per sys-call basis something cruel may happen in the code that deals with buffering etc.Oh yeah, you're right. As a side note: I would love to get a kick-ass I/O stream package into Phobos. It could replace std.stream as well as std.stdio.Stuff like serializers and lexers would be more robust and easier to write.Indeed and even beyond that. -- Dmitry Olshansky
Jan 07 2014
On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky = <dmitry.olsh gmail.com> wrote:07-Jan-2014 11:59, Jason White =D0=BF=D0=B8=D1=88=D0=B5=D1=82:noOn Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:Cool, got to steal sysErrorString too! :)Ok, now I see. In my eye though serialization completely hides raw stream write. So: struct SomeStream{ void write(const(ubyte)[] data...); } struct Serializer(Stream){ void write(T)(T value); //calls stream.write inside of it private: Stream stream; }I was thinking it should also have "alias stream this;", but maybe that's not the best thing to do for a serializer. I concede, I've s/(read|write)Data/\1/g on https://github.com/jasonwhite/io/blob/master/src/io/file.d and it now works on Windows with useful exception messages.Actually these objects do just fine, since OS does the locking (or makes sure of something equivalent). If your stream is TLS there is =lneed for extra locking at all (no interleaving of I/O calls is possible) regardless of its kind. Shared instances would need locking as 2 threads may request some operation, and as OS locks only on per sys-call basis something crue=omay happen in the code that deals with buffering etc.Oh yeah, you're right. As a side note: I would love to get a kick-ass I/O stream package int==Phobos. It could replace std.stream as well as std.stdio.Then our goals are aligned. Be sure to take a peek at (if you haven't =already): https://github.com/schveiguy/phobos/blob/new-io/std/io.dYes, I'm gearing up to revisit that after a long D hiatus, and I came = across this thread. At this point, I really really like the ideas that you have in this. It = = solves an issue that I struggled with, and my solution was quite clunky.= I am thinking of this layout for streams/buffers: 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which= I = have pretty much written) 2. Buffer like you have, based on a struct, with specific primitives. It= 's = job is to collect data from the underlying stream, and present it to = consumers as a random-access buffer. 3. Filter that has access to transform the buffer data/copy it. 4. Ranges that use the buffer/filter to process/present the data. The problem I struggled with is the presentation of UTF data of any form= at = as char[] wchar[] or dchar[]. 2 things need to happen. First is that the= = data needs to be post-processed to perform any necessary byte swapping. = = The second is to transcode the data into the correct width. In this way, you can process UTF data of any type (I even have code to = detect the encoding and automatically process it), and then use it in a = = way that makes sense for your code. My solution was to paste in a "processing" delegate into the class = hierarchy of buffered streams that allowed one read/write access to the = = buffer. But it's clunky, and difficult to deal with in a generalized = fashion. But the idea of using a buffer in between the stream and the range, and = = possibly bolting together multiple transformations in a clean way, makes= = this problem easy to solve, and I think it is closer to the vision = Andrei/Walter have. I also like the idea of "pinning" the data instead of my mechanism of = using a delegate (which was similar but not as general). It also has = better opportunities for optimization. Other ideas that came to me that buffer filters could represent: * compression/decompression * encryption I am going to study your code some more and see how I can update my code= = to use it. I still need to maintain the std.stdio.File interface, and = Walter is insistent that the initial state of stdout/err/in must be = synchronous with C (which kind of sucks, but I have plans on how to make= = it not be so bad). There is still a lot of work left to do, but I think one of the hard par= ts = is done, namely dealing with UTF transcoding. The remaining sticky part = is = dealing with shared. But with structs, this should make things much easi= er. One question, is there a reason a buffer type has to be a range at all? = I = can see where it's easy to make it a range, but I don't see higher-level= = code using the range primitives when dealing with chunks of a stream. -Steve
Jan 16 2014
16-Jan-2014 19:55, Steven Schveighoffer пишет:On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:The only interesting thing I'd add here s that some buffer may work without underlying stream. Best examples are arrays and MM-files.Then our goals are aligned. Be sure to take a peek at (if you haven't already): https://github.com/schveiguy/phobos/blob/new-io/std/io.dYes, I'm gearing up to revisit that after a long D hiatus, and I came across this thread. At this point, I really really like the ideas that you have in this. It solves an issue that I struggled with, and my solution was quite clunky. I am thinking of this layout for streams/buffers: 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which I have pretty much written) 2. Buffer like you have, based on a struct, with specific primitives. It's job is to collect data from the underlying stream, and present it to consumers as a random-access buffer.3. Filter that has access to transform the buffer data/copy it. 4. Ranges that use the buffer/filter to process/present the data.Yes, yes and yes. I find it surprisingly good to see our vision seems to match. I was half-expecting you'd come along and destroy it all ;)The problem I struggled with is the presentation of UTF data of any format as char[] wchar[] or dchar[]. 2 things need to happen. First is that the data needs to be post-processed to perform any necessary byte swapping. The second is to transcode the data into the correct width. In this way, you can process UTF data of any type (I even have code to detect the encoding and automatically process it), and then use it in a way that makes sense for your code. My solution was to paste in a "processing" delegate into the class hierarchy of buffered streams that allowed one read/write access to the buffer. But it's clunky, and difficult to deal with in a generalized fashion. But the idea of using a buffer in between the stream and the range, and possibly bolting together multiple transformations in a clean way, makes this problem easy to solve, and I think it is closer to the vision Andrei/Walter have.In essence a transcoding filter for UTF-16 would wrap a buffer of ubyte and itself present a buffer interface (but of wchar). My own stuff currently deals only in ubyte and the limited decoding is represented by a "decode" function that takes a buffer of ubyte and decodes UTF-8. I think typed buffers/filters is the way to go.I also like the idea of "pinning" the data instead of my mechanism of using a delegate (which was similar but not as general). It also has better opportunities for optimization. Other ideas that came to me that buffer filters could represent: * compression/decompression * encryption I am going to study your code some more and see how I can update my code to use it. I still need to maintain the std.stdio.File interface, and Walter is insistent that the initial state of stdout/err/in must be synchronous with C (which kind of sucks, but I have plans on how to make it not be so bad).I seriously not seeing how interfacing with C runtime could be fast enough.There is still a lot of work left to do, but I think one of the hard parts is done, namely dealing with UTF transcoding. The remaining sticky part is dealing with shared. But with structs, this should make things much easier.I'm thinking a generic locking wrapper is possible along the lines of: shared Locked!(GenericBuffer!char) stdin; //usage struct Locked(T){ shared: private: T _this; Mutex mut; public: //forwarded methods } The wrapper will introduce a lock, and implement every method of wrapped struct roughly like this: mut.lock(); scope(exit) mut.unlock(); (cast(T*)_this).method(args); I'm sure it could be pretty automatic.One question, is there a reason a buffer type has to be a range at all? I can see where it's easy to make it a range, but I don't see higher-level code using the range primitives when dealing with chunks of a stream.Lexers/parsers enjoy it - i.e. they work pretty much as ranges especially when skipping spaces and the like. As I said the main reason was: if it fits as range why not? After all it makes one-pass processing of data trivial as it rides on top of foreach: foreach(octect; mybuffer) { if(intersting(octect)) do_cool_stuff(); } Things like countUntil make perfect sense when called on buffer (e.g. to find matching sentinel). -- Dmitry Olshansky
Jan 16 2014
On Thu, 16 Jan 2014 13:44:08 -0500, Dmitry Olshansky = <dmitry.olsh gmail.com> wrote:16-Jan-2014 19:55, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=:tOn Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Then our goals are aligned. Be sure to take a peek at (if you haven'=already): https://github.com/schveiguy/phobos/blob/new-io/std/io.dYes, I'm gearing up to revisit that after a long D hiatus, and I came=Itacross this thread. At this point, I really really like the ideas that you have in this. =ky.solves an issue that I struggled with, and my solution was quite clun=ichI am thinking of this layout for streams/buffers: 1. Unbuffered stream used for raw i/o, based on a class hierarchy (wh=I have pretty much written) 2. Buffer like you have, based on a struct, with specific primitives.=tIt's job is to collect data from the underlying stream, and present i=to consumers as a random-access buffer.The only interesting thing I'd add here s that some buffer may work =without underlying stream. Best examples are arrays and MM-files.Yes, but I would stress that for convenience, the buffer should forward = = some of the stream primitives (such as seeking) in cases where seeking i= s = possible, at least in the case of a buffer that wraps a stream. That actually is another point that would have sucked with my class-base= d = solution -- allocating a class to use an array as backing.to =3. Filter that has access to transform the buffer data/copy it. 4. Ranges that use the buffer/filter to process/present the data.Yes, yes and yes. I find it surprisingly good to see our vision seems =match. I was half-expecting you'd come along and destroy it all ;):) I've been preaching for a while that ranges don't make good streams, = = and that streams should be classes, but I hadn't considered splitting ou= t = the buffer. I think it's the right balance.sThe problem I struggled with is the presentation of UTF data of any format as char[] wchar[] or dchar[]. 2 things need to happen. First i=ethat the data needs to be post-processed to perform any necessary byt=swapping. The second is to transcode the data into the correct width.=oIn this way, you can process UTF data of any type (I even have code t=adetect the encoding and automatically process it), and then use it in=heway that makes sense for your code. My solution was to paste in a "processing" delegate into the class hierarchy of buffered streams that allowed one read/write access to t=ndbuffer. But it's clunky, and difficult to deal with in a generalized fashion. But the idea of using a buffer in between the stream and the range, a=kespossibly bolting together multiple transformations in a clean way, ma=e =this problem easy to solve, and I think it is closer to the vision Andrei/Walter have.In essence a transcoding filter for UTF-16 would wrap a buffer of ubyt=and itself present a buffer interface (but of wchar).My intended interface allows you to specify the desired type per read. = Think of the case of stdin, where the clients will be varied and written= = by many different people, and its interface is decided by Phobos. But a transcoding buffer may make some optimizations. For instance, = reading a UTF32 file as utf-8 can re-use the same buffer, as no code uni= t = uses more than 4 code points (did I get that right?).odeI am going to study your code some more and see how I can update my c=to use it. I still need to maintain the std.stdio.File interface, and=akeWalter is insistent that the initial state of stdout/err/in must be synchronous with C (which kind of sucks, but I have plans on how to m=it not be so bad).I seriously not seeing how interfacing with C runtime could be fast =enough.It's not. But an important stipulation in order for this to all be = accepted is that it doesn't break existing code that expects things like= = printf and writef to interleave properly. However, I think we can have an opt-in scheme, and there are certain cas= es = where we can proactively switch to a D-buffer scheme. For example, if yo= u = get a ByLine range, it expects to exhaust the data from stream, and may = = not properly work with C printf. The idea is that stdio.File can switch at runtime from FILE * to D strea= ms = as needed or directed.ckyThere is still a lot of work left to do, but I think one of the hard parts is done, namely dealing with UTF transcoding. The remaining sti=spart is dealing with shared. But with structs, this should make thing=much easier.I'm thinking a generic locking wrapper is possible along the lines of:=shared Locked!(GenericBuffer!char) stdin; //usage struct Locked(T){ shared: private: T _this; Mutex mut; public: //forwarded methods } The wrapper will introduce a lock, and implement every method of wrapp=ed =struct roughly like this: mut.lock(); scope(exit) mut.unlock(); (cast(T*)_this).method(args); I'm sure it could be pretty automatic.This would be a key addition for ANY type in order to properly work with= = shared. BUT, I don't see how it works safely generically because you = necessarily have to cast away shared in order to call the methods. You = would have to limit this to only working on types it was intended for. I've been expecting to have to do something like this, but not looking = forward to it :(l?One question, is there a reason a buffer type has to be a range at al=ofI can see where it's easy to make it a range, but I don't see higher-level code using the range primitives when dealing with chunks=a stream.Lexers/parsers enjoy it - i.e. they work pretty much as ranges =especially when skipping spaces and the like. As I said the main reaso=n =was: if it fits as range why not? After all it makes one-pass processi=ng =of data trivial as it rides on top of foreach: foreach(octect; mybuffer) { if(intersting(octect)) do_cool_stuff(); } Things like countUntil make perfect sense when called on buffer (e.g. =to =find matching sentinel).I think I misstated my question. What I am curious about is why a type = must be a forward range to pass isBuffer. Of course, if it makes sense f= or = a buffer type to also be a range, it can certainly implement that = interface as well. But I don't know that I would need those primitives i= n = all cases. I don't have any specific use case for having a buffer that = doesn't implement a range interface, but I am hesitant to necessarily = couple the buffer interface to ranges just because we can't think of a = counter-case :) -Steve
Jan 16 2014
17-Jan-2014 00:00, Steven Schveighoffer пишет:On Thu, 16 Jan 2014 13:44:08 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:[snip]16-Jan-2014 19:55, Steven Schveighoffer пишет:On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:The other way around :) 4 code units - 1 code point.In essence a transcoding filter for UTF-16 would wrap a buffer of ubyte and itself present a buffer interface (but of wchar).My intended interface allows you to specify the desired type per read. Think of the case of stdin, where the clients will be varied and written by many different people, and its interface is decided by Phobos. But a transcoding buffer may make some optimizations. For instance, reading a UTF32 file as utf-8 can re-use the same buffer, as no code unit uses more than 4 code points (did I get that right?).The requirement may be that it's pure or should I say "well-contained". In other words as long as it doesn't smuggle references somewhere else it should be fine. That is to say it's 100% fool-proof, nor do I think that essentially simulating a synchronized class is a always a good thing to do...It's not. But an important stipulation in order for this to all be accepted is that it doesn't break existing code that expects things like printf and writef to interleave properly. However, I think we can have an opt-in scheme, and there are certain cases where we can proactively switch to a D-buffer scheme. For example, if you get a ByLine range, it expects to exhaust the data from stream, and may not properly work with C printf. The idea is that stdio.File can switch at runtime from FILE * to D streams as needed or directed.I am going to study your code some more and see how I can update my code to use it. I still need to maintain the std.stdio.File interface, and Walter is insistent that the initial state of stdout/err/in must be synchronous with C (which kind of sucks, but I have plans on how to make it not be so bad).I seriously not seeing how interfacing with C runtime could be fast enough.This would be a key addition for ANY type in order to properly work with shared. BUT, I don't see how it works safely generically because you necessarily have to cast away shared in order to call the methods. You would have to limit this to only working on types it was intended for.There is still a lot of work left to do, but I think one of the hard parts is done, namely dealing with UTF transcoding. The remaining sticky part is dealing with shared. But with structs, this should make things much easier.I'm thinking a generic locking wrapper is possible along the lines of: shared Locked!(GenericBuffer!char) stdin; //usage struct Locked(T){ shared: private: T _this; Mutex mut; public: //forwarded methods } The wrapper will introduce a lock, and implement every method of wrapped struct roughly like this: mut.lock(); scope(exit) mut.unlock(); (cast(T*)_this).method(args); I'm sure it could be pretty automatic.I've been expecting to have to do something like this, but not looking forward to it :(Convenient to work with does ring good to me. I simply see no need to reinvent std.algorithm on buffers especially the ones that just scan left-to-right. Example would be calculating a checksum of a stream (say data comes from a pipe or socket i.e. buffered). It's a trivial application of std.algorithm.reduce and there no need to reinvent that wheel IMHO. -- Dmitry OlshanskyI think I misstated my question. What I am curious about is why a type must be a forward range to pass isBuffer. Of course, if it makes sense for a buffer type to also be a range, it can certainly implement that interface as well. But I don't know that I would need those primitives in all cases. I don't have any specific use case for having a buffer that doesn't implement a range interface, but I am hesitant to necessarily couple the buffer interface to ranges just because we can't think of a counter-case :)One question, is there a reason a buffer type has to be a range at all? I can see where it's easy to make it a range, but I don't see higher-level code using the range primitives when dealing with chunks of a stream.Lexers/parsers enjoy it - i.e. they work pretty much as ranges especially when skipping spaces and the like. As I said the main reason was: if it fits as range why not? After all it makes one-pass processing of data trivial as it rides on top of foreach: foreach(octect; mybuffer) { if(intersting(octect)) do_cool_stuff(); } Things like countUntil make perfect sense when called on buffer (e.g. to find matching sentinel).
Jan 16 2014
On Thu, 16 Jan 2014 17:28:31 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:The other way around :) 4 code units - 1 code point.I knew that was going to happen :)I think you meant *not* 100% :) But yeah, regardless of how generalized it is, this is likely the best path. I think this is the tack that std.stdio.File takes anyway, except it's using FILE *'s locking mechanism.This would be a key addition for ANY type in order to properly work with shared. BUT, I don't see how it works safely generically because you necessarily have to cast away shared in order to call the methods. You would have to limit this to only working on types it was intended for.The requirement may be that it's pure or should I say "well-contained". In other words as long as it doesn't smuggle references somewhere else it should be fine. That is to say it's 100% fool-proof, nor do I think that essentially simulating a synchronized class is a always a good thing to do...Convenient to work with does ring good to me. I simply see no need to reinvent std.algorithm on buffers especially the ones that just scan left-to-right. Example would be calculating a checksum of a stream (say data comes from a pipe or socket i.e. buffered). It's a trivial application of std.algorithm.reduce and there no need to reinvent that wheel IMHO.I again think I am being misunderstood. Example might be appropriate: struct Buffer {...} // implements BOTH forward range and Buffer primitives struct OtherBuffer {...} // only implements Buffer primitives. If isBuffer is modified to not require isForwardRange, then both Buffer and OtherBuffer will work as buffers, but only Buffer works as a range. But as you have it now, isBuffer!OtherBuffer is false. Is this necessary? So we can implement buffers that are both ranges and buffers, and will work with std.algorithm without modification (and that's fine and expected by me), but do we need to *require* that? Are we over-specifying? Is there a possibility that someone can invent a buffer that satisfies the primitives of say a line-by-line reader, but cannot possibly be a forward range? -Steve
Jan 16 2014
17-Jan-2014 02:41, Steven Schveighoffer пишет:On Thu, 16 Jan 2014 17:28:31 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Aye. BTW I haven't thought of writing into the buffer, but it works exactly the same. It could be even read/write -"discarded" data is written to underlying stream, freshly loaded is read from stream. Now in case of input stream only writes are nop, for output-only reads are nops. With pinning it makes for cool multi-pass algorithms that actually output stuff into the file.The other way around :) 4 code units - 1 code point.I knew that was going to happen :)Aye.I think you meant *not* 100% :) But yeah, regardless of how generalized it is, this is likely the best path. I think this is the tack that std.stdio.File takes anyway, except it's using FILE *'s locking mechanism.This would be a key addition for ANY type in order to properly work with shared. BUT, I don't see how it works safely generically because you necessarily have to cast away shared in order to call the methods. You would have to limit this to only working on types it was intended for.The requirement may be that it's pure or should I say "well-contained". In other words as long as it doesn't smuggle references somewhere else it should be fine. That is to say it's 100% fool-proof, nor do I think that essentially simulating a synchronized class is a always a good thing to do...Looking at the post by Walter I see I need to clarify things. And if you browse the thread you'd see my understanding also changed with time - I started with no stinkin' forward ranges with buffered I/O only to later eat my words and make them happen. First let's call buffer a thing that pack an array and few extras to support pinning, refiling of data from underlying stream and extending. It exposes current "window" of underlying stream. Now we have pins in that buffer that move along. A pin not only enforces that all data "to the left" stays accessible but also is a "view" of buffer. It's even conceptually a range with extra primitives outlined here: http://blackwhale.github.io/datapicked/dpick.buffer.traits.html I should stick to naming it BufferRange. The "real buffer" is internal construct and BufferRanges share ownership of it. This is exactly what I ended up with in my second branch. "real" buffer: https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/buffer.d#L417 vs buffer range over it: https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/buffer.d#L152 Naming issues are apparently even worse then Walter implies.Convenient to work with does ring good to me. I simply see no need to reinvent std.algorithm on buffers especially the ones that just scan left-to-right. Example would be calculating a checksum of a stream (say data comes from a pipe or socket i.e. buffered). It's a trivial application of std.algorithm.reduce and there no need to reinvent that wheel IMHO.I again think I am being misunderstood. Example might be appropriate:struct Buffer {...} // implements BOTH forward range and Buffer primitives struct OtherBuffer {...} // only implements Buffer primitives. If isBuffer is modified to not require isForwardRange, then both Buffer and OtherBuffer will work as buffers, but only Buffer works as a range. But as you have it now, isBuffer!OtherBuffer is false. Is this necessary?I think I should call it BufferRange from now on. And bring my docs in line. It may make sense to provide naked Buffer itself as a construct (it has simpler interface) and have generic BufferRange wrapper. I'm just not seeing user code using the former - too error prone and unwieldy.So we can implement buffers that are both ranges and buffers, and will work with std.algorithm without modification (and that's fine and expected by me), but do we need to *require* that? Are we over-specifying?Random-Access range had to require .front/.back maybe it was over-specification too? Some stuff is easy to index but not "drop off an item at either end". But now it really doesn't matter - if there are such things, they are not random-access ranges.Is there a possibility that someone can invent a buffer that satisfies the primitives of say a line-by-line reader, but cannot possibly be a forward range?I hardly can see that: front --> lookahead(1)[0] empty --> lookahead(1).length != 0 popFront --> seek(1) or enforce(seek(1)) save -> well there got to be a way to pin the data in the buffer? And they surely could be better implemented inside of a specific buffer range. -- Dmitry Olshansky
Jan 17 2014
On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:BTW I haven't thought of writing into the buffer, but it works exactly the same. It could be even read/write -"discarded" data is written to underlying stream, freshly loaded is read from stream. Now in case of input stream only writes are nop, for output-only reads are nops. With pinning it makes for cool multi-pass algorithms that actually output stuff into the file.In my experience, the code/process for a write buffer is significantly different than the code for a read buffer. A read/write buffer is very difficult to make, because you either need 2 file pointers, or need to constantly seek the single file pointer to overwrite the existing data.In a sense, I agree. There aren't many things to do directly with a buffer (i.e. there will likely be few filters), and a buffer range provides enough primitives to make front/popFront/empty trivial.But as you have it now, isBuffer!OtherBuffer is false. Is this necessary?I think I should call it BufferRange from now on. And bring my docs in line. It may make sense to provide naked Buffer itself as a construct (it has simpler interface) and have generic BufferRange wrapper. I'm just not seeing user code using the former - too error prone and unwieldy.I'll have to take a closer look at your code to have a reasonable response. But I think this looks fine so far. -SteveSo we can implement buffers that are both ranges and buffers, and will work with std.algorithm without modification (and that's fine and expected by me), but do we need to *require* that? Are we over-specifying?Random-Access range had to require .front/.back maybe it was over-specification too? Some stuff is easy to index but not "drop off an item at either end". But now it really doesn't matter - if there are such things, they are not random-access ranges.Is there a possibility that someone can invent a buffer that satisfies the primitives of say a line-by-line reader, but cannot possibly be a forward range?I hardly can see that: front --> lookahead(1)[0] empty --> lookahead(1).length != 0 popFront --> seek(1) or enforce(seek(1)) save -> well there got to be a way to pin the data in the buffer? And they surely could be better implemented inside of a specific buffer range.
Jan 17 2014
17-Jan-2014 18:03, Steven Schveighoffer пишет:On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Agreed, read/write buffer is a bad idea. As for write-only buffer implemented in the same style as read-only, well I need to try it first. -- Dmitry OlshanskyBTW I haven't thought of writing into the buffer, but it works exactly the same. It could be even read/write -"discarded" data is written to underlying stream, freshly loaded is read from stream. Now in case of input stream only writes are nop, for output-only reads are nops. With pinning it makes for cool multi-pass algorithms that actually output stuff into the file.In my experience, the code/process for a write buffer is significantly different than the code for a read buffer. A read/write buffer is very difficult to make, because you either need 2 file pointers, or need to constantly seek the single file pointer to overwrite the existing data.
Jan 17 2014
On Fri, 17 Jan 2014 09:12:56 -0500, Dmitry Olshansky = <dmitry.olsh gmail.com> wrote:17-Jan-2014 18:03, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=:lyOn Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:BTW I haven't thought of writing into the buffer, but it works exact=othe same. It could be even read/write -"discarded" data is written t=funderlying stream, freshly loaded is read from stream. Now in case o=yinput stream only writes are nop, for output-only reads are nops. With pinning it makes for cool multi-pass algorithms that actually output stuff into the file.In my experience, the code/process for a write buffer is significantl=ydifferent than the code for a read buffer. A read/write buffer is ver=odifficult to make, because you either need 2 file pointers, or need t=a.constantly seek the single file pointer to overwrite the existing dat=Agreed, read/write buffer is a bad idea. As for write-only buffer =implemented in the same style as read-only, well I need to try it firs=t.Keep in mind that a read/write buffered stream *should* exist. It's just= = that you need to switch how you use the buffer. I think in my code, what I do is if someone wants to read, and is = currently writing (on a read/write stream), I flush the buffer, then = switch it to be in read mode, and then seek to the appropriate place, to= = fill the buffer. When switching from read to write, it's much easier, since the buffer = starts out empty. So I think a buffer type that can be used for read *OR* write is good, = just not simultaneously. However, simultaneous read/write is probably something that could be = useful for a niche use case (e.g. some transformative process), as long = as = you don't mind having 2 file descriptors open :) -Steve
Jan 17 2014
17-Jan-2014 18:52, Steven Schveighoffer пишет:On Fri, 17 Jan 2014 09:12:56 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:I'm doubting that actually. Just use the freaking MM-file directly you'd better off efficiency wise. And there would be no need to switch back and forth. Modern OS use paging subsystem for files anyway (unless opened with O_DIRECT or similar).17-Jan-2014 18:03, Steven Schveighoffer пишет:Keep in mind that a read/write buffered stream *should* exist. It's just that you need to switch how you use the buffer.On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Agreed, read/write buffer is a bad idea. As for write-only buffer implemented in the same style as read-only, well I need to try it first.BTW I haven't thought of writing into the buffer, but it works exactly the same. It could be even read/write -"discarded" data is written to underlying stream, freshly loaded is read from stream. Now in case of input stream only writes are nop, for output-only reads are nops. With pinning it makes for cool multi-pass algorithms that actually output stuff into the file.In my experience, the code/process for a write buffer is significantly different than the code for a read buffer. A read/write buffer is very difficult to make, because you either need 2 file pointers, or need to constantly seek the single file pointer to overwrite the existing data.I think in my code, what I do is if someone wants to read, and is currently writing (on a read/write stream), I flush the buffer, then switch it to be in read mode, and then seek to the appropriate place, to fill the buffer. When switching from read to write, it's much easier, since the buffer starts out empty. So I think a buffer type that can be used for read *OR* write is good, just not simultaneously.See above. I don't think there is a need for these paces with flushing stuff. And if it comes to pipes and sockets then just having 2 buffers fits the bill better.However, simultaneous read/write is probably something that could be useful for a niche use case (e.g. some transformative process), as long as you don't mind having 2 file descriptors open :)-- Dmitry Olshansky
Jan 17 2014
On Fri, 17 Jan 2014 10:40:02 -0500, Dmitry Olshansky = <dmitry.olsh gmail.com> wrote:17-Jan-2014 18:52, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=:ustKeep in mind that a read/write buffered stream *should* exist. It's j='d =that you need to switch how you use the buffer.I'm doubting that actually. Just use the freaking MM-file directly you=better off efficiency wise. And there would be no need to switch back ==and forth. Modern OS use paging subsystem for files anyway (unless =opened with O_DIRECT or similar).Remember we have to support whatever stdio.File does, and I think it doe= s = support this. -Steve
Jan 17 2014
On Thursday, 16 January 2014 at 15:55:07 UTC, Steven Schveighoffer wrote:I am thinking of this layout for streams/buffers: 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which I have pretty much written) 2. Buffer like you have, based on a struct, with specific primitives. It's job is to collect data from the underlying stream, and present it to consumers as a random-access buffer.If you have a struct-based buffer, how would you enlarge the buffer? Won't it suffer from AA syndrome?
Jan 17 2014
17-Jan-2014 13:19, Kagamin пишет:On Thursday, 16 January 2014 at 15:55:07 UTC, Steven Schveighoffer wrote:What's the problem? I don't see how struct/class can change there anything, it's a member field that is an array that we surely can expand.I am thinking of this layout for streams/buffers: 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which I have pretty much written) 2. Buffer like you have, based on a struct, with specific primitives. It's job is to collect data from the underlying stream, and present it to consumers as a random-access buffer.If you have a struct-based buffer, how would you enlarge the buffer?Won't it suffer from AA syndrome?Buffer is created with factory functions only. It's not like an AA that grows from an empty/null state. Empty buffer (.init) doesn't grow it's simply empty. -- Dmitry Olshansky
Jan 17 2014
On Friday, 17 January 2014 at 09:33:41 UTC, Dmitry Olshansky wrote:What's the problem? I don't see how struct/class can change there anything, it's a member field that is an array that we surely can expand.Ah, I thought one can copy the buffer, now I see you pass it by ref.
Jan 17 2014
On Friday, 17 January 2014 at 09:19:12 UTC, Kagamin wrote:On Thursday, 16 January 2014 at 15:55:07 UTC, Steven Schveighoffer wrote:It is the lazily initialized nature of AAs that causes the problem. Structs can, but don't have to, replicate the behaviour.I am thinking of this layout for streams/buffers: 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which I have pretty much written) 2. Buffer like you have, based on a struct, with specific primitives. It's job is to collect data from the underlying stream, and present it to consumers as a random-access buffer.If you have a struct-based buffer, how would you enlarge the buffer? Won't it suffer from AA syndrome?
Jan 17 2014
I've been rewriting a bit of the lexer in DScanner. https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/lexer.d (Ignore the "invariant" block. I've been trying to hunt down some unreelated memory corruption issue) One thing that I've found to be very useful is the ability to increment column or index counters inside of the lexer range's popFront method. I think that I'd end up using my own range type for arrays, but construct a lexer range on top of your buffer range for anything else.
Jan 04 2014
04-Jan-2014 13:39, Brian Schott пишет:I've been rewriting a bit of the lexer in DScanner. https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/lexer.d (Ignore the "invariant" block. I've been trying to hunt down some unreelated memory corruption issue)startsWith & peek could be implemented with proposed the lookahead, so overall buffer range looks fitting to your use case.One thing that I've found to be very useful is the ability to increment column or index counters inside of the lexer range's popFront method.I think that I'd end up using my own range type for arrays, but construct a lexer range on top of your buffer range for anything else.I think it should be possible to wrap a buffer range and to add related accounting and simplify interface for your own use. As an experiment I have a greatly simplified buffer range - it builds on top of forward range making it instantly compatible with the most of std.algorithm. The only problem at the moment is slightly worse performance with DMD (who cares) and that it segfaults with LDC (and this is a problem). Would be nice to test LDC first but I think I'll publish it anyway, stay tuned. -- Dmitry Olshansky
Jan 04 2014
My experimental lexer generator now uses the buffer range. https://github.com/Hackerpilot/Dscanner/tree/NewLexer/stdx The problem that I have with it right now is that range.lookbehind(1).length != range.lookahead(1).length. This was confusing.
Jan 09 2014
09-Jan-2014 13:23, Brian Schott пишет:My experimental lexer generator now uses the buffer range. https://github.com/Hackerpilot/Dscanner/tree/NewLexer/stdxCool! A minor note: https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/d/lexer.d#L487 lookahead(n) should always give a slice of length n, or 0 so you may as well test for != 0. In general you should avoid marking too often, it takes a bit of . I'm currently in favor of my 2nd design where marking is replaced by .save returning an independent view of buffer, making Buffer a normal forward range that is cheap to copy. https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/ Sadly it segfaults with LDC so I can't quite assess its performance :(The problem that I have with it right now is that range.lookbehind(1).length != range.lookahead(1).length. This was confusing.That indeed may look confusing at start but keep in mind that range.front is in fact a lookahead of 1. Thus all algorithms that can work with lookahead of 1 LL(1), LALR(1) etc. would easily work any forward/input range (though any practical parser need to slice or copy parts of input aside). -- Dmitry Olshansky
Jan 12 2014
On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:The BufferRange concept itself (for now called simply Buffer) is defined here: http://blackwhale.github.io/datapicked/dpick.buffer.traits.htmlI am confused because there are 4 terms conflated here: BufferRange Buffer InputStream Stream
Jan 16 2014
17-Jan-2014 00:18, Walter Bright пишет:On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:One and the same fr the moment. You even quoted it:The BufferRange concept itself (for now called simply Buffer) is defined here: http://blackwhale.github.io/datapicked/dpick.buffer.traits.htmlI am confused because there are 4 terms conflated here: BufferRange BufferBufferRange .. (for now called simply Buffer)...InputStream StreamOne and the same as I dealt with the input side of the problem only. -- Dmitry Olshansky
Jan 16 2014
On 1/16/2014 2:30 PM, Dmitry Olshansky wrote:17-Jan-2014 00:18, Walter Bright пишет:I know. But using two names for the same thing makes for confusing documentation. There also appears to be no rationale for "for the moment".On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:One and the same fr the moment. You even quoted it: >BufferRange .. (for now called simply Buffer)...The BufferRange concept itself (for now called simply Buffer) is defined here: http://blackwhale.github.io/datapicked/dpick.buffer.traits.htmlI am confused because there are 4 terms conflated here: BufferRange BufferSame problem with multiple names for the same thing, also there's no definition of either name.InputStream StreamOne and the same as I dealt with the input side of the problem only.
Jan 16 2014