digitalmars.D.announce - iopipe v0.0.4 - RingBuffers!
- Steven Schveighoffer (39/39) May 10 2018 OK, so at dconf I spoke with a few very smart guys about how I can use
- Dmitry Olshansky (19/48) May 10 2018 I’d start with something clinicaly synthetic.
- Steven Schveighoffer (39/79) May 11 2018 Hm.. this wouldn't work, because the idea is to keep some of the buffer
- =?UTF-8?Q?Ali_=c3=87ehreli?= (16/19) May 11 2018 There is the LMAX Disruptor, which was open sourced a few year ago along...
- Uknown (15/66) May 11 2018 I'm sure someone will find some good show off program.
- Steven Schveighoffer (38/64) May 11 2018 I would start here: https://en.wikipedia.org/wiki/Circular_buffer
- Dmitry Olshansky (24/97) May 11 2018 Then you cannot test it in such way.
- Uknown (4/12) May 11 2018 You can always use GNU grep. The one that comes with macOS is
- Kagamin (5/10) May 11 2018 Depends on OS and hardware. I would expect mmap implementation to
- Dmitry Olshansky (4/14) May 11 2018 It doesn’t. Instead it has a buffer mmaped twice side by side.
- Steven Schveighoffer (12/20) May 11 2018 As Dmitry hinted at, there actually is no file involved. I'm mapping
- Steven Schveighoffer (6/12) May 11 2018 Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU
- Steven Schveighoffer (8/23) May 11 2018 More testing reveals that as I increase the context lines to print,
- Joakim (4/23) May 11 2018 What stops you from downloading a linux release from here?
- Steven Schveighoffer (11/35) May 12 2018 So I did that, it's not much faster, a few milliseconds. Still about
- Dmitry Olshansky (10/30) May 12 2018 I could offer a few tricks to fix that w/o getting too dirty. GNU
- Joakim (3/15) May 12 2018 If you're talking about writing a grep prototype in D, that's a
- Dmitry Olshansky (3/22) May 12 2018 For shaming others to beat us using some other language. Making
- Nick Sabalausky (Abscissa) (7/12) May 12 2018 I wonder if there's realistic real-world cases where you could beat it
- Jonathan M Davis (19/22) May 11 2018 Curiously, the grep on FreeBSD seems to be GNU's grep with some addition...
- Patrick Schluter (5/6) May 12 2018 Oh, there had been an epic forum thread about the use of GNU grep
- Jon Degenhardt (11/16) May 11 2018 Yeah, the MacOS default versions of the Unix text processing
- Arun Chandrasekaran (12/56) May 11 2018 Since mmap is involved, it would be interesting to see if this
- Patrick Schluter (8/12) May 12 2018 They can be problematic with some CPU's and OS's. For modern
- Steven Schveighoffer (7/19) May 12 2018 Thanks for the tip. The nice thing about iopipe is that the buffer type
- bioinfornatics (31/75) May 14 2018 Hi Steve,
- Steven Schveighoffer (17/80) May 14 2018 Yeah, I have been working on and off with Vang Le (biocyberman) on using...
- biocyberman (11/50) May 15 2018 Hi Steve
- Claude (13/18) May 14 2018 I can think of a good use-case:
OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend. If anyone has any good use cases for it, I'm open to suggestions. Something that is going to potentially increase performance is an application that needs to keep the buffer mostly full when extending (i.e. something like 75% full or more). The buffer is selected by using `rbufd` instead of just `bufd`. Everything should be a drop-in replacement except for that. Note: I have ONLY tested on Macos, so if you find bugs in other OSes let me know. This is still a Posix-only library for now, but more on that later... As a test for Ring buffers, I implemented a simple "grep-like" search program that doesn't use regex, but phobos' canFind to look for lines that match. It also prints some lines of context, configurable on the command line. The lines of context I thought would show better performance with the RingBuffer than the standard buffer since it has to keep a bunch of lines in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200). However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep. Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point). Enjoy! https://github.com/schveiguy/iopipe https://code.dlang.org/packages/iopipe http://schveiguy.github.io/iopipe/ -Steve
May 10 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.I’d start with something clinicaly synthetic. Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize. Basically: 16 Mb buffer fixed vs 16 Mb mmap-ed ring Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU cache there. Otherwise CPU cache is so fast that ocasional copy is zilch, once we hit primary memory it’s not. Adjust sizes for your CPU. The amount of work done per byte though has to be minimal to actually see anything.in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200). However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it. Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point). Enjoy! https://github.com/schveiguy/iopipe https://code.dlang.org/packages/iopipe http://schveiguy.github.io/iopipe/ -Steve
May 10 2018
On 5/11/18 1:30 AM, Dmitry Olshansky wrote:On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:Hm.. this wouldn't work, because the idea is to keep some of the buffer full. What will happen here is that the buffer will extend to be able to accomodate the extra byte, and then you are back to having less of the buffer full at once. Iopipe is not afraid to increase the buffer :)OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.I’d start with something clinicaly synthetic. Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize.Basically: 16 Mb buffer fixed vs 16 Mb mmap-ed ring Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU cache there. Otherwise CPU cache is so fast that ocasional copy is zilch, once we hit primary memory it’s not. Adjust sizes for your CPU.This isn't how it will work. The system looks at the buffer and says "oh, I can just read 8MB - 1 byte," which gives you 2 bytes less than you need. Then you need the extra 2 bytes, so it will increase the buffer to hold at least 2 records. I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;) Update: nope, still pretty much the same.The amount of work done per byte though has to be minimal to actually see anything.Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise. What I have learned here is: 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers 2. The use cases are much smaller than I thought 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it. 4. iopipe makes testing with a different kind of buffer really easy, which was one of my original goals. So I'm glad that works! I'm going to (obviously) leave them there, hoping that someone finds a good use case, but I can say that my extreme excitement at getting it to work was depressed quite a bit when I found it didn't really gain much in terms of performance for the use cases I have been doing.I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line. One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all. I don't get why grep is so bad at this, since it is supposedly doing the matching without line boundaries. I was actually quite shocked when iopipe was that much faster -- even when I'm not asking grep to print out line numbers (so it doesn't actually ever really have to keep track of lines). -Stevein the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200). However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it. Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.
May 11 2018
On 05/11/2018 06:28 AM, Steven Schveighoffer wrote:1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers 2. The use cases are much smaller than I thoughtThere is the LMAX Disruptor, which was open sourced a few year ago along with a large number of articles, describing its history and design in great detail. Because of the large number of articles like this one https://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html it's impossible to find the one that had left an impression on me at the time I read it. The article was describing their story from the beginning to finally getting to their current design, starting from a simple std::map, lock contentions and other concurrency pitfall. They finally settled on a multi-producer-single-consumer design where the consumer works on one thread. This was giving them the biggest CPU cache advantage. The producers and the consumer share a ring buffer for communication. Perhaps the example you're looking for is in there somewhere. :) Ali
May 11 2018
On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:[...] I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;) Update: nope, still pretty much the same.I'm sure someone will find some good show off program.Now I need to learn all about ring-buffers. Do you have any good starting points?The amount of work done per byte though has to be minimal to actually see anything.Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise. What I have learned here is: 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers 2. The use cases are much smaller than I thought 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.4. iopipe makes testing with a different kind of buffer really easy, which was one of my original goals. So I'm glad that works!That satisfying feeling when the code works exactly the way you wanted it to!I'm going to (obviously) leave them there, hoping that someone finds a good use case, but I can say that my extreme excitement at getting it to work was depressed quite a bit when I found it didn't really gain much in terms of performance for the use cases I have been doing.I'm sure someone will find a place where its useful.iopipe is looking like a great library!I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line. One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it. Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.I don't get why grep is so bad at this, since it is supposedly doing the matching without line boundaries. I was actually quite shocked when iopipe was that much faster -- even when I'm not asking grep to print out line numbers (so it doesn't actually ever really have to keep track of lines). -SteveThat reminds me of this great blog post detailing grep's performance: http://ridiculousfish.com/blog/posts/old-age-and-treachery.html Also, one of the original authors of grep wrote about its performance optimizations, for anyone interested: https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
May 11 2018
On 5/11/18 10:04 AM, Uknown wrote:On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:I would start here: https://en.wikipedia.org/wiki/Circular_buffer The point of a circular buffer is to avoid having to copy any data anywhere -- things stay in place as long as they are in-buffer. So originally, I had intended to make a ring buffer (or circular buffer) where I have a random access range between 2 separate segments. In other words, the range would abstract the fact that the buffer was 2 segments, and give you random access to each element by checking to see which segment it's in. I never actually made it, because I realized quickly while optimizing e.g. line processing that the huge benefit of having sequential memory far outweighs the drawback of occasionally having to copy data. In other words, you are paying for every element access to avoid paying for a rare small copy. Consider a byline range. If you have a 8k buffer, and your lines are approximately 80 bytes in length average, when you reach the end of the buffer and have to move whatever existing partial-line to the front of the buffer to continue reading, you are really only copying 1% of the buffer, 1% of the time. But while you are searching for line endings (99% of the time), you are using a simple indexed pointer dereference. Contrast that with a disjoint buffer where every access to an element first requires a check to see which segment you are in before dereferencing. You have moved the payment from the 1% into the 99%. BUT, when at dconf, Dmitry and Shachar let me know about a technique to map the same memory segment to 2 consecutive address ranges. This allows you to look at the ring buffer without it ever being disjoint. Simply put, you have a 2x buffer, whereby each half looks at the same memory. Whenever your buffer start gets to the half way point, you simply move the pointers back by half a buffer. Other than that, the code is nearly identical to a straight allocated buffer, and the memory access is just as fast. So I decided to implement, hoping that I would magically just get a bit better performance. I should have known better :)What I have learned here is: 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers 2. The use cases are much smaller than I thought 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.Now I need to learn all about ring-buffers. Do you have any good starting points?Thanks! I hope to get more utility out of it. I still need to finish/publish my json parser based on it, and I'm thinking we need some parsing tools really to go on top of it to make things easier to approach.I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line. One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.iopipe is looking like a great library!That reminds me of this great blog post detailing grep's performance: http://ridiculousfish.com/blog/posts/old-age-and-treachery.html Also, one of the original authors of grep wrote about its performance optimizations, for anyone interested: https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.htmlThanks, I'll take a look at those. -Steve
May 11 2018
On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:On 5/11/18 1:30 AM, Dmitry Olshansky wrote:Then you cannot test it in such way.On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:Hm.. this wouldn't work, because the idea is to keep some of the buffer full. What will happen here is that the buffer will extend to be able to accomodate the extra byte, and then you are back to having less of the buffer full at once. Iopipe is not afraid to increase the buffer :)OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.I’d start with something clinicaly synthetic. Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize.Nope. Consider reading binary records where you know length in advance and skip over it w/o need to touch every byte. There it might help. If you touch every byte and do something the cost of copying the tail is zilch. One example is net string which is: 13,Hello, world! Basically length in ascii digits ‘,’ followed by tgat much UTF-8 codeunits. No decoding nessary. Torrent files use that I think, maybe other files. Is a nice example that avoids scans to find delimiters.Basically: 16 Mb buffer fixed vs 16 Mb mmap-ed ring Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU cache there. Otherwise CPU cache is so fast that ocasional copy is zilch, once we hit primary memory it’s not. Adjust sizes for your CPU.This isn't how it will work. The system looks at the buffer and says "oh, I can just read 8MB - 1 byte," which gives you 2 bytes less than you need. Then you need the extra 2 bytes, so it will increase the buffer to hold at least 2 records. I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;)Update: nope, still pretty much the same.This is also good. Normal ring buffers usually suck in speed department.The amount of work done per byte though has to be minimal to actually see anything.Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise. What I have learned here is: 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers2. The use cases are much smaller than I thought 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it. 4. iopipe makes testing with a different kind of buffer really easy, which was one of my original goals. So I'm glad that works! I'm going to (obviously) leave them there, hoping that someone finds a good use case, but I can say that my extreme excitement at getting it to work was depressed quite a bit when I found it didn't really gain much in terms of performance for the use cases I have been doing.It is malloc/free/addRange/removeRange for each call. I optimized 2.080 to reuse last recently used engine w/o these costs but I’ll have to check if it covers all cases.Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it. Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line.One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.I believe integrating iopipe awareness it in regex will easily make it 50% faster. A guestimate though.I don't get why grep is so bad at this, since it is supposedlygrep on Mac is a piece of sheat, sadly and I don’t know why exactly (too old?). Use some 3-rd party thing like ‘sift’ written in Go.-Steve
May 11 2018
On Friday, 11 May 2018 at 23:46:16 UTC, Dmitry Olshansky wrote:On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:You can always use GNU grep. The one that comes with macOS is pretty old and slow. If you have macports, its just `port install grep`. I'm sure brew will have a similar package for GNU grep.On 5/11/18 1:30 AM, Dmitry Olshansky wrote:grep on Mac is a piece of sheat, sadly and I don’t know why exactly (too old?). Use some 3-rd party thing like ‘sift’ written in Go.On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
May 11 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.
May 11 2018
On Friday, 11 May 2018 at 09:55:10 UTC, Kagamin wrote:On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:It doesn’t. Instead it has a buffer mmaped twice side by side. Therefore you can avoid copy at the end when it wraps around. Otherwise it’s the same buffering as usual.However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.
May 11 2018
On 5/11/18 5:55 AM, Kagamin wrote:On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:As Dmitry hinted at, there actually is no file involved. I'm mapping just straight memory to 2 segments. In fact, in my test application, I'm using stdin as the input, which may not even involve a file. It's just as fast as using memory, the only cool part is that you can write a buffer that wraps to the beginning as if it were a normal array. What surprises me is that the copying for the normal buffer doesn't hurt performance that much. I suppose this should probably have been expected, as CPUs are really really good at processing consecutive memory, and the copying you end up having to do is generally small compared to the rest of your app. -SteveHowever, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.
May 11 2018
On 5/10/18 7:22 PM, Steven Schveighoffer wrote:However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers). So at least there is something to strive for :) -Steve
May 11 2018
On 5/11/18 11:44 AM, Steven Schveighoffer wrote:On 5/10/18 7:22 PM, Steven Schveighoffer wrote:More testing reveals that as I increase the context lines to print, iopipe performs better than GNU grep. A shocking thing is that at 9 lines of context, grep goes up slightly, but all of a sudden at 10 lines of context, it doubles in the time taken (and is now slower than the iopipe_search). Also noting: my Linux VM does not have ldc, so these are dmd numbers. -SteveHowever, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers). So at least there is something to strive for :)
May 11 2018
On Friday, 11 May 2018 at 16:07:26 UTC, Steven Schveighoffer wrote:On 5/11/18 11:44 AM, Steven Schveighoffer wrote:What stops you from downloading a linux release from here? https://github.com/ldc-developers/ldc/releasesOn 5/10/18 7:22 PM, Steven Schveighoffer wrote:More testing reveals that as I increase the context lines to print, iopipe performs better than GNU grep. A shocking thing is that at 9 lines of context, grep goes up slightly, but all of a sudden at 10 lines of context, it doubles in the time taken (and is now slower than the iopipe_search). Also noting: my Linux VM does not have ldc, so these are dmd numbers. -Steve[...]Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers). So at least there is something to strive for :)
May 11 2018
On 5/11/18 5:42 PM, Joakim wrote:On Friday, 11 May 2018 at 16:07:26 UTC, Steven Schveighoffer wrote:So I did that, it's not much faster, a few milliseconds. Still about half as fast as GNU grep. But I am not expecting any miracles here. GNU grep does pretty much everything it can to achieve performance -- including eschewing the standard library buffering system as I am doing. I can probably match the performance at some point, but I doubt it's worth worrying about. It's still really really fast without trying to do anything crazy. I hope at some point, however, to work with Dmitry to add iopipe-based regex engine so we can see how much better we can make regex. -SteveOn 5/11/18 11:44 AM, Steven Schveighoffer wrote:What stops you from downloading a linux release from here? https://github.com/ldc-developers/ldc/releasesOn 5/10/18 7:22 PM, Steven Schveighoffer wrote:More testing reveals that as I increase the context lines to print, iopipe performs better than GNU grep. A shocking thing is that at 9 lines of context, grep goes up slightly, but all of a sudden at 10 lines of context, it doubles in the time taken (and is now slower than the iopipe_search). Also noting: my Linux VM does not have ldc, so these are dmd numbers.[...]Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers). So at least there is something to strive for :)
May 12 2018
On Saturday, 12 May 2018 at 12:14:28 UTC, Steven Schveighoffer wrote:On 5/11/18 5:42 PM, Joakim wrote:I could offer a few tricks to fix that w/o getting too dirty. GNU grep is fast, but std.regex is faster then that in raw speed on a significant class of quite common patterns. But I loaded file at once.On Friday, 11 May 2018 at 16:07:26 UTC, Steven Schveighoffer wrote:So I did that, it's not much faster, a few milliseconds. Still about half as fast as GNU grep. But I am not expecting any miracles here. GNU grep does pretty much everything it can to achieve performance -- including eschewing the standard library buffering system as I am doing. I can probably match the performance at some point, but I doubt it's worth worrying about. It's still really really fast without trying to do anything crazy.[...]What stops you from downloading a linux release from here? https://github.com/ldc-developers/ldc/releasesI hope at some point, however, to work with Dmitry to add iopipe-based regex engine so we can see how much better we can make regex.As such initiative goes it’s either now or never. Please get in touch directly over Slack or smth, let’s make it roll. I wanted to do grep-like utility since 2012. Now at long last we have all the building blocks.-Steve
May 12 2018
On Saturday, 12 May 2018 at 12:45:16 UTC, Dmitry Olshansky wrote:On Saturday, 12 May 2018 at 12:14:28 UTC, Steven Schveighoffer wrote:If you're talking about writing a grep prototype in D, that's a great idea, especially for publicizing D. :)[...]I could offer a few tricks to fix that w/o getting too dirty. GNU grep is fast, but std.regex is faster then that in raw speed on a significant class of quite common patterns. But I loaded file at once.[...]As such initiative goes it’s either now or never. Please get in touch directly over Slack or smth, let’s make it roll. I wanted to do grep-like utility since 2012. Now at long last we have all the building blocks.
May 12 2018
On Saturday, 12 May 2018 at 14:48:58 UTC, Joakim wrote:On Saturday, 12 May 2018 at 12:45:16 UTC, Dmitry Olshansky wrote:For shaming others to beat us using some other language. Making life better for everyone. Taking a DMD to a gun fight ;)On Saturday, 12 May 2018 at 12:14:28 UTC, Steven Schveighoffer wrote:If you're talking about writing a grep prototype in D, that's a great idea, especially for publicizing D. :)[...]I could offer a few tricks to fix that w/o getting too dirty. GNU grep is fast, but std.regex is faster then that in raw speed on a significant class of quite common patterns. But I loaded file at once.[...]As such initiative goes it’s either now or never. Please get in touch directly over Slack or smth, let’s make it roll. I wanted to do grep-like utility since 2012. Now at long last we have all the building blocks.
May 12 2018
On 05/12/2018 08:14 AM, Steven Schveighoffer wrote:But I am not expecting any miracles here. GNU grep does pretty much everything it can to achieve performance -- including eschewing the standard library buffering system as I am doing. I can probably match the performance at some point, but I doubt it's worth worrying about.I wonder if there's realistic real-world cases where you could beat it due to being a library solution and skipping the cost of launching grep as a new process. Granted, outside of Windows, process launching is considered to be fairly cheap, but it still isn't no-cost. That would still be a nice feather in D's cap: Comparable to grep for large data, faster than spawning a grep process for smaller data.
May 12 2018
On Friday, May 11, 2018 11:44:04 Steven Schveighoffer via Digitalmars-d- announce wrote:Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers).Curiously, the grep on FreeBSD seems to be GNU's grep with some additional patches, though I expect that it's a ways behind whatever GNU is releasing now, because while they were willing to put some GPLv2 stuff in FreeBSD, they have not been willing to have anything to do with GPLv3. FreeBSD's grep claims to be version 2.5.1-FreeBSD, whereas ports has the gnugrep package which is version 2.27, so that implies a fairly large version difference between the two. I have no idea how they compare in terms of performance. Either way, I would have expected FreeBSD to be using their own implementation, not something from GNU, especially since they seem to be trying to purge GPL stuff from FreeBSD. So, the fact that FreeBSD is using GNU's grep is a bit surprising. If I had to guess, I would guess that they switched to the GNU version at some point in the past, because it was easier to grab it than to make what they had faster, but I don't know. Either way, it sounds like Mac OS X either didn't take their grep from FreeBSD in this case, or they took it from an older version before FreeBSD switching to using GNU's grep. - Jonathan M Davis
May 11 2018
On Friday, 11 May 2018 at 16:06:41 UTC, Jonathan M Davis wrote:[...]Oh, there had been an epic forum thread about the use of GNU grep for BSD. i don't remember the details but it was long and heated (it was so epic that I even read it as I normaly don't care at all for BSD stuff).
May 12 2018
On Friday, 11 May 2018 at 15:44:04 UTC, Steven Schveighoffer wrote:On 5/10/18 7:22 PM, Steven Schveighoffer wrote: Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers).Yeah, the MacOS default versions of the Unix text processing tools are really slow. It's worth installing the GNU versions if doing performance comparisons on MacOS, or because you work with large files. Homebrew and MacPorts both have the GNU versions. Some relevant packages: coreutils, grep, gsed (sed), gawk (awk). Most tools are in coreutils. Many will be installed with a 'g' prefix by default, leaving the existing tools in place. e.g. 'cut' will be installed as 'gcut' unless specified otherwise. --Jon
May 11 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend. If anyone has any good use cases for it, I'm open to suggestions. Something that is going to potentially increase performance is an application that needs to keep the buffer mostly full when extending (i.e. something like 75% full or more). The buffer is selected by using `rbufd` instead of just `bufd`. Everything should be a drop-in replacement except for that. Note: I have ONLY tested on Macos, so if you find bugs in other OSes let me know. This is still a Posix-only library for now, but more on that later... As a test for Ring buffers, I implemented a simple "grep-like" search program that doesn't use regex, but phobos' canFind to look for lines that match. It also prints some lines of context, configurable on the command line. The lines of context I thought would show better performance with the RingBuffer than the standard buffer since it has to keep a bunch of lines in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200). However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep. Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point). Enjoy! https://github.com/schveiguy/iopipe https://code.dlang.org/packages/iopipe http://schveiguy.github.io/iopipe/ -SteveSince mmap is involved, it would be interesting to see if this can be extended for interprocess communication, akin boost::interprocess https://www.boost.org/doc/libs/1_67_0/doc/html/interprocess.html boost::interprocess uses mmap[1] followed by shm_open[2] by default (unless specified to use SysV shm) [1] https://github.com/boostorg/interprocess/blob/4f8459e868617f88ff105633a9aa82221d5e9bb1/include/boost/interprocess/mapped_region.hpp#L698 [2] https://github.com/boostorg/interprocess/blob/develop/include/boost/interprocess/shared_memory_object.hpp#L315
May 11 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. [...]They can be problematic with some CPU's and OS's. For modern CPU's there should be no problems but those with exotic caches and virtual memory configurations there can be some aliasing issues. Linus Torvalds talked a little about that case in this thread of realworldtech https://www.realworldtech.com/forum/?threadid=174426&curpostid=174731
May 12 2018
On 5/12/18 3:38 PM, Patrick Schluter wrote:On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:Thanks for the tip. The nice thing about iopipe is that the buffer type is completely selectable, and nothing changes, except possibly some performance. So on those arch's, I would expect people to select the normal AllocatedBuffer type. -SteveOK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. [...]They can be problematic with some CPU's and OS's. For modern CPU's there should be no problems but those with exotic caches and virtual memory configurations there can be some aliasing issues. Linus Torvalds talked a little about that case in this thread of realworldtech https://www.realworldtech.com/forum/?threadid=174426&curpostid=174731
May 12 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend. If anyone has any good use cases for it, I'm open to suggestions. Something that is going to potentially increase performance is an application that needs to keep the buffer mostly full when extending (i.e. something like 75% full or more). The buffer is selected by using `rbufd` instead of just `bufd`. Everything should be a drop-in replacement except for that. Note: I have ONLY tested on Macos, so if you find bugs in other OSes let me know. This is still a Posix-only library for now, but more on that later... As a test for Ring buffers, I implemented a simple "grep-like" search program that doesn't use regex, but phobos' canFind to look for lines that match. It also prints some lines of context, configurable on the command line. The lines of context I thought would show better performance with the RingBuffer than the standard buffer since it has to keep a bunch of lines in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200). However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep. Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point). Enjoy! https://github.com/schveiguy/iopipe https://code.dlang.org/packages/iopipe http://schveiguy.github.io/iopipe/ -SteveHi Steve, It is an exciting works, that could help in bioinformatics area. Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes. So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed ) It could be interesting to show how iopipe is fast. You can grab a fastq file from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/ and take a look at iopipe perf . fastq file is plain test format and it is usually a repetition of four lines: 1/ title and description this line starts with 2/ sequence line this line contains ususally DNA letters (ACGT) 3/ comment line this line starts with + 4/ quality of amino acids this line has the same length as the sequence line (n°2) Rarely, the comment section is over multiple lines. Warning the and + characters can be found inside the quality line, thus I search a pattern of two characters '\n ' and '\n+'. I never split file by line as it is a waste of time instead I read the content as a stream. I hope this show case help you Good luck :-)
May 14 2018
On 5/14/18 6:02 AM, bioinfornatics wrote:On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:Yeah, I have been working on and off with Vang Le (biocyberman) on using iopipe to parse such formats. He gave a good presentation at dconf this year on using D in bioinformatics, and I think it is a great fit for D! At dconf, I threw together a crude fasta parser (with the intention of having it be the base for parsing fastq as well) to demonstrate how iopipe can perform while parsing such things. I have no idea how fast or slow it is, as I just barely got it to work (pass unit tests I made up based on wikipedia entry for fasta), but IMO, the direct buffer access makes fast parsing much more pleasant than having to deal with your own buffering (using phobos makes parsing a bit difficult, however, I still see a need for some parsing tools for iopipe). You can find that library here: https://github.com/schveiguy/fastaq Not being in the field of bioinformatics, I can't really say that I am likely to continue development of it, but I'm certainly willing to help with iopipe for anyone who wants to use it in this field. -SteveOK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home. However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend. If anyone has any good use cases for it, I'm open to suggestions. Something that is going to potentially increase performance is an application that needs to keep the buffer mostly full when extending (i.e. something like 75% full or more). The buffer is selected by using `rbufd` instead of just `bufd`. Everything should be a drop-in replacement except for that. Note: I have ONLY tested on Macos, so if you find bugs in other OSes let me know. This is still a Posix-only library for now, but more on that later... As a test for Ring buffers, I implemented a simple "grep-like" search program that doesn't use regex, but phobos' canFind to look for lines that match. It also prints some lines of context, configurable on the command line. The lines of context I thought would show better performance with the RingBuffer than the standard buffer since it has to keep a bunch of lines in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200). However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep. Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point). Enjoy! https://github.com/schveiguy/iopipe https://code.dlang.org/packages/iopipe http://schveiguy.github.io/iopipe/Hi Steve, It is an exciting works, that could help in bioinformatics area. Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes. So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed ) It could be interesting to show how iopipe is fast.
May 14 2018
On Monday, 14 May 2018 at 14:23:43 UTC, Steven Schveighoffer wrote:On 5/14/18 6:02 AM, bioinfornatics wrote:Hi Steve Great work continuing to improve iopipe. Thank you for the example implementation of fasta/q parser with iopipe. I will definitely continue to work on this. It still requires some more time for me to get over beginner barriers in D. I am currently trying out some work over here https://github.com/bioslaD. Johnathan(bioinformatics) It will be great if you can join bioslaD and offer some help to make things move faster. VangOn Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:Yeah, I have been working on and off with Vang Le (biocyberman) on using iopipe to parse such formats. He gave a good presentation at dconf this year on using D in bioinformatics, and I think it is a great fit for D! At dconf, I threw together a crude fasta parser (with the intention of having it be the base for parsing fastq as well) to demonstrate how iopipe can perform while parsing such things. I have no idea how fast or slow it is, as I just barely got it to work (pass unit tests I made up based on wikipedia entry for fasta), but IMO, the direct buffer access makes fast parsing much more pleasant than having to deal with your own buffering (using phobos makes parsing a bit difficult, however, I still see a need for some parsing tools for iopipe). You can find that library here: https://github.com/schveiguy/fastaq Not being in the field of bioinformatics, I can't really say that I am likely to continue development of it, but I'm certainly willing to help with iopipe for anyone who wants to use it in this field. -Steve[...]Hi Steve, It is an exciting works, that could help in bioinformatics area. Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes. So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed ) It could be interesting to show how iopipe is fast.
May 15 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.I can think of a good use-case: - Audio streaming (on embedded environment)! If you have something like a bluetooth audio source, and alsa or any audio hardware API as an audio sink to speakers, you can use the ring-buffer as a fifo between the two. The bluetooth source has its own pace (and you cannot control it) and has a variable bit-rate, whereas the sink has a constant bit-rate, so you have to have a buffer between them. And you want to reduce the CPU cost has much as possible due to embedded system constraints (or even real-time constraint, especially for audio).
May 14 2018