digitalmars.D.announce - Faster Command Line Tools in D
- Mike Parker (10/10) May 24 2017 Some of you may remember Jon Degenhardt's talk from one of the
- cym13 (4/15) May 24 2017 A bit off topic but I really like that we still get quality
- Jon Degenhardt (6/11) May 24 2017 The complement to the community is well deserved, thank you for
- Walter Bright (2/2) May 24 2017 It's now #4 on the front page of Hacker News:
- cym13 (12/14) May 24 2017 The comments on HN are useless though, everybody went for the
- Jon Degenhardt (13/29) May 24 2017 Its not easy writing an article that doesn't draw some form of
- Walter Bright (13/24) May 24 2017 Any time one writes an article comparing speed between languages X and Y...
- Jon Degenhardt (9/25) May 24 2017 Thanks Walter, I appreciate your comments. And correct, as
- Wulfklaue (45/52) May 25 2017 Maybe as a more casual observer the article did feel more like
- Steven Schveighoffer (26/55) May 25 2017 Because split allocates on every call. The key, in many cases in D, to
- Suliman (3/6) May 25 2017 Is there any plan to deprecate all splitters and make one single.
- Jonathan M Davis via Digitalmars-d-announce (22/28) May 25 2017 I wouldn't expect any of the split-related functions to be going away. W...
- cym13 (8/33) May 28 2017 I don't think know if people coming from other languages would
- Jonathan M Davis via Digitalmars-d-announce (24/27) May 25 2017 Not only that, but over time, there has been a push to generalize functi...
- Jack Stouffer (3/7) May 24 2017 Wouldn't be the first time
- xtreak (9/20) May 25 2017 There are repeated references over usage of D at Netflix for
- Nick Sabalausky (Abscissa) (3/9) May 26 2017 I've used netflix. If its "suggestion" features are any indication, I'm
- =?UTF-8?Q?Ali_=c3=87ehreli?= (5/7) May 25 2017 Inspired Nim version, found on Reddit:
- Basile B. (2/9) May 25 2017 Wow, the D blog post opened Pandora's box.
- bachmeier (3/17) May 26 2017 I guess programmers will do comparisons of language speed
- John Colvin (20/31) May 26 2017 I spent some time fiddling with my own manual approaches to
- John Colvin (4/9) May 26 2017 This version also has the advantage of being (discounting any
- Steven Schveighoffer (8/17) May 30 2017 I worked a lot on making sure this works properly. However, it's
- Patrick Schluter (4/20) May 30 2017 If you want UCS-2 (aka UTF-16 without surrogates) data I can give
- Steven Schveighoffer (11/29) May 30 2017 The data I can (and have) generated from UTF-8 data. I have tested my
- Patrick Schluter (14/55) May 30 2017 In any case, you can download the dataset from [1] if you like.
- Steven Schveighoffer (3/14) May 31 2017 Thanks, I'll bookmark it for later use.
- Steven Schveighoffer (14/43) May 30 2017 nice! hm....
- Joakim (4/15) Aug 08 2017 Heh, happened to notice that this blog post now has 21 comments,
- bachmeier (2/19) Aug 08 2017 There was also a Haskell version on Reddit.
Some of you may remember Jon Degenhardt's talk from one of the Silicon Valley D meetups, where he described the performance improvements he saw when he rewrote some of eBay's command line tools in D. He has now put the effort into crafting a blog post on the same topic, where he takes D version of a command-line tool written in Python and incrementally improves its performance. The blog: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
May 24 2017
On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:Some of you may remember Jon Degenhardt's talk from one of the Silicon Valley D meetups, where he described the performance improvements he saw when he rewrote some of eBay's command line tools in D. He has now put the effort into crafting a blog post on the same topic, where he takes D version of a command-line tool written in Python and incrementally improves its performance. The blog: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/A bit off topic but I really like that we still get quality content such as this post on this blog. Sustained quality is hard job and I thank everyone involved for that.
May 24 2017
On Wednesday, 24 May 2017 at 17:36:29 UTC, cym13 wrote:On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote: [...snip...] A bit off topic but I really like that we still get quality content such as this post on this blog. Sustained quality is hard job and I thank everyone involved for that.The complement to the community is well deserved, thank you for including this post in the company. In this case, the post benefited from some really excellent review feedback and Mike made the publication side really easy. --Jon
May 24 2017
On Wednesday, 24 May 2017 at 21:34:08 UTC, Walter Bright wrote:https://news.ycombinator.com/newsThe comments on HN are useless though, everybody went for the "D versus Python" thing and seem to complain that it's doing a D/Python benchmark while only talking about D optimization...even though optimizing D is the whole point of the article. In the same way they rant against the fact that many iterations on the D script are shown while it is obviously to give different tricks while being clear on what trick gives what. I am disappointed because there are so many good things to say about this, so many good questions or remarks to make when not familiar with the language, and yet all we get is "Meh, this benchmark shows nothing of D's speed against Python".
May 24 2017
On Wednesday, 24 May 2017 at 21:46:10 UTC, cym13 wrote:On Wednesday, 24 May 2017 at 21:34:08 UTC, Walter Bright wrote:Its not easy writing an article that doesn't draw some form of criticism. FWIW, the reason I gave a Python example is because it is very commonly used for this type of problem and the language is well suited to it. A second reason is that I've seen several posts where someone has tried to rewrite a Python program like this in D, start with `split`, and wonder how to make it faster. My hope is that this will clarify how to achieve this. Another goal of the article was to describe how performance in the TSV Utilities had been achieved. The article is not about the TSV Utilities, but discussing both the benchmark results and how they had been achieved would be a very long article. --Jonhttps://news.ycombinator.com/newsThe comments on HN are useless though, everybody went for the "D versus Python" thing and seem to complain that it's doing a D/Python benchmark while only talking about D optimization...even though optimizing D is the whole point of the article. In the same way they rant against the fact that many iterations on the D script are shown while it is obviously to give different tricks while being clear on what trick gives what. I am disappointed because there are so many good things to say about this, so many good questions or remarks to make when not familiar with the language, and yet all we get is "Meh, this benchmark shows nothing of D's speed against Python".
May 24 2017
On 5/24/2017 3:56 PM, Jon Degenhardt wrote:Its not easy writing an article that doesn't draw some form of criticism. FWIW, the reason I gave a Python example is because it is very commonly used for this type of problem and the language is well suited to it. A second reason is that I've seen several posts where someone has tried to rewrite a Python program like this in D, start with `split`, and wonder how to make it faster. My hope is that this will clarify how to achieve this. Another goal of the article was to describe how performance in the TSV Utilities had been achieved. The article is not about the TSV Utilities, but discussing both the benchmark results and how they had been achieved would be a very long article.Any time one writes an article comparing speed between languages X and Y, someone gets their ox gored and will bitterly complain about how unfair the article is (though I noticed that none of the complainers wrote a faster Python version). Even if you tried to optimize the Python program, you'll be inevitably accused of deliberately not doing it right. The nadir of this for me was when I compared Digital Mars C++ code with DMD. Both share the same optimizer and back end, yet I was accused of "sabotaging" my own C++ compiler in order to make D look better !! Me, I just don't do public comparison benchmarking anymore. It's a waste of time arguing with people about it. I thought you wrote a fine article, and the criticism about the Python code was unwarranted (especially since nobody suggested better code), because the article was about optimizing D code, not optimizing Python.
May 24 2017
On Thursday, 25 May 2017 at 05:17:29 UTC, Walter Bright wrote:Any time one writes an article comparing speed between languages X and Y, someone gets their ox gored and will bitterly complain about how unfair the article is (though I noticed that none of the complainers wrote a faster Python version). Even if you tried to optimize the Python program, you'll be inevitably accused of deliberately not doing it right. The nadir of this for me was when I compared Digital Mars C++ code with DMD. Both share the same optimizer and back end, yet I was accused of "sabotaging" my own C++ compiler in order to make D look better !! Me, I just don't do public comparison benchmarking anymore. It's a waste of time arguing with people about it. I thought you wrote a fine article, and the criticism about the Python code was unwarranted (especially since nobody suggested better code), because the article was about optimizing D code, not optimizing Python.Thanks Walter, I appreciate your comments. And correct, as multiple people noted, a speed comparison with other languages not at all a goal of the article. The real intent was to tell a story of how several of D's features play together to enable optimizations like this, without having to write low-level code or step outside the core language features and standard library. --Jon
May 24 2017
On Thursday, 25 May 2017 at 06:22:28 UTC, Jon Degenhardt wrote:Thanks Walter, I appreciate your comments. And correct, as multiple people noted, a speed comparison with other languages not at all a goal of the article. The real intent was to tell a story of how several of D's features play together to enable optimizations like this, without having to write low-level code or step outside the core language features and standard library.Maybe as a more casual observer the article did feel more like Python vs D. I have not yet read the ycombinator comments, just from my personal observation after reading the article. My thinking was: - Python its PyPy is surprising fast. - Surprised that D was slower in version 1. - Kind of surprised again that it took so many versions to figure out the best approach. - Also wondering why one needed std.algorithm splitter, when you expect string split to be the fasted. Even the fact that you need to import std.array to split a string simply felt strange. - So much effort for relative little gain ( after v2 splitter ). The time spend on finding a faster solution is in business sense not worth it. But not finding a faster way is simply wasting performance, just on this simple function. - Started to wonder if Python its PyPy is so optimized that without any effort, your even faster then D. What other D idiomatic functions are slow? I am not criticizing your article Jon, just mentioning how i felt when reading it yesterday. It felt like the solution was overly complex to find and required too much deep D knowledge. Going to read the ycombinator comments now. Off-topic: Yesterday i was struggling with split but for a whole different reason. Take in account that i am new at D. Needed to split a string. Simple right? Search Google for "split string dlang". Get on the https://dlang.org/phobos/std_string.html page. After seeing the splitLines and start experimenting with it. Half a hour later i realize that the wrong function was used and needed to import std.array split function. Call it a issue with the documentation or my own stupidity. But the fact that Split was only listed as a imported function, in this mass of text, totally send me on the wrong direction. As stated above, i expected split to be part of the std.string, because i am manipulating a string, not that i needed to import std.array what is the end result. I simply find the documentation confusing with the wall of text. When i search for string split, you expect to arrive on the string.split page. Not only that, the split example are using split as a separate keyword, when i was looking for variable.split(). Veteran D programmers are probably going to laughing at me for this but one does feel a bit salty after that.
May 25 2017
On 5/25/17 6:27 AM, Wulfklaue wrote:- Also wondering why one needed std.algorithm splitter, when you expect string split to be the fasted. Even the fact that you need to import std.array to split a string simply felt strange.Because split allocates on every call. The key, in many cases in D, to increasing performance is avoiding allocations. Has been that way for as long as I can remember. Another possibility to "fix" this problem is to simply use an allocator with split that allocates on some predefined stack space. This is very similar to what v3 does with the Appender. Unfortunately, allocator is still experimental, and so split doesn't support using it.- So much effort for relative little gain ( after v2 splitter ). The time spend on finding a faster solution is in business sense not worth it. But not finding a faster way is simply wasting performance, just on this simple function.The answer is always "it depends". If you're processing hundreds of these files in tight loops, it probably makes sense to optimize the code. If not, then it may make sense to focus efforts elsewhere. The point of the article is, this is how to do it if you need performance there.- Started to wonder if Python its PyPy is so optimized that without any effort, your even faster then D. What other D idiomatic functions are slow?split didn't actually seem that slow. I'll note that you could opt for just the AA optimization (the converting char[] to string only when storing a new hash lookup is big, and not that cumbersome) and leave the code for split alone, and you probably still could beat the Python code.Off-topic: Yesterday i was struggling with split but for a whole different reason. Take in account that i am new at D. Needed to split a string. Simple right? Search Google for "split string dlang". Get on the https://dlang.org/phobos/std_string.html page. After seeing the splitLines and start experimenting with it. Half a hour later i realize that the wrong function was used and needed to import std.array split function. Call it a issue with the documentation or my own stupidity. But the fact that Split was only listed as a imported function, in this mass of text, totally send me on the wrong direction. As stated above, i expected split to be part of the std.string, because i am manipulating a string, not that i needed to import std.array what is the end result.std.string, std.array, and std.algorithm all have cross-polination when it comes to array operations. It has to do with the history of when the modules were introduced.I simply find the documentation confusing with the wall of text. When i search for string split, you expect to arrive on the string.split page. Not only that, the split example are using split as a separate keyword, when i was looking for variable.split().There is a search field on the top, which helps to narrow down what choices are available.Veteran D programmers are probably going to laughing at me for this but one does feel a bit salty after that.I understand your pain. I work with Swift often, and sometimes it's very frustrating trying to find the right tool for the job, as I'm not thoroughly immersed in Apple's SDK on a day-to-day basis. I don't know that any programming language gets this perfect. -Steve
May 25 2017
std.string, std.array, and std.algorithm all have cross-polination when it comes to array operations. It has to do with the history of when the modules were introduced.Is there any plan to deprecate all splitters and make one single. Because now as I understand we have 4 functions that make same task.
May 25 2017
On Thursday, May 25, 2017 14:17:27 Suliman via Digitalmars-d-announce wrote:I wouldn't expect any of the split-related functions to be going away. We often have a function that operates on arrays or strings and another which operates on more general ranges. It may mainly be for historical reasons, but removing the array-based functions would break existing code, and we'd get a whole other set of complaints about folks not understanding that you need to slap array() on the end of a call to splitter to get the split that they were looking for (especially if they're coming from another language and don't understand ranges yet). And ultimately, the array-based functions continue to serve as a way to have simpler code when you don't care about (or you actually need) the additional memory allocations. Also, splitLines/lineSplitter can't actually be written in terms of split/splitter, because split/splitter does not have a way to provide multiple delimeters (let alone multiple delimeters where one includes the other, which is what you get with "\n" and "\r\n"). So, that distinction isn't going away. It's also a common enough operation that having a function for it rather than having to pass all of the delimeters to a more general function is arguably worth it, just like having the overload of split/splitter which takes no delimiter and then splits on whitespace is arguably worth it over having a more general function where you have to feed it every variation of whitespace. - Jonathan M Davisstd.string, std.array, and std.algorithm all have cross-polination when it comes to array operations. It has to do with the history of when the modules were introduced.Is there any plan to deprecate all splitters and make one single. Because now as I understand we have 4 functions that make same task.
May 25 2017
On Thursday, 25 May 2017 at 16:19:16 UTC, Jonathan M Davis wrote:I wouldn't expect any of the split-related functions to be going away. We often have a function that operates on arrays or strings and another which operates on more general ranges. It may mainly be for historical reasons, but removing the array-based functions would break existing code, and we'd get a whole other set of complaints about folks not understanding that you need to slap array() on the end of a call to splitter to get the split that they were looking for (especially if they're coming from another language and don't understand ranges yet). And ultimately, the array-based functions continue to serve as a way to have simpler code when you don't care about (or you actually need) the additional memory allocations.I don't think know if people coming from other languages would really mind. Of course it would have to be taught onces, everything has, but many languages (and I have python especially in mind) have been lazifying their standard libraries for years now. I think consistency is what brings less questions, not diversity where one of the possibilities corresponds to what the programmer wants. He'll ask for the difference anyway.Also, splitLines/lineSplitter can't actually be written in terms of split/splitter, because split/splitter does not have a way to provide multiple delimeters (let alone multiple delimeters where one includes the other, which is what you get with "\n" and "\r\n"). So, that distinction isn't going away. It's also a common enough operation that having a function for it rather than having to pass all of the delimeters to a more general function is arguably worth it, just like having the overload of split/splitter which takes no delimiter and then splits on whitespace is arguably worth it over having a more general function where you have to feed it every variation of whitespace. - Jonathan M Davis
May 28 2017
On Thursday, May 25, 2017 08:46:17 Steven Schveighoffer via Digitalmars-d- announce wrote:std.string, std.array, and std.algorithm all have cross-polination when it comes to array operations. It has to do with the history of when the modules were introduced.Not only that, but over time, there has been a push to generalize functions. So, something that might have originally gotten put in std.string (because you'd normally think of it as a string function) got moved to std.array, because it could easily be generalized to work on arrays in general and not just string operations (I believe that split is an example of this). And something which was in std.array or std.string might have been generalized for ranges in general, in which case, we ended up with a new function in std.algorithm (hence, we have splitter in std.algorithm but split in std.array). The end result tends to make sense if you understand that functions that only operate on strings go in std.string, functions that operate on dynamic arrays in general (but not ranges) go in std.array, and functions which could have gone in std.string or std.array except that they operate on ranges in general go in std.algorithm. But if you don't understand that, it tends to be quite confusing, and even if you do, it's often the case that when you want to find a function to operate on a string, you're going to need to look in std.string, std.array, and std.algorithm. So, in part, it's an evolution thing, and in part, it's often just plain hard to find stuff when you're focused on a specific use case, and the library writer is focused on making the function that you need as general as possible. - Jonathan M Davis
May 25 2017
On Wednesday, 24 May 2017 at 21:46:10 UTC, cym13 wrote:I am disappointed because there are so many good things to say about this, so many good questions or remarks to make when not familiar with the language, and yet all we get is "Meh, this benchmark shows nothing of D's speed against Python".Wouldn't be the first time https://news.ycombinator.com/item?id=10828450
May 24 2017
On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:Some of you may remember Jon Degenhardt's talk from one of the Silicon Valley D meetups, where he described the performance improvements he saw when he rewrote some of eBay's command line tools in D. He has now put the effort into crafting a blog post on the same topic, where he takes D version of a command-line tool written in Python and incrementally improves its performance. The blog: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/There are repeated references over usage of D at Netflix for machine learning. It will be a very helpful boost if someone comes up with any reference or a post regarding how D is used at Netflix and addition of Netflix to https://dlang.org/orgs-using-d.html will be amazing. References : https://news.ycombinator.com/item?id=14064012 https://news.ycombinator.com/item?id=14413546
May 25 2017
On 05/25/2017 08:30 AM, xtreak wrote:There are repeated references over usage of D at Netflix for machine learning. It will be a very helpful boost if someone comes up with any reference or a post regarding how D is used at Netflix and addition of Netflix to https://dlang.org/orgs-using-d.html will be amazing.I've used netflix. If its "suggestion" features are any indication, I'm not sure such a thing would be a feather in D's cap ;)
May 26 2017
On 05/24/2017 06:39 AM, Mike Parker wrote:Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/Inspired Nim version, found on Reddit: https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/ Ali
May 25 2017
On Thursday, 25 May 2017 at 22:04:36 UTC, Ali Çehreli wrote:On 05/24/2017 06:39 AM, Mike Parker wrote:Wow, the D blog post opened Pandora's box.Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/Inspired Nim version, found on Reddit: https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/ Ali
May 25 2017
On Friday, 26 May 2017 at 06:05:11 UTC, Basile B. wrote:On Thursday, 25 May 2017 at 22:04:36 UTC, Ali Çehreli wrote:I guess programmers will do comparisons of language speed independent of whether it makes sense for that problem.On 05/24/2017 06:39 AM, Mike Parker wrote:Wow, the D blog post opened Pandora's box.Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/Inspired Nim version, found on Reddit: https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/ Ali
May 26 2017
On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:Some of you may remember Jon Degenhardt's talk from one of the Silicon Valley D meetups, where he described the performance improvements he saw when he rewrote some of eBay's command line tools in D. He has now put the effort into crafting a blog post on the same topic, where he takes D version of a command-line tool written in Python and incrementally improves its performance. The blog: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/I spent some time fiddling with my own manual approaches to making this as fast, wasn't satisfied and so decided to try using Steven's iopipe (https://github.com/schveiguy/iopipe) instead. Results were excellent. https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242 On my machine: python takes a little over 20s, pypy wobbles around 3.5s, v1 from the blog takes about 3.9s, v4b took 1.45s, a version of my own that is hideous* manages 0.78s at best, the above version with iopipe hits below 0.67s most runs. Not bad for a process that most people would call "IO-bound" (code for "I don't want to have to write fast code & it's all the disk's fault"). Obviously this version is a bit more code than is ideal, iopipe is currently quite "barebones", but I don't see why with some clever abstractions and wrappers it couldn't be the default thing that one does even for small scripts. *using byChunk and manually managing linesplits over chunks, very nasty.
May 26 2017
On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:I spent some time fiddling with my own manual approaches to making this as fast, wasn't satisfied and so decided to try using Steven's iopipe (https://github.com/schveiguy/iopipe) instead. Results were excellent. https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
May 26 2017
On 5/26/17 11:20 AM, John Colvin wrote:On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:I worked a lot on making sure this works properly. However, it's possible that there are some lingering issues. I also did not spend much time optimizing these paths (whereas I spent a ton of time getting the utf8 line parsing as fast as it could be). Partly because finding things other than utf8 in the wild is rare, and partly because I have nothing to compare it with to know what is possible :) -SteveI spent some time fiddling with my own manual approaches to making this as fast, wasn't satisfied and so decided to try using Steven's iopipe (https://github.com/schveiguy/iopipe) instead. Results were excellent. https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
May 30 2017
On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer wrote:On 5/26/17 11:20 AM, John Colvin wrote:If you want UCS-2 (aka UTF-16 without surrogates) data I can give you gigabytes of files in tmx format.On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:I worked a lot on making sure this works properly. However, it's possible that there are some lingering issues. I also did not spend much time optimizing these paths (whereas I spent a ton of time getting the utf8 line parsing as fast as it could be). Partly because finding things other than utf8 in the wild is rare, and partly because I have nothing to compare it with to know what is possible :) -Steve[...]This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
May 30 2017
On 5/30/17 5:57 PM, Patrick Schluter wrote:On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer wrote:The data I can (and have) generated from UTF-8 data. I have tested my byLine parser to make sure it properly splits on "interesting" code points in all widths. UTF-16 data without surrogates should probably work fine. I haven't tuned it though like I tuned the UTF-8 version. Is there a memchr for wide characters? ;) What I really haven't done is compared my line parsing code with multi-code-unit delimiters against one that can do the same thing. I know Phobos and C FILE * really can't do it. I haven't really looked at all in C++, so I should probably look there before giving up. -SteveOn 5/26/17 11:20 AM, John Colvin wrote:If you want UCS-2 (aka UTF-16 without surrogates) data I can give you gigabytes of files in tmx format.On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:I worked a lot on making sure this works properly. However, it's possible that there are some lingering issues. I also did not spend much time optimizing these paths (whereas I spent a ton of time getting the utf8 line parsing as fast as it could be). Partly because finding things other than utf8 in the wild is rare, and partly because I have nothing to compare it with to know what is possible :)[...]This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
May 30 2017
On Tuesday, 30 May 2017 at 22:31:50 UTC, Steven Schveighoffer wrote:On 5/30/17 5:57 PM, Patrick Schluter wrote:In any case, you can download the dataset from [1] if you like. There are several 100 Mb big zip files containing a collection of tmx files (translation memory exchange) with European Legislation. The files contain multi-alignment texts in up to 24 languages. The files are encoded in UCS-2 little-endian. I know for a fact (because I compiled the data) that they don't contain characters outside of the BMP. The data is public and can be used freely (as in beer). When I get some time, I will try to port the java app that is distributed with it to D (partially done yet). [1]: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memoryOn Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer wrote:The data I can (and have) generated from UTF-8 data. I have tested my byLine parser to make sure it properly splits on "interesting" code points in all widths. UTF-16 data without surrogates should probably work fine. I haven't tuned it though like I tuned the UTF-8 version. Is there a memchr for wide characters? ;) What I really haven't done is compared my line parsing code with multi-code-unit delimiters against one that can do the same thing. I know Phobos and C FILE * really can't do it. I haven't really looked at all in C++, so I should probably look there before giving up. -SteveOn 5/26/17 11:20 AM, John Colvin wrote:If you want UCS-2 (aka UTF-16 without surrogates) data I can give you gigabytes of files in tmx format.On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:I worked a lot on making sure this works properly. However, it's possible that there are some lingering issues. I also did not spend much time optimizing these paths (whereas I spent a ton of time getting the utf8 line parsing as fast as it could be). Partly because finding things other than utf8 in the wild is rare, and partly because I have nothing to compare it with to know what is possible :)[...]This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
May 30 2017
On 5/31/17 1:09 AM, Patrick Schluter wrote:In any case, you can download the dataset from [1] if you like. There are several 100 Mb big zip files containing a collection of tmx files (translation memory exchange) with European Legislation. The files contain multi-alignment texts in up to 24 languages. The files are encoded in UCS-2 little-endian. I know for a fact (because I compiled the data) that they don't contain characters outside of the BMP. The data is public and can be used freely (as in beer). When I get some time, I will try to port the java app that is distributed with it to D (partially done yet). [1]: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memoryThanks, I'll bookmark it for later use. -Steve
May 31 2017
On 5/26/17 10:41 AM, John Colvin wrote:On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:nice! hm.... /** something vaguely like this should be in iopipe, users shouldn't need to write it */ auto ref runWithEncoding(alias process, FileT, Args...)(FileT file, auto ref Args args) stealing for iopipe, thanks :) I'll need to dedicate another slide to you...Some of you may remember Jon Degenhardt's talk from one of the Silicon Valley D meetups, where he described the performance improvements he saw when he rewrote some of eBay's command line tools in D. He has now put the effort into crafting a blog post on the same topic, where he takes D version of a command-line tool written in Python and incrementally improves its performance. The blog: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/I spent some time fiddling with my own manual approaches to making this as fast, wasn't satisfied and so decided to try using Steven's iopipe (https://github.com/schveiguy/iopipe) instead. Results were excellent. https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242On my machine: python takes a little over 20s, pypy wobbles around 3.5s, v1 from the blog takes about 3.9s, v4b took 1.45s, a version of my own that is hideous* manages 0.78s at best, the above version with iopipe hits below 0.67s most runs. Not bad for a process that most people would call "IO-bound" (code for "I don't want to have to write fast code & it's all the disk's fault"). Obviously this version is a bit more code than is ideal, iopipe is currently quite "barebones", but I don't see why with some clever abstractions and wrappers it couldn't be the default thing that one does even for small scripts.The idea behind iopipe is to give you the building blocks to create exactly the pipeline you need, without a lot of effort. Once you have those blocks, then you make higher level functions out of it. Like you have above :) BTW, there is a byLineRange function that handles slicing off the newline character inside iopipe.textpipe. -Steve
May 30 2017
On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:Some of you may remember Jon Degenhardt's talk from one of the Silicon Valley D meetups, where he described the performance improvements he saw when he rewrote some of eBay's command line tools in D. He has now put the effort into crafting a blog post on the same topic, where he takes D version of a command-line tool written in Python and incrementally improves its performance. The blog: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/Heh, happened to notice that this blog post now has 21 comments, with people posting links to versions in Go, C++, and Kotlin up till this week, months after the post went up! :D
Aug 08 2017
On Tuesday, 8 August 2017 at 21:51:30 UTC, Joakim wrote:On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:There was also a Haskell version on Reddit.Some of you may remember Jon Degenhardt's talk from one of the Silicon Valley D meetups, where he described the performance improvements he saw when he rewrote some of eBay's command line tools in D. He has now put the effort into crafting a blog post on the same topic, where he takes D version of a command-line tool written in Python and incrementally improves its performance. The blog: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ Reddit: https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/Heh, happened to notice that this blog post now has 21 comments, with people posting links to versions in Go, C++, and Kotlin up till this week, months after the post went up! :D
Aug 08 2017