digitalmars.D.announce - Updates to the tsv-utils toolkit
- Jon Degenhardt (23/23) Feb 22 2017 It's not quite a year since the open-sourcing of eBay's tsv
- Jack Stouffer (4/13) Feb 22 2017 Great news!
- Jon Degenhardt (4/17) Feb 22 2017 Agreed, an outstanding result. I had not anticipated the deltas.
- bpr (9/13) Feb 22 2017 On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt
- Jon Degenhardt (4/16) Feb 22 2017 Thanks! Both for the feedback and for any evaluation you might
- Joakim (4/13) Feb 22 2017 Nice writeup, somebody posting this to reddit or will that be
- Jon Degenhardt (7/30) Mar 04 2017 One more update: Schveiguy helped identify the performance
It's not quite a year since the open-sourcing of eBay's tsv utilities. Since then there have been a number of additions and updates, and the tools form a more complete package. The tools assist with manipulation of tabular data files common in machine learning and data mining environments. They work alongside traditional Unix command line tools like 'cut', and 'sort'. They also fit well with data mining and stats packages like R and Pandas. The tools include filtering, slicing, joins and other manipulation, sampling, and statistical calculations. If you find yourself working with large data files from a unix shell, you may like these tools. Speed matters when processing large data files, and these tools are fast. I've published new benchmarks comparing the tools to similar tools written in several native compiled programming languages. The tools are the fastest on five of the six benchmarks run, generally by significant margins. It's a good result for the D programming language. The benchmarks may be of interest regardless of your interest in the tools themselves. Repository: https://github.com/eBay/tsv-utils-dlang Performance benchmarks: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md --Jon
Feb 22 2017
On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt wrote:Speed matters when processing large data files, and these tools are fast. I've published new benchmarks comparing the tools to similar tools written in several native compiled programming languages. The tools are the fastest on five of the six benchmarks run, generally by significant margins. It's a good result for the D programming language.Great news!The specialty toolkits have been anonymized in the tables below. The purpose of these benchmarks is to gauge performance of the D tools, not make comparisons between other toolkits.You're no fun ;)
Feb 22 2017
On Wednesday, 22 February 2017 at 18:43:57 UTC, Jack Stouffer wrote:On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt wrote:Agreed, an outstanding result. I had not anticipated the deltas.Speed matters when processing large data files, and these tools are fast. I've published new benchmarks comparing the tools to similar tools written in several native compiled programming languages. The tools are the fastest on five of the six benchmarks run, generally by significant margins. It's a good result for the D programming language.Great news!Yeah, I know. Not my style.The specialty toolkits have been anonymized in the tables below. The purpose of these benchmarks is to gauge performance of the D tools, not make comparisons between other toolkits.You're no fun ;)
Feb 22 2017
On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt wrote: ...snip...Repository: https://github.com/eBay/tsv-utils-dlang Performance benchmarks: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md --JonThis is very nice code, and a good result for D. I'll study this carefully. So much of data analysis is reading/transforming files... I wish you didn't anonymize the specialty toolkits. I think I understand why you chose to do so, but it makes the comparison less valuable. Still, great work! Looking forward to a blogpost.
Feb 22 2017
On Wednesday, 22 February 2017 at 21:07:43 UTC, bpr wrote:On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt wrote: ...snip...Thanks! Both for the feedback and for any evaluation you might do. Any insights or thoughts you may have would be quite welcome. --JonRepository: https://github.com/eBay/tsv-utils-dlang Performance benchmarks: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md --JonThis is very nice code, and a good result for D. I'll study this carefully. So much of data analysis is reading/transforming files... ...snip...
Feb 22 2017
On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt wrote:It's not quite a year since the open-sourcing of eBay's tsv utilities. Since then there have been a number of additions and updates, and the tools form a more complete package. The tools assist with manipulation of tabular data files common in machine learning and data mining environments. They work alongside traditional Unix command line tools like 'cut', and 'sort'. They also fit well with data mining and stats packages like R and Pandas. [...]Nice writeup, somebody posting this to reddit or will that be done with a future blog post?
Feb 22 2017
On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt wrote:It's not quite a year since the open-sourcing of eBay's tsv utilities. Since then there have been a number of additions and updates, and the tools form a more complete package. The tools assist with manipulation of tabular data files common in machine learning and data mining environments. They work alongside traditional Unix command line tools like 'cut', and 'sort'. They also fit well with data mining and stats packages like R and Pandas. The tools include filtering, slicing, joins and other manipulation, sampling, and statistical calculations. If you find yourself working with large data files from a unix shell, you may like these tools. Speed matters when processing large data files, and these tools are fast. I've published new benchmarks comparing the tools to similar tools written in several native compiled programming languages. The tools are the fastest on five of the six benchmarks run, generally by significant margins. It's a good result for the D programming language. The benchmarks may be of interest regardless of your interest in the tools themselves. Repository: https://github.com/eBay/tsv-utils-dlang Performance benchmarks: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md --JonOne more update: Schveiguy helped identify the performance bottleneck in the csv2tsv tool, now the tools are the fastest on all six benchmarks. Benchmarks have been updated (and reformatted a bit). Summary table here: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md#top-four-in-each-benchmark
Mar 04 2017