digitalmars.D - tolf and detab
- Walter Bright (104/104) Aug 06 2010 I wrote these two trivial utilities for the purpose of canonicalizing so...
- Andrei Alexandrescu (6/14) Aug 06 2010 [snip]
- Andrej Mitrovic (14/32) Aug 06 2010 Or improve your google-fu by finding some existing tools that do the job
- Walter Bright (2/4) Aug 06 2010 Sure, but I suspect it's faster to write the utility! After all, they ar...
- Walter Bright (2/4) Aug 06 2010 Some D2-fu would be cool. Any takers?
- Yao G. (5/22) Aug 06 2010 What does idiomatic D means?
- Andrei Alexandrescu (4/5) Aug 06 2010 At a quick glance - I'm thinking two elements would be using string and
- Nick Sabalausky (3/4) Aug 06 2010 "idiomatic D" -> "In typical D style"
- Jonathan M Davis (68/90) Aug 07 2010 I didn't try and worry about multiline string literals, but here are my ...
- bearophile (7/11) Aug 07 2010 Your code looks better.
- Andrei Alexandrescu (5/24) Aug 07 2010 I think it's worth targeting D2 to tasks that are usually handled by
- Andrei Alexandrescu (3/7) Aug 07 2010 That would be great so we can tune our approach. Thanks!
- bearophile (61/67) Aug 08 2010 In Python there is a helper module:
- Walter Bright (5/8) Aug 08 2010 So it is with byLine, too. You've burdened D with double the amount of a...
- Nick Sabalausky (3/11) Aug 08 2010 I thought byLine just re-uses the same buffer each time?
- bearophile (28/37) Aug 08 2010 I think you are wrong two times:
- Walter Bright (5/29) Aug 08 2010 If you want to conclude that Python is better at processing files, you n...
- bearophile (6/9) Aug 08 2010 byLine() yields a char[], so if you want to do most kinds of strings pro...
- Andrej Mitrovic (3/25) Aug 08 2010 Andrei used to!string() in an early example in TDPL for some line-by-lin...
- Andrei Alexandrescu (3/6) Aug 08 2010 For example, to!string(someString) does not duplicate the string.
- bearophile (3/4) Aug 08 2010 I don't know where the performance bug is, maybe it's a matter of GC, no...
-
Yao G.
(4/17)
Aug 08 2010
What's next? Will you demand attribution like the time Andrei -
bearophile
(4/6)
Aug 08 2010
Of course. In the end all D will be mine
... - Yao G. (5/12) Aug 08 2010 :D That was a good comeback.
- Andrei Alexandrescu (8/28) Aug 08 2010 Well I understand his frustration. I asked him for a comparison and he
- Andrei Alexandrescu (4/11) Aug 08 2010 Thanks for your analysis. Where does xio derive its performance
- bearophile (7/8) Aug 08 2010 I'd like to give you a good answer, but I can't. dlibs1 (that you can fo...
- Kagamin (2/6) Aug 08 2010 Don't you minimize heap allocation etc by reading whole file in one io c...
- bearophile (4/5) Aug 09 2010 The whole thread was about lazy read of file lines. If the file is very ...
- Michel Fortin (13/19) Aug 09 2010 For non-huge files that can fit in the memory space, I'd just
- Jonathan M Davis (5/22) Aug 09 2010 Well, you can just read the whole file in as a string with readText(), a...
- Andrei Alexandrescu (5/42) Aug 08 2010 I think at the end of the day, regardless the relative possibilities of
- bearophile (5/8) Aug 08 2010 For now I suggest you to aim to be just about as fast as Python in this ...
- Andrei Alexandrescu (3/10) Aug 08 2010 Why?
- bearophile (4/9) Aug 09 2010 Because it's a core functionality for Python so devs probably have optim...
- Andrei Alexandrescu (8/18) Aug 09 2010 Then we can do whatever they've done. It's not like they're using APIs
- dsimcha (22/33) Aug 08 2010 because reading the lines of a _normal_ text file is faster in Python co...
- Bruno Medeiros (17/23) Sep 30 2010 dsimcha wrote:
- bearophile (12/21) Sep 30 2010 This is an interesting topic of practical language design, it's a wide p...
- Bruno Medeiros (38/59) Oct 01 2010 I'm not so sure about that. Probably backwards-incompatible changes will...
- bearophile (16/29) Oct 01 2010 Ada has essentially died for several reasons, but in my opinion one of t...
- Pelle (20/49) Oct 01 2010 No, dynamic scoping is the crazy thing. Perl code:
- Bruno Medeiros (19/30) Oct 05 2010 There are a lot of things in a language that, if they make it harder to
- Nick Sabalausky (5/17) Aug 08 2010 I can respect that. Personally, though, I find a lot of value in not nee...
- Jonathan M Davis (5/34) Aug 07 2010 Actually, looking at the code again, that while loop really should be
- Andrei Alexandrescu (50/122) Aug 07 2010 Very nice. Here's how I'd improve removeTabs:
- Jonathan M Davis (29/86) Aug 07 2010 Ah. I needed to close the file. I pretty much always just use readText()...
- Walter Bright (3/5) Aug 07 2010 Because of asynchronous I/O, being able to start processing and start wr...
- Nick Sabalausky (4/7) Aug 08 2010 I'm fairly sure SVN doesn't commit touched files unless there are actual...
- Andrei Alexandrescu (3/12) Aug 08 2010 It doesn't, but it still shows them as changed etc.
- Leandro Lucarella (33/46) Aug 08 2010 Nope, not really:
- Norbert Nemec (3/107) Aug 08 2010 I usually do the same thing with a shell pipe
-
Walter Bright
(2/4)
Aug 08 2010
- Nick Sabalausky (3/5) Aug 08 2010 Filed under "Why I don't like regex for non-trivial things" ;)
- Leandro Lucarella (17/24) Aug 08 2010 Those regex are non-trivial?
- Nick Sabalausky (13/24) Aug 08 2010 IMHO, A task has to be REALLY trivial to be trivial in regex ;)
- Walter Bright (3/5) Aug 08 2010 Regexes are like flying airplanes. You have to do them often or you get ...
- bearophile (21/23) Aug 08 2010 I have modified the code:
- Andrej Mitrovic (4/28) Aug 08 2010 What are you using to time the app? I'm using timeit (from the Windows
- bearophile (5/7) Aug 08 2010 If you run the benchmarks two times, the second time if you have enough ...
- Walter Bright (2/4) Aug 08 2010 Just run it several times until the times stop going down.
I wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization. tolf - converts LF, CR, and CRLF line endings to LF. detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines. Posted here just in case someone wonders what they are. --------------------------------------------------------- /* Replace tabs with spaces, and remove trailing whitespace from lines. */ import std.file; import std.path; int main(string[] args) { foreach (f; args[1 .. $]) { auto input = cast(char[]) std.file.read(f); auto output = filter(input); if (output != input) std.file.write(f, output); } return 0; } char[] filter(char[] input) { char[] output; size_t j; int column; for (size_t i = 0; i < input.length; i++) { auto c = input[i]; switch (c) { case '\t': while ((column & 7) != 7) { output ~= ' '; j++; column++; } c = ' '; column++; break; case '\r': case '\n': while (j && output[j - 1] == ' ') j--; output = output[0 .. j]; column = 0; break; default: column++; break; } output ~= c; j++; } while (j && output[j - 1] == ' ') j--; return output[0 .. j]; } ----------------------------------------------------- /* Replace line endings with LF */ import std.file; import std.path; int main(string[] args) { foreach (f; args[1 .. $]) { auto input = cast(char[]) std.file.read(f); auto output = filter(input); if (output != input) std.file.write(f, output); } return 0; } char[] filter(char[] input) { char[] output; size_t j; for (size_t i = 0; i < input.length; i++) { auto c = input[i]; switch (c) { case '\r': c = '\n'; break; case '\n': if (i && input[i - 1] == '\r') continue; break; case 0: continue; default: break; } output ~= c; j++; } return output[0 .. j]; } ------------------------------------------
Aug 06 2010
On 08/06/2010 08:34 PM, Walter Bright wrote:I wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization. tolf - converts LF, CR, and CRLF line endings to LF. detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines. Posted here just in case someone wonders what they are.[snip] Nice, though they don't account for multiline string literals. A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. Andrei
Aug 06 2010
Or improve your google-fu by finding some existing tools that do the job right. :) I'm pretty sure Uncrustify is good at most of these issues, not to mention it's a very nice source-code "prettifier/indenter". There's a front-end called UniversalIndentGUI, which has about a dozen integrated versions of source-code prettifiers (including uncrustify, and for many languages). It has varios settings on the left, and togglable *Live* preview mode which you can view on the right. I invite you guys to try it out sometime: http://universalindent.sourceforge.net/ (+ you can save different settings which is neat when you're coding for different projects that have different "code design & look" standards) On Sat, Aug 7, 2010 at 3:50 AM, Andrei Alexandrescu < SeeWebsiteForEmail erdani.org> wrote:On 08/06/2010 08:34 PM, Walter Bright wrote:I wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization. tolf - converts LF, CR, and CRLF line endings to LF. detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines. Posted here just in case someone wonders what they are.[snip] Nice, though they don't account for multiline string literals. A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. Andrei
Aug 06 2010
Andrej Mitrovic wrote:Or improve your google-fu by finding some existing tools that do the job right. :)Sure, but I suspect it's faster to write the utility! After all, they are trivial.
Aug 06 2010
Andrei Alexandrescu wrote:A good exercise would be rewriting these tools in idiomatic D2 and assess the differences.Some D2-fu would be cool. Any takers?
Aug 06 2010
What does idiomatic D means? On Fri, 06 Aug 2010 20:50:52 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 08/06/2010 08:34 PM, Walter Bright wrote:-- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/I wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization. tolf - converts LF, CR, and CRLF line endings to LF. detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines. Posted here just in case someone wonders what they are.[snip] Nice, though they don't account for multiline string literals. A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. Andrei
Aug 06 2010
On 08/06/2010 09:33 PM, Yao G. wrote:What does idiomatic D means?At a quick glance - I'm thinking two elements would be using string and possibly byLine. Andrei
Aug 06 2010
"Yao G." <nospamyao gmail.com> wrote in message news:op.vg1qpcjfxeuu2f miroslava.gateway.2wire.net...What does idiomatic D means?"idiomatic D" -> "In typical D style"
Aug 06 2010
On Friday 06 August 2010 18:50:52 Andrei Alexandrescu wrote:On 08/06/2010 08:34 PM, Walter Bright wrote:I didn't try and worry about multiline string literals, but here are my more idiomatic solutions: detab: /* Replace tabs with spaces, and remove trailing whitespace from lines. */ import std.conv; import std.file; import std.stdio; import std.string; void main(string[] args) { const int tabSize = to!int(args[1]); foreach(f; args[2 .. $]) removeTabs(tabSize, f); } void removeTabs(int tabSize, string fileName) { auto file = File(fileName); string[] output; foreach(line; file.byLine()) { int lastTab = 0; while(lastTab != -1) { const int tab = line.indexOf('\t'); if(tab == -1) break; const int numSpaces = tabSize - tab % tabSize; line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $]; lastTab = tab + numSpaces; } output ~= line.idup; } std.file.write(fileName, output.join("\n")); } ------------------------------------------- The three differences between mine and Walter's are that mine takes the tab size as the first argumen,t it doesn't put a newline at the end of the file, and it writes the file even if it changed (you could test for that, but when using byLine(), it's a bit harder). Interestingly enough, from the few tests that I ran, mine seems to be somewhat faster. I also happen to think that the code is clearer (it's certainly shorter), though that might be up for debate. ------------------------------------------- tolf: /* Replace line endings with LF */ import std.file; import std.string; void main(string[] args) { foreach(f; args[1 .. $]) fixEndLines(f); } void fixEndLines(string fileName) { auto fileStr = std.file.readText(fileName); auto result = fileStr.replace("\r\n", "\n").replace("\r", "\n"); std.file.write(fileName, result); } ------------------------------------------- This version is ludicrously simple. And it was also faster than Walter's in the few tests that I ran. In either case, I think that it is definitely clearer code. I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go. - Jonathan M DavisI wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization. tolf - converts LF, CR, and CRLF line endings to LF. detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines. Posted here just in case someone wonders what they are.[snip] Nice, though they don't account for multiline string literals. A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. Andrei
Aug 07 2010
Jonathan M Davis:I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go.Your code looks better. My (probably controversial) opinion on this is that the idiomatic D solution for those text "scripts" is to use a scripting language, as Python :-) In this case a Python version is more readable, shorter and probably faster too because reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request). On the other hand D2 is in its debugging phase, so it's good to use it even for purposes it's not the best language for, to catch bugs or performance bugs. So I think it's positive to write such scripts in D2, even if in a real-world setting I want to use Python to write them. Bye, bearophile
Aug 07 2010
On 08/07/2010 11:16 PM, bearophile wrote:Jonathan M Davis:I think it's worth targeting D2 to tasks that are usually handled by scripting languages. I've done a lot of that and it beats the hell out of rewriting in D a script that's grown out of control AndreiI would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go.Your code looks better. My (probably controversial) opinion on this is that the idiomatic D solution for those text "scripts" is to use a scripting language, as Python :-) In this case a Python version is more readable, shorter and probably faster too because reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request). On the other hand D2 is in its debugging phase, so it's good to use it even for purposes it's not the best language for, to catch bugs or performance bugs. So I think it's positive to write such scripts in D2, even if in a real-world setting I want to use Python to write them.
Aug 07 2010
On 08/07/2010 11:16 PM, bearophile wrote:In this case a Python version is more readable, shorter and probably faster too because reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request).That would be great so we can tune our approach. Thanks! Andrei
Aug 07 2010
Andrei Alexandrescu:This makes me think we should have a range that detects and replaces patterns lazily and on the fly.In Python there is a helper module: http://docs.python.org/library/fileinput.htmlI think it's worth targeting D2 to tasks that are usually handled by scripting languages. I've done a lot of that and it beats the hell out of rewriting in D a script that's grown out of controlDynamic languages are handy but they require some rigour when you program. Python is probably unfit to write one million lines long programs, but if you train yourself a little and keep your code clean, you usually become able to write clean larghish programs in Python.That would be great so we can tune our approach. Thanks!In my dlibs I have the xio module that read by lines efficiently, it was faster than the iterating on the lines of BufferedFile. There are tons of different benchmarks that you may use, but a simple one to start is better, one that just iterates the file lines. See below. Related: experiments have shown that the (oldish) Java GC improves its performance if it is able to keep strings (that are immutable) in a separate memory pool, _and_ be able to recognize duplicated strings, of course keeping only one string for each equality set. It's positive to do a similar experiment with the D GC, but first you need applications that use the GC to test if this idea is an improvement :-) So I have used a minimal benchmark: -------------------------- from sys import argv def process(file_name): total = 0 for line in open(file_name): total += len(line) return total print "Total:", process(argv[1]) -------------------------- // D2 code import std.stdio: File, writeln; int process(string fileName) { int total = 0; auto file = File(fileName); foreach (rawLine; file.byLine()) { string line = rawLine.idup; total += line.length; } file.close(); return total; } void main(string[] args) { if (args.length == 2) writeln("Total: ", process(args[1])); } -------------------------- In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key. I have used Python 2.7 with no Psyco JIT (http://psyco.sourceforge.net/ ) to speed up the Python code because it's not available yet for Python 2.7. D code compiled with dmd 2.047, optimized build. As test text data I have used a concatenation of all text files here (they are copyrighted, but freely usable): http://gnosis.cx/TPiP/ The result on Windows is a file of 1_116_552 bytes. I have attached the file to itself, duplicating its length, some times, the result is a file of 71_459_328 bytes (this is not an fully realistic case because you often have many small files to read instead of a very large one). The timings are taken with warm disk cache, so they are essentially read from RAM. This is not fully realistic, but if you want to write a benchmark you have to do this, because for me it's very hard on Windows to make sure that the disk cache is fully empty. So it's better to do the opposite and benchmark a warm file. The output of the Python code is: Total: 69789888 Found in 0.88 seconds (best of 6, the variance is minimal). The output of the D code is: Total: 69789888 Found in 1.28 seconds (best of 6, minimal variance). If in the D2 code I comment out the idup like this: foreach (rawLine; file.byLine()) { total += rawLine.length; } The output of the D code without idup is: Total: 69789888 Found in 0.75 seconds (best of 6, minimal variance). As you see it's a matter of GC efficiency too. Beside the GC the cause of the higher performance of the Python code comes from a tuned design, you can see the function getline_via_fgets here: http://svn.python.org/view/python/trunk/Objects/fileobject.c?revision=81275&view=markup It uses a "stack buffer" (char buf[MAXBUFSIZE]; where MAXBUFSIZE is 300) too. Bye, bearophile
Aug 08 2010
bearophile wrote:In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key.So it is with byLine, too. You've burdened D with double the amount of allocations. Also, I object in general to this method of making things "more fair". Using a less efficient approach in X because Y cannot use such an approach is not a legitimate comparison.
Aug 08 2010
"Walter Bright" <newshound2 digitalmars.com> wrote in message news:i3mpnb$2hcf$1 digitalmars.com...bearophile wrote:I thought byLine just re-uses the same buffer each time?In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key.So it is with byLine, too. You've burdened D with double the amount of allocations.
Aug 08 2010
Walter Bright:bearophile wrote:I think you are wrong two times: 1) byLine() doesn't return a newly allocated line, you can see it with this small program: import std.stdio: File, writeln; void main(string[] args) { char[][] lines; auto file = File(args[1]); foreach (rawLine; file.byLine()) { writeln(rawLine.ptr); lines ~= rawLine; } file.close(); } Its output shows that all "strings" (char[]) share the same pointer: 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 ... 2) You can't use the result of rawLine() as string key for an associative array, as you I have said you can in Python. Currently you can, but according to Andrei this is a bug. And if it's not a bug then I'll reopen this closed bug 4474: http://d.puremagic.com/issues/show_bug.cgi?id=4474In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key.So it is with byLine, too. You've burdened D with double the amount of allocations.Also, I object in general to this method of making things "more fair". Using a less efficient approach in X because Y cannot use such an approach is not a legitimate comparison.I generally agree, but this it not the case. In some situations you indeed don't need a newly allocated string for each loop, because for example you just want to read them and process them and not change/store them. You can't do this in Python, but this is not what I want to test. As I have explained in bug 4474 this behaviour is useful but it is acceptable only if explicitly requested by the programmer, and not as default one. The language is safe, as Andrei explains there, because you are supposed to idup the char[] to use it as key for an associative array (if your associative array is declared as int[char[]] then it can accept such rawLine() as keys, but you can clearly see those aren't strings. This is why I have closed bug 4474). Bye, bearophile
Aug 08 2010
bearophile wrote:Walter Bright:eh, you're right. the phobos documentation for byLine needs to be fixed.bearophile wrote:I think you are wrong two times: 1) byLine() doesn't return a newly allocated line, you can see it with this small program: import std.stdio: File, writeln; void main(string[] args) { char[][] lines; auto file = File(args[1]); foreach (rawLine; file.byLine()) { writeln(rawLine.ptr); lines ~= rawLine; } file.close(); } Its output shows that all "strings" (char[]) share the same pointer: 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 ...In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key.So it is with byLine, too. You've burdened D with double the amount of allocations.You can't do this in Python, but this is not what I want to test.If you want to conclude that Python is better at processing files, you need to show it using each language doing it a way well suited to that language, rather than burdening one so it uses the same method as the less powerful one.
Aug 08 2010
Walter Bright:If you want to conclude that Python is better at processing files, you need to show it using each language doing it a way well suited to that language, rather than burdening one so it uses the same method as the less powerful one.byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. You can of course use the raw char[], but then you lose the advantages advertised when you have introduced the safer immutable D2 strings. And in many situations you have to dup the char[] anyway, otherwise your have all kinds of bugs, that Python lacks. In D1 to avoid it I used to use dup more often than necessary. I have explained this in the bug 4474. In this newsgroup my purpose it to show D faults, suggest improvements, etc. In this case my purpose was just to show that byLine()+idup is slow. And you have to thankful for my benchmarks. In my dlibs1 for D1 I have a xio module that reads files by line that is faster than iterating on a BufferedFile, so it's not a limit of the language, it's Phobos that has a performance bug that can be improved. Bye, bearophile
Aug 08 2010
Andrei used to!string() in an early example in TDPL for some line-by-line processing. I'm not sure of the advantages/disadvantages of to!type vs .dup. On Sun, Aug 8, 2010 at 11:44 PM, bearophile <bearophileHUGS lycos.com>wrote:Walter Bright:If you want to conclude that Python is better at processing files, youneed toshow it using each language doing it a way well suited to that language,ratherthan burdening one so it uses the same method as the less powerful one.byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. You can of course use the raw char[], but then you lose the advantages advertised when you have introduced the safer immutable D2 strings. And in many situations you have to dup the char[] anyway, otherwise your have all kinds of bugs, that Python lacks. In D1 to avoid it I used to use dup more often than necessary. I have explained this in the bug 4474. In this newsgroup my purpose it to show D faults, suggest improvements, etc. In this case my purpose was just to show that byLine()+idup is slow. And you have to thankful for my benchmarks. In my dlibs1 for D1 I have a xio module that reads files by line that is faster than iterating on a BufferedFile, so it's not a limit of the language, it's Phobos that has a performance bug that can be improved. Bye, bearophile
Aug 08 2010
On 08/08/2010 04:48 PM, Andrej Mitrovic wrote:Andrei used to!string() in an early example in TDPL for some line-by-line processing. I'm not sure of the advantages/disadvantages of to!type vs .dup.For example, to!string(someString) does not duplicate the string. Andrei
Aug 08 2010
so it's not a limit of the language, it's Phobos that has a performance bug that can be improved.I don't know where the performance bug is, maybe it's a matter of GC, not a Phobos performance bug. Bye, bearophile
Aug 08 2010
On Sun, 08 Aug 2010 16:44:09 -0500, bearophile <bearophileHUGS lycos.com> wrote:Walter Bright:<g> What's next? Will you demand attribution like the time Andrei presented the ranges design?If you want to conclude that Python is better at processing files, you need to show it using each language doing it a way well suited to that language, rather than burdening one so it uses the same method as the less powerful one.byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. [snip] And you have to [be] thankful for my benchmarks. [snip] Bye, bearophile
Aug 08 2010
Yao G.:<g> What's next? Will you demand attribution like the time Andrei presented the ranges design?Of course. In the end all D will be mine <evil laugh with echo effects> :-) Bye, bearophile
Aug 08 2010
On Sun, 08 Aug 2010 17:27:04 -0500, bearophile <bearophileHUGS lycos.com> wrote:Yao G.::D That was a good comeback. -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/<g> What's next? Will you demand attribution like the time Andrei presented the ranges design?Of course. In the end all D will be mine <evil laugh with echo effects> :-) Bye, bearophile
Aug 08 2010
On 08/08/2010 05:17 PM, Yao G. wrote:On Sun, 08 Aug 2010 16:44:09 -0500, bearophile <bearophileHUGS lycos.com> wrote:Well I understand his frustration. I asked him for a comparison and he took the time to write one and play with it. I think the proper answer to that is to see what we can do to improve the situation, not defend the status quo. Whatever the weaknesses of the benchmark are they should be fixed, and then whatever weaknesses the library has they should be addressed. AndreiWalter Bright:<g> What's next? Will you demand attribution like the time Andrei presented the ranges design?If you want to conclude that Python is better at processing files, you need to show it using each language doing it a way well suited to that language, rather than burdening one so it uses the same method as the less powerful one.byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. [snip] And you have to [be] thankful for my benchmarks. [snip] Bye, bearophile
Aug 08 2010
On 08/08/2010 04:44 PM, bearophile wrote:Walter Bright:Thanks for your analysis. Where does xio derive its performance advantage from? AndreiIf you want to conclude that Python is better at processing files, you need to show it using each language doing it a way well suited to that language, rather than burdening one so it uses the same method as the less powerful one.byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. You can of course use the raw char[], but then you lose the advantages advertised when you have introduced the safer immutable D2 strings. And in many situations you have to dup the char[] anyway, otherwise your have all kinds of bugs, that Python lacks. In D1 to avoid it I used to use dup more often than necessary. I have explained this in the bug 4474. In this newsgroup my purpose it to show D faults, suggest improvements, etc. In this case my purpose was just to show that byLine()+idup is slow. And you have to thankful for my benchmarks. In my dlibs1 for D1 I have a xio module that reads files by line that is faster than iterating on a BufferedFile, so it's not a limit of the language, it's Phobos that has a performance bug that can be improved.
Aug 08 2010
Andrei:Where does xio derive its performance advantage from?<I'd like to give you a good answer, but I can't. dlibs1 (that you can found online still) has a Python Licence, so to create xio.xfile() I have just translated to D1 the C code of the CPython implementation code of the file object I have already linked here. I think it minimizes heap allocations, the performance is tuned for a line length found to be the "average one" for normal files. So I presume if your text file has very short lines (like 5 chars each) or very long ones (like 1000 chars each) it becomes less efficient. So it's probably a matter of good usage of the C I/O functions and probably a more efficient management by the GC. Phobos is Boost Licence, but I don't think Python devs can get mad if you take a look at how Python reads lines lazily :-) Someone has tried to implement a Python-style associative array in a similar way. Bye, bearophile
Aug 08 2010
bearophile Wrote:I think it minimizes heap allocations, the performance is tuned for a line length found to be the "average one" for normal files. So I presume if your text file has very short lines (like 5 chars each) or very long ones (like 1000 chars each) it becomes less efficient. So it's probably a matter of good usage of the C I/O functions and probably a more efficient management by the GC.Don't you minimize heap allocation etc by reading whole file in one io call?
Aug 08 2010
Kagamin:Don't you minimize heap allocation etc by reading whole file in one io call?The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once. Bye, bearophile
Aug 09 2010
On 2010-08-09 07:12:38 -0400, bearophile <bearophileHUGS lycos.com> said:Kagamin:For non-huge files that can fit in the memory space, I'd just memory-map the whole file and treat it as a giant string that I could then slice and keep the slices around (yeah!). The virtual memory system will take care of loading the file content's as you read from its memory space, so the file isn't loaded all at once. But that's not compatible with the C file IO functions. Does Python uses C file IO calls when reading from a file? If not, perhaps that's why it's faster. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Don't you minimize heap allocation etc by reading whole file in one io call?The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once.
Aug 09 2010
On Monday, August 09, 2010 05:30:33 Michel Fortin wrote:On 2010-08-09 07:12:38 -0400, bearophile <bearophileHUGS lycos.com> said:Well, you can just read the whole file in as a string with readText(), and any slices to that could stick around, but presumably, that's using the C file I/O calls underneath. - Jonathan M DavisKagamin:For non-huge files that can fit in the memory space, I'd just memory-map the whole file and treat it as a giant string that I could then slice and keep the slices around (yeah!). The virtual memory system will take care of loading the file content's as you read from its memory space, so the file isn't loaded all at once. But that's not compatible with the C file IO functions. Does Python uses C file IO calls when reading from a file? If not, perhaps that's why it's faster.Don't you minimize heap allocation etc by reading whole file in one io call?The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once.
Aug 09 2010
On 08/08/2010 02:32 PM, bearophile wrote:Walter Bright:I think at the end of the day, regardless the relative possibilities of file reading in the two languages, we should be faster than Python when allocating one new string per line. Andreibearophile wrote:I think you are wrong two times: 1) byLine() doesn't return a newly allocated line, you can see it with this small program: import std.stdio: File, writeln; void main(string[] args) { char[][] lines; auto file = File(args[1]); foreach (rawLine; file.byLine()) { writeln(rawLine.ptr); lines ~= rawLine; } file.close(); } Its output shows that all "strings" (char[]) share the same pointer: 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 ... 2) You can't use the result of rawLine() as string key for an associative array, as you I have said you can in Python. Currently you can, but according to Andrei this is a bug. And if it's not a bug then I'll reopen this closed bug 4474: http://d.puremagic.com/issues/show_bug.cgi?id=4474In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key.So it is with byLine, too. You've burdened D with double the amount of allocations.Also, I object in general to this method of making things "more fair". Using a less efficient approach in X because Y cannot use such an approach is not a legitimate comparison.I generally agree, but this it not the case. In some situations you indeed don't need a newly allocated string for each loop, because for example you just want to read them and process them and not change/store them. You can't do this in Python, but this is not what I want to test. As I have explained in bug 4474 this behaviour is useful but it is acceptable only if explicitly requested by the programmer, and not as default one. The language is safe, as Andrei explains there, because you are supposed to idup the char[] to use it as key for an associative array (if your associative array is declared as int[char[]] then it can accept such rawLine() as keys, but you can clearly see those aren't strings. This is why I have closed bug 4474). Bye, bearophile
Aug 08 2010
Andrei Alexandrescu:I think at the end of the day, regardless the relative possibilities of file reading in the two languages, we should be faster than Python when allocating one new string per line.For now I suggest you to aim to be just about as fast as Python in this task :-) Beating Python significantly on this task is probably not easy. (Later someday I'd also like D AAs to become about as fast as Python dicts.) Bye, bearophile
Aug 08 2010
On 08/08/2010 10:29 PM, bearophile wrote:Andrei Alexandrescu:Why? AndreiI think at the end of the day, regardless the relative possibilities of file reading in the two languages, we should be faster than Python when allocating one new string per line.For now I suggest you to aim to be just about as fast as Python in this task :-) Beating Python significantly on this task is probably not easy.
Aug 08 2010
Andrei Alexandrescu:Because it's a core functionality for Python so devs probably have optimized it well, it's written in C, and in this case there is very little interpreter overhead. Bye, bearophileFor now I suggest you to aim to be just about as fast as Python in this task :-) Beating Python significantly on this task is probably not easy.Why?
Aug 09 2010
bearophile wrote:Andrei Alexandrescu:Then we can do whatever they've done. It's not like they're using APIs nobody heard of. It seems such a comparison of file I/O speed becomes in fact a comparison of garbage collectors. That's fine, but in that case the notion that D offers the possibility to avoid allocation should come back to the table. AndreiBecause it's a core functionality for Python so devs probably have optimized it well, it's written in C, and in this case there is very little interpreter overhead.For now I suggest you to aim to be just about as fast as Python in this task :-) Beating Python significantly on this task is probably not easy.Why?
Aug 09 2010
== Quote from bearophile (bearophileHUGS lycos.com)'s articleJonathan M Davis:those text "scripts" is to use a scripting language, as Python :-)I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go.Your code looks better. My (probably controversial) opinion on this is that the idiomatic D solution forIn this case a Python version is more readable, shorter and probably faster toobecause reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request).On the other hand D2 is in its debugging phase, so it's good to use it even forpurposes it's not the best language for, to catch bugs or performance bugs. So I think it's positive to write such scripts in D2, even if in a real-world setting I want to use Python to write them.Bye, bearophileI disagree completely. D is clearly designed from the "simple things should be simple and complicated things should be possible" point of view. If it doesn't work well for these kinds of short scripts then we've failed at making simple things simple and we're just like every other crappy "large scale, industrial strength" language like Java and C++ that's great for megaprojects but makes simple things complicated. That said, I think D does a great job in this regard. I actually use Python as my language of second choice for things D isn't good at. Mostly this means needing Python's huge standard library, needing 64-bit support, or needing to share my code with people who don't know D. Needing to write a very short script tends not to be a reason for me to switch over. It's not that rare for me to start with a short script and then end up adding something that needs performance to it (like monte carlo simulation of a null probability distribution) and I don't find D substantially harder to use for these cases.
Aug 08 2010
On 08/08/2010 14:31, dsimcha wrote:I disagree completely. D is clearly designed from the "simple things should be simple and complicated things should be possible" point of view. If it doesn't work well for these kinds of short scripts then we've failed at making simple things simple and we're just like every other crappy "large scale, industrial strength" language like Java and C++ that's great for megaprojects but makes simple things complicated.dsimcha wrote: "I hate Java and every programming language where a readable hello world takes more than 3 SLOC" That may be your preference, but other people here in the community, me at least, very much want D to be a "large scale, industrial strength" language that's great for megaprojects. I think that medium and large scale projects are simply much more important and interesting than small scale ones. I am hoping this would become an *explicit* point of D design goals, if it isn't already. And I will campaign against (so to speak), people like you who think small scale is more important. No personal animosity intended though. Note: I am not stating that is is not possible to be good, even great, at both things (small and medium/large scale). -- Bruno Medeiros - Software Engineer
Sep 30 2010
Bruno Medeiros:I think that medium and large scale projects are simply much more important and interesting than small scale ones.I am hoping this would become an *explicit* point of D design goals, if it isn't already. And I will campaign against (so to speak), people like you who think small scale is more important. No personal animosity intended though.Note: I am not stating that is is not possible to be good, even great, at both things (small and medium/large scale).This is an interesting topic of practical language design, it's a wide problem and I can't have complete answers. D2 design is mostly done, only small parts may be changed now, so those campaigns probably can't change D2 design much. The name of the Scala language means that it is meant to be a scalable language, this means it is designed to be useful and usable for both large and quite small programs. A language like Ada is not a bad language. Programming practice shows that in many situations the debug time is the larger percentage of the development of a program. So minimizing debug time is usually a very good thing. Ada tries hard to avoid many common bugs, much more than D (it has ranged integers, integer overflows, it defines a portable floating point semantics (despite there is a way to use the faster IEEE semantics), it forces to use clear interfaces between modules (much more explicit ones than D ones), it never silently changes variable types, its semantics is fully specified, there are very precise Ada semantics specs, all Ada compilers must pass a very large test suite, and so on and on). In practice the language is able to catch many bugs before they happen. So if you want to write a program in critical situations, like important control systems, Ada is a language better than Perl, and probably better than D too :-) Yet, writing programs in Ada is not handy, if you need to write small programs you need lot of boilerplate code that is useful only in larger programs. And Ada is a Pascal-like language that many modern programmers don't know/like. Ada looks designed for larger, low-bug-count, costly (and often well planned out from the beginning, with no specs that change with time) programs, but it's not handy to write small programs. Probably Ada is not the best language to write web code that has to change all the time. Today Ada is not a dead language, but it smells funny, it's not commonly used. Andrei has expressed the desire to use D2 as a language to write script-like programs too. I think in most cases a language like Python is better than D2 to write small script-like programs, yet I agree with Andrei that it's good to try to make D2 language fit to write small script-like programs too, because to write such programs you need a very handy language, that catches/avoids many simple common bugs quickly, gives you excellent error messages/stack traces, and allows you to do common operations on web/text files/images/sounds/etc in few lines of code. My theory is that later those qualities turn out to be useful even in large programs. I think such qualities may help D avoid the Ada fate. The ability to write small programs with D is also useful to attract programmers to D, because if in your language you need to write 30 lines long programs to write a "hello world" on the screen then newcomers are likely to stop using that language after their first try. Designing a language that is both good for small and large programs is not easy, but it is a worth goal. D module system must be debugged & finished & improved to improve the usage of D for larger programs. Some features of the unittesting and design by contract currently missing are very useful if you want to use D to write large programs. If you want to write large programs reliability becomes an important concern, so integer overflow tests and some system to avoid null-related bugs (not-nullable types and more) become useful or very useful. Bye, bearophile
Sep 30 2010
On 30/09/2010 19:31, bearophile wrote:Bruno Medeiros:I'm not so sure about that. Probably backwards-incompatible changes will be very few, if any. But there can be backwards-compatible changes, or changes to stuff that was not mentioned in TDPL. And there may be a D3 eventually (a long way down the road though) But my main worry is not language changes, I actually think it's very unlikely Walter and Andrei would do a language change that intentionally would adversely affect medium/large scale programs in favor of small scale programs. My main issue is with the time and thinking resources that are expended here in the NG when people argue for changes (or against other changes) with the intention of favoring small-scall programs. If this were explicit in the D design goals, it would help save us from these discussions (which affect NG readers, not just posters).I think that medium and large scale projects are simply much more important and interesting than small scale ones.I am hoping this would become an *explicit* point of D design goals, if it isn't already. And I will campaign against (so to speak), people like you who think small scale is more important. No personal animosity intended though.Note: I am not stating that is is not possible to be good, even great, at both things (small and medium/large scale).This is an interesting topic of practical language design, it's a wide problem and I can't have complete answers. D2 design is mostly done, only small parts may be changed now, so those campaigns probably can't change D2 design much.The name of the Scala language means that it is meant to be a scalable language, this means it is designed to be useful and usable for both large and quite small programs.Whoa wait. From my understanding, Scala is a "scalable language" in the sense that it easy to add new language features, or something similar to that. But lets be clear, that's not what I'm talking about, and neither is scalability of program data/inputs/performance. I'm talking about scalability of source code, software components, developers, teams, requirements, planning changes, project management issues, etc..A language like Ada is not a bad language. Programming practice shows that in many situations the debug time is the larger percentage of the development of a program. So minimizing debug time is usually a very good thing. Ada tries hard to avoid many common bugs, much more than D (it has ranged integers, integer overflows, it defines a portable floating point semantics (despite there is a way to use the faster IEEE semantics), it forces to use clear interfaces between modules (much more explicit ones than D ones), it never silently changes variable types, its semantics is fully specified, there are very precise Ada semantics specs, all Ada compilers must pass a very large test suite, and so on and on). In practice the language is able to catch many bugs before they happen. So if you want to write a program in critical situations, like important control systems, Ada is a language better than Perl, and probably better than D too :-) Yet, writing programs in Ada is not handy, if you need to write small programs you need lot of boilerplate code that is useful only in larger programs. And Ada is a Pascal-like language that many modern programmers don't know/like. Ada looks designed for larger, low-bug-count, costly (and often well planned out from the beginning, with no specs that change with time) programs, but it's not handy to write small programs. Probably Ada is not the best language to write web code that has to change all the time. Today Ada is not a dead language, but it smells funny, it's not commonly used.Certainly it's not just web code that can change all the time. But I'm missing your point here, what does Ada have to do with this?Andrei has expressed the desire to use D2 as a language to write script-like programs too. I think in most cases a language like Python is better than D2 to write small script-like programs, yet I agree with Andrei that it's good to try to make D2 language fit to write small script-like programs too, because to write such programs you need a very handy language, that catches/avoids many simple common bugs quickly, gives you excellent error messages/stack traces, and allows you to do common operations on web/text files/images/sounds/etc in few lines of code. My theory is that later those qualities turn out to be useful even in large programs. I think such qualities may help D avoid the Ada fate. The ability to write small programs with D is also useful to attract programmers to D, because if in your language you need to write 30 lines long programs to write a "hello world" on the screen then newcomers are likely to stop using that language after their first try. Designing a language that is both good for small and large programs is not easy, but it is a worth goal. D module system must be debugged& finished& improved to improve the usage of D for larger programs. Some features of the unittesting and design by contract currently missing are very useful if you want to use D to write large programs. If you want to write large programs reliability becomes an important concern, so integer overflow tests and some system to avoid null-related bugs (not-nullable types and more) become useful or very useful. Bye, bearophileYeah, I actually think D (or any other language under design) can be quite good at both things. Maybe something like 90% of features that are good for large-scale programs are also good for small-scale ones. One of the earliest useful programs I wrote in D, was a two-page bash shell script that I converted to D. Even though it was just abut two pages, it was already hard to extend and debug. After converting it to D, with the right shortcut methods and abstractions, the code actually manage to be quite succint and comparable, I suspect, to code in Python or Perl, or languages like that. (I say suspect because I don't actually know much about Python or Perl, but I simply didn't see much language changes that could have made my D more succint, barring crazy stuff like dynamic scoping) -- Bruno Medeiros - Software Engineer
Oct 01 2010
Bruno Medeiros:From my understanding, Scala is a "scalable language" in the sense that it easy to add new language features, or something similar to that.I see. You may be right.But I'm missing your point here, what does Ada have to do with this?Ada has essentially died for several reasons, but in my opinion one of them is the amount of code you have to write to do even small things. If you design a language that is not handy to write small programs, you have a higher risk of seeing your language die.but I simply didn't see much language changes that could have made my D more succint,Making a language more succint is easy, you may take a look at J or K languages. The hard thing is to design a succint language that is also readable and not bug-prone. Python has some features that make the code longer, like the obligatory "self." before class instance names and the optional usage of argument names at the calling point make the code longer. The ternary operator too in Python is longer, as the "and" operator, etc. Such things improve readability, etc. Several Python features help shorten the code, like sequence unpacking syntax and multiple return values:... return 1, 2 ...def foo():1a, b = foo() a2 List comprehensions help shorten the code, but I think they also reduce bug count a bit and allow you to think about your code at a bit higher level:b[9, 25, 49, 81, 121, 169] Python has some other features that help shorten the code, like the significant leading white space that avoids some bugs, avoids brace style wars, and removes both some noise and closing brace code lines.xs = [2,3,4,5,6,7,8,9,10,11,12,13] ps = [x * x for x in xs if x % 2] psbarring crazy stuff like dynamic scoping)I don't know what dynamic scoping is, do you mean that crazy nice thing named dynamic typing? :-) Bye, bearophile
Oct 01 2010
On 10/01/2010 01:54 PM, bearophile wrote:Bruno Medeiros:No, dynamic scoping is the crazy thing. Perl code: $x = 1; sub p { print "$x\n" } sub a { local $x = 2; p; } p; a; p results in: pp ~/perl% perl wat.pl 1 2 1 Crazy. :-)From my understanding, Scala is a "scalable language" in the sense that it easy to add new language features, or something similar to that.I see. You may be right.But I'm missing your point here, what does Ada have to do with this?Ada has essentially died for several reasons, but in my opinion one of them is the amount of code you have to write to do even small things. If you design a language that is not handy to write small programs, you have a higher risk of seeing your language die.but I simply didn't see much language changes that could have made my D more succint,Making a language more succint is easy, you may take a look at J or K languages. The hard thing is to design a succint language that is also readable and not bug-prone. Python has some features that make the code longer, like the obligatory "self." before class instance names and the optional usage of argument names at the calling point make the code longer. The ternary operator too in Python is longer, as the "and" operator, etc. Such things improve readability, etc. Several Python features help shorten the code, like sequence unpacking syntax and multiple return values:... return 1, 2 ...def foo():1a, b = foo() a2 List comprehensions help shorten the code, but I think they also reduce bug count a bit and allow you to think about your code at a bit higher level:b[9, 25, 49, 81, 121, 169] Python has some other features that help shorten the code, like the significant leading white space that avoids some bugs, avoids brace style wars, and removes both some noise and closing brace code lines.xs = [2,3,4,5,6,7,8,9,10,11,12,13] ps = [x * x for x in xs if x % 2] psbarring crazy stuff like dynamic scoping)I don't know what dynamic scoping is, do you mean that crazy nice thing named dynamic typing? :-) Bye, bearophile
Oct 01 2010
On 01/10/2010 12:54, bearophile wrote:Bruno Medeiros:There are a lot of things in a language that, if they make it harder to write small programs, they will also make it harder for larger programs. (sometimes even much harder) I'm no expert in ADA, and there are many things that will affect the success of the language, so I can't comment in detail. But from a cursory look at the language, it looks terribly terse. That "begin" "end <name of block>" syntax is awful. I already think just "begin" "end" syntax is bad, but also having to repeat the name of block/function/procedure/loop at the "end", that's awful. Is it trying to compete with "XML" ? :pFrom my understanding, Scala is a "scalable language" in the sense that it easy to add new language features, or something similar to that.I see. You may be right.But I'm missing your point here, what does Ada have to do with this?Ada has essentially died for several reasons, but in my opinion one of them is the amount of code you have to write to do even small things. If you design a language that is not handy to write small programs, you have a higher risk of seeing your language die.Indeed, I agree. And that was the spirit of that original comment: First of all, I meant succinct not only in character and line count but also syntactical and semantic constructs. And succinct without changes that would impact a lot the readability or safety of the code. (as mentioned in "barring crazy stuff like dynamic scoping")but I simply didn't see much language changes that could have made my D more succint,Making a language more succint is easy, you may take a look at J or K languages. The hard thing is to design a succint language that is also readable and not bug-prone.Like Pete explained, it's indeed exactly "dynamic scoping" that I meant. -- Bruno Medeiros - Software Engineerbarring crazy stuff like dynamic scoping)I don't know what dynamic scoping is, do you mean that crazy nice thing named dynamic typing? :-)
Oct 05 2010
"bearophile" <bearophileHUGS lycos.com> wrote in message news:i3lb30$26vf$1 digitalmars.com...Jonathan M Davis:I can respect that. Personally, though, I find a lot of value in not needing to switch languages for that sort of thing. Too much "context switch" for my brain ;)I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go.Your code looks better. My (probably controversial) opinion on this is that the idiomatic D solution for those text "scripts" is to use a scripting language, as Python :-)
Aug 08 2010
Jonathan M Davis wrote:void removeTabs(int tabSize, string fileName) { auto file = File(fileName); string[] output; foreach(line; file.byLine()) { int lastTab = 0; while(lastTab != -1) { const int tab = line.indexOf('\t'); if(tab == -1) break; const int numSpaces = tabSize - tab % tabSize; line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $]; lastTab = tab + numSpaces; } output ~= line.idup; } std.file.write(fileName, output.join("\n")); }Actually, looking at the code again, that while loop really should be while(1) rather than while(lastTab != -1), but it will work the same regardless. - Jonathan M Davis
Aug 07 2010
On 08/07/2010 11:04 PM, Jonathan M Davis wrote:On Friday 06 August 2010 18:50:52 Andrei Alexandrescu wrote:Very nice. Here's how I'd improve removeTabs: import std.conv; import std.file; import std.getopt; import std.stdio; import std.string; void main(string[] args) { uint tabSize = 8; getopt(args, "tabsize|t", &tabSize); foreach(f; args[1 .. $]) removeTabs(tabSize, f); } void removeTabs(int tabSize, string fileName) { auto file = File(fileName); string output; bool changed; foreach(line; file.byLine(File.KeepTerminator.yes)) { int lastTab = 0; while(lastTab != -1) { const tab = line.indexOf('\t'); if(tab == -1) break; const numSpaces = tabSize - tab % tabSize; line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $]; lastTab = tab + numSpaces; changed = true; } output ~= line; } file.close(); if (changed) std.file.write(fileName, output); }A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. AndreiI didn't try and worry about multiline string literals, but here are my more idiomatic solutions: detab: /* Replace tabs with spaces, and remove trailing whitespace from lines. */ import std.conv; import std.file; import std.stdio; import std.string; void main(string[] args) { const int tabSize = to!int(args[1]); foreach(f; args[2 .. $]) removeTabs(tabSize, f); } void removeTabs(int tabSize, string fileName) { auto file = File(fileName); string[] output; foreach(line; file.byLine()) { int lastTab = 0; while(lastTab != -1) { const int tab = line.indexOf('\t'); if(tab == -1) break; const int numSpaces = tabSize - tab % tabSize; line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $]; lastTab = tab + numSpaces; } output ~= line.idup; } std.file.write(fileName, output.join("\n")); }------------------------------------------- The three differences between mine and Walter's are that mine takes the tab size as the first argumen,t it doesn't put a newline at the end of the file, and it writes the file even if it changed (you could test for that, but when using byLine(), it's a bit harder). Interestingly enough, from the few tests that I ran, mine seems to be somewhat faster. I also happen to think that the code is clearer (it's certainly shorter), though that might be up for debate. ------------------------------------------- tolf: /* Replace line endings with LF */ import std.file; import std.string; void main(string[] args) { foreach(f; args[1 .. $]) fixEndLines(f); } void fixEndLines(string fileName) { auto fileStr = std.file.readText(fileName); auto result = fileStr.replace("\r\n", "\n").replace("\r", "\n"); std.file.write(fileName, result); } ------------------------------------------- This version is ludicrously simple. And it was also faster than Walter's in the few tests that I ran. In either case, I think that it is definitely clearer code.Very nice! You may as well guard the write with an if (result != fileStr). With control source etc. in the mix it's always polite to not touch files unless you are actually modifying them. This makes me think we should have a range that detects and replaces patterns lazily and on the fly. I've always thought that loading entire files in memory and working on them is "cheating" in some sense, and a range would help with replacing patterns in streams.I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go.Looking very good, thanks. I think we should have a feature these and a few others as examples on the website. Andrei
Aug 07 2010
On Saturday 07 August 2010 21:59:50 Andrei Alexandrescu wrote:Very nice. Here's how I'd improve removeTabs: import std.conv; import std.file; import std.getopt; import std.stdio; import std.string; void main(string[] args) { uint tabSize = 8; getopt(args, "tabsize|t", &tabSize); foreach(f; args[1 .. $]) removeTabs(tabSize, f); } void removeTabs(int tabSize, string fileName) { auto file = File(fileName); string output; bool changed; foreach(line; file.byLine(File.KeepTerminator.yes)) { int lastTab = 0; while(lastTab != -1) { const tab = line.indexOf('\t'); if(tab == -1) break; const numSpaces = tabSize - tab % tabSize; line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $]; lastTab = tab + numSpaces; changed = true; } output ~= line; } file.close(); if (changed) std.file.write(fileName, output); }Ah. I needed to close the file. I pretty much always just use readText(), so I didn't catch that. Also, it does look like detecting whether the file changed was a bit simpler than I thought that it would be. Quite simple really. Thanks.Very nice! You may as well guard the write with an if (result != fileStr). With control source etc. in the mix it's always polite to not touch files unless you are actually modifying them.Yes. That would be good. It's the kind of thing that I forget - probably because most of the code that I write generates new files rather than updating pre- existing ones.This makes me think we should have a range that detects and replaces patterns lazily and on the fly. I've always thought that loading entire files in memory and working on them is "cheating" in some sense, and a range would help with replacing patterns in streams.It would certainly be nice to have a way to reasonably process with ranges without having to load the whole thing into memory at once. Most of the time, I wouldn't care too much, but if you start processing large files, having the whole thing in memory could be a problem (especially if you have multiple versions of it which were created along the way as you were manipulating it). Haskell does lazy loading of files by default and doesn't load the data until you read the appropriate part of the string. It shouldn't be all that hard to do something similar with D and ranges. The hard port would be trying to do all of it in a way that makes it so that all of the processing of the file's data doesn't have to load it all into memory (let alone load it multiple times). I'm not sure that you could do that without explicitly processing a file line by line, writing it to disk after each line is processed, since you could be doing an arbitrary set of operations on the data. It could be interesting to try and find a solution for that though.Looking very good, thanks. I think we should have a feature these and a few others as examples on the website.Well, I for one, much prefer the ability to program in a manner that's closer to telling the computer to do what I want rather than having to tell it how to do what I want (the replace end-of-line character program being a prime example). It makes life much simpler. Ranges certainly help a lot in that regard too. And having good example code of how to program that way could help encourage people to program that way and use std.range and std.algorithm and their ilk rather than trying more low-level solutions which aren't as easy to understand. - Jonathan M Davis
Aug 07 2010
Jonathan M Davis wrote:It would certainly be nice to have a way to reasonably process with ranges without having to load the whole thing into memory at once.Because of asynchronous I/O, being able to start processing and start writing the new file before the old one is finished reading should speed things up.
Aug 07 2010
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:i3ldk4$2ci0$1 digitalmars.com...Very nice! You may as well guard the write with an if (result != fileStr). With control source etc. in the mix it's always polite to not touch files unless you are actually modifying them.I'm fairly sure SVN doesn't commit touched files unless there are actual changes. (Or maybe it's TortoiseSVN that adds that intelligence?)
Aug 08 2010
On 08/08/2010 12:28 PM, Nick Sabalausky wrote:"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:i3ldk4$2ci0$1 digitalmars.com...It doesn't, but it still shows them as changed etc. AndreiVery nice! You may as well guard the write with an if (result != fileStr). With control source etc. in the mix it's always polite to not touch files unless you are actually modifying them.I'm fairly sure SVN doesn't commit touched files unless there are actual changes. (Or maybe it's TortoiseSVN that adds that intelligence?)
Aug 08 2010
Andrei Alexandrescu, el 8 de agosto a las 14:44 me escribiste:On 08/08/2010 12:28 PM, Nick Sabalausky wrote:Nope, not really: /tmp$ svnadmin create x /tmp$ svn co file:///tmp/x xwc Revisión obtenida: 0 /tmp$ cd xwc/ /tmp/xwc$ echo hello > hello /tmp/xwc$ svn add hello A hello /tmp/xwc$ svn commit -m 'test' Añadiendo hello Transmitiendo contenido de archivos . Commit de la revisión 1. /tmp/xwc$ touch hello /tmp/xwc$ svn status /tmp/xwc$ echo changed > hello /tmp/xwc$ svn status M hello /tmp/xwc$ (sorry about the Spanish messages, I saw them after copying the test and I'm too lazy to repeat them changing the LANG environment variable :) You might want to set the mtime to the same as the original file for build purposes though (you know you're changing the file in a way it doesn't really change its semantics, so you might want to avoid unnecessary recompilation). -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- ... los cuales son susceptibles a una creciente variedad de ataques previsibles, tales como desbordamiento del tampón, falsificación de parámetros, ... -- Stealth - ISS LLC - Seguridad de IT"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:i3ldk4$2ci0$1 digitalmars.com...It doesn't, but it still shows them as changed etc.Very nice! You may as well guard the write with an if (result != fileStr). With control source etc. in the mix it's always polite to not touch files unless you are actually modifying them.I'm fairly sure SVN doesn't commit touched files unless there are actual changes. (Or maybe it's TortoiseSVN that adds that intelligence?)
Aug 08 2010
I usually do the same thing with a shell pipe expand | sed 's/ *$//;s/\r$//;s/\r/\n/' On 07/08/10 02:34, Walter Bright wrote:I wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization. tolf - converts LF, CR, and CRLF line endings to LF. detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines. Posted here just in case someone wonders what they are. --------------------------------------------------------- /* Replace tabs with spaces, and remove trailing whitespace from lines. */ import std.file; import std.path; int main(string[] args) { foreach (f; args[1 .. $]) { auto input = cast(char[]) std.file.read(f); auto output = filter(input); if (output != input) std.file.write(f, output); } return 0; } char[] filter(char[] input) { char[] output; size_t j; int column; for (size_t i = 0; i < input.length; i++) { auto c = input[i]; switch (c) { case '\t': while ((column & 7) != 7) { output ~= ' '; j++; column++; } c = ' '; column++; break; case '\r': case '\n': while (j && output[j - 1] == ' ') j--; output = output[0 .. j]; column = 0; break; default: column++; break; } output ~= c; j++; } while (j && output[j - 1] == ' ') j--; return output[0 .. j]; } ----------------------------------------------------- /* Replace line endings with LF */ import std.file; import std.path; int main(string[] args) { foreach (f; args[1 .. $]) { auto input = cast(char[]) std.file.read(f); auto output = filter(input); if (output != input) std.file.write(f, output); } return 0; } char[] filter(char[] input) { char[] output; size_t j; for (size_t i = 0; i < input.length; i++) { auto c = input[i]; switch (c) { case '\r': c = '\n'; break; case '\n': if (i && input[i - 1] == '\r') continue; break; case 0: continue; default: break; } output ~= c; j++; } return output[0 .. j]; } ------------------------------------------
Aug 08 2010
Norbert Nemec wrote:I usually do the same thing with a shell pipe expand | sed 's/ *$//;s/\r$//;s/\r/\n/'<g>
Aug 08 2010
"Norbert Nemec" <Norbert Nemec-online.de> wrote in message news:i3lq17$99u$1 digitalmars.com...I usually do the same thing with a shell pipe expand | sed 's/ *$//;s/\r$//;s/\r/\n/'Filed under "Why I don't like regex for non-trivial things" ;)
Aug 08 2010
Nick Sabalausky, el 8 de agosto a las 13:31 me escribiste:"Norbert Nemec" <Norbert Nemec-online.de> wrote in message news:i3lq17$99u$1 digitalmars.com...Those regex are non-trivial? Maybe you're confusing sed statements with regex, in that sed program, there are 3 trivial regex: regex replace with *$ (nothing) \r$ (nothing) \r \n They are the most trivial regex you'd ever find! =) -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Vaporeso sostenía a rajacincha la teoría del No-Water, la cual le pertenecía y versaba lo siguiente: "Para darle la otra mejilla al fuego, éste debe ser apagado con alpargatas apenas húmedas".I usually do the same thing with a shell pipe expand | sed 's/ *$//;s/\r$//;s/\r/\n/'Filed under "Why I don't like regex for non-trivial things" ;)
Aug 08 2010
"Leandro Lucarella" <luca llucax.com.ar> wrote in message news:20100808212859.GL3360 llucax.com.ar...Nick Sabalausky, el 8 de agosto a las 13:31 me escribiste:IMHO, A task has to be REALLY trivial to be trivial in regex ;)"Norbert Nemec" <Norbert Nemec-online.de> wrote in message news:i3lq17$99u$1 digitalmars.com...Those regex are non-trivial?I usually do the same thing with a shell pipe expand | sed 's/ *$//;s/\r$//;s/\r/\n/'Filed under "Why I don't like regex for non-trivial things" ;)Maybe you're confusing sed statements with regex, in that sed program, there are 3 trivial regex:Ahh, I see. I'm not familiar with sed, so my eyes got to the part after "sed" and began bleeding, so I figured it had to be one of three things: - Encrypted data - Hardware crash - Regex ;) Insert other joke about "read-only languages" or "languages that look the same before and after RSA encryption" here. (I'm not genuinely complaining about regexes. They can be very useful. They just tend to get real ugly real fast.)
Aug 08 2010
Nick Sabalausky wrote:(I'm not genuinely complaining about regexes. They can be very useful. They just tend to get real ugly real fast.)Regexes are like flying airplanes. You have to do them often or you get "rusty" real fast. (Flying is not a natural behavior, it's not like riding a bike.)
Aug 08 2010
Andrej Mitrovic:Andrei used to!string() in an early example in TDPL for some line-by-line processing. I'm not sure of the advantages/disadvantages of to!type vs .dup.I have modified the code: import std.stdio: File, writeln; import std.conv: to; int process(string fileName) { int total = 0; auto file = File(fileName); foreach (rawLine; file.byLine()) { string line = to!string(rawLine); total += line.length; } file.close(); return total; } void main(string[] args) { if (args.length == 2) writeln("Total: ", process(args[1])); } The run time is 1.29 seconds, showing this is equivalent to the idup. Bye, bearophile
Aug 08 2010
What are you using to time the app? I'm using timeit (from the Windows Server 2003 Resource Kit). I'm getting similar results to yours. Btw, how do you use a warm disk cache? Is there a setting somewhere for that? On Sun, Aug 8, 2010 at 11:54 PM, bearophile <bearophileHUGS lycos.com>wrote:Andrej Mitrovic:Andrei used to!string() in an early example in TDPL for some line-by-line processing. I'm not sure of the advantages/disadvantages of to!type vs.dup. I have modified the code: import std.stdio: File, writeln; import std.conv: to; int process(string fileName) { int total = 0; auto file = File(fileName); foreach (rawLine; file.byLine()) { string line = to!string(rawLine); total += line.length; } file.close(); return total; } void main(string[] args) { if (args.length == 2) writeln("Total: ", process(args[1])); } The run time is 1.29 seconds, showing this is equivalent to the idup. Bye, bearophile
Aug 08 2010
Andrej Mitrovic:What are you using to time the app?A buggy utility that is the Windows port of the GNU time command.Btw, how do you use a warm disk cache? Is there a setting somewhere for that?If you run the benchmarks two times, the second time if you have enough free RAM and your system isn't performing I/O on disk for other purposes, then Windows keep essentially the whole file in a cache in RAM. On Linux too there is a disk cache. The HD too has a cache, so the situation is not simple and probably they are not fully under control of Windows. Bye, bearophile
Aug 08 2010
Andrej Mitrovic wrote:Btw, how do you use a warm disk cache? Is there a setting somewhere for that?Just run it several times until the times stop going down.
Aug 08 2010