www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - For those ready to take the challenge

reply "eles" <eles eles.com> writes:
https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro
Jan 09 2015
next sibling parent reply Justin Whear <justin economicmodeling.com> writes:
On Fri, 09 Jan 2015 13:50:28 +0000, eles wrote:

 https://codegolf.stackexchange.com/questions/44278/debunking-
stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro Was excited to give it a try, then remembered...std.xml :(
Jan 09 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 9 January 2015 at 16:55:30 UTC, Justin Whear wrote:
 Was excited to give it a try, then remembered...std.xml  :(
Well, as the author of my dom.d, I think it counts as a first party library when I use it! --- import arsd.dom; import std.net.curl; import std.stdio, std.algorithm; void main() { auto document = new Document(cast(string) get("http://www.stroustrup.com/C++.html")); writeln(document.querySelectorAll("a[href]").map!(a=>a.href)); } --- prints: [snip ... "http://www.morganstanley.com/", "http://www.cs.columbia.edu/", "http://www.cse.tamu.edu", "index.html", "C++.html", "bs_faq.html", "bs_faq2.html", "C++11FAQ.html", "papers.html", "4th.html", "Tour.html", "programming.html", "dne.html", "bio.html", "interviews.html", "applications.html", "glossary.html", "compilers.html"] Or perhaps better yet: import arsd.dom; import std.net.curl; import std.stdio; void main() { auto document = new Document(cast(string) get("http://www.stroustrup.com/C++.html")); foreach(a; document.querySelectorAll("a[href]")) writeln(a.href); } Which puts each one on a separate line.
Jan 09 2015
next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
Huh, looking at the answers on the website, they're mostly using 
regular expressions. Weaksauce. And wrong - they don't find ALL 
the links, they find the absolute HTTP urls!
Jan 09 2015
next sibling parent Justin Whear <justin economicmodeling.com> writes:
On Fri, 09 Jan 2015 17:18:42 +0000, Adam D. Ruppe wrote:

 Huh, looking at the answers on the website, they're mostly using regular
 expressions. Weaksauce. And wrong - they don't find ALL the links, they
 find the absolute HTTP urls!
Yes, I noticed that. `<script src="http://app.js"`></script>` isn't a "hyperlink". Wake up sheeple!
Jan 09 2015
prev sibling next sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Friday, 9 January 2015 at 17:18:43 UTC, Adam D. Ruppe wrote:
 Huh, looking at the answers on the website, they're mostly 
 using regular expressions. Weaksauce. And wrong - they don't 
 find ALL the links, they find the absolute HTTP urls!
Yeah... Surprising, since languages like python includes a HTML parser in the standard library. Besides, if you want all resource links you have to do a lot better, since the following attributes can contain resource addresses: href, src, data, cite, xlink:href… You also need to do entity expansion since the links can contain html entities like "&amp;". Depressing.
Jan 10 2015
prev sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Friday, 9 January 2015 at 17:18:43 UTC, Adam D. Ruppe wrote:
 Huh, looking at the answers on the website, they're mostly 
 using regular expressions. Weaksauce. And wrong - they don't 
 find ALL the links, they find the absolute HTTP urls!
Since it is a comparison of languages it's okay to match the original behaviour.
Jan 10 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 10 January 2015 at 12:34:42 UTC, Tobias Pankrath 
wrote:
 Since it is a comparison of languages it's okay to match the 
 original behaviour.
I don't think this is really a great comparison of languages either though because it is gluing together a couple library tasks. Only a few bits about the actual language are showing through. In the given regex solutions, C++ has an advantage over C wherein the regex structure can be freed automatically in a destructor and a raw string literal in here, but that's about all from the language itself. The original one is kinda long because he didn't use a http get library, not because the language couldn't do one. There are bits where the language can make those libraries nicer too: dom.d uses operator overloading and opDispatch to support things like .attribute and also .attr.X and .style.foo and element["selector"].addClass("foo") and so on implemented in very very little code - I didn't have to manually list methods for the collection or properties for the attributes - ...but a library *could* do it that way and get similar results for the end user; the given posts wouldn't show that.
Jan 10 2015
parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Saturday, 10 January 2015 at 15:13:27 UTC, Adam D. Ruppe wrote:
 On Saturday, 10 January 2015 at 12:34:42 UTC, Tobias Pankrath 
 wrote:
 Since it is a comparison of languages it's okay to match the 
 original behaviour.
I don't think this is really a great comparison of languages either though because it is gluing together a couple library tasks. Only a few bits about the actual language are showing through. In the given regex solutions, C++ has an advantage over C wherein the regex structure can be freed automatically in a destructor and a raw string literal in here, but that's about all from the language itself. The original one is kinda long because he didn't use a http get library, not because the language couldn't do one. There are bits where the language can make those libraries nicer too: dom.d uses operator overloading and opDispatch to support things like .attribute and also .attr.X and .style.foo and element["selector"].addClass("foo") and so on implemented in very very little code - I didn't have to manually list methods for the collection or properties for the attributes - ...but a library *could* do it that way and get similar results for the end user; the given posts wouldn't show that.
I agree and one of the answers says:
 I think the "no third-party" assumption is a fallacy. And is a 
 specific fallacy that afflicts C++ developers, since it's so 
 hard to make reusable code in C++. When you are developing 
 anything at all, even if it's a small script, you will always 
 make use of whatever pieces of reusable code are available to 
 you.
 The thing is, in languages like Perl, Python, Ruby (to name a 
 few), reusing
 someone else's code is not only easy, but it is how most people 
 actually write code most of the time.
I think he's wrong, because it spoils the comparison. Every answer should delegate those tasks to a library that Stroustroup used as well, e.g. regex matching, string to number conversion and some kind of TCP sockets. But it must do the same work that he's solution does: Create and parse HTML header and extract the html links, probably using regex, but I wouldn't mind another solution. Everyone can put a libdo_the_stroustroup_thing on dub and then call do_the_stroustroup_thing() in main. To compare what the standard libraries (and libraries easily obtained or quasi standard) offer is another challenge.
Jan 10 2015
next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 10 January 2015 at 15:52:21 UTC, Tobias Pankrath 
wrote:
 But it must do the same work that he's solution does: Create 
 and parse HTML header and extract the html links, probably 
 using regex, but I wouldn't mind another solution.
Yeah, that would be best. BTW interesting line here: s << "GET " << "http://" + server + "/" + file << " HTTP/1.0\r\n"; s << "Host: " << server << "\r\n"; Why + instead of <<? C++'s usage of << is totally blargh to me anyway, but seeing both is even stranger. Weird language, weird library.
 Everyone can put a libdo_the_stroustroup_thing on dub and then 
 call do_the_stroustroup_thing() in main. To compare what the 
 standard libraries (and libraries easily obtained or quasi 
 standard) offer is another challenge.
Yeah.
Jan 10 2015
prev sibling next sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Saturday, 10 January 2015 at 15:52:21 UTC, Tobias Pankrath 
wrote:
 ...

 The thing is, in languages like Perl, Python, Ruby (to name a 
 few), reusing
 someone else's code is not only easy, but it is how most 
 people actually write code most of the time.
I think he's wrong, because it spoils the comparison. Every answer should delegate those tasks to a library that Stroustroup used as well, e.g. regex matching, string to number conversion and some kind of TCP sockets. But it must do the same work that he's solution does: Create and parse HTML header and extract the html links, probably using regex, but I wouldn't mind another solution. Everyone can put a libdo_the_stroustroup_thing on dub and then call do_the_stroustroup_thing() in main. To compare what the standard libraries (and libraries easily obtained or quasi standard) offer is another challenge.
I disagree. The great thing about comes with batteries runtimes is that I have the guarantee the desired features exist in all platforms supported by the language. If the libraries are dumped into a repository, there is always a problem if the library works across all OS supported by the language or even if they work together at all. Specially if they depend on common packages with incompatible versions. This is the cause of so many string and vector types across all C++ libraries as most of those libraries were developed before C++98 was even done. Or why C runtime isn't nothing more than a light version of UNIX as it was back in 1989, without any worthwhile feature since then, besides some extra support for numeric types and a little more secure libraries. Nowadays, unless I am doing something very OS specific, I hardly care which OS I am using, thanks to such "comes with batteries" runtimes. -- Paulo
Jan 10 2015
prev sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Saturday, 10 January 2015 at 15:52:21 UTC, Tobias Pankrath 
wrote:
 I think he's wrong, because it spoils the comparison. Every 
 answer should delegate those tasks to a library that 
 Stroustroup used as well, e.g. regex matching, string to number 
 conversion and some kind of TCP sockets. But it must do the 
 same work that he's solution does: Create and parse HTML header 
 and extract the html links, probably using regex, but I 
 wouldn't mind another solution.
The challenge is completely pointless. Different languages have different ways of hacking together a compact incorrect solution. How to directly translate a C++ hack into another language is a task for people who are drunk. For the challenge to make sense it would entail parsing all legal HTML5 documents, extracting all resource links, converting them into absolute form and printing them one per line. With no hickups.
Jan 10 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 10 January 2015 at 17:23:31 UTC, Ola Fosheim Grøstad 
wrote:
 For the challenge to make sense it would entail parsing all 
 legal HTML5 documents, extracting all resource links, 
 converting them into absolute form and printing them one per 
 line. With no hickups.
Though, that's still a library thing rather than a language thing. dom.d and the Url struct in cgi.d should be able to do all that, in just a few lines even, but that's just because I've done a *lot* of web scraping with the libs before so I made them work for that. In fact.... let me to do it. I'll use my http2.d instead of cgi.d, actually, it has a similar Url struct just more focused on client requests. import arsd.dom; import arsd.http2; import std.stdio; void main() { auto base = Uri("http://www.stroustrup.com/C++.html"); // http2 is a newish module of mine that aims to imitate // a browser in some ways (without depending on curl btw) auto client = new HttpClient(); auto request = client.navigateTo(base); auto document = new Document(); // and http2 provides an asynchonous api but you can // pretend it is sync by just calling waitForCompletion auto response = request.waitForCompletion(); // parseGarbage uses a few tricks to fixup invalid/broken HTML // tag soup and auto-detect character encodings, including when // it lies about being UTF-8 but is actually Windows-1252 document.parseGarbage(response.contentText); // Uri.basedOn returns a new absolute URI based on something else foreach(a; document.querySelectorAll("a[href]")) writeln(Uri(a.href).basedOn(base)); } Snippet of the printouts: [...] http://www.computerhistory.org http://www.softwarepreservation.org/projects/c_plus_plus/ http://www.morganstanley.com/ http://www.cs.columbia.edu/ http://www.cse.tamu.edu http://www.stroustrup.com/index.html http://www.stroustrup.com/C++.html http://www.stroustrup.com/bs_faq.html http://www.stroustrup.com/bs_faq2.html http://www.stroustrup.com/C++11FAQ.html http://www.stroustrup.com/papers.html [...] The latter are relative links that it based on and the first few are absolute. Seems to have worked. There's other kinds of links than just a[href], but fetching them is as simple as adding them to the selector or looping for them too separately: foreach(a; document.querySelectorAll("script[src]")) writeln(Uri(a.src).basedOn(base)); none on that page, no <link>s either, but it is easy enough to do with the lib. Looking at the source of that page, I find some invalid HTML and lies about the character set. How did Document.parseGarbage do? Pretty well, outputting the parsed DOM tree shows it auto-corrected the problems I see by eye.
Jan 10 2015
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Saturday, 10 January 2015 at 17:39:17 UTC, Adam D. Ruppe wrote:
 Though, that's still a library thing rather than a language 
 thing.
It is a language-library-platform thing, things like how composable the eco system is would be interesting to compare. But it would be unfair to require a minimalistic language to not use third party libraries. One should probably require that the library used is generic (not a spider-framework), not using FFI, mature and maintained?
 	document.parseGarbage(response.contentText);

         // Uri.basedOn returns a new absolute URI based on 
 something else
 	foreach(a; document.querySelectorAll("a[href]"))
 		writeln(Uri(a.href).basedOn(base));
 }
Nice and clean code; does it expand html entities ("&amp")? The HTML5 standard has improved on HTML4 by now being explicit on how incorrect documents shall be interpreted in section 8.2. That ought to be sufficient, since that is what web browsers are supposed to do. http://www.w3.org/TR/html5/syntax.html#html-parser
Jan 10 2015
parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 10 January 2015 at 19:17:22 UTC, Ola Fosheim Grøstad 
wrote:
 Nice and clean code; does it expand html entities ("&amp")?
Of course. It does it both ways: <span>a &amp;</span> span.innerText == "a &" span.innerText = "a \" b"; assert(span.innerHTML == "a &quot; b"); parseGarbage also tries to fix broken entities, so like & standing alone it will translate to &amp; for you. there's also parseStrict which just throws an exception in cases like that. That's one thing a lot of XML parsers don't do in the name of speed, but I do since it is pretty rare that I don't want them translated. One thing I did for a speedup though was scan the string for & and if it doesn't find one, return a slice of the original, and if it does, return a new string with the entity translated. Gave a surprisingly big speed boost without costing anything in convenience.
 The HTML5 standard has improved on HTML4 by now being explicit 
 on how incorrect documents shall be interpreted in section 8.2. 
 That ought to be sufficient, since that is what web browsers 
 are supposed to do.

 http://www.w3.org/TR/html5/syntax.html#html-parser
Huh, I never read that, my thing just did what looked right to me over hundreds of test pages that were broken in various strange and bizarre ways.
Jan 10 2015
prev sibling parent reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
On Friday, 9 January 2015 at 17:15:43 UTC, Adam D. Ruppe wrote:
 import arsd.dom;
 import std.net.curl;
 import std.stdio, std.algorithm;

 void main() {
 	auto document = new Document(cast(string) 
 get("http://www.stroustrup.com/C++.html"));
 	writeln(document.querySelectorAll("a[href]").map!(a=>a.href));
 }

 Or perhaps better yet:

 import arsd.dom;
 import std.net.curl;
 import std.stdio;

 void main() {
 	auto document = new Document(cast(string) 
 get("http://www.stroustrup.com/C++.html"));
 	foreach(a; document.querySelectorAll("a[href]"))
 		writeln(a.href);
 }

 Which puts each one on a separate line.
Both these code examples triggers the same assert() dmd: expression.c:3761: size_t StringExp::length(int): Assertion `encSize == 1 || encSize == 2 || encSize == 4' failed. on dmd git master. Ideas anyone?
Jan 10 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 10 January 2015 at 13:22:57 UTC, Nordlöw wrote:
 on dmd git master. Ideas anyone?
Don't use git master :P Definitely another regression. That line was just pushed to git like two weeks ago and the failing assertion is pretty obviously a pure dmd code bug, it doesn't know the length of char apparently.
Jan 10 2015
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Adam D. Ruppe:

 Don't use git master :P
Is the issue in Bugzilla? Bye, bearophile
Jan 10 2015
next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 10 January 2015 at 15:24:45 UTC, bearophile wrote:
 Is the issue in Bugzilla?
I don't know, bugzilla is extremely difficult to search. I guess I'll post it again and worst case it will be closed as a duplicate.
Jan 10 2015
prev sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 10 January 2015 at 15:24:45 UTC, bearophile wrote:
 Is the issue in Bugzilla?
https://issues.dlang.org/show_bug.cgi?id=13966
Jan 10 2015
prev sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 10 January 2015 at 14:56:09 UTC, Adam D. Ruppe wrote:
 On Saturday, 10 January 2015 at 13:22:57 UTC, Nordlöw wrote:
 on dmd git master. Ideas anyone?
Don't use git master :P
Do use git master. The more people do, the fewer regressions will slip into the final release. You can use Dustmite to reduce the code to a simple example, and Digger to find the exact pull request which introduced the regression. (Yes, shameless plug, preaching to the choir, etc.)
Jan 10 2015
prev sibling next sibling parent reply "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro
Link to answer in D: http://codegolf.stackexchange.com/a/44417/13362
Jan 09 2015
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/9/15 6:10 PM, Jesse Phillips wrote:
 On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro
Link to answer in D: http://codegolf.stackexchange.com/a/44417/13362
Nailed it. -- Andrei
Jan 09 2015
prev sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 10 January 2015 at 02:10:04 UTC, Jesse Phillips 
wrote:
 On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro
Link to answer in D: http://codegolf.stackexchange.com/a/44417/13362
I think byLine is not necessary. By default . will not match line breaks. One statement solution: import std.net.curl, std.stdio; import std.algorithm, std.regex; void main() { get("http://www.stroustrup.com/C++.html") .matchAll(`<a.*?href="(.*)"`) .map!(m => m[1]) .each!writeln(); }
Jan 09 2015
parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Vladimir Panteleev via Digitalmars-d-learn píše v So 10. 01. 2015 v
07:42 +0000:
 On Saturday, 10 January 2015 at 02:10:04 UTC, Jesse Phillips 
 wrote:
 On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro
Link to answer in D: http://codegolf.stackexchange.com/a/44417/13362
I think byLine is not necessary. By default . will not match line breaks. One statement solution: import std.net.curl, std.stdio; import std.algorithm, std.regex; void main() { get("http://www.stroustrup.com/C++.html") .matchAll(`<a.*?href="(.*)"`) .map!(m => m[1]) .each!writeln(); }
Oh here is it, I was looking for each. I think it is allready in a phobos but I can not find. Now I know why :D
Jan 10 2015
prev sibling parent "MattCoder" <stop spam.com> writes:
On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro
From the link: "Let's show Stroustrup what small and readable program actually is." Alright, there are a lot a examples in many languagens, but those examples doesn't should handle exceptions like the original code does? Matheus.
Jan 10 2015