digitalmars.D.learn - Web crawler/scraping
- Carlos Cabral (5/5) Feb 17 2021 Hi,
- Ferhat =?UTF-8?B?S3VydHVsbXXFnw==?= (4/9) Feb 17 2021 I found this but it looks outdated:
- Carlos Cabral (11/22) Feb 17 2021 Thanks!
- Adam D. Ruppe (14/16) Feb 17 2021 Does the website need javascript?
- Carlos Cabral (9/25) Feb 17 2021 No, I don't think it needs JS.
- Carlos Cabral (37/53) Feb 17 2021 ...and it's working :)
Hi, I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form. Is there a D library that can help me with this? Thank you
Feb 17 2021
On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:Hi, I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form. Is there a D library that can help me with this? Thank youI found this but it looks outdated: https://github.com/gedaiu/selenium.d
Feb 17 2021
On Wednesday, 17 February 2021 at 12:27:16 UTC, Ferhat Kurtulmuş wrote:On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:Thanks! This seems to depend on Selenium, I was looking for something standalone, like crawler.get(...) crawler.post(...) crawler.parse(...) so that I can deploy the executable in the client's network as a single executable (the website I'm crawling is only available internally...).Hi, I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form. Is there a D library that can help me with this? Thank youI found this but it looks outdated: https://github.com/gedaiu/selenium.d
Feb 17 2021
On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.Does the website need javascript? If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work). Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff. I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.
Feb 17 2021
On Wednesday, 17 February 2021 at 13:13:00 UTC, Adam D. Ruppe wrote:On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:No, I don't think it needs JS. I think can submit the login form and then just fetch/save the json request using the login cookie as you suggest. A crawler/scraping solution maybe overkill... I'll try with std.net.curl and come back to you in a couple of hours Thank you!!I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.Does the website need javascript? If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work). Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff. I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.
Feb 17 2021
On Wednesday, 17 February 2021 at 13:13:00 UTC, Adam D. Ruppe wrote:On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:...and it's working :) thank you Adam and Ferhat leaving this here if anyone needs: ``` import std.stdio; import std.string; import std.net.curl; import core.thread; void main() { int waitTime = 5; auto domain = "https://example.com"; auto cookiesFile = "cookies.txt"; auto http = HTTP(); http.handle.set(CurlOption.use_ssl, 1); http.handle.set(CurlOption.ssl_verifypeer, 0); http.handle.set(CurlOption.cookiefile, cookiesFile); http.handle.set(CurlOption.cookiejar , cookiesFile); http.setUserAgent("..."); http.onReceive = (ubyte[] data) { (...) } http.method = HTTP.Method.get; http.url = domain ~ "/login"; http.perform(); Thread.sleep(waitTime.seconds); auto data = "username=user&password=pass"; http.method = HTTP.Method.post; http.url = domain ~ "/login"; http.setPostData(data, "application/x-www-form-urlencoded"); http.perform(); Thread.sleep(waitTime.seconds); http.method = HTTP.Method.get; http.url = domain ~ "/fetchjson"; http.perform(); } ```I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.Does the website need javascript? If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work). Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff. I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.
Feb 17 2021