paulpro's comments

paulpro · on Oct 26, 2020

What is the reason you are not just getting page content directly with HTTP request? Is headless browser providing some benefits in your case?

mrskitch · on Oct 26, 2020

Yes: often the case is that JS does some kind of data-fetching, API calls, or whatever else to render a full page (single-page apps for instance). With Github being mostly just HTML markup and not needing a JS runtime we could have definitely gone that route. The rationale was that we had a desire to use our product ourselves, to gain better insight into what our users do, and become more empathetic to their cause.

In short: we wanted to dogfood the product at the cost of some time and machine resources

paulpro · on Oct 26, 2020

IMO, for the most of the data-gathering needs running a browser (even a headless one) would be an overkill. Browser is better suited for complex interactions, when you need to fully pretend to be a user. Or just for testing purposes so your environments match.

dwd · on Oct 26, 2020

I've used Selenium API running in Firefox in the past to scrape customers data out of proprietary .Net WebForm systems requiring a login that didn't offer any option to export the data.

Crawling the list pages and then each edit page in turn allowed for dumping the name and value from each input field to the log as key:value pairs for processing offline.

Navigating paging was probably the biggest challenge.

bigiain · on Oct 27, 2020

I have done the same, to "export" 10s of thousands of pages from a client's Sitecore website where they were in a very adversarial relationship with the incumbent Sitecore dev/hosts.

I totally don't recommend doing this. But it worked for this case.

"I hate to advocate drugs, alcohol, violence, or insanity to anyone, but they've always worked for me." -- Hunter S Thompson

bdcravens · on Oct 27, 2020

Selenium is great, but seems to be easier to detect and block.

dwd · on Oct 27, 2020

Selenium lets you add random delays between your actions which could help avoid triggering a firewall to block you.

Good practice anyway so you don't overload the site and find your logs empty or full of gaps.

bdcravens · on Oct 27, 2020

Good approach, but advanced Selenium detection goes beyond heuristics. Selenium injects JavaScript into the page to function, and the presence of this is how Selenium is detected.

dwd · on Oct 27, 2020

Interesting, I've worked on both sides of scraping and protecting content but hadn't really considered checking for JavaScript frameworks as a trigger. I'm assuming this is something you could configure in a F5 that also injects its own JavaScript?

Randomising field names, seeding hidden bogus data and messing with element order was more what I would look at once a persistent scraper was using enough IPs to get around rate limits.

bdcravens · on Oct 28, 2020

I believe you could recompile Selenium with different names to get around it, but since Puppeteer uses CDP baked into the browser, no injection is necessary, bypassing alot of this.

bigiain · on Oct 27, 2020

I agree with both you and the post you're replying to.

One comment though, once you've past that "I can't do this without a real browser" line in the sand a few times, you end up with a collection of snippets and skills that moves that line much closer. Sure, I'll load the page and watch in browser tools to see what's in the html and what's coming back to XHR calls, but when I've got a directory full of previously used example code to fire up that uses Python/Selenium and deals with "boilerplate" parts, it's a much easier decision to jump that way than the first time I stared at the BeautifySoup documentation.

(When the only tool you have is a nailgun, every problem looks like a messiah...)

nsonha · on Oct 27, 2020

most sites these days are single page apps. Unless cheerio and phantomjs work well with those (have not tried), I don't see any other option. Benefit of a browser is that it does multi-processing much better than you do. I only need to add some custom code to block non-js requests to improve performance a bit.

Like if you do ad-hoc web scrapping then it's fine to spend time looking for the most efficient way, but if your web scrapping framework is part of a data pipeline that scrapes all sort of website then a browser is the most development time-saving route.

paulpro · on Oct 26, 2020

Have you made it to the end? There is also a quick PoC of how to do same in the browser console