Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Eh.

I tried to do Chromium/Puppeteer based scraping this way.

Building a Dockerfile took ages due to the low compute. (Rust was a non-starter).

I also had (foolishly) only bought the Pi with 2GB instead of 8GB so RAM was an issue.

Disk was super slow.

I'm not sure how viable this is, especially with how hard it is currently to source a Pi, let alone its computation/memory constraints.



> Building a Dockerfile took ages due to the low compute

Why compile anything on the raspberry pi? Cross-compile on a machine with more compute (like your laptop, desktop, phone, or ec2 instance) one time, and then transfer the compiled binaries or built docker image over.

> Pi with 2GB instead of 8GB

For headless chrome, that should be enough unless you're doing other stuff with it. Unless you mean for compiling stuff, which as before can be done elsewhere.


You're mentioning the worst possible scenario.

First of all the answer recommends to use either a Pi or a VPS. If a Pi doesn't cut it for you, just switch to a VPS with sufficient specs for your requirement. Problem solved, now it's viable.

Besides, a significant part of the web can be scraped without resorting to a heavyweight browser such as Chromium. It should always be the last resort. Even if you have to evaluate Javascript (in case of SPAs), there are much cheaper solutions than Puppeteer (JSDOM being an example) which can get the job done most of the time.

As to Docker, I fail to see why you would need Docker for this kind of job, unless you don't know how to do it without Docker.

When there is no portability requirement, the costs of Docker easily surpass its benefits.

...

So no, it's not that what's being recommended in the answer is not viable. You're doing it wrong. Either you're using a Pi when you need to use a VPS or you are introducing unnecessary layers.


You really don't need docker for this.


It's true that you don't, but I can see the advantages.

I have a Raspberry Pi that is natively running a scraper using headless Chromium and cron. It works great, except....

I ended up needing a virtual framebuffer. I got it working on the Raspberry Pi, but I got a new workstation and wanted to edit my script and test it there. I got cryptic errors that I needed to debug to understand they were framebuffer issues, then attempt to recreate the setup that's running on my Pi, then debug that.....

My first mistake was not writing down what I did in my README, but a Docker image would have saved me a ton of time here.


Why couldn't you keep editing your script in your Pi?

Isn't proposing Docker as a solution to this going nuclear?

I think there are some many cheaper things you can do to solve this.

For me this is absolutely no justification for Docker use in this scenario.


The script takes 60+ seconds to run on an RPi and fewer than 10 on an i9, allowing for significantly faster iteration.


Doesn't really answer any of the questions of what docker actually does to help here, other than make everything slower and more complex.


That's really deviating from the nature of the "cheapest, easiest way to host a cronjob" question. If the OP has that kind of requirement, he won't get good answers.


You can use the "browserless" docker service which contains a headless chrome browser in a docker container. It also supports puppeteer and playwright connect api. Works flawless! I use it in combination with n8n. All on a raspberry pi4b (yes, I got one recently)


> You can use the "browserless" docker service which contains a headless chrome browser in a docker container.

A lot of websites can detect the IPs from this and block it, basically almostly like a CAPTCHA.

I had other needs like Postgres, etc.

Scraping data is one thing, actually doing anything with it is another. I quickly hit the limits of a $50 RaspberryPi 4 or whatever they're going for on Amazon these days with the gouging, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: