Eh. I tried to do Chromium/Puppeteer based scraping this way. Building a Dockerf...

TheDong · on Dec 19, 2022

> Building a Dockerfile took ages due to the low compute

Why compile anything on the raspberry pi? Cross-compile on a machine with more compute (like your laptop, desktop, phone, or ec2 instance) one time, and then transfer the compiled binaries or built docker image over.

> Pi with 2GB instead of 8GB

For headless chrome, that should be enough unless you're doing other stuff with it. Unless you mean for compiling stuff, which as before can be done elsewhere.

gwn7 · on Dec 19, 2022

You're mentioning the worst possible scenario.

First of all the answer recommends to use either a Pi or a VPS. If a Pi doesn't cut it for you, just switch to a VPS with sufficient specs for your requirement. Problem solved, now it's viable.

Besides, a significant part of the web can be scraped without resorting to a heavyweight browser such as Chromium. It should always be the last resort. Even if you have to evaluate Javascript (in case of SPAs), there are much cheaper solutions than Puppeteer (JSDOM being an example) which can get the job done most of the time.

As to Docker, I fail to see why you would need Docker for this kind of job, unless you don't know how to do it without Docker.

When there is no portability requirement, the costs of Docker easily surpass its benefits.

...

So no, it's not that what's being recommended in the answer is not viable. You're doing it wrong. Either you're using a Pi when you need to use a VPS or you are introducing unnecessary layers.

marginalia_nu · on Dec 19, 2022

You really don't need docker for this.

Arainach · on Dec 19, 2022

It's true that you don't, but I can see the advantages.

I have a Raspberry Pi that is natively running a scraper using headless Chromium and cron. It works great, except....

I ended up needing a virtual framebuffer. I got it working on the Raspberry Pi, but I got a new workstation and wanted to edit my script and test it there. I got cryptic errors that I needed to debug to understand they were framebuffer issues, then attempt to recreate the setup that's running on my Pi, then debug that.....

My first mistake was not writing down what I did in my README, but a Docker image would have saved me a ton of time here.

gwn7 · on Dec 20, 2022

Why couldn't you keep editing your script in your Pi?

Isn't proposing Docker as a solution to this going nuclear?

I think there are some many cheaper things you can do to solve this.

For me this is absolutely no justification for Docker use in this scenario.

Arainach · on Dec 20, 2022

The script takes 60+ seconds to run on an RPi and fewer than 10 on an i9, allowing for significantly faster iteration.

marginalia_nu · on Dec 20, 2022

Doesn't really answer any of the questions of what docker actually does to help here, other than make everything slower and more complex.

marcosdumay · on Dec 19, 2022

That's really deviating from the nature of the "cheapest, easiest way to host a cronjob" question. If the OP has that kind of requirement, he won't get good answers.

SeriousM · on Dec 19, 2022

You can use the "browserless" docker service which contains a headless chrome browser in a docker container. It also supports puppeteer and playwright connect api. Works flawless! I use it in combination with n8n. All on a raspberry pi4b (yes, I got one recently)

MuffinFlavored · on Dec 19, 2022

> You can use the "browserless" docker service which contains a headless chrome browser in a docker container.

A lot of websites can detect the IPs from this and block it, basically almostly like a CAPTCHA.

I had other needs like Postgres, etc.

Scraping data is one thing, actually doing anything with it is another. I quickly hit the limits of a $50 RaspberryPi 4 or whatever they're going for on Amazon these days with the gouging, etc.