Rdrview – Firefox Reader View as a Linux command line tool

awendland · on Oct 19, 2020

I needed a reader view library for a side project and decided to compare the most popular options (repo at https://github.com/awendland/readable-web-extractor-comparis...). Among cleanview, metascraper, @postlight/mercury-parser, and mozilla/readability I thought that mozilla/readability performed the best because of its consistent extraction of the primary content and minimal mangling of the semantic structure.

For a quick preview of each library on a random sample of 16 articles posted to HN, see https://github.com/awendland/readable-web-extractor-comparis... (you’ll need to expand a row to see its results).

eafer · on Oct 19, 2020

Interesting. Frankly, I didn't put much thought into the choice myself. I picked the Mozilla version because I use it on Firefox every day and it seems to work fine.

eafer · on Oct 19, 2020

Hi, I'm the developer of rdrview. I posted the project as a "Show HN" a week ago or so but it got no traction, so I'm really surprised to find it on the front page today. I'll be happy to answer any questions you may have.

buzzert · on Oct 19, 2020

Just wanted to say that I love that you do the parsing in a sandbox. Very well written!

eafer · on Oct 19, 2020

Thanks, I'm glad that you like it.

dang · on Oct 19, 2020

> but it got no traction, so I'm really surprised to find it on the front page today

There's a lot of randomness in what happens to get noticed or get traction on HN. We have various tricks to try to mitigate that. One is that we allow a small number of reposts if an article hasn't had significant attention yet (https://news.ycombinator.com/newsfaq.html).

eafer · on Oct 19, 2020

> There's a lot of randomness in what happens to get noticed or get traction on HN.

I'm the kind of user that just lurks in the front page, so I don't really have any right to complain if my work doesn't get noticed by others. I'm glad that it did though.

> we allow a small number of reposts if an article hasn't had significant attention yet

Does that apply to Show HN too?

dang · on Oct 20, 2020

eafer · on Oct 19, 2020

By the way, in case anyone is wondering, I've been told that rdrview can be built on OpenBSD as well [1]. I intend to add support for the BSDs soon.

[1] https://old.reddit.com/r/commandline/comments/jaluzg/firefox...

ivanovb · on Oct 19, 2020

I was working recently on a similar project and I knocked my head in the GDPR walls; the most annoying being TechCrunch - you don't just get a modal that you could bypass, they send you to another site.

Did you find a way to deal with this hindrance ?

eafer · on Oct 19, 2020

The core of my code is a line by line translation of the Firefox version. I know what it does, but not the exact motivation for everything, so it has many hidden tricks I never noticed. I'm not in Europe and I never tested this, but it's possible that it does remove some of the modals, as long as the actual content is on the page.

It won't do anything for the TechCrunch case you describe, because it only fetches the one webpage you point it to (and any redirections).

danpeddle · on Oct 19, 2020

I took a very quick look at the source code, and seems you’re using the curl default options for things like the user agent. Please correct me if wrong!

Did you try pretending to be a search engine crawler, for an idea..?

eafer · on Oct 19, 2020

I didn't; I guess I could let the user pick an agent. You think that would help with GDPR?

So far, I've never run into this problem myself. If I did, I think I would use tor.

bityard · on Oct 19, 2020

I would love to have a whole browser based around Reader View. Remove the zany hip web 3.0 layouts, the fancy fonts, remove the dickbars and cookie warnings, etc. Just show me the content _without_ having to endure the original site first.

(I understand that navigating some sites would be a challenge as Reader Mode doesn't always know exactly what is cruft, and what isn't. But I don't see it as insurmountable.)

eafer · on Oct 19, 2020

I would like that too. You could even extract the menus from the webpage and display them in a "native" way. The big problem is that most websites will never cooperate with this, so it will always be a hack.

mrkstu · on Oct 19, 2020

You can set up Safari like that, where it will, when it can render a reader view, default to that presentation.

dijksterhuis · on Oct 19, 2020

Turning reader view on by default on my iPhone transformed how I experienced being online. So worth it. Especially as you can add sites to a non-reader (ignore) list so stuff like youtube doesn't get caught up in it.

Although it used to fully load the page then convert it to reader view which was a bit annoying on an old and increasingly sluggish iphone 6.

input_sh · on Oct 19, 2020

I've personally used mercury-parser (https://github.com/postlight/mercury-parser/) in the past. It can display an article in clear text, Markdown, and HTML. I go with Markdown in order to have a permanent copy of the article within my notes.

Slightly related: I'm always confused when I stumble upon an article that I can view just fine in Firefox's reader view, but doesn't get recognised as an article by Mozilla's Pocket, making me unable to highlight stuff.

AdmiralAsshat · on Oct 19, 2020

> Slightly related: I'm always confused when I stumble upon an article that I can view just fine in Firefox's reader view, but doesn't get recognised as an article by Mozilla's Pocket, making me unable to highlight stuff.

So, granted that Mozilla owns Pocket now, but they were their own thing long before the Mozilla acquisition, and I doubt their codebases have really merged.

input_sh · on Oct 19, 2020

Yeah, I'm paying for it (mostly to give some money to Mozilla on a regular basis), but I can't say I'm happy with the slow development.

For example, the API has no way of extracting the highlights, I can't scrape them because their login form is behind Google's CAPTCHA, and at the same time I know of third-party services that somehow have access to the highlights (like readwise.io). Contacting them just resulted in "we're a small team and have no updates on when we'll expand our API".

Siira · on Oct 19, 2020

I have a thin CLI wrapper for Mozilla Readability (https://github.com/NightMachinary/readability-cli). I use it to create ebooks from URLs to read stuff on my Kindle.

makeworld · on Oct 19, 2020

Yeah, it seems like readability-cli is preferable, as it doesn't need to re-implement the original Mozilla algorithm.

eafer · on Oct 19, 2020

Maybe, if you don't mind Node and all the dependencies. Not everybody is willing or able to put up with that, particularly when it comes to small "unix tools".

mqus · on Oct 19, 2020

This is really useful. I can remember that we had a research project once where we tried to crawl the internet for text stuff and cleaning that up was a hassle (I think we used jsoup for extracting the text). Firefox reader is not entirely flawless but more tools are definitely better.

AshamedCaptain · on Oct 19, 2020

I remember that at some point I was using something similar to readability.js as a pre-processor in plucker distiller (a program to spider and store websites so that you could read offline on e.g. PalmOS devices).

loa_in_ · on Oct 19, 2020

Have you tried Apache Tika?

black_puppydog · on Oct 19, 2020

Ohhhhh I'm adding that as a filter for newsboat tonight. Instead of loading castrated headline + tagline stubs, then I'll be able to read the whole article in terminal without any of the commercials and "also see" BS. <3

eafer · on Oct 19, 2020

I use it with newsboat myself, but I didn't know about the filters. I should really look into that.

I was just setting the BROWSER environment variable to a script that makes me choose if I want to open the article on rdrview (with lynx) or firefox.

henil2911 · on Oct 19, 2020

I use newsboat too. Would you mind sharing the filter?

black_puppydog · on Oct 19, 2020

once I finish the daily workywork and get around to building it, sure why not? :P

henil2911 · on Oct 20, 2020

bump :)

black_puppydog · on Oct 20, 2020

This is turning out to be a bit more involved. Mostly because I have no experience in generating rss...

Newsboat filters operate on the whole feed. So the script needs to read the input rss (cf feedparse) then retrieve the HTML (cf rdrview) and then generate rss errrr.... Feedparse doesn't do that last bit. And then add a layer of on disk caching because otherwise it would keep re-downloading all entries over and over again.

I'll get back to that tomorrow I guess. Now for shut-eye

captn3m0 · on Oct 19, 2020

This is great! I'm very impressed with the effort that must have gone into transpiling the original JS codebase into C by hand.

Any plans to make this available as a library? Would be nice to use this as a C-library in other languages, instead of current approaches where it gets re-written everywhere.

Kudos!

eafer · on Oct 19, 2020

> I'm very impressed with the effort that must have gone into transpiling the original JS codebase into C by hand.

Thanks! Yes, it was harder than I expected.

> Any plans to make this available as a library?

I've been thinking about it, though I've never written a library before. I think I would like to get the code to build on other platforms first.

ktpsns · on Oct 19, 2020

It would be super helpful to have such a tool be part of a command line browser such as lyx or links.

eafer · on Oct 19, 2020

I haven't looked too deeply into the lynx codebase, but it might be tricky to do this without extra dependencies. Lynx already parses html obviously, but I don't know if it ever needs to make modifications to it.

Now, it would be great if there was a way to pipe the html from a webpage into another tool from inside lynx. Maybe that exists already and I'm just not aware of it.

eafer · on Oct 26, 2020

> Now, it would be great if there was a way to pipe the html from a webpage into another tool from inside lynx.

For future reference, it appears that this can be done with elinks.

dvfjsdhgfv · on Oct 19, 2020

I really hope it happens!

smhenderson · on Oct 19, 2020

lynx has —dump which prints a rendered page to stdout.

eafer · on Oct 19, 2020

The motivation for rdrview is cleaning up all the menus and clutter from the dump. It still uses lynx (or any other text browser) for the render.

yjftsjthsd-h · on Oct 19, 2020

Likewise `elinks -dump` if you prefer elinks.

classichasclass · on Oct 19, 2020

Readability is a real boon for older and slower computers, too. I'm trying to make it more prominent in TenFourFox for that reason. We're using the latest version and it really works well on all kinds of pages.

floatingatoll · on Oct 19, 2020

Show HN, previously: https://news.ycombinator.com/item?id=24759493

bluedino · on Oct 19, 2020

I was reading websites in Lynx for a while...really reduces the distractions. Some sites have been broken for years, but it’s more usable than you’d think.

eafer · on Oct 19, 2020

I use lynx for RSS. Most websites do work, but I find it annoying to have to scroll past all the menus and forms, which are sometimes several pages in the terminal. Rdrview doesn't replace lynx, it just removes the clutter before you open the page on lynx.

anthk · on Oct 19, 2020

Ditto, but for some pages I use the Gopher proxy as gopher://codevoid/1/hn or gopher://gopherddit.com as they keep the comment threading layout as is, allowing you to follow a conversation perfectly.

akalsz · on Oct 19, 2020

Lynx is great, but I personally prefer w3m[1]. Of course it's still quite limited without css/js, but it renders websites way nicer (with shiny html features like frames, tables, etc.), and even supports displaying images.

With the right configuration, it can also do stuff like automatically replacing urls, or youtube-dl'ing a page and piping its output to mpv.

[1] https://github.com/tats/w3m

t0astbread · on Oct 19, 2020

It should be noted that reader mode bypasses browser extensions. So if you use something like uBlock Origin it will have no effect in reader mode.

greggyb · on Oct 19, 2020

My big complaint is that this means I get no value from my Tridactyl keybindings. I would use reader mode on almost everything by default if it weren't for this.

eafer · on Oct 19, 2020

Rdrview doesn't actually use Firefox internally, it's just a rewrite of the Readability.js library.

t0astbread · on Oct 19, 2020

Sorry, I didn't pay enough attention when reading! That is actually pretty nice and one could use this to implement a reader-view browser extension. I like that it uses seccomp for the HTML handling as well.

alpaca128 · on Oct 19, 2020

You can block 99% of ads using just the hosts file as well. Also reader mode blocks out everything other than static images and text, and static images as ads are almost non-existant nowadays(unfortunately).

t0astbread · on Oct 19, 2020

You're right and it all depends on what you consider acceptable or not. Still, reader mode will let some things slip through, like cookies, which could be especially undesirable on blogging sites like Medium.

classichasclass · on Oct 19, 2020

Sure, but it would probably end up filtering out much of the crap anyway.

yjftsjthsd-h · on Oct 19, 2020

Filtering isn't the painful part in many cases; the real pain to me is that I lose my vim keybindings.

jokoon · on Oct 19, 2020

Remember that in europe, websites are obligated to get your consent to get your data, and will record when you click on "accept".

Reader view is the exact solution you need to counter this.

Right clicking -> block element with ublock origin is also an impressively easy solution for those consent popups, since such a click does have value in court.

coldpie · on Oct 19, 2020

The AdGuard Annoyances list, listed by default in uBlock Origin, blocks many of those cookie warnings. Another list is here: https://www.i-dont-care-about-cookies.eu/

jokoon · on Oct 20, 2020

most of them seem to have some random CSS class names, so I'm not sure it works...

jpsalm · on Oct 19, 2020

The output html is getting picked up by xdg-open and directed into Firefox -- anyway of redirecting the output to lynx e.g. ./rdrview '<site>' | lynx

eafer · on Oct 19, 2020

I've seen this problem with Fedora, the default mailcap settings are not sane. On Debian it worked fine by default. If you don't want to edit your mailcap, you can use the -B flag to pick a browser, like the sibling comment said.

raimue · on Oct 19, 2020

Yes, the man page documents the option to specify the browser on command line with -B/--browser or in the environment as RDRVIEW_BROWSER. By default it parses your mailcap preferences for text/html.