Hacker News new | past | comments | ask | show | jobs | submit login
Rdrview – Firefox Reader View as a Linux command line tool (github.com/eafer)
238 points by ashitlerferad on Oct 19, 2020 | hide | past | favorite | 61 comments



I needed a reader view library for a side project and decided to compare the most popular options (repo at https://github.com/awendland/readable-web-extractor-comparis...). Among cleanview, metascraper, @postlight/mercury-parser, and mozilla/readability I thought that mozilla/readability performed the best because of its consistent extraction of the primary content and minimal mangling of the semantic structure.

For a quick preview of each library on a random sample of 16 articles posted to HN, see https://github.com/awendland/readable-web-extractor-comparis... (you’ll need to expand a row to see its results).


Interesting. Frankly, I didn't put much thought into the choice myself. I picked the Mozilla version because I use it on Firefox every day and it seems to work fine.


Hi, I'm the developer of rdrview. I posted the project as a "Show HN" a week ago or so but it got no traction, so I'm really surprised to find it on the front page today. I'll be happy to answer any questions you may have.


Just wanted to say that I love that you do the parsing in a sandbox. Very well written!


Thanks, I'm glad that you like it.


> but it got no traction, so I'm really surprised to find it on the front page today

There's a lot of randomness in what happens to get noticed or get traction on HN. We have various tricks to try to mitigate that. One is that we allow a small number of reposts if an article hasn't had significant attention yet (https://news.ycombinator.com/newsfaq.html).


> There's a lot of randomness in what happens to get noticed or get traction on HN.

I'm the kind of user that just lurks in the front page, so I don't really have any right to complain if my work doesn't get noticed by others. I'm glad that it did though.

> we allow a small number of reposts if an article hasn't had significant attention yet

Does that apply to Show HN too?


Yes.


By the way, in case anyone is wondering, I've been told that rdrview can be built on OpenBSD as well [1]. I intend to add support for the BSDs soon.

[1] https://old.reddit.com/r/commandline/comments/jaluzg/firefox...


I was working recently on a similar project and I knocked my head in the GDPR walls; the most annoying being TechCrunch - you don't just get a modal that you could bypass, they send you to another site.

Did you find a way to deal with this hindrance ?


The core of my code is a line by line translation of the Firefox version. I know what it does, but not the exact motivation for everything, so it has many hidden tricks I never noticed. I'm not in Europe and I never tested this, but it's possible that it does remove some of the modals, as long as the actual content is on the page.

It won't do anything for the TechCrunch case you describe, because it only fetches the one webpage you point it to (and any redirections).


I took a very quick look at the source code, and seems you’re using the curl default options for things like the user agent. Please correct me if wrong!

Did you try pretending to be a search engine crawler, for an idea..?


I didn't; I guess I could let the user pick an agent. You think that would help with GDPR?

So far, I've never run into this problem myself. If I did, I think I would use tor.


I would love to have a whole browser based around Reader View. Remove the zany hip web 3.0 layouts, the fancy fonts, remove the dickbars and cookie warnings, etc. Just show me the content _without_ having to endure the original site first.

(I understand that navigating some sites would be a challenge as Reader Mode doesn't always know exactly what is cruft, and what isn't. But I don't see it as insurmountable.)


I would like that too. You could even extract the menus from the webpage and display them in a "native" way. The big problem is that most websites will never cooperate with this, so it will always be a hack.


You can set up Safari like that, where it will, when it can render a reader view, default to that presentation.


Turning reader view on by default on my iPhone transformed how I experienced being online. So worth it. Especially as you can add sites to a non-reader (ignore) list so stuff like youtube doesn't get caught up in it.

Although it used to fully load the page then convert it to reader view which was a bit annoying on an old and increasingly sluggish iphone 6.


I've personally used mercury-parser (https://github.com/postlight/mercury-parser/) in the past. It can display an article in clear text, Markdown, and HTML. I go with Markdown in order to have a permanent copy of the article within my notes.

Slightly related: I'm always confused when I stumble upon an article that I can view just fine in Firefox's reader view, but doesn't get recognised as an article by Mozilla's Pocket, making me unable to highlight stuff.


> Slightly related: I'm always confused when I stumble upon an article that I can view just fine in Firefox's reader view, but doesn't get recognised as an article by Mozilla's Pocket, making me unable to highlight stuff.

So, granted that Mozilla owns Pocket now, but they were their own thing long before the Mozilla acquisition, and I doubt their codebases have really merged.


Yeah, I'm paying for it (mostly to give some money to Mozilla on a regular basis), but I can't say I'm happy with the slow development.

For example, the API has no way of extracting the highlights, I can't scrape them because their login form is behind Google's CAPTCHA, and at the same time I know of third-party services that somehow have access to the highlights (like readwise.io). Contacting them just resulted in "we're a small team and have no updates on when we'll expand our API".


I have a thin CLI wrapper for Mozilla Readability (https://github.com/NightMachinary/readability-cli). I use it to create ebooks from URLs to read stuff on my Kindle.


Yeah, it seems like readability-cli is preferable, as it doesn't need to re-implement the original Mozilla algorithm.


Maybe, if you don't mind Node and all the dependencies. Not everybody is willing or able to put up with that, particularly when it comes to small "unix tools".


This is really useful. I can remember that we had a research project once where we tried to crawl the internet for text stuff and cleaning that up was a hassle (I think we used jsoup for extracting the text). Firefox reader is not entirely flawless but more tools are definitely better.


I remember that at some point I was using something similar to readability.js as a pre-processor in plucker distiller (a program to spider and store websites so that you could read offline on e.g. PalmOS devices).


Have you tried Apache Tika?


Ohhhhh I'm adding that as a filter for newsboat tonight. Instead of loading castrated headline + tagline stubs, then I'll be able to read the whole article in terminal without any of the commercials and "also see" BS. <3


I use it with newsboat myself, but I didn't know about the filters. I should really look into that.

I was just setting the BROWSER environment variable to a script that makes me choose if I want to open the article on rdrview (with lynx) or firefox.


I use newsboat too. Would you mind sharing the filter?


once I finish the daily workywork and get around to building it, sure why not? :P


bump :)


This is turning out to be a bit more involved. Mostly because I have no experience in generating rss...

Newsboat filters operate on the whole feed. So the script needs to read the input rss (cf feedparse) then retrieve the HTML (cf rdrview) and then generate rss errrr.... Feedparse doesn't do that last bit. And then add a layer of on disk caching because otherwise it would keep re-downloading all entries over and over again.

I'll get back to that tomorrow I guess. Now for shut-eye


This is great! I'm very impressed with the effort that must have gone into transpiling the original JS codebase into C by hand.

Any plans to make this available as a library? Would be nice to use this as a C-library in other languages, instead of current approaches where it gets re-written everywhere.

Kudos!


> I'm very impressed with the effort that must have gone into transpiling the original JS codebase into C by hand.

Thanks! Yes, it was harder than I expected.

> Any plans to make this available as a library?

I've been thinking about it, though I've never written a library before. I think I would like to get the code to build on other platforms first.


It would be super helpful to have such a tool be part of a command line browser such as lyx or links.


I haven't looked too deeply into the lynx codebase, but it might be tricky to do this without extra dependencies. Lynx already parses html obviously, but I don't know if it ever needs to make modifications to it.

Now, it would be great if there was a way to pipe the html from a webpage into another tool from inside lynx. Maybe that exists already and I'm just not aware of it.


> Now, it would be great if there was a way to pipe the html from a webpage into another tool from inside lynx.

For future reference, it appears that this can be done with elinks.


I really hope it happens!


lynx has —dump which prints a rendered page to stdout.


The motivation for rdrview is cleaning up all the menus and clutter from the dump. It still uses lynx (or any other text browser) for the render.


Likewise `elinks -dump` if you prefer elinks.


Readability is a real boon for older and slower computers, too. I'm trying to make it more prominent in TenFourFox for that reason. We're using the latest version and it really works well on all kinds of pages.



I was reading websites in Lynx for a while...really reduces the distractions. Some sites have been broken for years, but it’s more usable than you’d think.


I use lynx for RSS. Most websites do work, but I find it annoying to have to scroll past all the menus and forms, which are sometimes several pages in the terminal. Rdrview doesn't replace lynx, it just removes the clutter before you open the page on lynx.


Ditto, but for some pages I use the Gopher proxy as gopher://codevoid/1/hn or gopher://gopherddit.com as they keep the comment threading layout as is, allowing you to follow a conversation perfectly.


Lynx is great, but I personally prefer w3m[1]. Of course it's still quite limited without css/js, but it renders websites way nicer (with shiny html features like frames, tables, etc.), and even supports displaying images.

With the right configuration, it can also do stuff like automatically replacing urls, or youtube-dl'ing a page and piping its output to mpv.

[1] https://github.com/tats/w3m


It should be noted that reader mode bypasses browser extensions. So if you use something like uBlock Origin it will have no effect in reader mode.


My big complaint is that this means I get no value from my Tridactyl keybindings. I would use reader mode on almost everything by default if it weren't for this.


Rdrview doesn't actually use Firefox internally, it's just a rewrite of the Readability.js library.


Sorry, I didn't pay enough attention when reading! That is actually pretty nice and one could use this to implement a reader-view browser extension. I like that it uses seccomp for the HTML handling as well.


You can block 99% of ads using just the hosts file as well. Also reader mode blocks out everything other than static images and text, and static images as ads are almost non-existant nowadays(unfortunately).


You're right and it all depends on what you consider acceptable or not. Still, reader mode will let some things slip through, like cookies, which could be especially undesirable on blogging sites like Medium.


Sure, but it would probably end up filtering out much of the crap anyway.


Filtering isn't the painful part in many cases; the real pain to me is that I lose my vim keybindings.


Remember that in europe, websites are obligated to get your consent to get your data, and will record when you click on "accept".

Reader view is the exact solution you need to counter this.

Right clicking -> block element with ublock origin is also an impressively easy solution for those consent popups, since such a click does have value in court.


The AdGuard Annoyances list, listed by default in uBlock Origin, blocks many of those cookie warnings. Another list is here: https://www.i-dont-care-about-cookies.eu/


most of them seem to have some random CSS class names, so I'm not sure it works...


The output html is getting picked up by xdg-open and directed into Firefox -- anyway of redirecting the output to lynx e.g. ./rdrview '<site>' | lynx


I've seen this problem with Fedora, the default mailcap settings are not sane. On Debian it worked fine by default. If you don't want to edit your mailcap, you can use the -B flag to pick a browser, like the sibling comment said.


Yes, the man page documents the option to specify the browser on command line with -B/--browser or in the environment as RDRVIEW_BROWSER. By default it parses your mailcap preferences for text/html.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: