I needed a reader view library for a side project and decided to compare the most popular options (repo at https://github.com/awendland/readable-web-extractor-comparis...). Among cleanview, metascraper, @postlight/mercury-parser, and mozilla/readability I thought that mozilla/readability performed the best because of its consistent extraction of the primary content and minimal mangling of the semantic structure.
Interesting. Frankly, I didn't put much thought into the choice myself. I picked the Mozilla version because I use it on Firefox every day and it seems to work fine.
Hi, I'm the developer of rdrview. I posted the project as a "Show HN" a week ago or so but it got no traction, so I'm really surprised to find it on the front page today. I'll be happy to answer any questions you may have.
> but it got no traction, so I'm really surprised to find it on the front page today
There's a lot of randomness in what happens to get noticed or get traction on HN. We have various tricks to try to mitigate that. One is that we allow a small number of reposts if an article hasn't had significant attention yet (https://news.ycombinator.com/newsfaq.html).
> There's a lot of randomness in what happens to get noticed or get traction on HN.
I'm the kind of user that just lurks in the front page, so I don't really have any right to complain if my work doesn't get noticed by others. I'm glad that it did though.
> we allow a small number of reposts if an article hasn't had significant attention yet
I was working recently on a similar project and I knocked my head in the GDPR walls; the most annoying being TechCrunch - you don't just get a modal that you could bypass, they send you to another site.
The core of my code is a line by line translation of the Firefox version. I know what it does, but not the exact motivation for everything, so it has many hidden tricks I never noticed. I'm not in Europe and I never tested this, but it's possible that it does remove some of the modals, as long as the actual content is on the page.
It won't do anything for the TechCrunch case you describe, because it only fetches the one webpage you point it to (and any redirections).
I took a very quick look at the source code, and seems you’re using the curl default options for things like the user agent. Please correct me if wrong!
Did you try pretending to be a search engine crawler, for an idea..?
I would love to have a whole browser based around Reader View. Remove the zany hip web 3.0 layouts, the fancy fonts, remove the dickbars and cookie warnings, etc. Just show me the content _without_ having to endure the original site first.
(I understand that navigating some sites would be a challenge as Reader Mode doesn't always know exactly what is cruft, and what isn't. But I don't see it as insurmountable.)
I would like that too. You could even extract the menus from the webpage and display them in a "native" way. The big problem is that most websites will never cooperate with this, so it will always be a hack.
Turning reader view on by default on my iPhone transformed how I experienced being online. So worth it. Especially as you can add sites to a non-reader (ignore) list so stuff like youtube doesn't get caught up in it.
Although it used to fully load the page then convert it to reader view which was a bit annoying on an old and increasingly sluggish iphone 6.
I've personally used mercury-parser (https://github.com/postlight/mercury-parser/) in the past. It can display an article in clear text, Markdown, and HTML. I go with Markdown in order to have a permanent copy of the article within my notes.
Slightly related: I'm always confused when I stumble upon an article that I can view just fine in Firefox's reader view, but doesn't get recognised as an article by Mozilla's Pocket, making me unable to highlight stuff.
> Slightly related: I'm always confused when I stumble upon an article that I can view just fine in Firefox's reader view, but doesn't get recognised as an article by Mozilla's Pocket, making me unable to highlight stuff.
So, granted that Mozilla owns Pocket now, but they were their own thing long before the Mozilla acquisition, and I doubt their codebases have really merged.
Yeah, I'm paying for it (mostly to give some money to Mozilla on a regular basis), but I can't say I'm happy with the slow development.
For example, the API has no way of extracting the highlights, I can't scrape them because their login form is behind Google's CAPTCHA, and at the same time I know of third-party services that somehow have access to the highlights (like readwise.io). Contacting them just resulted in "we're a small team and have no updates on when we'll expand our API".
Maybe, if you don't mind Node and all the dependencies. Not everybody is willing or able to put up with that, particularly when it comes to small "unix tools".
This is really useful. I can remember that we had a research project once where we tried to crawl the internet for text stuff and cleaning that up was a hassle (I think we used jsoup for extracting the text). Firefox reader is not entirely flawless but more tools are definitely better.
I remember that at some point I was using something similar to readability.js as a pre-processor in plucker distiller (a program to spider and store websites so that you could read offline on e.g. PalmOS devices).
Ohhhhh I'm adding that as a filter for newsboat tonight.
Instead of loading castrated headline + tagline stubs, then I'll be able to read the whole article in terminal without any of the commercials and "also see" BS. <3
This is turning out to be a bit more involved. Mostly because I have no experience in generating rss...
Newsboat filters operate on the whole feed. So the script needs to read the input rss (cf feedparse) then retrieve the HTML (cf rdrview) and then generate rss errrr.... Feedparse doesn't do that last bit.
And then add a layer of on disk caching because otherwise it would keep re-downloading all entries over and over again.
I'll get back to that tomorrow I guess. Now for shut-eye
This is great! I'm very impressed with the effort that must have gone into transpiling the original JS codebase into C by hand.
Any plans to make this available as a library? Would be nice to use this as a C-library in other languages, instead of current approaches where it gets re-written everywhere.
I haven't looked too deeply into the lynx codebase, but it might be tricky to do this without extra dependencies. Lynx already parses html obviously, but I don't know if it ever needs to make modifications to it.
Now, it would be great if there was a way to pipe the html from a webpage into another tool from inside lynx. Maybe that exists already and I'm just not aware of it.
Readability is a real boon for older and slower computers, too. I'm trying to make it more prominent in TenFourFox for that reason. We're using the latest version and it really works well on all kinds of pages.
I was reading websites in Lynx for a while...really reduces the distractions. Some sites have been broken for years, but it’s more usable than you’d think.
I use lynx for RSS. Most websites do work, but I find it annoying to have to scroll past all the menus and forms, which are sometimes several pages in the terminal. Rdrview doesn't replace lynx, it just removes the clutter before you open the page on lynx.
Ditto, but for some pages I use the Gopher proxy as
gopher://codevoid/1/hn or gopher://gopherddit.com as
they keep the comment threading layout as is, allowing
you to follow a conversation perfectly.
Lynx is great, but I personally prefer w3m[1]. Of course it's still quite limited without css/js, but it renders websites way nicer (with shiny html features like frames, tables, etc.), and even supports displaying images.
With the right configuration, it can also do stuff like automatically replacing urls, or youtube-dl'ing a page and piping its output to mpv.
My big complaint is that this means I get no value from my Tridactyl keybindings. I would use reader mode on almost everything by default if it weren't for this.
Sorry, I didn't pay enough attention when reading! That is actually pretty nice and one could use this to implement a reader-view browser extension. I like that it uses seccomp for the HTML handling as well.
You can block 99% of ads using just the hosts file as well. Also
reader mode blocks out everything other than static images and
text, and static images as ads are almost non-existant
nowadays(unfortunately).
You're right and it all depends on what you consider acceptable or not. Still, reader mode will let some things slip through, like cookies, which could be especially undesirable on blogging sites like Medium.
Remember that in europe, websites are obligated to get your consent to get your data, and will record when you click on "accept".
Reader view is the exact solution you need to counter this.
Right clicking -> block element with ublock origin is also an impressively easy solution for those consent popups, since such a click does have value in court.
The AdGuard Annoyances list, listed by default in uBlock Origin, blocks many of those cookie warnings. Another list is here: https://www.i-dont-care-about-cookies.eu/
I've seen this problem with Fedora, the default mailcap settings are not sane. On Debian it worked fine by default. If you don't want to edit your mailcap, you can use the -B flag to pick a browser, like the sibling comment said.
Yes, the man page documents the option to specify the browser on command line with -B/--browser or in the environment as RDRVIEW_BROWSER. By default it parses your mailcap preferences for text/html.
For a quick preview of each library on a random sample of 16 articles posted to HN, see https://github.com/awendland/readable-web-extractor-comparis... (you’ll need to expand a row to see its results).