A key problem with extracting article context is that there are so many distinct sources.
That said, power laws and Zipf functions apply, and a large fraction of HN front-page articles come from a relatively small set of domains. There's further aggregation possible when underlying publishing engines can be identified, e.g., Wordpress, CMSes used by a large number of news organisations, Medium, Substack, Github, Gitlab, Fediverse servers, and a number of static site generators (Hugo, Jekyll, Pelican, Gatsby, etc.).
I suspect you're aware of most of this.
I have a set of front-page sites from an earlier scraping project:
(For the life of me I cannot remember what the 3rd column represents, though it may be a miscalculated cumulative percentage. The "category" field was manually supplied by me, every site with > 17 appearances has one, as well as several below that threshold which could be identified by other means, e.g., regexes on blogging engines, GitHub pages, etc.)
Further thoughts on article extraction: one idea that comes to mind is including extraction rules in the source selection metadata.
I'm using something along these lines right now to process sections within a given source, where I define the section-distinguishing-element from a headline URL, as well as the plaintext, position (within my generated page), lines of context, and maximum age (days) I'm interested in.
That could be extended or paired with a per-source rule that identifies the htmlq specifiers which pull out title, dateline, and byline elements from the source.
A further challenge is that such specifiers have a tendency to change as the publisher's back-end CMS varies, and finding ways to identify those is ... difficult.
That said, power laws and Zipf functions apply, and a large fraction of HN front-page articles come from a relatively small set of domains. There's further aggregation possible when underlying publishing engines can be identified, e.g., Wordpress, CMSes used by a large number of news organisations, Medium, Substack, Github, Gitlab, Fediverse servers, and a number of static site generators (Hugo, Jekyll, Pelican, Gatsby, etc.).
I suspect you're aware of most of this.
I have a set of front-page sites from an earlier scraping project:
(For the life of me I cannot remember what the 3rd column represents, though it may be a miscalculated cumulative percentage. The "category" field was manually supplied by me, every site with > 17 appearances has one, as well as several below that threshold which could be identified by other means, e.g., regexes on blogging engines, GitHub pages, etc.)
17,782 sites in total, if I'm reading my past notes correctly.More on that project in an HN search: <https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...>
(Individual comments/posts seem presently unreachable due to an HN site bug.)