Hacker News new | past | comments | ask | show | jobs | submit login

A key problem with extracting article context is that there are so many distinct sources.

That said, power laws and Zipf functions apply, and a large fraction of HN front-page articles come from a relatively small set of domains. There's further aggregation possible when underlying publishing engines can be identified, e.g., Wordpress, CMSes used by a large number of news organisations, Medium, Substack, Github, Gitlab, Fediverse servers, and a number of static site generators (Hugo, Jekyll, Pelican, Gatsby, etc.).

I suspect you're aware of most of this.

I have a set of front-page sites from an earlier scraping project:

(For the life of me I cannot remember what the 3rd column represents, though it may be a miscalculated cumulative percentage. The "category" field was manually supplied by me, every site with > 17 appearances has one, as well as several below that threshold which could be identified by other means, e.g., regexes on blogging engines, GitHub pages, etc.)

  Rank  Count    ???  Site :::: Category
  ------------------------------------------------------------- 
     1  7294   5.175  n/a :::: n/a
     2  3803   7.873  nytimes.com :::: general news
     3  3495  10.352  techcrunch.com :::: tech news
     4  1580  11.473  arstechnica.com :::: tech news
     5  1344  12.426  bloomberg.com :::: business news
     6  1288  13.340  wired.com :::: tech news
     7  1171  14.171  wsj.com :::: business news
     8  1099  14.951  youtube.com :::: video
     9  1026  15.678  wikipedia.org :::: general info (wiki)
    10   921  16.332  bbc.com :::: general news
    11   911  16.978  bbc.co.uk :::: general news
    12   893  17.612  theguardian.com :::: general news
    13   866  18.226  washingtonpost.com :::: general news
    14   846  18.826  reuters.com :::: general news
    15   829  19.414  economist.com :::: business news
    16   781  19.968  theatlantic.com :::: general interest
    17   631  20.416  arxiv.org :::: academic / science
    18   628  20.862  npr.org :::: general news
    19   622  21.303  nature.com :::: academic / science
    20   614  21.738  newyorker.com :::: general interest
    21   505  22.097  eff.org :::: law
    22   475  22.434  stanford.edu :::: academic / science
    23   471  22.768  ieee.org :::: technology
    24   456  23.091  reddit.com :::: general discussion
    25   448  23.409  amazon.com :::: corporate comm.
    26   445  23.725  microsoft.com :::: technology
    27   416  24.020  theverge.com :::: tech news
    28   410  24.311  venturebeat.com :::: business news
    29   408  24.600  quantamagazine.org :::: academic / science
    30   407  24.889  cnn.com :::: general news
17,782 sites in total, if I'm reading my past notes correctly.

More on that project in an HN search: <https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...>

(Individual comments/posts seem presently unreachable due to an HN site bug.)




Further thoughts on article extraction: one idea that comes to mind is including extraction rules in the source selection metadata.

I'm using something along these lines right now to process sections within a given source, where I define the section-distinguishing-element from a headline URL, as well as the plaintext, position (within my generated page), lines of context, and maximum age (days) I'm interested in.

That could be extended or paired with a per-source rule that identifies the htmlq specifiers which pull out title, dateline, and byline elements from the source.

A further challenge is that such specifiers have a tendency to change as the publisher's back-end CMS varies, and finding ways to identify those is ... difficult.

But grist for the mill, at any rate.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: