Hacker News new | past | comments | ask | show | jobs | submit login

nice project, pieterma.

i'm curious about the decision to use hellinger distance for the second round of UMAP - was that purely empirical or did you have some intuition about why it'd work well for this specific dataset?

also, out of curiosity, what's the most popular book on the map that doesn't have a clear genre cluster?




Thanks!

The cluster memberships that come out of the first round are distributions over the different clusters, e.g. a given book is weighted 0.8 for cluster A and 0.2 for cluster B. The Hellinger distance is well-suited to quantify the difference between two distributions like that. Cosine similarity and Euclidean distance worked as well, but Hellinger gave subjectively nicer results.

Very interesting question, I'm not sure! While developing, I noticed that the systems thinking books were spread over different genres, which I found quite pleasing. However, I'm not sure if other books were even more diffuse. I'll have to dig back in and find out :)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: