I think if you can make the vectorization methods more customizable and transparent, this could be a research accelerant too, since as a lot of AI R&D on new domains or datasets has "make a good embedding" as a first step. It's not a very hard step right now, but I think you can probably make it faster to rapidly prototype something and then iterate on it, so long as you set it up to be possible to do the latter (IE inspectable, interoperable, etc)
I think even though some people doubt the value of being able to compare disparate types via embedding, allowing it to be done more seamlessly makes a kind of "silly" (or more charitably "playful") research I happen to like a lot more feasible. In particular, artificially-produced "synesthesia" that comes from tuning weird embedding comparisons could end up being really useful in some domains, because like human synesthetes, the underlying structure of one domain might provide counterintuitive insight or legibility into the other in some cases
But all of this requires that the library allows fine-tuning and retraining of the underlying embeddings. It would be useful to natively support coembeddings of different domains, as things like CLIP drove the current wave of multimodal generative models.
Things are what people call them. Featurization/Feature Extraction used to refer to manual feature engineering, where you could tell what each numerical value is.
Vectorization, as colloquially used by developers in the AI space today, refers to the same thing being done via deep learning models, so it is less to do with ML Features, and more to do with generating a Vector all at once, with each dimension not having a specific meaning.
Is it just me, or is vector search not particularly good?
It seems like magic at first, but then you start rubbing into a bunch of issues:
1. Too much data within a single vector (often even just a few sentences) makes it so that most vectors are very close to each other due to many overlapping concepts.
2. Searching over moderately sized corpus of documentation (e.g. a couple thousand pages of text) starts to degrade scoring (usually sure to the above issue)
3. Every model I've tried fails pretty regularly on named entities (e.g someone's name, a product, etc) unless it's pretty well known
4. Getting granular enough to see useful variance requires generating a ton of embeddings, which start to cause performance bottlenecks really quickly
I've honestly had a lot more success with more traditional search methods.
I’m not sure I get this. First of all, a perhaps-unnecessary question: who wants to search between molecules and audio files?
By the way, is this even supported? I noticed the audio example seems to return a vector of floating point numbers but the molecule vector is binary true/false values.
Anyway, what embedding model is used here? Can it be customized, or swapped out? And why is it binary only sometimes? It’s great that Radient is high-level and just provides “vectors” for things but I think a few details (and perhaps a small amount of customization) would go a long way.
We added a heterogeneous dataframe auto vectorizer to our oss lib last year for a few reasons. Imagine writing: `graphistry.nodes(cudf.read_parquet("logs/")).featurize(**optional_cfg).umap().plot()`
We like using UMAP, GNNs, etc for understanding heterogeneous data like logs and other event & entity data, so needed a way to easily handle date, string, JSON, etc columns. So automatic feature engineering that we could tweak later is important. Feature engineering is a bottleneck on bigger datasets, like working with 100K+ log lines or webpages, so we later added an optional GPU mode. The rest of our library can already run (opt-in) on GPUs, so that completed our flow of raw data => viz/AI/etc end-to-end on GPUs.
To your point... Most of our users need just numbers, dates, text, etc. We do occasionally hit the need for images... but it was easy to do externally and just append those columns. A one-size-fits-most is not obvious to me for embedding images when I think of our projects here. So this library is interesting to me if they can pick good encodings...
If automatic cpu/gpu feature engineering happens across heterogeneous dataframe columns, that's via pygraphistry's automation calls to our lower-level library cu_cat: https://github.com/graphistry/cu-cat
We've been meaning to write about cu_cat with the Nvidia RAPIDS team, it's a cool GPU fork of dirty cat. We see anywhere from 2-100X speedups on cpu -> gpu.
It already has sentence_transformers built in. Due to our work with louie.ai <> various vector DBs, we're looking at revisiting how to make it even easier to plug in outside embeddings. Would be curious if any patterns would be useful there. Prior to this thread, we weren't even thinking folks would want images built-in as we find that so context-dependent...
> I think a few details (and perhaps a small amount of customization) would go a long way.
I hear you and agree 100% - I unfortunately haven't gotten around to writing better documentation nor solid code samples that utilize Radient yet.
Regarding molecule vectorization: that capability comes from RDKit (https://rdkit.org) - I just uploaded a sample to the /examples directory. You're right that molecule-to-audio and audio-to-molecule search is nonsensical from a semantic perspective, but I could see a third modality such as text or images that ties the two together, similar to what ImageBind is doing (https://arxiv.org/abs/2305.05665).
I think even though some people doubt the value of being able to compare disparate types via embedding, allowing it to be done more seamlessly makes a kind of "silly" (or more charitably "playful") research I happen to like a lot more feasible. In particular, artificially-produced "synesthesia" that comes from tuning weird embedding comparisons could end up being really useful in some domains, because like human synesthetes, the underlying structure of one domain might provide counterintuitive insight or legibility into the other in some cases
But all of this requires that the library allows fine-tuning and retraining of the underlying embeddings. It would be useful to natively support coembeddings of different domains, as things like CLIP drove the current wave of multimodal generative models.