I used to work on what your paper calls "unsupervised transport", that is machine translation between two languages without alignment data. You note that this field has existed since ~2016 and you provide a number of references, but you only dedicate ~4 lines of text to this branch of research. There's no comparison about why your technique is different to this prior work or why the prior algorithms can't be applied to the output of modern LLMs.
I can imagine there is lots of room for improvements around implicit regularization in the algorithms. Specifically, these algorithms were designed with word2vec output in mind (typically 300 dimensional vectors with 200000 observations), but your problem has higher dimensional vectors with fewer observations and so would likely require different hyperparameter tuning. IIRC, there's no explicit regularization in these methods, but hyperparameters like stepsize/stepcount can implicitly add L2 regularization, which you probably need for your application.
---
PS.
I *strongly dislike* your name of vec2vec. You aren't the first/only algorithm for taking vectors as input and getting vectors as output, and you have no right to claim such a general title.
---
PPS.
I believe there is a minor typo with footnote 1. The note is "Our code is available on GitHub." but it is attached to the sentence "In practice, it is unrealistic to expect that such a database be available."
Hey, I appreciate the perspective. We definitely should cite both those papers, and will do so in the next version of our draft. There are a lot of papers in this area, and they're all a few years old now, so you might understand how we missed two of them.
We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables. So some of this is covered. A lot of these methods also require a seed dictionary, which we don't have in our case. That said, you're welcome to take any number of these tools and plug them into our codebase; the results would definitely be interesting, although we can expect the adversarial methods still work best, as they do in the problem settings you mention.
As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.
> We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables.
Sorry if I'm being obtuse, but I don't see any mention of the POT package in your paper or of what specific algorithms you used from it to compare against. My best guess is that you used the linear map similar to the example at <https://pythonot.github.io/auto_examples/domain-adaptation/p...>. The methods I mentioned are also linear, but contain a number of additional tricks that result in much better performance than a standard L2 loss, and so I would expect those methods to outperform your OT baseline.
> As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.
But both of those papers are about generic vector alignment, so the generality of the name makes sense. Your contribution here seems specifically about the LLM use case, and so a name that implies the LLM use case would be preferable.
I do agree though that in general naming is hard and I don't have a better name to suggest. I also agree that there's lots of related papers, and you can't cite/discuss them all reasonably.
And I don't mean to be overly critical... the application to LLMs is definitely cool. I wouldn't have read the paper and written up my critiques if I didn't overall like it :)
Naming things is hard. Noting the two alternative approaches that you referenced are called "vecmap" and "alignment" which "aren't the first/only algorithm for ... and you have no right to claim such a general title" could easily apply there as well.
Except those papers are 8ish years old; they actually were among the first 2-3 algs for this task; and they studied the fully general vector space alignment problem. But I agree that naming things is hard and don't have a better name.
Imagine having more than a passing understanding of philosophy, and then reading much of any major computer science papers. By this "No right to claim" logic, I'd have you all on trial.
The problem solved in this paper is strictly harder than alignment. Alignment works with multiple, unmatched representations of the same inputs (e.g, different embeddings of the same words). The goal is to match them up.
The goal here is harder: given an embedding of an unknown text in one space, generate a vector in another space that's close to the embedding of the same text -- but, unlike in the word alignment problem, the texts are not known in advance.
Neither unsupervised transport, nor optimal alignment can solve this problem. Their input sets must be embeddings of the same texts. The input sets here are embeddings of different texts.
FWIW, this is all explained in the paper, including even the abstract. The comparisons with optimal assignment explicitly note that this is an idealized pseudo-baseline, and in reality OA cannot used for embedding translation (as opposed to matching, alignment, correspondence, etc.)
It seems like you have some misconceptions about Strassen's alg:
1. It is a standard example of the divide and conquer approach to algorithm design, not the dynamic programming approach. (I'm not even sure how you'd squint at it to convert it into a dynamic programming problem.)
2. Strassen's does not require complex valued matrices. Everything can be done in the real numbers.
I think the OP was pointing out that the reason Strasssen's algorithm works is that it somehow uncovered a kind of repeated work that's not evident in a simple divide and conquer approach. It's by the clever definition of the various submatrices that this "overlapping" work can be avoided.
In other words, the power of Strasssens algorithm comes from a strategy that's similar to / reminiscent of dynamic programming.
Building off this question, it's not clear to me why Python should have both t-strings and f-strings. The difference between the two seems like a stumbling block to new programmers, and my "ideal python" would have only one of these mechanisms.
f-strings immediately become a string, and are "invisible" to the runtime from a normal string. t-strings introduce an object so that libraries can do custom logic/formatting on the template strings, such as decided _how_ to format the string.
My main motivation as an author of 501 was to ensure user input is properly escaped when inserting into sql, which you cant enforce with f-strings.
> ensure user input is properly escaped when inserting into sql
I used to wish for that and got it in JS with template strings and libs around it. For what it’s worth (you got a whole PEP done, you have more credibility than I do) I ended up changing my mind, I think it’s a mistake.
It’s _nice_ from a syntax perspective. But it obscures the reality of sql query/parameter segregation, it builds an abstraction on top of sql that’s leaky and doesn’t even look like an abstraction.
And more importantly, it looks _way too close_ to the wrong thing. If the difference between the safe way to do sql and the unsafe way is one character and a non-trivial understanding of string formatting in python… bad things will happen. In a one-person project it’s manageable, in a bigger one where people have different experiences and seniority it will go wrong.
It’s certainly cute. I don’t thing it’s a good thing for sql queries.
I understand your concern, and I think the PEP addresses it. Quite bluntly, t"foo" is not a string, while f"foo" is. You'll get a typecheck error if you run a typechecker like any reasonable developer, and will get a runtime error if you ignore the type mismatch, because t"foo" even lacks a __str__() method.
One statement the PEP could put front and center in the abstract could be "t-strings are not strings".
For one thing, `f"something"` is of type `str`; `t"something"` is of type `string.templatelib.Template`. With t-strings, your code can know which parts of the string were dynamically substituted and which were not.
The types aren't so important. __call__ or reference returns type string, an f and a t will be interchangeable from the consumer side.
Example, if you can go through (I'm not sure you can) and trivially replace all your fs with ts, and then have some minor fixups where the final product is used, I don't think a migration from one to the other would be terribly painful. Time-consuming, yes.
Give it a few years to when f-string usage has worn off to the point that a decision can be made to remove it without breaking a significant number of projects in the wild.
The linter is a big deal, actually. I've worked with Python off and on during the past few decades; I just recently moved onto a project that uses Python with a bunch of linters and autoformatters enabled. I was used to writing my strings ('foo %s % bar), and the precommit linter told me to write f'foo %{bar}'. Easy enough!
printf-style formatting ("foo %s" % "bar") feels the most ready to be retired (except insofar as it probably never will, because it's a nice shortcut).
The other ones at least are based on the same format string syntax.
"foo {}".format("bar") would be an obvious "just use f-string" case, except when the formatting happens far off. But in that case you could "just" use t-strings? Except in cases where you're (for example) reading a format string from a file. Remember, t- and f- strings are syntactic elements, so dynamism prevents usage of it!
So you have the following use cases:
- printf-style formatting: some C-style string formatting is needed
- .format: You can't use an f- string because of non-locality in data to format, and you can't use a t- string due to dynamism in
- f-string: you have the template and the data in the same spot lexicographically, and you just want string concatenation (very common!)
- t-string: you have the template and the data in the same spot lexicogrpahically, but want to use special logic to actually build up your resulting value (which might not even be a string!)
The last two additions being syntax makes it hard to use them to cover all use cases of the first two.
But in a specific use case? It's very likely that there is an exact best answer amongst these 4.
> printf-style formatting ("foo %s" % "bar") feels the most ready to be retired (except insofar as it probably never will, because it's a nice shortcut).
It’s also the only one which is anything near safe for being user provided.
`str.format` allows the format string to navigate through indexes, entries, and attributes. If the result of the formatting is echoed back and any non-trivial object it passed in, it allows for all sorts of introspection.
printf-style... does not support any of that. It can only format the objects passed in.
Very good point. While I think we could do away with the syntactic shorthand, definitely would want to keep some function/method around with the capabilities.
And if it's being used, and isn't considered problematic, then it should remain. I've found use for all the current ones:
(1) for text that naturally has curlies,
(2) for templating
(3) for immediate interpolation, and improved at-site readability
I see (4) being about the flexibility of (2) and readability of (3). Maybe it'll eventually grow to dominate one or both, but it's also fine if it doesn't. I don't see (1) going away at all since the curly collision still exists in (4).
Putting down my marker on the opposite. Once you're targeting a version of python that has t-strings, decent linters/libraries have an excuse to put almost all uses of f-strings in the ground.
If the usage of a feature is significantly close enough to 0 because there is a well used alternative, what need is there for backward compatibility? If anything, it can be pushed to a third party package on PyPI.
I have a minor nit to pick. I actually prefer when tutorials provide the prompts for all code snippets for two reasons:
1. Many tutorials reference many languages. (I frequently write tutorials for students that include bash, sql, and python.) Providing the prompts `$`, `sqlite>` and `>>>` makes it obvious which language a piece of code is being written in.
2. Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line. A good example is a sequence of commands that involves `sudo dd` to format a harddrive. But for really intro-level stuff I want the student/reader to carefully think about all the commands, and forcing them to copy/paste line by line helps achieve that goal.
That said, this is an overall good introduction to writing that I will definitely making required reading for some of my data science students. When the book is complete, I'll be happily buying a copy :)
> Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line.
I hardcore oppose this kind of thing, for the same reason I oppose people putting obstacles in the way of curl-to-bash.
Adding the prompt character doesn’t make people think, it just makes people press backspace. Frequently I’m reading a tutorial because I’m trying to assemble headless scripts for setting up a VM and I really just need verbatim lines I can copy/paste so I know I’ve got the right arguments.
>Many tutorials reference many languages. (I frequently write tutorials for students that include bash, sql, and python.) Providing the prompts `$`, `sqlite>` and `>>>` makes it obvious which language a piece of code is being written in.
I think it's fine to show the prompt character, but I think it's the author's job to make sure that copy/paste still works. I've seen a lot of examples that use CSS to show the prompt or line number without it becoming part of copied text, and I'm highly in favor of that.
I think if I had to choose between breaking copy/paste and making the language obvious with the prompt character, I'd exclude the prompt, but I think that's a matter of taste.
>Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line. A good example is a sequence of commands that involves `sudo dd` to format a harddrive. But for really intro-level stuff I want the student/reader to carefully think about all the commands, and forcing them to copy/paste line by line helps achieve that goal.
Yeah, I agree about preventing the reader from copy/pasting something dangerous.
In tutorials that require a reboot, I'll never include a reboot command bunched in with other commands because I don't want the user to do it by mistake. And I agree for something like `dd`, you'd want to present it in a way to make it hard for the reader to make mistakes or run it thoughtlessly.
I'm not sure about that. There are markdown rendering engines where you can specify the language of a codeblock and it will render with specific CSS based on the language. So you can do something like ```bash ... ``` and it will show the code with newlines prefixed by "$"
AFAIK specifying a language only makes a difference for syntax highlighting. I have never seen a markdown processor that would add prompts to the code based on the specified language.
Syntax highlighting is just CSS. There's nothing stopping you from adding your own custom CSS to the code block which will prefix lines with the prompts.
> I can talk about concepts like "atoms" or "bacteria" or "black holes" with anyone, and they'll know what they are - even if their knowledge of those subjects isn't in depth.
I'm not convinced this is an unalloyed good. Knowing that a disease is caused by "bacteria" instead of "demons" isn't really helpful if you don't have a deep understanding of exactly what bacteria is. See, for example, all of the people who want antibiotics whenever they're sick for any reason. We've just replaced one set of weird beliefs in the general populace with another and given it a veneer of science.
> Knowing that a disease is caused by "bacteria" instead of "demons" isn't really helpful if you don't have a deep understanding of exactly what bacteria is.
This is a poor example. Even an incomplete image of the germ theory of disease is a massive improvement over thinking illness is caused by demons. An extremely superficial understanding of bacteria as "microscopic organisms which can make you sick" gives good justification why people should do things like wash their hands, cover their mouth when coughing, and not lick the railing on a subway.
Knowing the difference between bacteria being living organisms and viruses being not-quite-alive does not qualify as a "deep understanding" though.
Further, the presence of people misunderstanding something that most of the population knows pretty well in no way makes teaching that subject to the population bad. Your assertion would require that believing demons cause sickness actually has benefits we've lost.
But more people know what bacteria are at a baseline level and what they do with diseases than before when all we had were demons/bad humors/etc.
There are functionally illiterate people too in modern day and the average reading level is still elementary school level, but that's vastly better than before when the average person couldn't read at all.
Suicide does not have stable reporting rates. It was very stigmatized in the past, and so investigators would notoriously report suicides as "unknown cause of death" if they could.
Violent crime, on the other hand, is much more correlated with things like poverty than with mental health.
I think it's quite obviously the case that there are no clear indicators about what "mental health" looked like 100 years ago and there. Any projections into the past will involve a lot of extrapolation and have all sorts of biases.
They very clearly explain why this matters in the "Why should I care?" section. Partially quoting them:
> Harry Potter is an innocent example, but this problem is far more costly when it comes to higher value use-cases. For example, we analyze insurance policies. They’re 70-120 pages long, very dense and expect the reader to create logical links between information spread across pages (say, a sentence each on pages 5 and 95). So, answering a question like “what is my fire damage coverage?” means you have to read: Page 2 (the premium), Page 3 (the deductible and limit), Page 78 (the fire damage exclusions), Page 94 (the legal definition of “fire damage”).
It's not at all obvious how you could write code to do that for you. Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task, even if there are "better" ways of solving the Harry Potter problem.
> Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task
Not really. The "Harry Potter Problem" as formulated is asking an LLM to solve a problem that they are architecturally unsuited for. They do poorly at counting and similar algorithms tasks no matter the size of the context provided. The correct approach to allowing an AI agent to solve a problem like this one would be (as OP indicates) to have it recognize that this is an algorithmic challenge that it needs to write code to solve, then have it write the code and execute it.
Asking specific questions about your insurance policy is a qualitatively different type of problem that algorithms are bad at, but it's the kind of problem that LLMs are already very good at in smaller context windows. Making progress on that type of problem requires only extending a model's capabilities to use the context, not simultaneously building out a framework for solving algorithmic problems.
So if anything it's the reverse: solving the insurance problem would be a prerequisite to solving the Harry Potter Problem.
LLMs can't count well. This is in large part a tokenization issue. Doesn't mean they couldn't answer all those kind of questions. Maybe the current state of the art can't. But you won't find out by asking it to count.
The WHO list of essential medicines is not just over-the-counter drugs. It includes things like the chemotherapy drug cisplatin. I happened to need that for testicular cancer ~10 years ago, and the treatment cost was $50k (as "payed" by insurance). That overall seems pretty reasonable to me for the treatment I received, but definitely not something I'd expect the median American to be able to pay out of pocket.
The median American would not have to pay out of pocket, as nearly every American has health insurance (since the ACA, it is actually illegal not to have insurance).
I think it's accurate to say that the median American is insured, with only 8% of the population uninsured [1]. Although, to put that percentage in perspective, that's 26 million people and likely thousands in excess mortality relative to the insured poplulation.
I believe you're referring to the ACA's "individual mandate", which imposed a federal tax penalty for being uninsured. I won't argue whether that makes it illegal or not, but I can say that the individual madate was eliminated by the Tax Cuts and Jobs Act in 2019 [1]. There's no longer a federal tax penality for being uninsured.
This is purely anecdotal, but of that 8% (26 million), I would posit that most of those people are uninsured by choice. e.g., probably mostly young, maybe part-time workers without chronic illnesses.
You're wording in this comment (and the twitter/comment video) gives off the same vibes as the google april 1st videos for things like gmail motion (https://www.youtube.com/playlist?list=PLAD8wFTLnQKeDsINWn8Wj...). I honestly thought this was full sarcasm at first.
I used to work on what your paper calls "unsupervised transport", that is machine translation between two languages without alignment data. You note that this field has existed since ~2016 and you provide a number of references, but you only dedicate ~4 lines of text to this branch of research. There's no comparison about why your technique is different to this prior work or why the prior algorithms can't be applied to the output of modern LLMs.
Naively, I would expect off-the-shelf embedding alignment algorithms (like <https://github.com/artetxem/vecmap> and <https://github.com/facebookresearch/fastText/tree/main/align...>, neither of which are cited or compared against) to work quite well on this problem. So I'm curious if they don't or why they don't.
I can imagine there is lots of room for improvements around implicit regularization in the algorithms. Specifically, these algorithms were designed with word2vec output in mind (typically 300 dimensional vectors with 200000 observations), but your problem has higher dimensional vectors with fewer observations and so would likely require different hyperparameter tuning. IIRC, there's no explicit regularization in these methods, but hyperparameters like stepsize/stepcount can implicitly add L2 regularization, which you probably need for your application.
---
PS.
I *strongly dislike* your name of vec2vec. You aren't the first/only algorithm for taking vectors as input and getting vectors as output, and you have no right to claim such a general title.
---
PPS.
I believe there is a minor typo with footnote 1. The note is "Our code is available on GitHub." but it is attached to the sentence "In practice, it is unrealistic to expect that such a database be available."
reply