More

jackpirate · 2025-05-21T20:30:38 1747859438

I hate to be "reviewer 2", but:

I used to work on what your paper calls "unsupervised transport", that is machine translation between two languages without alignment data. You note that this field has existed since ~2016 and you provide a number of references, but you only dedicate ~4 lines of text to this branch of research. There's no comparison about why your technique is different to this prior work or why the prior algorithms can't be applied to the output of modern LLMs.

Naively, I would expect off-the-shelf embedding alignment algorithms (like <https://github.com/artetxem/vecmap> and <https://github.com/facebookresearch/fastText/tree/main/align...>, neither of which are cited or compared against) to work quite well on this problem. So I'm curious if they don't or why they don't.

I can imagine there is lots of room for improvements around implicit regularization in the algorithms. Specifically, these algorithms were designed with word2vec output in mind (typically 300 dimensional vectors with 200000 observations), but your problem has higher dimensional vectors with fewer observations and so would likely require different hyperparameter tuning. IIRC, there's no explicit regularization in these methods, but hyperparameters like stepsize/stepcount can implicitly add L2 regularization, which you probably need for your application.

---

PS.

I *strongly dislike* your name of vec2vec. You aren't the first/only algorithm for taking vectors as input and getting vectors as output, and you have no right to claim such a general title.

---

PPS.

I believe there is a minor typo with footnote 1. The note is "Our code is available on GitHub." but it is attached to the sentence "In practice, it is unrealistic to expect that such a database be available."

jxmorris12 · 2025-05-21T20:42:40 1747860160

Hey, I appreciate the perspective. We definitely should cite both those papers, and will do so in the next version of our draft. There are a lot of papers in this area, and they're all a few years old now, so you might understand how we missed two of them.

We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables. So some of this is covered. A lot of these methods also require a seed dictionary, which we don't have in our case. That said, you're welcome to take any number of these tools and plug them into our codebase; the results would definitely be interesting, although we can expect the adversarial methods still work best, as they do in the problem settings you mention.

As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.

jackpirate · 2025-05-21T22:57:12 1747868232

> We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables.

Sorry if I'm being obtuse, but I don't see any mention of the POT package in your paper or of what specific algorithms you used from it to compare against. My best guess is that you used the linear map similar to the example at <https://pythonot.github.io/auto_examples/domain-adaptation/p...>. The methods I mentioned are also linear, but contain a number of additional tricks that result in much better performance than a standard L2 loss, and so I would expect those methods to outperform your OT baseline.

> As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.

But both of those papers are about generic vector alignment, so the generality of the name makes sense. Your contribution here seems specifically about the LLM use case, and so a name that implies the LLM use case would be preferable.

I do agree though that in general naming is hard and I don't have a better name to suggest. I also agree that there's lots of related papers, and you can't cite/discuss them all reasonably.

And I don't mean to be overly critical... the application to LLMs is definitely cool. I wouldn't have read the paper and written up my critiques if I didn't overall like it :)

newfocogi · 2025-05-21T20:37:27 1747859847

Naming things is hard. Noting the two alternative approaches that you referenced are called "vecmap" and "alignment" which "aren't the first/only algorithm for ... and you have no right to claim such a general title" could easily apply there as well.

jackpirate · 2025-05-21T23:01:31 1747868491

Except those papers are 8ish years old; they actually were among the first 2-3 algs for this task; and they studied the fully general vector space alignment problem. But I agree that naming things is hard and don't have a better name.

mjburgess · 2025-05-21T22:53:27 1747868007

> I strongly dislike your name of vec2vec.

Imagine having more than a passing understanding of philosophy, and then reading much of any major computer science papers. By this "No right to claim" logic, I'd have you all on trial.

austinpilot · 2025-05-22T00:49:32 1747874972

The problem solved in this paper is strictly harder than alignment. Alignment works with multiple, unmatched representations of the same inputs (e.g, different embeddings of the same words). The goal is to match them up.

The goal here is harder: given an embedding of an unknown text in one space, generate a vector in another space that's close to the embedding of the same text -- but, unlike in the word alignment problem, the texts are not known in advance.

Neither unsupervised transport, nor optimal alignment can solve this problem. Their input sets must be embeddings of the same texts. The input sets here are embeddings of different texts.

FWIW, this is all explained in the paper, including even the abstract. The comparisons with optimal assignment explicitly note that this is an idealized pseudo-baseline, and in reality OA cannot used for embedding translation (as opposed to matching, alignment, correspondence, etc.)

jackpirate · 2025-05-14T19:11:27 1747249887

It seems like you have some misconceptions about Strassen's alg:

1. It is a standard example of the divide and conquer approach to algorithm design, not the dynamic programming approach. (I'm not even sure how you'd squint at it to convert it into a dynamic programming problem.)

2. Strassen's does not require complex valued matrices. Everything can be done in the real numbers.

pontus · 2025-05-14T21:03:19 1747256599

I think the OP was pointing out that the reason Strasssen's algorithm works is that it somehow uncovered a kind of repeated work that's not evident in a simple divide and conquer approach. It's by the clever definition of the various submatrices that this "overlapping" work can be avoided.

In other words, the power of Strasssens algorithm comes from a strategy that's similar to / reminiscent of dynamic programming.

kenjackson · 2025-05-14T19:16:58 1747250218

I think the original poster was referring to the AlphaEvolve variant of Strassen's, not the standard Strassen (with respect to complex values).

jackpirate · 2025-05-02T03:25:50 1746156350

As a CS prof, I'd love to have this in my office for students to play with. Looks awesome!

jackpirate · 2025-04-10T21:54:27 1744322067

Building off this question, it's not clear to me why Python should have both t-strings and f-strings. The difference between the two seems like a stumbling block to new programmers, and my "ideal python" would have only one of these mechanisms.

nhumrich · 2025-04-10T22:00:45 1744322445

f-strings immediately become a string, and are "invisible" to the runtime from a normal string. t-strings introduce an object so that libraries can do custom logic/formatting on the template strings, such as decided _how_ to format the string.

My main motivation as an author of 501 was to ensure user input is properly escaped when inserting into sql, which you cant enforce with f-strings.

williamdclt · 2025-04-10T23:14:35 1744326875

> ensure user input is properly escaped when inserting into sql

I used to wish for that and got it in JS with template strings and libs around it. For what it’s worth (you got a whole PEP done, you have more credibility than I do) I ended up changing my mind, I think it’s a mistake.

It’s _nice_ from a syntax perspective. But it obscures the reality of sql query/parameter segregation, it builds an abstraction on top of sql that’s leaky and doesn’t even look like an abstraction.

And more importantly, it looks _way too close_ to the wrong thing. If the difference between the safe way to do sql and the unsafe way is one character and a non-trivial understanding of string formatting in python… bad things will happen. In a one-person project it’s manageable, in a bigger one where people have different experiences and seniority it will go wrong.

It’s certainly cute. I don’t thing it’s a good thing for sql queries.

nine_k · 2025-04-10T23:33:53 1744328033

I understand your concern, and I think the PEP addresses it. Quite bluntly, t"foo" is not a string, while f"foo" is. You'll get a typecheck error if you run a typechecker like any reasonable developer, and will get a runtime error if you ignore the type mismatch, because t"foo" even lacks a __str__() method.

One statement the PEP could put front and center in the abstract could be "t-strings are not strings".

guelo · 2025-04-10T23:41:24 1744328484

> "t-strings are not strings"

t-string is an unfortunate name for something that is not a string.

nine_k · 2025-04-10T23:46:25 1744328785

I wish it were called "string templates" instead, with t"whatever" form being called a "template literal".

DonHopkins · 2025-04-11T00:04:38 1744329878

Simpson's Individual Stringettes!

https://www.youtube.com/watch?v=7qNj-QFZbew

sevensor · 2025-04-11T01:29:11 1744334951

Away with floods! Away with workaday tidal waves!

jackpirate · 2025-04-11T00:49:29 1744332569

That all make senses to me. But it definitely won't make sense to my intro to programming students. They already have enough weird syntax to juggle.

nhumrich · 2025-04-11T00:54:43 1744332883

Then dont teach them t-strings

davepeck · 2025-04-10T21:58:36 1744322316

For one thing, `f"something"` is of type `str`; `t"something"` is of type `string.templatelib.Template`. With t-strings, your code can know which parts of the string were dynamically substituted and which were not.

all2 · 2025-04-10T22:32:16 1744324336

The types aren't so important. __call__ or reference returns type string, an f and a t will be interchangeable from the consumer side.

Example, if you can go through (I'm not sure you can) and trivially replace all your fs with ts, and then have some minor fixups where the final product is used, I don't think a migration from one to the other would be terribly painful. Time-consuming, yes.

itishappy · 2025-04-10T23:20:54 1744327254

Not sure that's true. `Template`s don't provide a `__str__` function, so you need to pass them to a processing function to get a `str` back.

https://peps.python.org/pep-0750/#no-template-str-implementa...

skeledrew · 2025-04-10T22:06:41 1744322801

Give it a few years to when f-string usage has worn off to the point that a decision can be made to remove it without breaking a significant number of projects in the wild.

milesrout · 2025-04-10T22:09:16 1744322956

That will never happen.

skeledrew · 2025-04-10T22:16:05 1744323365

Well if it continues to be popular then that is all good. Just keep it. What matters is that usage isn't complex for anyone.

macNchz · 2025-04-10T22:49:06 1744325346

Well now we'll have four different ways to format strings, since removing old ones is something that doesn't actually happen:

    "foo %s" % "bar"
    "foo {}".format("bar")
    bar = "bar"; f"foo {bar}"
    bar = "bar"; t"foo {bar}" # has extra functionality!

amenghra · 2025-04-10T23:01:00 1744326060

This is where an opinionated linter comes in handy. Ensures people gradually move to the “better” version while not breaking backwards compatibility.

It does suck for beginners who end up having to know about all variations until their usage drops off.

QuercusMax · 2025-04-10T23:14:21 1744326861

The linter is a big deal, actually. I've worked with Python off and on during the past few decades; I just recently moved onto a project that uses Python with a bunch of linters and autoformatters enabled. I was used to writing my strings ('foo %s % bar), and the precommit linter told me to write f'foo %{bar}'. Easy enough!

rtpg · 2025-04-10T23:19:40 1744327180

printf-style formatting ("foo %s" % "bar") feels the most ready to be retired (except insofar as it probably never will, because it's a nice shortcut).

The other ones at least are based on the same format string syntax.

"foo {}".format("bar") would be an obvious "just use f-string" case, except when the formatting happens far off. But in that case you could "just" use t-strings? Except in cases where you're (for example) reading a format string from a file. Remember, t- and f- strings are syntactic elements, so dynamism prevents usage of it!

So you have the following use cases:

- printf-style formatting: some C-style string formatting is needed

- .format: You can't use an f- string because of non-locality in data to format, and you can't use a t- string due to dynamism in

- f-string: you have the template and the data in the same spot lexicographically, and you just want string concatenation (very common!)

- t-string: you have the template and the data in the same spot lexicogrpahically, but want to use special logic to actually build up your resulting value (which might not even be a string!)

The last two additions being syntax makes it hard to use them to cover all use cases of the first two.

But in a specific use case? It's very likely that there is an exact best answer amongst these 4.

milesrout · 2025-04-11T01:36:01 1744335361

.format is also nice because you can have more complex subexpressions broken over multiple lines instead of having complex expressions inside the {}.

masklinn · 2025-04-11T03:24:15 1744341855

> printf-style formatting ("foo %s" % "bar") feels the most ready to be retired (except insofar as it probably never will, because it's a nice shortcut).

It’s also the only one which is anything near safe for being user provided.

pansa2 · 2025-04-11T03:30:07 1744342207

I don’t think I’ve ever used % formatting in Python - what makes it safer than `format`?

masklinn · 2025-04-11T04:51:46 1744347106

`str.format` allows the format string to navigate through indexes, entries, and attributes. If the result of the formatting is echoed back and any non-trivial object it passed in, it allows for all sorts of introspection.

printf-style... does not support any of that. It can only format the objects passed in.

rtpg · 2025-04-12T00:46:04 1744418764

Very good point. While I think we could do away with the syntactic shorthand, definitely would want to keep some function/method around with the capabilities.

skeledrew · 2025-04-10T23:12:42 1744326762

And if it's being used, and isn't considered problematic, then it should remain. I've found use for all the current ones: (1) for text that naturally has curlies, (2) for templating (3) for immediate interpolation, and improved at-site readability

I see (4) being about the flexibility of (2) and readability of (3). Maybe it'll eventually grow to dominate one or both, but it's also fine if it doesn't. I don't see (1) going away at all since the curly collision still exists in (4).

pletnes · 2025-04-21T10:11:16 1745230276

There is also the template module from the python 2 days.

milesrout · 2025-04-11T01:34:55 1744335295

Don't forget string.Template:

    import string
    t = string.Template("foo $bar")
    t.substitute(bar="bar")

darthrupert · 2025-04-11T07:23:18 1744356198

Five, if you count the log module. I hope t-strings will come there soon.

log.error("foo happend %s", reason)

bcoates · 2025-04-11T03:02:52 1744340572

Putting down my marker on the opposite. Once you're targeting a version of python that has t-strings, decent linters/libraries have an excuse to put almost all uses of f-strings in the ground.

aatd86 · 2025-04-10T23:18:15 1744327095

No backward compatibility?!

skeledrew · 2025-04-10T23:39:10 1744328350

If the usage of a feature is significantly close enough to 0 because there is a well used alternative, what need is there for backward compatibility? If anything, it can be pushed to a third party package on PyPI.

jackpirate · 2025-01-02T21:02:35 1735851755

I have a minor nit to pick. I actually prefer when tutorials provide the prompts for all code snippets for two reasons:

1. Many tutorials reference many languages. (I frequently write tutorials for students that include bash, sql, and python.) Providing the prompts `$`, `sqlite>` and `>>>` makes it obvious which language a piece of code is being written in.

2. Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line. A good example is a sequence of commands that involves `sudo dd` to format a harddrive. But for really intro-level stuff I want the student/reader to carefully think about all the commands, and forcing them to copy/paste line by line helps achieve that goal.

That said, this is an overall good introduction to writing that I will definitely making required reading for some of my data science students. When the book is complete, I'll be happily buying a copy :)

Uehreka · 2025-01-02T23:16:41 1735859801

> Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line.

I hardcore oppose this kind of thing, for the same reason I oppose people putting obstacles in the way of curl-to-bash.

Adding the prompt character doesn’t make people think, it just makes people press backspace. Frequently I’m reading a tutorial because I’m trying to assemble headless scripts for setting up a VM and I really just need verbatim lines I can copy/paste so I know I’ve got the right arguments.

mtlynch · 2025-01-02T21:13:29 1735852409

Thanks for reading!

>Many tutorials reference many languages. (I frequently write tutorials for students that include bash, sql, and python.) Providing the prompts `$`, `sqlite>` and `>>>` makes it obvious which language a piece of code is being written in.

I think it's fine to show the prompt character, but I think it's the author's job to make sure that copy/paste still works. I've seen a lot of examples that use CSS to show the prompt or line number without it becoming part of copied text, and I'm highly in favor of that.

I think if I had to choose between breaking copy/paste and making the language obvious with the prompt character, I'd exclude the prompt, but I think that's a matter of taste.

>Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line. A good example is a sequence of commands that involves `sudo dd` to format a harddrive. But for really intro-level stuff I want the student/reader to carefully think about all the commands, and forcing them to copy/paste line by line helps achieve that goal.

Yeah, I agree about preventing the reader from copy/pasting something dangerous.

In tutorials that require a reboot, I'll never include a reboot command bunched in with other commands because I don't want the user to do it by mistake. And I agree for something like `dd`, you'd want to present it in a way to make it hard for the reader to make mistakes or run it thoughtlessly.

jackpirate · 2025-01-02T23:03:07 1735858987

> I've seen a lot of examples that use CSS to show the prompt or line number without it becoming part of copied text, and I'm highly in favor of that.

This is unfortunately not compatible with writing the tutorial in markdown to be rendered on github.

xmprt · 2025-01-02T23:35:59 1735860959

I'm not sure about that. There are markdown rendering engines where you can specify the language of a codeblock and it will render with specific CSS based on the language. So you can do something like ```bash ... ``` and it will show the code with newlines prefixed by "$"

yencabulator · 2025-01-10T18:13:25 1736532805

That's only for highlighting.

```bash assumes content is a valid shell script

```console assumes content is a dialog between a user and a computer, with $ or such prompts and unprefixed program output

Console sessions showing output is why you can't magically auto-prefix every line with a prompt.

grodriguez100 · 2025-01-03T05:29:13 1735882153

AFAIK specifying a language only makes a difference for syntax highlighting. I have never seen a markdown processor that would add prompts to the code based on the specified language.

xmprt · 2025-01-03T08:48:18 1735894098

Syntax highlighting is just CSS. There's nothing stopping you from adding your own custom CSS to the code block which will prefix lines with the prompts.

grodriguez100 · 2025-01-05T06:54:59 1736060099

Yes, of course. I was just describing what most existing engines currently do.

__mharrison__ · 2025-01-03T08:34:19 1735893259

I always include a $ in front of terminal commands (and > for Windows).

My books are Python related, so there is code that runs in putting and code that runs in other environments.

I guess I'm not really writing "tutorials" in the sense of webpages, so I'm less concerned with copy paste working and more concerned with clarity.

jackpirate · 2024-09-19T17:02:04 1726765324

> I can talk about concepts like "atoms" or "bacteria" or "black holes" with anyone, and they'll know what they are - even if their knowledge of those subjects isn't in depth.

I'm not convinced this is an unalloyed good. Knowing that a disease is caused by "bacteria" instead of "demons" isn't really helpful if you don't have a deep understanding of exactly what bacteria is. See, for example, all of the people who want antibiotics whenever they're sick for any reason. We've just replaced one set of weird beliefs in the general populace with another and given it a veneer of science.

rimunroe · 2024-09-19T17:26:08 1726766768

> Knowing that a disease is caused by "bacteria" instead of "demons" isn't really helpful if you don't have a deep understanding of exactly what bacteria is.

This is a poor example. Even an incomplete image of the germ theory of disease is a massive improvement over thinking illness is caused by demons. An extremely superficial understanding of bacteria as "microscopic organisms which can make you sick" gives good justification why people should do things like wash their hands, cover their mouth when coughing, and not lick the railing on a subway.

digging · 2024-09-19T19:03:49 1726772629

Knowing the difference between bacteria being living organisms and viruses being not-quite-alive does not qualify as a "deep understanding" though.

Further, the presence of people misunderstanding something that most of the population knows pretty well in no way makes teaching that subject to the population bad. Your assertion would require that believing demons cause sickness actually has benefits we've lost.

iteria · 2024-09-19T17:16:47 1726766207

But more people know what bacteria are at a baseline level and what they do with diseases than before when all we had were demons/bad humors/etc.

There are functionally illiterate people too in modern day and the average reading level is still elementary school level, but that's vastly better than before when the average person couldn't read at all.

jackpirate · 2024-06-24T21:41:39 1719265299

I think you're wrong.

Suicide does not have stable reporting rates. It was very stigmatized in the past, and so investigators would notoriously report suicides as "unknown cause of death" if they could.

Violent crime, on the other hand, is much more correlated with things like poverty than with mental health.

I think it's quite obviously the case that there are no clear indicators about what "mental health" looked like 100 years ago and there. Any projections into the past will involve a lot of extrapolation and have all sorts of biases.

jackpirate · on April 23, 2024

They very clearly explain why this matters in the "Why should I care?" section. Partially quoting them:

> Harry Potter is an innocent example, but this problem is far more costly when it comes to higher value use-cases. For example, we analyze insurance policies. They’re 70-120 pages long, very dense and expect the reader to create logical links between information spread across pages (say, a sentence each on pages 5 and 95). So, answering a question like “what is my fire damage coverage?” means you have to read: Page 2 (the premium), Page 3 (the deductible and limit), Page 78 (the fire damage exclusions), Page 94 (the legal definition of “fire damage”).

It's not at all obvious how you could write code to do that for you. Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task, even if there are "better" ways of solving the Harry Potter problem.

lolinder · on April 23, 2024

> Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task

Not really. The "Harry Potter Problem" as formulated is asking an LLM to solve a problem that they are architecturally unsuited for. They do poorly at counting and similar algorithms tasks no matter the size of the context provided. The correct approach to allowing an AI agent to solve a problem like this one would be (as OP indicates) to have it recognize that this is an algorithmic challenge that it needs to write code to solve, then have it write the code and execute it.

Asking specific questions about your insurance policy is a qualitatively different type of problem that algorithms are bad at, but it's the kind of problem that LLMs are already very good at in smaller context windows. Making progress on that type of problem requires only extending a model's capabilities to use the context, not simultaneously building out a framework for solving algorithmic problems.

So if anything it's the reverse: solving the insurance problem would be a prerequisite to solving the Harry Potter Problem.

causal · on April 23, 2024

Word counting and summarizing key information are wildly different problems though

og_kalu · on April 23, 2024

Not really.

LLMs can't count well. This is in large part a tokenization issue. Doesn't mean they couldn't answer all those kind of questions. Maybe the current state of the art can't. But you won't find out by asking it to count.

jackpirate · on April 17, 2024

The WHO list of essential medicines is not just over-the-counter drugs. It includes things like the chemotherapy drug cisplatin. I happened to need that for testicular cancer ~10 years ago, and the treatment cost was $50k (as "payed" by insurance). That overall seems pretty reasonable to me for the treatment I received, but definitely not something I'd expect the median American to be able to pay out of pocket.

buzzert · on April 17, 2024

The median American would not have to pay out of pocket, as nearly every American has health insurance (since the ACA, it is actually illegal not to have insurance).

acaloiar · on April 17, 2024

I think it's accurate to say that the median American is insured, with only 8% of the population uninsured [1]. Although, to put that percentage in perspective, that's 26 million people and likely thousands in excess mortality relative to the insured poplulation.

I believe you're referring to the ACA's "individual mandate", which imposed a federal tax penalty for being uninsured. I won't argue whether that makes it illegal or not, but I can say that the individual madate was eliminated by the Tax Cuts and Jobs Act in 2019 [1]. There's no longer a federal tax penality for being uninsured.

[1] https://www.census.gov/library/visualizations/interactive/pe...

[2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5944881/

buzzert · on April 18, 2024

This is purely anecdotal, but of that 8% (26 million), I would posit that most of those people are uninsured by choice. e.g., probably mostly young, maybe part-time workers without chronic illnesses.

jackpirate · on April 15, 2024

You're wording in this comment (and the twitter/comment video) gives off the same vibes as the google april 1st videos for things like gmail motion (https://www.youtube.com/playlist?list=PLAD8wFTLnQKeDsINWn8Wj...). I honestly thought this was full sarcasm at first.