More

liuliu · 2025-10-09T19:22:10 1760037730

And 300ms for a DB call is slow, in any case. We really shouldn't accept that as normal cost of doing business. 300ms is only acceptable if we are doing scrypt type of things.

kstrauser · 2025-10-09T19:40:25 1760038825

> in any case.

In some cases. Are looking up a single indexed row in a small K-V table? Yep, slow. Are you generating reports on the last 6 years of sales, grouped by division within larger companies? That might be pretty fast.

I'm not sure why you'd even generalize that so overly broadly.

liuliu · 2025-10-09T20:54:17 1760043257

To put in perspective, 300ms is about looping over 30GiB data from RAM, loading 800MiB data from SSD, or doing 1TFLOPS on a single core computer.

300ms to generate a report would be able to go through ~100M rows at least (on a single core).

And the implicit assumption that comment I made earlier, of course is not about the 100M rows scan. If there is a confusion, I am sorry.

kstrauser · 2025-10-09T21:36:08 1760045768

That's all true, so long as you completely ignore doing any processing on the data, like evaluating the rows and selectively appending some of them into a data structure, then sorting and serializing the results, let alone optimizing the query plan for the state of the system at that moment and deciding whether it makes more sense to hit the indexes or just slurp in the whole table given that N other queries are also executing right now, or mapping a series of IO queries to their exact address in the underlying disks, and performing the parity checks as you read the data off the RAID and combine it into a single, coherent stream of not-block-aligned tuples.

There's a metric boatload of abstractions between sending a UTF-8 query string over the packet-switched network and receiving back a list of results. 300ms suddenly starts looking like a smaller window than it originally appears.

liuliu · 2025-09-30T18:29:29 1759256969

No, iPad Pro won't be faster than 4090s or 4070s (or even 5% of the speed of 4090).

But newer chips might contain Neural Accelerator to close the gap a little bit (i.e. 10%??).

(I maintain https://apps.apple.com/us/app/draw-things-ai-generation/id64...)

sroussey · 2025-09-30T22:10:34 1759270234

What improvements did the A19 Pro provide for Draw Things?

liuliu · 2025-09-30T23:26:18 1759274778

https://releases.drawthings.ai/p/iphone-17-pro-doubles-ai-pe...

sroussey · 2025-10-01T02:19:39 1759285179

That's amazing! Curious how this will translate to the M5 Pro/Max Macs...

liuliu · 2025-09-30T18:21:41 1759256501

They were acquisition target since 2017 (from the OpenAI internal emails). So lacking of acquisition is not because lacking of interests. Let you wonder what happened in these due-diligence.

liuliu · 2025-09-30T17:35:51 1759253751

Video generation is extremely exciting a.k.a. https://video-zero-shot.github.io/

However, personalization (teleporting yourself into a video scene) is boring to me. At its core, it doesn't generate new experience to me. My experience is not defined by photos / videos I took on a trip.

liuliu · 2025-09-25T18:42:43 1758825763

Very interesting read. I first learned this method from a random reddit post a while ago and very happy to see a systematic study on this (wish I would save the original post somewhere to reference to!).

liuliu · 2025-09-23T17:10:22 1758647422

- 454 instances of "Rc<" in Servo: https://github.com/search?q=repo%3Aservo%2Fservo+Rc%3C&type=...

- 6 instances of "Rc<" in AWS SDK for Rust: https://github.com/search?q=repo%3Arusoto%2Frusoto+Rc%3C&typ...

- 0 instance for "Rc<" in LOC: https://github.com/search?q=repo%3Acgag%2Floc+Rc%3C&type=cod...

(Disclaimer: I don't know what these repos are except Servo).

liuliu · 2025-09-22T20:39:09 1758573549

My understanding is that you cannot talk about warp specialization without talking about the alternative: multi-stage pipelining. And the final example code given is multi-stage pipeline with double buffers.

And here is my understanding where it differs:

1. multi-stage pipeline requires careful hand-tuning, even at PTX level to make sure your async wait is weaved properly to maximize overlap.

2. since these register files now is huge, multi-stage pipeline is difficult to write at intrinsics level to make efficient use of these huge register files.

3. Warp specialization delegated most of these scheduling dynamically, hence it is better adapted to hardware (and have more information to make scheduling decisions at runtime). Although this is a bit moot because we write different code for different hardware anyway.

Anything more I am missing?

rohany · 2025-09-22T21:36:52 1758577012

Author here! I think that warp specialization is inherently related to multi-stage pipelining, they aren't really alternatives of each other. Warp specialization is a way to realize a multi-stage pipeline in the face of hazards that may cause the pipeline to spill out of the register file or not let parts of the pipeline run concurrently as desired.

The fact that we tend to need different warp specialization strategies for different hardware is a consequence of the capabilities of that hardware (i.e. different asynchronous instruction types), and contributes to the complexity of targeting that new hardware.

liuliu · 2025-09-18T01:16:12 1758158172

Or the GluonCV by mxnet guys (ancient! https://github.com/dmlc/gluon-cv)

liuliu · 2025-09-11T19:09:14 1757617754

The reason why people go distances to package PyTorch is because the skill of translating models between different frameworks manually is "easy" but not well dispensed in developer community.

That's why people will go stupid lengths to convert model from PyTorch / TensorFlow with onnxtools / coremltools to avoid touch the model / weights themselves.

The only one that escaped this is llama.cpp, which weirdly, despite the difficulty of model conversion with ggml, people seem to do it anyway.

liuliu · 2025-09-05T00:24:48 1757031888

One thing I would call out, if you use SQLite as an application format:

BLOB type is limited to 2GiB in size (int32). Depending on your use cases, that might seem high, or not.

People would argue that if you store that much of binary data in a SQLite database, it is not really appropriate. But, application format usually has this requirement to bundle large binary data in one nice file, rather than many files that you need to copy together to make it work.

Retr0id · 2025-09-05T01:18:30 1757035110

You can split your data up across multiple blobs

johncolanduoni · 2025-09-05T05:24:00 1757049840

Also you almost certainly want to do this anyway so you can stream the blobs into/out of the network/filesystem, well before you have GBs in a single blob.

Retr0id · 2025-09-05T10:16:08 1757067368

Singular sqlite blobs are streamable too! But for streaming in you need to know the size in advance.

liuliu · 2025-09-05T16:34:46 1757090086

That's right, but it is much easier to just use blob without application logic to worry about chunking. It is the same reason why we use SQLite in the first place, a lot of transaction / rollback logic now is on SQLite layer, not the application layer.

Also, SQLite did provide good support for read / write the blob in streamable fashion, see: https://www.sqlite.org/c3ref/blob_read.html

So the limitation is really a structural issue that Dr. Hipp at some point might resolve (or not), but pretty much has to be resolved by SQLite core team, not outside contributors (of course you can resolve it by forking, but...).

bob1029 · 2025-09-05T07:48:53 1757058533

This is essential if you want to have encryption/compression + range access at the same time.

I've been using chunk sizes of 128 megabytes for my media archive. This seems to be a reasonable tradeoff between range retrieval delay and per object overhead (e.g. s3 put/get cost).