And 300ms for a DB call is slow, in any case. We really shouldn't accept that as normal cost of doing business. 300ms is only acceptable if we are doing scrypt type of things.
In some cases. Are looking up a single indexed row in a small K-V table? Yep, slow. Are you generating reports on the last 6 years of sales, grouped by division within larger companies? That might be pretty fast.
I'm not sure why you'd even generalize that so overly broadly.
That's all true, so long as you completely ignore doing any processing on the data, like evaluating the rows and selectively appending some of them into a data structure, then sorting and serializing the results, let alone optimizing the query plan for the state of the system at that moment and deciding whether it makes more sense to hit the indexes or just slurp in the whole table given that N other queries are also executing right now, or mapping a series of IO queries to their exact address in the underlying disks, and performing the parity checks as you read the data off the RAID and combine it into a single, coherent stream of not-block-aligned tuples.
There's a metric boatload of abstractions between sending a UTF-8 query string over the packet-switched network and receiving back a list of results. 300ms suddenly starts looking like a smaller window than it originally appears.
They were acquisition target since 2017 (from the OpenAI internal emails). So lacking of acquisition is not because lacking of interests. Let you wonder what happened in these due-diligence.
However, personalization (teleporting yourself into a video scene) is boring to me. At its core, it doesn't generate new experience to me. My experience is not defined by photos / videos I took on a trip.
Very interesting read. I first learned this method from a random reddit post a while ago and very happy to see a systematic study on this (wish I would save the original post somewhere to reference to!).
My understanding is that you cannot talk about warp specialization without talking about the alternative: multi-stage pipelining. And the final example code given is multi-stage pipeline with double buffers.
And here is my understanding where it differs:
1. multi-stage pipeline requires careful hand-tuning, even at PTX level to make sure your async wait is weaved properly to maximize overlap.
2. since these register files now is huge, multi-stage pipeline is difficult to write at intrinsics level to make efficient use of these huge register files.
3. Warp specialization delegated most of these scheduling dynamically, hence it is better adapted to hardware (and have more information to make scheduling decisions at runtime). Although this is a bit moot because we write different code for different hardware anyway.
Author here! I think that warp specialization is inherently related to multi-stage pipelining, they aren't really alternatives of each other. Warp specialization is a way to realize a multi-stage pipeline in the face of hazards that may cause the pipeline to spill out of the register file or not let parts of the pipeline run concurrently as desired.
The fact that we tend to need different warp specialization strategies for different hardware is a consequence of the capabilities of that hardware (i.e. different asynchronous instruction types), and contributes to the complexity of targeting that new hardware.
The reason why people go distances to package PyTorch is because the skill of translating models between different frameworks manually is "easy" but not well dispensed in developer community.
That's why people will go stupid lengths to convert model from PyTorch / TensorFlow with onnxtools / coremltools to avoid touch the model / weights themselves.
The only one that escaped this is llama.cpp, which weirdly, despite the difficulty of model conversion with ggml, people seem to do it anyway.
One thing I would call out, if you use SQLite as an application format:
BLOB type is limited to 2GiB in size (int32). Depending on your use cases, that might seem high, or not.
People would argue that if you store that much of binary data in a SQLite database, it is not really appropriate. But, application format usually has this requirement to bundle large binary data in one nice file, rather than many files that you need to copy together to make it work.
Also you almost certainly want to do this anyway so you can stream the blobs into/out of the network/filesystem, well before you have GBs in a single blob.
That's right, but it is much easier to just use blob without application logic to worry about chunking. It is the same reason why we use SQLite in the first place, a lot of transaction / rollback logic now is on SQLite layer, not the application layer.
So the limitation is really a structural issue that Dr. Hipp at some point might resolve (or not), but pretty much has to be resolved by SQLite core team, not outside contributors (of course you can resolve it by forking, but...).
This is essential if you want to have encryption/compression + range access at the same time.
I've been using chunk sizes of 128 megabytes for my media archive. This seems to be a reasonable tradeoff between range retrieval delay and per object overhead (e.g. s3 put/get cost).
reply