Several companies have implemented databases on GPUs but there is a good technical reason that the approach has never really caught on, and some of these companies even migrated to selling the same platform on CPUs only.
The weakness of GPU databases is that while they have fantastic internal bandwidth, their network to the rest of the hardware in a server system is over PCIe, which generally isn't going to be as good as what a CPU has and databases tend to be bandwidth bound. This is a real bottleneck and trying to work around it makes the entire software stack clunky.
I once asked the designer of a good GPU database what the "no bullshit" performance numbers were relative to CPU. He told me GPU was about 4x the throughput of CPU, which is very good, but after you added in the extra hardware costs, power costs, engineering complexity etc, the overall economics were about the same to a first approximation. And so he advised me to not waste my time considering GPU architectures for databases outside of some exotic, narrow use cases where the economics still made sense. Which was completely sensible.
For data intensive processing, like databases, you don't want your engine to live on a coprocessor. No matter how attractive the coprocessor, the connectivity to the rest of the system extracts a big enough price that it is rarely worth it.
You pretty much hit the nail on the head, a big limitation is the feeding the GPUs in the first place.
Furthermore, all that memory bandwidth is calculated against all the cores. So you have to be VERY careful in usage patterns (it doesn't work like a giant CPU). Not to mention how much it costs involved per GB or TB!
I have some experience in this area, and where GPU's & database really shine is building (and especially) re-building indexes.
Running a database on GPUs isn't going to replace all DBs overnight. But, the hardware does have it's uses; just like improved SIMD on CPU's and NVMe storage, plus developments in networking - Basically look where Intel is going since we're coming to the limits of silicon transistors..
I can definitely see value in offloading a bunch of number crunching on sufficiently large data sets to a graphics card (particularly parts that are trivially parallelizable). But to me, that seems like an optimization a database would make in its query planner, not one that you'd necessarily want to build an entire standalone product around.
> The weakness of GPU databases is that while they have fantastic internal bandwidth, their network to the rest of the hardware in a server system is over PCIe, which generally isn't going to be as good as what a CPU has and databases tend to be bandwidth bound. This is a real bottleneck and trying to work around it makes the entire software stack clunky.
How relevant is that when you're looking at multi-TB data sets that don't fit into computer RAM? Sure, the RAM <---> CPU bandwidth may be very wide, but the SSD connects to the computer over the same PCIe bus.
And also: when did you have this conversation? GPU performance has changed very much year by year, so what wouldn't have been worth it 2 years ago might be a huge gain now.
The difficulty with how most GPUs are connected to the rest of the system is that the data has to go RAM -> CPU -> GPU. If it could go directly RAM -> GPU, then the calculations would be better, but still not great as PCIe is still lower bandwidth and higher latency than RAM -> CPU.
It's not about GPU performance, it's about the latency and bandwidth of getting that dat to the GPU. If once you ship data to the GPU, you reuse it many times for many calculations, that cost is amortized and it doesn't matter as much. But if you ship data to the GPU and use it once, then that cost will probably not be amortized. I think of databases tending to fit in the latter category.
It would seem that rationalization would fall apart quickly with the new Power CPUs that have NVLink built right into the CPU. Getting data back and forth shouldn't be a problem anymore.
Outside of CPU >> GPU, I'm not sure what other data movement you could be talking about. A SAS HBA or Ethernet NIC or Infiniband HBA are almost always going to be operating over the same PCIe bus the GPU uses. In the rare instances they're built onto the CPU, the "network link" is likely going to still be significantly slower than the fastest PCIe slot.
>It would seem that rationalization would fall apart quickly with the new Power CPUs that have NVLink
NVLink has 80GB/s [1]. DDR4 quad channel (Xeon Servers) has ~120GB/s [2]. So no this rationalization doesn't fall apart. Furthermore in the event NVLink gets faster then RAM, then you'll still be bottlenecked by RAM access, as you'll buffer here.
This of course is ignoring weird systems where you attempt to maintain ACID coherence of tables between GPU, CPU, and Disk memory. But then GPU memory size become inherently limiting as even the biggest max out at ~32GB.
It's a faster link between between future IBM POWER processors and NVidia GPUs, but won't make waves in database market since those systems are niche HPC/supercomputing hardware.
I've also seen NVLink on some of the pre-Pascal roadmaps for nVidia's gaming-oriented graphics cards. Since the current generation gaming consoles has HSA, I'm hoping that it gains in popularity and because less niche. Problem is that it'd be a pretty vital component to be nVidia proprietary.
CAPI is "open" solution to do the same thing, though NVLink supposedly still offers more bandwidth. Unfortunately CAPI is only available on POWER8 hardware at the moment and I'm not sure if IBM is open to license it beyond OpenPOWER - still, there's a decent number of CAPI-capable cards already available and more coming to the market in the future.
GPU's may not be great for every unit of work a RDBMS has to perform, but given their ability to rapidly compute hashes it could help a lot with joins (as evidenced by PGStrom).
http://www.nvidia.com/object/nvlink.html - NVIDIA® NVLink™ is a high-bandwidth, energy-efficient interconnect that enables ultra-fast communication between the CPU and GPU, and between GPUs.
And a big on the PowerPC roadmap http://www.nextplatform.com/2016/04/07/ibm-unfolds-power-chi...
"With NVLink, multiple GPUs can be linked by 20 GB/sec links (bi-directional at that speed) to each other or to the Power8 processor so they can share data more rapidly than is possible over PCI-Express 3.0 peripheral links. (Those PCI-Express links top out at 16 GB/sec and, unlike NVLink, they cannot be aggregated to boost the bandwidth between two devices.)"
you don't want your engine to live on a coprocessor
Well yes and no. Both IBM and Oracle have gotten impressive performance using database-specific co-processors, but these are not GPGPUs, they are dedicated hardware that sits on the storage path. Baidu are reinventing that wheel with FPGAs too.
I am the CTO of blazingDB. The data is not stored in GPU or even RAM. We operate from the disk though we we will cache information in RAM when we have plenty of it available.
For some reason I am surprised that anyone would want to ship the data whole and as is, to the GPU. Wouldn't it make more sense to use a representative, transformed "GPU-ready" data set, both much smaller in size & designed specifically for the queries that are to be optimized?
We are not shipping all of the data as a whole to the gpu. We are going to be releasing some whitepapers that explain this in more detail but lets get a few things clear. Data is sent to the gpu compressed since it is compressed when it is stored. We can decompress VERY quickly on GPU's (30-50GB/s is easily achievable on a K80) and because each of our columns are compressed using one of our cascading compression algorithms (which everyone offers the best in terms of compression and throughput). We are a column store and only send over the columns that are being used in processing. So for example
select id, name, age, avg(income) from people group by gender
In this case only the income and gender columns would actually be sent to the gpu and they would do so in a compressed fashion to increase the "effective" bandwidth of data over PCIE. Even more interesting is that id, name, and age, would be pulled from our horizontal store instead of our compressed columnar store in order to minimize the number of iops necessary to fill the result set.
The transformation to GPU-ready would not be as trivial and effectively redundant as pruning the data of course. It would be produce a secondary data structure, like an index on a column, though in this case of course destined to be processed within the math-oriented, high-branch-cost setting of a GPU.
This is basically the bread and butter of columnar compute systems, not just GPU ones. GPUs ones just get to throw more compute at them, and thus do even better for these kinds of space/time trade-offs. Interestingly, most big data systems are increasingly columnar.
The data bandwidth between main memory and that which the GPU works with isn't speeding up by the same factors though. This is fine when working with a dataset that fits into the GPU's memory pool and your workload involves relatively few (or zero) changes because you can transfer it once and repeatedly ask the GPUs to analyse it in what-ever ways. As soon as the common dataset doesn't fit neatly into the GPU's RAM (leaving enough spare for scratch space) you end up thrashing the channel and it becomes the main bottleneck.
That's true. There are some workarounds though, like distributing the workload between multiple GPUs - actively under development in machine learning for instance
This has been tried many times. Unless you are doing a computation with sufficient arithmetic intensity [0] the cost of shipping the data over PCIe and back dominates any gain you might get over a CPU.
Seems kind of like a blanket statement to make if you have not investigated this for yourself. So let me give you some example use cases.
Decompression for processing:
We can roundtrip decompress 8 byte integers RRleDeltaRle 4x on an AWS g2.2xlarge faster when we use the GPU. This includes sending the data TO the GPU and brining it back. Our decompression segments on CPU were set up so that every thread was processing a segment to be decompressed so we were using every avaiable thread at about 100%.
Sorting data:
Here the difference can be startling. On an aws g2.2xlarge we are able to sort orders of magnitude faster than you can on GPU. Checkout thrust to run some exmaples
http://docs.nvidia.com/cuda/thrust/#axzz4K7CRY352
A few modifications there can let you run this with both and NVIDIA backend and one that runs on CPU threads. It will run orders of magnitude faster on GPU than CPU on a small gpu instance on amazon. Even a laptop gpu would still outperform the cpu sorting capacity by at least an order of magnitude.
Database workloads tend to be very branch heavy, e.g. string comparisons. If you've ever programmed a GPU you know that taking a data-dependent branch serializes the execution of the GPU stream processing units. So already the 70-100x benefit GPUs give on vectorized floating point workloads is greatly reduced. Add in the memory bandwidth penalty and it's a complete waste for IO intensive database workloads.
Save the GPUs for training neural nets and physics simulations. When the fundamental hardware capabilities of GPUs make them worth investing in for database workloads, I'll change my opinion. Until then, any budget for expensive DB hardware is much better spent on NVMe storage.
You can write code to compare strings without branches. Imagine comparing one string to every string in a database column. Pad to a fixed size first (if you find an ambiguous match later you can do a full string compare then). Subtract the strings, aka compare, and store the result. At the end of all the commparisons you have a memory structure of negative, zero, or positive values. Then you can do something with the matching values.
There have been some very sweet algorithms for doing B+ tree searches for Itanium and for AVX and those work just as well, if not better on GPU.
With some thinking, branch operations can be converted into math and applied to masses of data without checking for branch conditions. This wastes some work but is still faster than branching.
That sounds like an interesting technique - but it unfortunately does not apply on a majority of real-world data.
In particular I mean sorting non-English text which any serious database needs to be able to do. Subtracting characters doesn't work when you're dealing language specific Unicode collation [0].
Yeah, well, you just can't do it that way and be fast at the same time. I mean seriously fast.
What you can do is decide how your comparison should be collated and preprocess your strings into a sort key form that is strictly big-endian binary (first byte has highest weight). Or little-endian I suppose, whichever works best for your hardware.
Your sort key doesn't have to be the actual string.
Yes data dependent branch serialzizes the execution of the gpu stream processing units (this is not true for amd actually). Yes it sucks operating on string sometimes on gpu and we don't always do it because of this. But we are ignoring certain aspects. Long strings are dictionary encoded in our database usually (no one picks this an optimizer finds the best cascading compression scheme and imposes this). Dictionary encodings are sorted so that each dicitionary value's key is in sorted order. So guess what you can already do comparison and equality checks on much smaller datasizes. A long string can be encoded in 8, 4 ,2 sometimes even 1 byte. Many times we have to do string comparisons on strings that we hhave not encoded ie
select * from table where column1 = "some awesome text here"
In this case we actually do a comparison between hashes of the data. Hashing is cheap, fast, and makes comparisons on the GPU's be a breeze. So long story short, we have no data dependent branching. We do this by never using certain statements inside of kernel code.
the use of "if" is expressly forbidden at blazing for any gpu code and it's use is punished viciously (said individual usually has to be the one that captures meaningful input from one the 80 log files of our 80 gpu cluster ).
Editing to mention the way you can encode a long string down to a 1 byte is by doing a dictionary compression and then bit packing the keys. On gpu the way you do this is getting the max key (min key is always 0) and then you can store this data in 1 byte (if max(key) < 255).
The whole point of making things fast is to remove all the if statements. People complain that their current if-heavy algorithm can't run on GPUs therefor GPUs suck. Of course then the algorithms need to be reworked to have less branches.
Big joins are our best use case. Joins are hard for many databases to optimize when they have not seen them before or are not "expecting" them like when you let amazon know how to partition your different tables onto the same physical machines so that redshift can return your query in a reasonble amount of time. But most SQL operations can be accelerated by the use of GPU's. Order by (holy smokes it helps), arithmetic or date transformations (20-30x for comparable cpu code), predicates, group by. All of these operations are happening over vectors of data. SIMD rock out when it comes to running these kinds of loads. The only use cases that we actually think are very poorly suited to gpus thus far (and this is a nut someone will one day probably crack) is wild card string searches. Some of our competitors handle this by caching all the data in GPU RAM but we consider that to be "cheating" since you would never be able to justify the pcie transfer to do wild card string searches.
We are an Analytical database not a transactional one. We would love to integrate with more tools like spark alas we are a small team of 5 working on the engine and making the engine itself has taken most of our time to date.
There was a paper a few months back in deep reinforcement learning that got record setting RL results using only a CPU [0]. Previously, these algorithms would play fewer games and run a gpu over and over on the few games they had played. By using a CPU you can generate more samples that are up to date with your learning algorithm. It sounds obvious in hindsight, but you can't exactly run 60 atari sims on a gpu.
> It sounds obvious in hindsight, but you can't exactly run 60 atari sims on a gpu.
Why not?
It's certainly easier to run parallel Atari sims on CPU, because CPU programs are typically written as single-threaded or with parameterized number of threads.
Running parallel Atari on GPU is completely possible, with either running an Atari game on the each of ~30 SMs or each of the 32 * n_SMs ~= 1000 warps. However, because GPU code is typically written and delivered as kernels which utilize the full GPU, this type of embarrassing parallelism over SMs or warps typically can't be gained from using an existing library.
How many fewer cores, and how much less GPU RAM, and how much slower were GPUs in general when this was tried? They're changing quite rapidly—significantly faster than any other component in a modern PC. Any attempts more than 2 or 3 years ago aren't very relevant.
In my experiments in 2010, GPU speed was irrelevant. The GPU could had performed the calculations literally instantaneously, and the GPU still would have lost to the CPU by several orders of magnitude. Quoting myself, in some contexts, "data movement efficacy trumps raw computational power": http://www.scott-a-s.com/files/debs2010.pdf
Detection of false positives: Run candidate solutions through the CPU
Detection of false negatives: Compare solution distribution and frequency to expected models; switch to debug kernels if outside tolerance.
However, this works because the mining problem space is stateless and follows strict mathematically predictable models.
A DB is stateful and the answers generally can't be verified without consulting a secondary copy, which is why I'm super curious how they would engineer correctness and reliability in a cost-effective way using GPUs.
That's very interesting. Have you collected statistical data on these bit errors? Is it always a single bit error?
I'm assuming you are using GeForce cards and not Tesla cards which have an ECC memory protection mode?
I've tried to collect some statistics on GPU memory errors rates but have found them to be normally extremely rare. The only time I've reproducibly seen them is due to faulty hardware, where the errors become highly reproducible and the GPU needs replacement. The other theoretical cause of bit flips is supposed to be random errors due to cosmic radiation but I've never been able to observe that using memory testing software (though I did only run the experiments in AWS).
Could it be that you have faulty or low grade GPUs? I assume these are all low-cost OEM parts, given your application? Or maybe there's something odd about your data center environment?
Regarding the GPU database application, I think the answer is to just use the Tesla grade GPU with ECC memory enabled.
Generally we prefer AMD cards as most (profitable) mining functions are memory bandwidth dominated. Usually it's a shader unit that gets unstable in the 70-80C range (note that most silicon is rated for higher ranges).
AMD's hardware specifications are more open too which lets one build your own shader compilers and get direct access to the iron.
We've been working on an interposing library for guaranteeing GPU computation and it would be great to get your feedback. Any chance we sync up? My email is in my profile.
Not OP but a lot of this seems pretty believable and easy to detect to me?
Dont the GPU specs allow for a certain lossiness in the math? Or like, at least they don't conform to IEEE 794 float specs w/ regard to order of operations, precision, degradation, etc.
So like, do a shitload of math ops in a glsl shader with a deterministic outcome, render the result to a texture, take the texture back and make sure the RGBA values match bit for bit with the numbers you expected?
Or to detect single bit errors in the gpu's local memory or caching just attach textures, read data into them, read it back, render back, etc etc.
I am reminded of a Jack Vance quote, where one of the characters is "capable of performing complex calculations in his head and furnishing the results in an instant, whether they are right or wrong".
I cannot find any reference to handling of soft errors in their material. One rather banal approach is to do everything twice and check the results; effective, but I'm not sure that they do this. They may be simply putting their heads in the sand.
Do you use compute shaders / CUDA? I'm really surprised at the error rate. I've used fragment shaders in OpenGL ES 2.0 for compute on mobile platforms, and the "errors" turned out to be dithering.
The numbers for GPU databases look “good” because you can get pretty high cross-sectional bandwidth to a reasonably large memory from 8 GPUs in one box, and advertise blazing speed from that. But it’s just a trick.
The only thing that matters for them here is the aggregate, cross-sectional bandwidth to your data’s working set in memory. For databases, especially for the approach that many GPU databases take (light on indexes since GPUs aren’t great at data-dependent memory movement or pointer chasing, just brute-force scan much of the data), the working set size is something that will only fit in main memory.
Instead of using 8 GPUs with a peak global, cross-sectional memory b/w of 8 * 320 = 2560 GB/sec and brute-force scans, you can parallelize across ~40 CPU nodes each with ~60 GB/sec b/w to main memory. The cross-sectional bandwidth will be about the same, and the cost to split and join the results of the query is likely small in comparison to actually doing the work, assuming the intermediate results are reasonably small. You can use a broadcast and reduction tree; the added latency of the broadcast and reduction tree's depth likely won't add much, since there isn’t much data to broadcast in a query, and the data returned by each machine for the reduction is hopefully (!) tiny in proportion to the actual data scanned.
If you want to consider indices on data, then maybe the heads of that can remain resident in a CPU’s cache, and will make the individual CPU scans even faster. The GPU caches are tiny and mainly serve to patch up strided loads and other bad uses of memory.
Whether or not it’s worth it one way or another depends upon how large your database size is, the relative cost of GPUs versus CPU nodes to get the memory you want and the cross-sectional b/w you need, perf/W and other issues.
You’re probably nowhere near close to arithmetic throughput bounds on GPUs or CPUs since these workloads have very low op / byte loaded ratios compared to typical HPC workloads, so that aspect of GPUs doesn’t matter. If you’re doing expensive pre- or post-processing on GPUs as well, then that may push the balance more towards GPUs.
You bring up some really good points about badnwidth to the data and how many of our competitors have gotten ridiculously high benchmarks. We do NOT cache on the gpu, EVER. The reason why is because most of our columns are much too large to fit on one or even 8 gpus let alone the rest of the space required for processing on it.
which are why we actually prefer to have only 1 GPU per server when we are making our own boxes. We find that our most optimal running environment is when we have smaller instances with only 1 gpu. This is due to the fact that two smaller rigs with a gpu each will benefit from increased CPU RAM throughput (basically double 1 rig) and you have 2x the PCIE bandwidth since you are splitting it across two machines. While we don't have the enviable op / byte loaded that some machine learning tool sets might use we are able to greatly enhance these throughputs by using compression and doing things like running multiple arithmetic operations in one kernel call.
Got pals at MapD. I'd like to see any of them benchmarked against Kx. Pretty much all database problems are IO bound, and not many get it as right as Art did.
Every time I read about offloading work to the GPU, good old times come to my mind.
I vividly remember the Intel 8087, a math co-processor to the Intel 8086 that came out in 1980-1981. All the floating point arithmetic was offloaded to it.
It ended up disappearing as a separate chip with the Intel 80486 in the late eighties.
The weakness of GPU databases is that while they have fantastic internal bandwidth, their network to the rest of the hardware in a server system is over PCIe, which generally isn't going to be as good as what a CPU has and databases tend to be bandwidth bound. This is a real bottleneck and trying to work around it makes the entire software stack clunky.
I once asked the designer of a good GPU database what the "no bullshit" performance numbers were relative to CPU. He told me GPU was about 4x the throughput of CPU, which is very good, but after you added in the extra hardware costs, power costs, engineering complexity etc, the overall economics were about the same to a first approximation. And so he advised me to not waste my time considering GPU architectures for databases outside of some exotic, narrow use cases where the economics still made sense. Which was completely sensible.
For data intensive processing, like databases, you don't want your engine to live on a coprocessor. No matter how attractive the coprocessor, the connectivity to the rest of the system extracts a big enough price that it is rarely worth it.