Hacker News new | past | comments | ask | show | jobs | submit login
BlazingDB uses GPUs to manipulate huge databases in no time (techcrunch.com)
121 points by rezist808 on Sept 12, 2016 | hide | past | favorite | 69 comments



Several companies have implemented databases on GPUs but there is a good technical reason that the approach has never really caught on, and some of these companies even migrated to selling the same platform on CPUs only.

The weakness of GPU databases is that while they have fantastic internal bandwidth, their network to the rest of the hardware in a server system is over PCIe, which generally isn't going to be as good as what a CPU has and databases tend to be bandwidth bound. This is a real bottleneck and trying to work around it makes the entire software stack clunky.

I once asked the designer of a good GPU database what the "no bullshit" performance numbers were relative to CPU. He told me GPU was about 4x the throughput of CPU, which is very good, but after you added in the extra hardware costs, power costs, engineering complexity etc, the overall economics were about the same to a first approximation. And so he advised me to not waste my time considering GPU architectures for databases outside of some exotic, narrow use cases where the economics still made sense. Which was completely sensible.

For data intensive processing, like databases, you don't want your engine to live on a coprocessor. No matter how attractive the coprocessor, the connectivity to the rest of the system extracts a big enough price that it is rarely worth it.


You pretty much hit the nail on the head, a big limitation is the feeding the GPUs in the first place.

Furthermore, all that memory bandwidth is calculated against all the cores. So you have to be VERY careful in usage patterns (it doesn't work like a giant CPU). Not to mention how much it costs involved per GB or TB!

I have some experience in this area, and where GPU's & database really shine is building (and especially) re-building indexes.

Running a database on GPUs isn't going to replace all DBs overnight. But, the hardware does have it's uses; just like improved SIMD on CPU's and NVMe storage, plus developments in networking - Basically look where Intel is going since we're coming to the limits of silicon transistors..


I can definitely see value in offloading a bunch of number crunching on sufficiently large data sets to a graphics card (particularly parts that are trivially parallelizable). But to me, that seems like an optimization a database would make in its query planner, not one that you'd necessarily want to build an entire standalone product around.


> The weakness of GPU databases is that while they have fantastic internal bandwidth, their network to the rest of the hardware in a server system is over PCIe, which generally isn't going to be as good as what a CPU has and databases tend to be bandwidth bound. This is a real bottleneck and trying to work around it makes the entire software stack clunky.

How relevant is that when you're looking at multi-TB data sets that don't fit into computer RAM? Sure, the RAM <---> CPU bandwidth may be very wide, but the SSD connects to the computer over the same PCIe bus.

And also: when did you have this conversation? GPU performance has changed very much year by year, so what wouldn't have been worth it 2 years ago might be a huge gain now.


The difficulty with how most GPUs are connected to the rest of the system is that the data has to go RAM -> CPU -> GPU. If it could go directly RAM -> GPU, then the calculations would be better, but still not great as PCIe is still lower bandwidth and higher latency than RAM -> CPU.

It's not about GPU performance, it's about the latency and bandwidth of getting that dat to the GPU. If once you ship data to the GPU, you reuse it many times for many calculations, that cost is amortized and it doesn't matter as much. But if you ship data to the GPU and use it once, then that cost will probably not be amortized. I think of databases tending to fit in the latter category.


It would seem that rationalization would fall apart quickly with the new Power CPUs that have NVLink built right into the CPU. Getting data back and forth shouldn't be a problem anymore.

Outside of CPU >> GPU, I'm not sure what other data movement you could be talking about. A SAS HBA or Ethernet NIC or Infiniband HBA are almost always going to be operating over the same PCIe bus the GPU uses. In the rare instances they're built onto the CPU, the "network link" is likely going to still be significantly slower than the fastest PCIe slot.


>It would seem that rationalization would fall apart quickly with the new Power CPUs that have NVLink

NVLink has 80GB/s [1]. DDR4 quad channel (Xeon Servers) has ~120GB/s [2]. So no this rationalization doesn't fall apart. Furthermore in the event NVLink gets faster then RAM, then you'll still be bottlenecked by RAM access, as you'll buffer here.

This of course is ignoring weird systems where you attempt to maintain ACID coherence of tables between GPU, CPU, and Disk memory. But then GPU memory size become inherently limiting as even the biggest max out at ~32GB.

[1] https://en.wikipedia.org/wiki/NVLink

[2] http://www.corsair.com/en-us/blog/2014/september/ddr3_vs_ddr... (2channel -> 4channel x2)


What is NVLink, and what does it mean in terms of data transfer?


It's a faster link between between future IBM POWER processors and NVidia GPUs, but won't make waves in database market since those systems are niche HPC/supercomputing hardware.


I've also seen NVLink on some of the pre-Pascal roadmaps for nVidia's gaming-oriented graphics cards. Since the current generation gaming consoles has HSA, I'm hoping that it gains in popularity and because less niche. Problem is that it'd be a pretty vital component to be nVidia proprietary.


CAPI is "open" solution to do the same thing, though NVLink supposedly still offers more bandwidth. Unfortunately CAPI is only available on POWER8 hardware at the moment and I'm not sure if IBM is open to license it beyond OpenPOWER - still, there's a decent number of CAPI-capable cards already available and more coming to the market in the future.

GPU's may not be great for every unit of work a RDBMS has to perform, but given their ability to rapidly compute hashes it could help a lot with joins (as evidenced by PGStrom).


http://www.nvidia.com/object/nvlink.html - NVIDIA® NVLink™ is a high-bandwidth, energy-efficient interconnect that enables ultra-fast communication between the CPU and GPU, and between GPUs.

And a big on the PowerPC roadmap http://www.nextplatform.com/2016/04/07/ibm-unfolds-power-chi... "With NVLink, multiple GPUs can be linked by 20 GB/sec links (bi-directional at that speed) to each other or to the Power8 processor so they can share data more rapidly than is possible over PCI-Express 3.0 peripheral links. (Those PCI-Express links top out at 16 GB/sec and, unlike NVLink, they cannot be aggregated to boost the bandwidth between two devices.)"


Fully agreed. I wrote a paper six years ago that came to the same conclusion: http://www.scott-a-s.com/files/debs2010.pdf


you don't want your engine to live on a coprocessor

Well yes and no. Both IBM and Oracle have gotten impressive performance using database-specific co-processors, but these are not GPGPUs, they are dedicated hardware that sits on the storage path. Baidu are reinventing that wheel with FPGAs too.


Is the data stored in GPU memory?


I am the CTO of blazingDB. The data is not stored in GPU or even RAM. We operate from the disk though we we will cache information in RAM when we have plenty of it available.


Limiting your queries to ~12gb, the max ram on one GPU, (beyond which PCIe I/O becomes a bottleneck) will be a problem for business use, I'd think


For some reason I am surprised that anyone would want to ship the data whole and as is, to the GPU. Wouldn't it make more sense to use a representative, transformed "GPU-ready" data set, both much smaller in size & designed specifically for the queries that are to be optimized?


We are not shipping all of the data as a whole to the gpu. We are going to be releasing some whitepapers that explain this in more detail but lets get a few things clear. Data is sent to the gpu compressed since it is compressed when it is stored. We can decompress VERY quickly on GPU's (30-50GB/s is easily achievable on a K80) and because each of our columns are compressed using one of our cascading compression algorithms (which everyone offers the best in terms of compression and throughput). We are a column store and only send over the columns that are being used in processing. So for example

select id, name, age, avg(income) from people group by gender

In this case only the income and gender columns would actually be sent to the gpu and they would do so in a compressed fashion to increase the "effective" bandwidth of data over PCIE. Even more interesting is that id, name, and age, would be pulled from our horizontal store instead of our compressed columnar store in order to minimize the number of iops necessary to fill the result set.


Once you read and prune out the dataset to only include the relevant data, then what's left for the GPU to do?


The transformation to GPU-ready would not be as trivial and effectively redundant as pruning the data of course. It would be produce a secondary data structure, like an index on a column, though in this case of course destined to be processed within the math-oriented, high-branch-cost setting of a GPU.


This is basically the bread and butter of columnar compute systems, not just GPU ones. GPUs ones just get to throw more compute at them, and thus do even better for these kinds of space/time trade-offs. Interestingly, most big data systems are increasingly columnar.


When did you ask? GPUs have been getting a lot faster year on year.

Though, I would like to see Xeon Knights Landing compared to GPUs for db uses.


The data bandwidth between main memory and that which the GPU works with isn't speeding up by the same factors though. This is fine when working with a dataset that fits into the GPU's memory pool and your workload involves relatively few (or zero) changes because you can transfer it once and repeatedly ask the GPUs to analyse it in what-ever ways. As soon as the common dataset doesn't fit neatly into the GPU's RAM (leaving enough spare for scratch space) you end up thrashing the channel and it becomes the main bottleneck.


That's true. There are some workarounds though, like distributing the workload between multiple GPUs - actively under development in machine learning for instance


This has been tried many times. Unless you are doing a computation with sufficient arithmetic intensity [0] the cost of shipping the data over PCIe and back dominates any gain you might get over a CPU.

[0] - http://www.nersc.gov/users/application-performance/measuring...


Seems kind of like a blanket statement to make if you have not investigated this for yourself. So let me give you some example use cases.

Decompression for processing: We can roundtrip decompress 8 byte integers RRleDeltaRle 4x on an AWS g2.2xlarge faster when we use the GPU. This includes sending the data TO the GPU and brining it back. Our decompression segments on CPU were set up so that every thread was processing a segment to be decompressed so we were using every avaiable thread at about 100%.

Sorting data: Here the difference can be startling. On an aws g2.2xlarge we are able to sort orders of magnitude faster than you can on GPU. Checkout thrust to run some exmaples http://docs.nvidia.com/cuda/thrust/#axzz4K7CRY352

A few modifications there can let you run this with both and NVIDIA backend and one that runs on CPU threads. It will run orders of magnitude faster on GPU than CPU on a small gpu instance on amazon. Even a laptop gpu would still outperform the cpu sorting capacity by at least an order of magnitude.


Seems like a great application of AMD's Fiji GPU w/onboard flash: http://www.anandtech.com/show/10518/amd-announces-radeon-pro...


You would need a GPU with an onboard SSD (exists). Or GPUs and drivers with the ability to talk directly to infiniband hardware (exists).

Or the ability to not be ridiculously wasteful of existing resources (also exists).

Your naysaying doesn't make you smart. Your naysaying makes you cut off from learning a different, better way of doing things.


Database workloads tend to be very branch heavy, e.g. string comparisons. If you've ever programmed a GPU you know that taking a data-dependent branch serializes the execution of the GPU stream processing units. So already the 70-100x benefit GPUs give on vectorized floating point workloads is greatly reduced. Add in the memory bandwidth penalty and it's a complete waste for IO intensive database workloads.

Save the GPUs for training neural nets and physics simulations. When the fundamental hardware capabilities of GPUs make them worth investing in for database workloads, I'll change my opinion. Until then, any budget for expensive DB hardware is much better spent on NVMe storage.


You can write code to compare strings without branches. Imagine comparing one string to every string in a database column. Pad to a fixed size first (if you find an ambiguous match later you can do a full string compare then). Subtract the strings, aka compare, and store the result. At the end of all the commparisons you have a memory structure of negative, zero, or positive values. Then you can do something with the matching values.

There have been some very sweet algorithms for doing B+ tree searches for Itanium and for AVX and those work just as well, if not better on GPU.

With some thinking, branch operations can be converted into math and applied to masses of data without checking for branch conditions. This wastes some work but is still faster than branching.


That sounds like an interesting technique - but it unfortunately does not apply on a majority of real-world data.

In particular I mean sorting non-English text which any serious database needs to be able to do. Subtracting characters doesn't work when you're dealing language specific Unicode collation [0].

[0] - http://www.unicode.org/reports/tr10/


Yeah, well, you just can't do it that way and be fast at the same time. I mean seriously fast.

What you can do is decide how your comparison should be collated and preprocess your strings into a sort key form that is strictly big-endian binary (first byte has highest weight). Or little-endian I suppose, whichever works best for your hardware.

Your sort key doesn't have to be the actual string.


Yes data dependent branch serialzizes the execution of the gpu stream processing units (this is not true for amd actually). Yes it sucks operating on string sometimes on gpu and we don't always do it because of this. But we are ignoring certain aspects. Long strings are dictionary encoded in our database usually (no one picks this an optimizer finds the best cascading compression scheme and imposes this). Dictionary encodings are sorted so that each dicitionary value's key is in sorted order. So guess what you can already do comparison and equality checks on much smaller datasizes. A long string can be encoded in 8, 4 ,2 sometimes even 1 byte. Many times we have to do string comparisons on strings that we hhave not encoded ie

select * from table where column1 = "some awesome text here"

In this case we actually do a comparison between hashes of the data. Hashing is cheap, fast, and makes comparisons on the GPU's be a breeze. So long story short, we have no data dependent branching. We do this by never using certain statements inside of kernel code.

the use of "if" is expressly forbidden at blazing for any gpu code and it's use is punished viciously (said individual usually has to be the one that captures meaningful input from one the 80 log files of our 80 gpu cluster ).

Editing to mention the way you can encode a long string down to a 1 byte is by doing a dictionary compression and then bit packing the keys. On gpu the way you do this is getting the max key (min key is always 0) and then you can store this data in 1 byte (if max(key) < 255).


The whole point of making things fast is to remove all the if statements. People complain that their current if-heavy algorithm can't run on GPUs therefor GPUs suck. Of course then the algorithms need to be reworked to have less branches.


BlazingDB is advertising as being run on AWS, Azure, Softlayer; do the GPUs on these systems have these hardware features?


Seems like a smarter response to me. Go ahead and use a GPU database if you want?? I have yet to see a good use case of one though.


We have seen many.

Big joins are our best use case. Joins are hard for many databases to optimize when they have not seen them before or are not "expecting" them like when you let amazon know how to partition your different tables onto the same physical machines so that redshift can return your query in a reasonble amount of time. But most SQL operations can be accelerated by the use of GPU's. Order by (holy smokes it helps), arithmetic or date transformations (20-30x for comparable cpu code), predicates, group by. All of these operations are happening over vectors of data. SIMD rock out when it comes to running these kinds of loads. The only use cases that we actually think are very poorly suited to gpus thus far (and this is a nut someone will one day probably crack) is wild card string searches. Some of our competitors handle this by caching all the data in GPU RAM but we consider that to be "cheating" since you would never be able to justify the pcie transfer to do wild card string searches.


This seems more like an analytics workload.. you are querying a bunch of things you can't index ahead of time.

Why not put GPUs on your analytics machines? Or a cluster of them with SPARK. Or heck, distribute the spark cluster on top of your database.


We are an Analytical database not a transactional one. We would love to integrate with more tools like spark alas we are a small team of 5 working on the engine and making the engine itself has taken most of our time to date.


There was a paper a few months back in deep reinforcement learning that got record setting RL results using only a CPU [0]. Previously, these algorithms would play fewer games and run a gpu over and over on the few games they had played. By using a CPU you can generate more samples that are up to date with your learning algorithm. It sounds obvious in hindsight, but you can't exactly run 60 atari sims on a gpu.

[0] - https://arxiv.org/abs/1602.01783


> It sounds obvious in hindsight, but you can't exactly run 60 atari sims on a gpu.

Why not?

It's certainly easier to run parallel Atari sims on CPU, because CPU programs are typically written as single-threaded or with parameterized number of threads.

Running parallel Atari on GPU is completely possible, with either running an Atari game on the each of ~30 SMs or each of the 32 * n_SMs ~= 1000 warps. However, because GPU code is typically written and delivered as kernels which utilize the full GPU, this type of embarrassing parallelism over SMs or warps typically can't be gained from using an existing library.


> This has been tried many times.

How many fewer cores, and how much less GPU RAM, and how much slower were GPUs in general when this was tried? They're changing quite rapidly—significantly faster than any other component in a modern PC. Any attempts more than 2 or 3 years ago aren't very relevant.


In my experiments in 2010, GPU speed was irrelevant. The GPU could had performed the calculations literally instantaneously, and the GPU still would have lost to the CPU by several orders of magnitude. Quoting myself, in some contexts, "data movement efficacy trumps raw computational power": http://www.scott-a-s.com/files/debs2010.pdf


Video encoders like x264 are not GPU accelerated, for the same reason.


How do they guarantee the correctness of results? A major problem with GPUs is you see single bit errors with surprisingly high frequency.

For graphics this usually doesn't matter as a minor color or vertex deviation isn't noticeable, but for compute it can be devastating.

We do cryptocurrency mining on an industrial scale and constantly see single bit errors from hardware that is brand-new without modifications.


We do cryptocurrency mining on an industrial scale and constantly see single bit errors from hardware that is brand-new without modifications.

That's very surprising and interesting. How do you detect these single bit errors?


Detection of false positives: Run candidate solutions through the CPU

Detection of false negatives: Compare solution distribution and frequency to expected models; switch to debug kernels if outside tolerance.

However, this works because the mining problem space is stateless and follows strict mathematically predictable models.

A DB is stateful and the answers generally can't be verified without consulting a secondary copy, which is why I'm super curious how they would engineer correctness and reliability in a cost-effective way using GPUs.


That's very interesting. Have you collected statistical data on these bit errors? Is it always a single bit error?

I'm assuming you are using GeForce cards and not Tesla cards which have an ECC memory protection mode?

I've tried to collect some statistics on GPU memory errors rates but have found them to be normally extremely rare. The only time I've reproducibly seen them is due to faulty hardware, where the errors become highly reproducible and the GPU needs replacement. The other theoretical cause of bit flips is supposed to be random errors due to cosmic radiation but I've never been able to observe that using memory testing software (though I did only run the experiments in AWS).

Could it be that you have faulty or low grade GPUs? I assume these are all low-cost OEM parts, given your application? Or maybe there's something odd about your data center environment?

Regarding the GPU database application, I think the answer is to just use the Tesla grade GPU with ECC memory enabled.


Generally we prefer AMD cards as most (profitable) mining functions are memory bandwidth dominated. Usually it's a shader unit that gets unstable in the 70-80C range (note that most silicon is rated for higher ranges).

AMD's hardware specifications are more open too which lets one build your own shader compilers and get direct access to the iron.


We've been working on an interposing library for guaranteeing GPU computation and it would be great to get your feedback. Any chance we sync up? My email is in my profile.


Not OP but a lot of this seems pretty believable and easy to detect to me?

Dont the GPU specs allow for a certain lossiness in the math? Or like, at least they don't conform to IEEE 794 float specs w/ regard to order of operations, precision, degradation, etc.

So like, do a shitload of math ops in a glsl shader with a deterministic outcome, render the result to a texture, take the texture back and make sure the RGBA values match bit for bit with the numbers you expected?

Or to detect single bit errors in the gpu's local memory or caching just attach textures, read data into them, read it back, render back, etc etc.


I am reminded of a Jack Vance quote, where one of the characters is "capable of performing complex calculations in his head and furnishing the results in an instant, whether they are right or wrong".

I cannot find any reference to handling of soft errors in their material. One rather banal approach is to do everything twice and check the results; effective, but I'm not sure that they do this. They may be simply putting their heads in the sand.


It's surprising you can still do cryptocurrency mining with GPUs. Not bitcoin, then?


Ethereum, Monero, and even Ethereum-Classic (difficulty/arbitrage/exchange rate permitting) are all good GPU candidates -

SHA-256d (Bitcoin) and Scrypt (Litecoin) have ASICs, X11 (Dash - formerly Darkcoin) has FPGAs.


The GP didn't say that their mining is done on GPUs.


It is implied that mining is how they are familiar with bit error problems in GPUs.


Do you use compute shaders / CUDA? I'm really surprised at the error rate. I've used fragment shaders in OpenGL ES 2.0 for compute on mobile platforms, and the "errors" turned out to be dithering.


Some Tesla GPUs support ECC memory, so that should help.


The numbers for GPU databases look “good” because you can get pretty high cross-sectional bandwidth to a reasonably large memory from 8 GPUs in one box, and advertise blazing speed from that. But it’s just a trick.

The only thing that matters for them here is the aggregate, cross-sectional bandwidth to your data’s working set in memory. For databases, especially for the approach that many GPU databases take (light on indexes since GPUs aren’t great at data-dependent memory movement or pointer chasing, just brute-force scan much of the data), the working set size is something that will only fit in main memory.

Instead of using 8 GPUs with a peak global, cross-sectional memory b/w of 8 * 320 = 2560 GB/sec and brute-force scans, you can parallelize across ~40 CPU nodes each with ~60 GB/sec b/w to main memory. The cross-sectional bandwidth will be about the same, and the cost to split and join the results of the query is likely small in comparison to actually doing the work, assuming the intermediate results are reasonably small. You can use a broadcast and reduction tree; the added latency of the broadcast and reduction tree's depth likely won't add much, since there isn’t much data to broadcast in a query, and the data returned by each machine for the reduction is hopefully (!) tiny in proportion to the actual data scanned.

If you want to consider indices on data, then maybe the heads of that can remain resident in a CPU’s cache, and will make the individual CPU scans even faster. The GPU caches are tiny and mainly serve to patch up strided loads and other bad uses of memory.

Whether or not it’s worth it one way or another depends upon how large your database size is, the relative cost of GPUs versus CPU nodes to get the memory you want and the cross-sectional b/w you need, perf/W and other issues.

You’re probably nowhere near close to arithmetic throughput bounds on GPUs or CPUs since these workloads have very low op / byte loaded ratios compared to typical HPC workloads, so that aspect of GPUs doesn’t matter. If you’re doing expensive pre- or post-processing on GPUs as well, then that may push the balance more towards GPUs.


You bring up some really good points about badnwidth to the data and how many of our competitors have gotten ridiculously high benchmarks. We do NOT cache on the gpu, EVER. The reason why is because most of our columns are much too large to fit on one or even 8 gpus let alone the rest of the space required for processing on it.

which are why we actually prefer to have only 1 GPU per server when we are making our own boxes. We find that our most optimal running environment is when we have smaller instances with only 1 gpu. This is due to the fact that two smaller rigs with a gpu each will benefit from increased CPU RAM throughput (basically double 1 rig) and you have 2x the PCIE bandwidth since you are splitting it across two machines. While we don't have the enviable op / byte loaded that some machine learning tool sets might use we are able to greatly enhance these throughputs by using compression and doing things like running multiple arithmetic operations in one kernel call.


It would be interesting to benchmark this against https://wiki.postgresql.org/wiki/PGStrom .


And MapD, too.


Got pals at MapD. I'd like to see any of them benchmarked against Kx. Pretty much all database problems are IO bound, and not many get it as right as Art did.


.. and BlazeGraph, GunRock, etc. :)


Every time I read about offloading work to the GPU, good old times come to my mind.

I vividly remember the Intel 8087, a math co-processor to the Intel 8086 that came out in 1980-1981. All the floating point arithmetic was offloaded to it.

It ended up disappearing as a separate chip with the Intel 80486 in the late eighties.

[1] https://en.wikipedia.org/wiki/Intel_8087



Sounds similar to what Netezza did with FPGA's but here with GPUs. So they may hit patents they say they are unaware of...


Yeah, they need to be unaware of them...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: