Vectorized processing for Apache Arrow

gizmodo59 · on Aug 4, 2018

There is another component called Apache Arrow Flight, a new way for applications to interact with Arrow. Now that there is an established way for representing data between systems, Flight is an RPC protocol for exchanging that data between systems much more efficiently. You can think of this as a much faster alternative to ODBC/JDBC for in-memory analytics.

wesm · on Aug 4, 2018

Arrow Flight is still being developed and may be while before it drops to real user code. See ARROW-249. Indeed having a open standard runtime memory format for tabular / columnar data sets is key to improved performance

cafxx · on Aug 4, 2018

No mention anywhere of how it was ensured that the benchmark measurements were taken after the code had been compiled by the JIT (specifically, C2 on HotSpot)... seriously?

DanielShir · on Aug 5, 2018

I'd like to use it, could you guys add a license to this thing?

jpfr · on Aug 4, 2018

The technology behind this (and Apache Arrow in general!) are really cool.

The HN submission title is much more clickbait than the actual content deserves. Any algorithmic improvement (in terms of big-O runtime complexity) can be scaled up to 100x by making the data set large enough.

lmeyerov · on Aug 5, 2018

This is cool! Maybe a different perspective:

-- Numbers are a red herring to me: HPC libs plugged onto the columnar data should destroy LLVM-generated code, including SIMD (not sure if being done yet). HPC libs should cover the typical alg shapes anyways (e.g., everything in the article), and GPU code should generally beat whatever is here anyways. For a fun time, look into SIMD polyhedral vs skeleton & manual GPU codes..

The numbers here we really care about are expression compile time - will a system respond quickly to an analyst or adaptive system generating code. Is it < 10ms overhead, < 100ms, < 1s, < 5s, ...?

-- All about the UDFs: So why do we care? A big problem w/ the algorithm template libs is they stink at accepting friendly user-defined expressions (UDFs). Fast systems are increasingly adopting LLVM to solve this. Speaking of SIMD.. MapD does exactly this: columnar SQL expressions -- LLVM --> CUDA. Our team has been thinking about how to do stuff like js/python/pandas UDFs -> CUDA, and pragmatically, most roads came down to this (vs say Numba's internals). So Gandiva providing a modular bit here is quite cool, esp. for the growing Arrow community!

-- For framework devs: Likewise, there are some cool choices here like code caching. Users shouldn't directly interact with this, but framework devs will. We've eaten cycles here for similar projects, and most Gandiva consumer would have to as well if it wasn't here too. So just as Apache Calcite is easing dev of a lot of similar systems, it's cool to see the team is not just doing the modular system, but seemingly, doing it well!

wesm · on Aug 4, 2018

In general we are talking about O(n) algorithms, and the gains are due to better CPU cache utilization and fewer instructions per value, which LLVM helps do

dang · on Aug 4, 2018

We changed the title from "Apache Arrow LLVM Compilation for 10-100 Performance Improvement".