Hacker News new | past | comments | ask | show | jobs | submit login
Chronon, Airbnb's ML feature platform, is now open source (medium.com/airbnb-engineering)
224 points by vquemener on April 9, 2024 | hide | past | favorite | 112 comments



It's refreshing to read something about ML and inference and have it not be anything related to a transformer architecture sending up fruit growing from a huge heap of rotten, unknown, mostly irrelevant data. With traditional ML, it's useful to talk about the sources of bias and error, and even measure some of them. You can do things that improve them without starting over on everything else.

With LLMs, it's more like you buy a large pancake machine that you dump all of your compost into (and you suspect the installers might have hooked up to your sewage line as input too). It triples your electricity bill, it makes bizarre screeching noises as it runs, you haven't seen your cat in a week, but at the end out come some damn fine pancakes.

I apologize. I'm talking about the thing that I was saying was a relief to be not talking about.


I agree with you - about the sentiment around the GenAI megaphone.

FWIW, Chronon does serve context within prompts to personalize LLM responses. It is also used to time-travel new prompts for evaluation.


> time-travel new prompts for evaluation

What does this mean?


Imagine you are building a customer support bot for a food delivery app.

The user might say - I need a refund. The bot needs to know contextual information - order details, delivery tracking details etc.

Now you have written a prompt template that needs to be rendered with contextual information. This rendered prompt is what the model will use to decide whether to issue a refund or not.

Before you deploy this prompt to prod, you want to evaluate its performance - instances where it correctly decided to issue or decline a refund.

To evaluate, you can “replay” historical refund requests. The issue is that the information in the context changes with time. You want to instead simulate the value of the context at a historical point in time - or time-travel.


Are you using function calling for the context info?


Time-travel evals, nice.


In what world is it appropriate or even legal to decide on refunds via LLM?

Can you give an example that's not ripe for abuse? This really doesn't sell LLMs as anything useful except insulation from the consequences of bad decisions.


Don't think of LLM as completely replacing the support agent here; rather augmenting. A lot of customer service is setting/finding context: customer name, account, order, item, etc. If an LLM chatbot can do all of that, then handoff to a human support agent, there is real cost savings to be had, without reducing the quality of service.


I'd love for others to think that way. I am a very vocal (in my own bubble) advocate for human-in-the-loop ML.


Have you requested a refund off Amazon lately? They have an automated system where, iirc, a wizard will ask you a few questions and then process it, presumably inspecting your customer history and so on. If the system thinks your request looks genuine and it's within whatever parameters they've set, it'll accept instantly, refund you, sometimes without even asking you to send the item back. If it's less sure, it will pass the request on to a human agent to be dealt with like it would have been in the Before Times.

I can see no reason why it would be illegal or inappropriate to use an LLM as part of the initial flow there. In fact I see no reason why it would be illegal for Amazon to simply flip a coin to decide whether to immediately accept your refund. (Appropriateness is another matter!)

I guess you're assuming the LLM would be the only point of contact with no recourse if it rejects you? Which strikes me as very pessimistic, unless you live in a very poorly regulated country.


"Imagine" is the operative word :-)


What do you think of the approach in DSPy[0]? It seems to give a more traditional ML feel to LLM optimization.

[0] https://dspy.ai/


What is the difference between a ML feature store and a low-latency OLAP DB platform/data warehouse? I see many similarities between both, like the possibility of performing aggregation of large data sets in a very short time.


There is none. The industry is being flooded with DS and "AI" majors (and other generally non-technical people) that have zero historical context on storage and database systems - and so everything needs to be reinvented (but in Python this time) and rebranded. At the end of the day you're simply looking at different mixtures of relational databases, key-value stores, graph databases, caches, time-series databases, column stores, etc. The same stuff we've had for 50+ years.


Two main differences - ability to time travel for training data generation and the ability to push compute to the write side of the view rather than the read side for low latency feature serving.


> "ability to time travel for training"

Nah, this is nothing new.

We've solved this for ages with "snapshots" or "archives", or fancy indexing strategies, or just a freaking "timestamp" column in your tables.


There's a lot more to it than snapshots or timestamped columns when it comes to ML training data generation. We often have windowed aggregations that need to computed as of precise intra-day timestamps in order to achieve parity between training data (backfilled in batch) and the data that is being served online realtime (with streaming aggregations being computed realtime).

Standard OLAP solutions right now are really good at "What's the X day sum of this column as of this timestamp", but when every row of your training data has a precise intra-day timestamp that you need windowed aggregations to be accurate as-of, this is a different challenge.

And when you have many people sharing these aggregations, but with potentially different timestamps/timelines, you also want them sharing partial aggregations where possibly for efficiency.

All of this is well beyond the scope that is addressed by standard OLAP data solutions.

Not to mention the fact that the offline computation needs to translate seamlessly to power online serving (i.e. seeding feature values, and combining with streaming realtime aggregations), and the need for online/offline consistency measurement.

That's why a lot of teams don't even bother with this, and basically just log their feature values from online to offline. But this limits what kind of data they can use, and also how quickly they can iterate on new features (need to wait for enough log data to accumulate before you can train).


> Standard OLAP solutions right now are really good at "What's the X day sum of this column as of this timestamp", but when every row of your training data has a precise intra-day timestamp that you need windowed aggregations to be accurate as-of, this is a different challenge.

As long as your OLAP table/projection/materialized view is sorted/clustered by that timestamp, it will be able to efficiently pick only the data in that interval for your query, regardless of the precision you need.

> And when you have many people sharing these aggregations, but with potentially different timestamps/timelines, you also want them sharing partial aggregations where possibly for efficiency.

> All of this is well beyond the scope that is addressed by standard OLAP data solutions.

I think the StarRocks open-source OLAP DB supports this as a query rewrite mechanism that optimizes performance by using data from materialized views. It can build UNION queries to handle date ranges [1]

[1] https://docs.starrocks.io/docs/using_starrocks/query_rewrite...


I’m still not seeing how this is a novel problem. You just apply a filter to your timestamp column and re-run the window function. It will give you the same value down to the resolution of the timestamp every time.


Let's try an example: `average page views in the last 1, 7, 30, 60, 180 days`

You need these values accurate as of ~500k timestamps for 10k different page ids, with significant skew for some page ids.

So you have a "left" table with 500k rows, each with a page id and timestamp. Then you have a `page_views` table with many millions/billions/whatever rows that need to be aggregated.

Sure, you could do this with backfill with SQL and fancy window functions. But let's just look at what you would need to do to actually make this work, assuming you wanted it to be serving online with realtime updates (from a page_views kafka topic that is the source of the page views table):

For online serving: 1. Decompose the batch computation to SUM and COUNT and seed the values in your KV store 2. Write the streaming job that does realtime updates to your SUMs/COUNTs. 3. Have an API for fetching and finalizing the AVERAGE value.

For Backfilling: 1. Write your verbose query with windowed aggregations (I encourage you to actually try it). 2. Often you also want a daily front-fill job for scheduled retraining. Now you're also thinking about how to reuse previous values. Maybe you reuse your decomposed SUMs/COUNTs above, but if so you're now orchestrating these pipelines.

For making sure you didn't mess it up: 1. Compare logs of fetched features to backfilled values to make sure that they're temporally consistent.

For sharing: 1. Let's say other ML practitioners are also playing around with this feature, but with a different timelines (i.e. different timestamps). Are they redoing all of the computation? Or are you orchestrating caching and reusing partial windows?

So you can do all that, or you can write a few lines of python in Chronon.

Now let's say you want to add a window. Or say you want to change it so it's aggregated by `user_id` rather than `page_id`. Or say you want to add other aggregations other than AVERAGE. You can redo all of that again, or change a few lines of Python.


I admit this is a bit outside my wheelhouse so I’m probably still missing something.

Isn’t this just a table with 5bn rows of timestamp, page_type, page_views_t1d, page_views_t7d, page_views_t30d, page_views_t60d, and page_views_t180d? You can even compute this incrementally or in parallel by timestamp and/or page_type.

What’s the magic Chronon is doing?


For offline computation, it is okay with the table with 5bn rows. But for online serving, it would be really challenging to serve the features at a few milliseconds.

But even for offline computation, for the same computation logic, the code will be duplicated in lots of places. we have observed the ML practitioners copied sql queries all over. In the end, it is not possible for debugging, feature interpretability and lineage.

Chronon abstracts all those away so that ML practitioners can focus on the core problems they are dealing with, rather than spending time on the ML Ops.

For an extreme use case, one user defined 1000 features with 250 lines of code, which is definitely impossible with SQL queries, not to even mention the extra work to serve those features.


How does Chronon do this faster than the precomputed table? And in a single docker container? Is it doing logically similar operations but just automating the creation and orchestration of the aggregation tasks? How does it work?


We utilize a lambda architecture, which incorporates the concept of precomputed tables as well. Those precomputed tables store intermediate representation of the final results. These precomputed tables are capable of providing snapshot or daily accuracy features. However, when it comes to real-time features that require point-in-time correctness, using precomputed tables may present challenges.

For the offline computations, we will reuse those intermediate results to avoid calculation from the beginning again. So the engine can actually scale sub-linearly.


Thanks. How does Chronon serve the real-time features without precomputed tables?


This is good post. You had me until this part:

    > So you can do all that, or you can write a few lines of python in Chronon.
It all seems a bit handvwavy here. Will Chronon work as well as the SQL version or be correct? I vote for an LLM tool to help you write those queries. Or is that effectively what Chronon is doing?


For correctness, yes, it works as well as the SQL version. And the aggregation can be extensible for other operations easily. For example, we have an operation of last, which is not even available in standard SQL.


I’ll stop short of calling comparisons to standard SQL disingenuous but it’s definitely unrealistic because no standard SQL implementation exists.

What does this “last” operation do? There’s definitely a LAST_VALUE() window function in the databases I use. It is available in Postgres, Redshift, EMR, Oracle, MySQL, MSSQL, Bigquery, and certainly others I am not aware of.


That's fair.

Actually, Last is usually called last_k(n), so that you can specify the number of the values in the result array. For example, if the input column is page_view_id and n = 300, it will return the last 300 page_view_id as an array. If a window is used, for example, 7d, it will truncate the results to the past 7d. The LAST_VALUE() seems to return the last value from an ordered set. Hope that helps. Thanks for your interests.


In SQL we do that with a RANK window function then apply a filter to that rank. It can also be done with a correlated subquery.


What's with the dismissiveness? The author is a senior staff engineer at a huge company & has worked in this space for years. I'd suspect they've done their diligence...


Snapshots can’t travel back with milliseconds precision or even minute level precision. They are just full dumps at regular fixed intervals in time.


https://en.wikipedia.org/wiki/Sixth_normal_form Basically we've had time travel (via triggers or built in temporal tables or just writing the data) for a long time, its just expensive to have it all for an OLTP database.

We've also had slowly changing dimensions to solve this type of problem for a decent amount of time for the labels that sit on top of everything, though really these are just fact tables with a similar historical approach.


6NF works well for some temporal data, but I haven't seen it work well for windowed aggregations because the start/end time format of saving values doesn't handle events "falling out of the window" too well. At least the examples I've seen have values change due to explicit mutation events.


Agree, you don't really want to pre-aggregate your temporal data, or it will effectively only aggregate at each row-time boundary and the value is lower than just keeping the individual calculations.


Databases have had many forms of time travel for 30+ years now.


Not at the latency needed for feature serving and most databases struggle with column limits.

But please enlighten us on which databases to use so Airbnb (and the rest of us) can stop wasting time.


Shameless plug, but XTDB v2 is being built for low-latency bitemporal queries over columnar storage and might be applicable: https://docs.xtdb.com/quickstart/query-the-past.html

We've not been developing v2 with ML feature serving in mind so far, but I would love to speak with anyone interested in this use case and figure out where the gaps are.


Snapshots don’t have to be at regular intervals and can be at whatever resolution you choose. You could snapshot as the first step of training then keep that snapshot for the life of the resulting model. Or you could use some other time travel methodology. Snapshots are only one of many options.


These are reconstruction of features / columns that don’t exist yet.


I don’t understand what this means. How can something be reconstructed without first existing? Is this not just a caching exercise?


Have you guys considered Rockset? What you mentioned are some classic real-time aggregation use cases and Rockset seems to support that well: https://docs.rockset.com/documentation/docs/ingestion-rollup...


> ability to time travel for training data generation

What now?


Pardon the jargon. But it is a necessary addition to the vocabulary.

To evaluate if a feature is valuable, you could attach the value of the feature to past inferences and retrain a new model to check for improvement in performance.

But this “attach”-ing needs the feature value to be as of the time of the past inference.


That’s not a new concept.


True. But it is not necessary to reinvent the wheel for engineers. :)


That’s the point of this subthread though. What’s the new thing Chronon is doing? It can’t just be point in time features because that’s already a thing.


You need the columnar store for both training data and batch inference data. If you have a batch ML system that works with time series data, the feature store will help you create point in time correct training data snapshots from the mutable feature datab(no future data leakagae), as well as batch inference data.

For real-time ml systems, it give uou row oriented retrival latencies for features.

Most importantly, it helps modularize your ML system into feature pipelines training pipelines, and inference pipelines. No onolithic ML pipelines.


Feature stores are more for fast read and moderate write/update for ML training and inference flows. Good organization and fast query of relatively clean data.

Data warehouse is more for relatively unstructured or blobby data with moderate read access and capacity for massive files.

OLAP is mostly for feeding streaming and event-driven flows, including but not limited to ML.


the ability generate training sets against historical inferences to back-test new features

another one is the focus on pushing as much compute to the write-side as possible (within Chronon) - specially joins and aggregations.

OLAP databases and even graph databases don't scale well to high read traffic. Even when they do, the latencies are very high.


You may want to take a look at Starrocks [1]. It is an open-source DB [2] that competes with Clickhouse [3] and claims to scale well – even with joins – to handle use cases like real-time and user-facing analytics, where most queries should run in a fraction of a second.

[1] https://www.starrocks.io/ [2] https://github.com/StarRocks/starrocks [3] https://www.starrocks.io/blog/starrocks-vs-clickhouse-the-qu...


We did and gave up due to scalability limitations.

Fundamentally most of the computation needs to happen before the read request is sent.


Hey! I work on the ML Feature Infra at Netflix, operating a similar system to Chronon but with some crucial differences. What other alternatives aside from Starrocks did you evaluate as potential replacements prior to building Chronon? Curious if you got to try Tecton or Materialize.com.


We haven’t tried materialize - IIUC materialized is pure kappa. Since we need to correct upstream data errors and forget selective data(GDPR) automatically - we need a lambda system.

Tecton, we evaluated, but decided that the time-travel strategy wasn’t scalable for our needs at the time.

A philosophical difference with tecton is that, we believe the compute primitives (aggregation and enrichment) need to be composable. We don’t have a FeatureSet or a TrainingSet for that reason - we instead have GroupBy and Join.

This enables chaining or composition to handle normalization (think 3NF) / star-schema in the warehouse.

Side benefit is that, non ml use-cases are able to leverage functionality within Chronon.


FeatureSets are mutable data and TrainingSets are consistent snapshots of feature data (from FeatureSets). I fail to see what that has to do with composability. Join is still available for FeatureSets to enable composable feature views - join is resuse of feature data. GroupBy is just an aggregation in a feature pipeline, not sure your point here. You can still do star schema (and even snowflake schema if you have the right abstractions).


Normalization is a model-dependent transformation and happens after the feature store - needs to be consistent between training and inference pipelines.


Normalization is overloaded. I was referring to schema normalization (3NF etc) not feature normalization - like standard scaling etc.


Ok, but star schema is denormalized. Snowflake is normalized.


To be pedantic, even in star schema - the dim tables are denormalized, fact tables are not.

I agree that my statement would be much better if used snowflake schema instead.


What is the meaning of pure kappa?



Thank you for sharing!


That evaluation would be an amazing addendum or engineering blog post! I know it’s not as sexy as announcing a product, but from an engineering perspective the process matters as much as the outcome :)


Please can you expand? What limitations, computations?


Let’s say you want to compute avg transaction value of a user in the last 90days. You could pull individual transactions and average during the request time - or you could pre compute a partial aggregates and re-aggregate on read.

OLAP systems are fundamentally designed to scale the read path - former approach. Feature serving needs the latter.


Does Chronon automatically determine what intermediate calculations should be cached? Does it accept hints?


We don't accept hints yet - but we determine what to cache.


First of all, congrats on the release! Well done. A few questions:

- Since the platform is designed to scale, it would be nice to see scalability benchmarks

- Is the platform compatible with human-in-the-loop workflows? In my experience, those workflows tend to require vastly different needs than fully automated workflows (e.g. online advertising)


re: scalability benchmarks - we plan to publish more benchmark information against publicly available datasets in the near future.

re: human-in-the-loop workflows - do you mean labeling?


Author. Happy to answer any questions.


How do you/AirBnB handle deeply linked features (2-hop+?) that are also latency sensitive? Maybe I'm missing something, but I don't imagine that with the transformation DSL described in Chronon.

For our org, those are by far the most complicated to handle. Graph DBs are kind of scaling poorly, while storing state in stream processing jobs is way too large/expensive. Those would also be built on top of API sources, which then lead us to the unfortunate "log & wait" approach for our most important features


we call this chaining.

In the API itself - you could specify the chain links by specifying the source.

To be precise - a GroupBy(aggregation primitive) can have a Join(enrichment primitive) as a source. To rephrase, you can enrich first and then aggregate and continue this chain indefinitely.

> Graph DBs are kind of scaling poorly

That makes sense. Since you scaling these on the read side it is much much harder than pre-computing on the write side. (That is what Chronon allows you to do)


At what size of team, features or number of models would you say the break even point is for investing time into using this platform?


Offline is pretty easy to get started with. It should take less than a week to set it up for new use-cases across the company. (You can begin building training-sets if offline is setup)

Online is a bit more involved - you need a month or more to test that your KV store scales against traffic coming from chronon for reads and writes.


How does this relate to Zipline and Bighead? Does it replace those projects or is it a continuation of them?


Bighead is the model training and inference platform.

Chronon is a full re-write of zipline with 1) a different underlying algorithm for time-travel to address scalability concerns. 2) a different serde and fetching strategy to address latency concerns.


I'd imagine a continuation... he is also the author of Zipline


I noticed airflow as the backing orchestration service. Was there any consideration for another orchestration tool? I know Airbnb has at least two internally, but also that airflow is the predominant one for the data org still.


Airflow is the current implementation since it is the paved path at airbnb. But we are open to accepting contributions for other orchestrators.

Someone mentioned they wanted to add cadence support.


I'm also curious how you went from a non-platformatized approach to adopting this platform; what were the important insights for strategizing, prioritizing, motivating teams to lift existing pipelines into the new thing? Open ended question


There were two main drivers -

- inability to back-test new real-time features. People were forced to log-and-wait to create training sets for months. Chronon reduces this to hours or days.

- the difficulty of creating the lambda system (batch pipeline, streaming pipeline, index, serving endpoint) for every feature group. In chronon, you simply set a flag on your feature definition to spin up the lambda system.


How does Chronon handle mutable data when backfilling? Or does it make some assumptions on the underlying data?


By mutable data do you mean - change data coming from OLTP databases? If yes, we do this via the EntitySource api.

https://www.chronon.ai/authoring_features/Source.html#stream...


> By mutable data do you mean - change data coming from OLTP databases?

Yes, exactly! I see there is some kind of support, but is it possible to use the OLTP database as an event source?

For example, say I had a table in my OLTP database, `data.returns`, that had columns `ts`, the event time, and `status`, which can be PENDING or COMPLETED. I'd like to generate point-in-time correct training data, where the feature is the count of completed returns. It seems like all the necessary information to calculate this is there


Looks very useful. I'm not aware of any open source alternative (although I could just be ignorant here!)


This is the biggest one: https://feast.dev/


This isn't really a drop-in replacement; they don't offer transforms out of the box.

Admittedly some of the transforms proposed in this article are a little simple & don't represent the full space of feature eng requirements for all large orgs


Actually feast does support transformations depending upon the source. It supports transferring data on demand and via streaming. It does not support batch transformation only because technically it should just be an upload but we can revisit that decision.


I think feast is sunsetted



Feathr from linkedin is the closest. But there doesn't seem to be much recent activity on the project.


Hopsworks


great work! When it comes to batched computations, why not leverage intermediate state much like streaming jobs. For example, if we need to calculate past 30 day sum for a value daily - it seems like this would compute so from scratch daily. Would it not make sense to model this as a sliding window that's updated daily?


We do this for training data generation already.

We have plans to implement this behavior for computing the batch arm of feature serving.


can you point to documentation/ examples which talk about it - I could not find this while exploring the pages (or maybe I missed it altogether).


What does Airbnb use ML for?


almost every button click is either powered by a model or guarded by a model.


Paywalled for me


It opens for me in incognito mode - albeit with a large popup that I had to close.


The downside is after you use the platform for a week, you have to delete all the expired models yourself and clean up all the labels or face a hefty housekeeping surcharge.


Why do major sites still use Medium as a blog platform.


for others who also hate medium: https://scribe.rip/airbnb-engineering/chronon-airbnbs-ml-fea...

and probably the only link you care about: https://github.com/airbnb/chronon#readme (Apache 2)


It's tough to prioritize migrating to a new platform for the engineering blog, without a very good ROI. Airbnb's eng blog was set up on Medium a while ago, it's doing fine, they have no real reason to spend a lot of resources on switching.


Ugh yes. The first thing I see on clicking the link is an overwhelming login/join pop-over. I’m never visiting that blog again…


Wild that it's 2024 and you still don't have UBlock Origin.


Maybe they opened the link on Safari on iOS like me?


Disabling JavaScript helps with that (sometimes they don't show the full article if JS is disabled though).


Substack is the same


Don’t let having to tap “x” a single time ruin your day. You’re missing out on a lot of good stuff.


I am with you on this one.


Reach (sadly)


Free income?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: