Have you thought about how you would handle much larger datasets? Or is the idea that since this is a spreadsheet, the 10M cell limit is plenty sufficient?
I find WASM really interesting, but I can't wrap my head around how this scales in the enterprise. But I figure it probably just comes down to the use cases and personas you're targeting.
I am also very deeply invested in this question. It seems like the goto path for huge large data sets is text to sql (clickhouse, snowflake) etc. But all these juicy python data science libraries require code execution based on off the much small data payloads from the sql results. Feel free to reach out, what you are trying to achieve seems very similar to what I am trying to do in a completely different industry/usecase.
Fabi.ai | https://www.fabi.ai/| Senior front end engineer | Full-time | Hybrid SF or Remote (US)
We're looking for a senior front end engineer to join our mighty and growing team.
We're transforming the way data analysis is done in the enterprise and already have some amazing customers and are growing rapidly.
This person should have extensive React and Typescript experience be able to operate with minimal design supervision (we're a small team and we expect this person to have a sharp eye).
This feels like a good opportunity for a startup. I've seen a lot of startups crop up around Snowflake cost management, I wonder what's in the AWS space.
> A common pattern I’ve seen over the years have been folks in engineering leadership positions that are not super comfortable with extracting and interpreting data from stores
I think this extends beyond just engineering, and I wish more data teams made the raw data (or at least some clean, subset) more readily available for folks across organizations to explore. I've been part of orgs where I had access to read-only replicas, and I quickly got comfortable querying and analyzing data on my own, and I've been part of other orgs where everything had to go through the data team, and I had to be spoon-fed all the data and insights.
Totally agree. In my last job I was able to create my own ETL jobs as a PM to get data for my own analyses and figured out a fairly minor configuration change could save us $10M per year. It was from one of many random ETL jobs I created myself out of curiosity that, if I had been forced to rely on other people, I may not ever have created.
If you’d just had a business controller, you’d have x*$10M saved and have more time for your PM-role.
Yes, calling BS on leadership running their own SQL. Bring strategy and tactics, find good people, create clear roles and expectations and sure don’t get lost in running naive scripts you’ve written because you can do all roles better than the people actually occupying those roles.
I know nothing about working in small firms. So that is probably very true. The smaller the firm, the more you do yourself. But ... if a company can save $ 10 mln. ... it can afford a set of financials.
This is actually one of the more interesting LLM observability platforms I've seen. Beyond addressing scaling issues, where do you see yourself going next?
Positioning/roadmap differs between the different project in the space.
We summarized what we strongly believe in here: https://langfuse.com/why
Tldr: open apis, self-hostable, LLM/cloud/model/framework-agnostic, API first, unopinionated building blocks for sophisticated teams, simple yet scalable instrumentation that is incrementally adoptable
We work closely with the community, and the roadmap can change frequently based on feedback. GitHub Discussions is very active, so feel free to join the conversation if you want to suggest or contribute a feature: https://langfuse.com/ideas
Thanks for sharing your blogpost. We had a similar journey. I installed and tried both Langfuse and Phoenix and ended up choosing Langfuse due to some versioning conflicts on the python dependency. I’m curious if your thoughts change after V3? I also liked that it only depended on Postgres but the scalable version requires other dependencies.
The thing I liked about Phoenix is that it uses OpenTelemetry. In the end we’re building our Agents SDK in a way that the observability platform can be swapped (https://github.com/zetaalphavector/platform/tree/master/agen...) and the abstraction is OpenTelemetry-inspired.
As you mentioned, this was a significant trade-off. We faced two choices:
(1) Stick with a single Docker container and Postgres. This option is simple to self-host, operate, and iterate on, but it suffers from poor performance at scale, especially for analytical queries that become crucial as the project grows. Additionally, as more features emerged, we needed a queue and benefited from caching and asynchronous processing, which required splitting into a second container and adding Redis. These features would have been blocked when going for this setup.
(2) Switch to a scalable setup with a robust infrastructure that enables us to develop features that interest the majority of our community. We have chosen this path and prioritized templates and Helm charts to simplify self-hosting. Please let us know if you have any questions or feedback as we transition to v3. We aim to make this process as easy as possible.
Regarding OTel, we are considering adding a collector to Langfuse as the OTel semantics are currently developing well. The needs of the Langfuse community are evolving rapidly, and starting with our own instrumentation has allowed us to move quickly while the semantic conventions were not developed. We are tracking this here and would greatly appreciate your feedback, upvotes, or any comments you have on this thread: https://github.com/orgs/langfuse/discussions/2509
So we are still on V2.7 - works pretty good for us. Havent tried V3 yet, and not looking to upgrade. I think the next big feature set we are looking for is a prompt evaluation system.
But we are coming around to the view that it is a big enough problem to have dedicated saas, rather than piggy back on observability saas. At NonBioS, we have very complex requirements - so we might just end up building it up from the ground up.
"Langsmith appeared popular, but we had encountered challenges with Langchain from the same company, finding it overly complex for previous NonBioS tooling. We rewrote our systems to remove dependencies on Langchain and chose not to proceed with Langsmith as it seemed strongly coupled with Langchain."
I've never really used Langchain, but setup Langsmith with my own project quite quickly. It's very similar to setting up Langfuse, activated with a wrapper around the OpenAI library. (Though I haven't looked into the metadata and tracing yet.)
Functionally the two seem very similar. I'm looking at both and am having a hard time figuring out differences.
I'm a maintainer of Opik, an open source LLM evaluation and observability platform. We only launched a few months ago, but we're growing rapidly: https://github.com/comet-ml/opik
I'm curious to see how this plays out when it comes to deploying and maintaining production-grade apps. I know relatively little about infrastructure and DevOps, but that's the stuff that actually always seems complicated when it goes from going to MVP to production. This question feels particularly important if we're expecting PMs and designers to be primary users.
That said, I'm super excited about this space and love seeing smart folks putting energy into this. Even if it's still a bit aspirational, I think the idea of cutting down time spent debugging and refactoring and putting more power in the hands of less technical folks is awesome.
Are you looking to validate a market idea? If so, are you thinking of more of a consumer use case? You mentioned Cursor, so it sound like you're maybe thinking more enterprise, but embedded ads are basically not a thing in the enterprise. Most solutions offer freemium mostly as a loss-leader, but this isn't AI specific IMO.
You're right, embedded ads don’t work in enterprise, and freemium often serves as a loss-leader there. We're looking to validate the market, possibly for consumer use cases, while testing if freemium can drive early adoption or loyalty. Do you think it has potential in consumer AI, or is premium-only the better approach?
I'm building in this space[1] and I'm intrigued. When I checked out the repo, this actually looked like possibly a really convenient way to fine-tune models, but I'm trying to understand the piece about "products simply don’t have datasets, and datasets can’t keep up with product evolution". What does this mean in practice and how does this relate to fine-tuning?
Datasets tend to be really rough proxies of product goals. Initial spec is "feature smiling faces", so a "smiling/no-smiling" dataset is built. But over the next year you really realize people can be "smiling but ugly smiling", "neutral faced but pleasant" and a bunch more. There are bugs you need to fix (false positive/negatives), and lots of tweaks to the goals. Any design nuance is lost in the chain: explain product concept to data science team, who writes a spec for data collectors, who collect samples, DS makes a model, eng integrates, and then folks (finally) try it in product.
QA files one off bugs, but not in a way that impacts datasets/training. Someone needs to analyze them in bulk and make calls about which areas to care about (which is slow and expensive).
However, if the time to data is tiny, you can iterate more like software. New model drops often (with fast evals). Subjective feedback can become synth data quickly, the issue fixed, and results evaluated.
Your product looks a bit more like analysis pipelines for new problems? I'm more looking at zero-shot quality and performance.
Have you thought about how you would handle much larger datasets? Or is the idea that since this is a spreadsheet, the 10M cell limit is plenty sufficient?
I find WASM really interesting, but I can't wrap my head around how this scales in the enterprise. But I figure it probably just comes down to the use cases and personas you're targeting.
[1] https://www.fabi.ai/