Apache Hop 2.0

lovelearning · on June 9, 2022

(Just a feedback for the project's tech writer) I always include a "What problem does this solve?" and screenshots in my docs. I think they help people understand a project better.

Here too, I understood Hop's purpose only after seeing the screenshots on secondary pages like https://hop.apache.org/manual/latest/getting-started/hop-gui.... Abstract statements like "aims to facilitate all aspects of data and metadata orchestration" in the front page, or even in the "What is Hop?" doc, didn't help.

zasdffaa · on June 9, 2022

I too am getting tired of this shite. This disease afflicts corporate product descriptions to the point I just use wikipedia to get information on WTF £££expensive product does, and it's now metastasizing to free software.

"Apache Hop, short for Hop Orchestration Platform, is a data orchestration and data engineering platform that aims to facillitate all aspects of data and metadata orchestration. Hop lets you focus on the problem you’re trying to solve without technology getting in the way"

What is 'data orchestration'? Ditto 'data engineering platform'?

'facillitate all aspects of data and metadata orchestration' What the hell does this even mean.

'Hop lets you focus on the problem you’re trying to solve' so what problem do you think I'm trying to solve?

It's just so bizarre, it's like language meaning has separated from language itself like layers of plywood left in the rain. And there is no wikipedia page to help out.

sdoering · on June 9, 2022

Thanks for sharing the page. I was wondering what hop was and if it could potentially be a solution for a current problem I am facing.

Now at least I know a bit more.

cmcconomy · on June 8, 2022

"What is HOP?"

https://hop.apache.org/manual/latest/getting-started/hop-wha...

chrisweekly · on June 8, 2022

> "VISUAL DESIGN AND METADATA

> Apache Hop, short for Hop Orchestration Platform, is a data orchestration and data engineering platform that aims to facillitate all aspects of data and metadata orchestration. Hop lets you focus on the problem you’re trying to solve without technology getting in the way. Simple tasks should be easy, complex tasks need to be possible.

> Hop allows data professionals to work visually, using metadata to describe how data should be processed. Visual design enables data developers to focus on what they want to do instead of how that task needs to be done. This focus on the task at hand lets Hop developers be more productive than they would be when writing code."

jjtheblunt · on June 8, 2022

I'm misunderstanding how so many Apache hosted projects P let someone focus on X without Y getting in the way, totally ignoring the complexity of introducing P and altering everything to align with P, thereby forbidding focus on X.

Are these really often useful?

ATsch · on June 8, 2022

As far as I can tell, the ASF is just where many companies send their failed Java projects for palliative care.

Simon_O_Rourke · on June 8, 2022

Thanks for expanding that, it reads like it's some Airflow competitor. Would be curious how it handles all the authentication management for the various pipeline elements.

waynesonfire · on June 8, 2022

"Hop initially (late 2019) started as a fork of the Kettle (Pentaho Data Integration)."

mason55 · on June 8, 2022

Wow. PDI is one of the worst pieces of software I've ever used. Possibly only second to Pentaho Report Designer.

From looking at the Apache HOP docs, it doesn't look like they have changed the UI much (if at all). I wonder if they at least made it less buggy.

frellus · on June 8, 2022

Aside from this platform, which I've never heard about until now, I'm wondering what others are using in the workflow orchestration space?

I'd assume Airflow is the most prevalent, but there's also Argo getting quite a bit of momentum lately.

abrazensunset · on June 9, 2022

"I want to write my orchestration in Python and I'm comfortable hosting my own compute" -> Prefect (lightweight) or Dagster (heavier but featureful)

"My team already knows Airflow and/or I want to pay Astronomer a lot of money" -> Airflow

"I love YAML and everything is on k8s anyway" -> Argo

"I just want something that works out of the box and don't want to host my own compute" -> Shipyard, maybe Orchest

"I want a more flexible, generic workflow engine and don't care about writing orchestration in Python" -> Temporal/Cadence

"I am very nostalgic" -> Azkaban, Oozie, Luigi

"I love clunky Java solutions to data problems" -> Nifi et al

"I like to pay for half-managed solutions and late upgrades to a first-generation technology" -> AWS/GCP hosted Airflow options

"I am on AWS and it doesn't need to be complicated" -> AWS Step Functions

herodoturtle · on June 9, 2022

This was a really useful comment. Thank you.

I know there's a degree of oversimplification going on here, but there's something to be said for having a simple bullet-list breakdown of all the use-cases - alongside the best tool for each use-case.

It is servers as a practical starting point in terms of narrowing down the list of tools (of which there are so many), before one proceeds with a deeper dive into the best fitting tool.

Would be great if there were a site that did this sort of thing for all the common architectural needs.

corrius · on June 9, 2022

+1 to AWS Step Functions, in my last three companies I have built fairly complicated workflows with them and once you get used to them they are very powerful, reliable and cheap. I just wish a little bit more monitoring on top of them but nothing you can not build by yourself.

otabdeveloper4 · on June 9, 2022

I ended up choosing NodeRed at current job. What does that make me? :)

kodablah · on June 8, 2022

https://temporal.io/

(disclaimer: I work for Temporal on the Go SDK and upcoming Python SDK)

wharfjumper · on June 9, 2022

Oooh I hadn't realised there was a Python SDK under development - will take a look.

rzk · on June 8, 2022

In my previous job, I used Dagster. It has served us well.

See my comment about it here: https://news.ycombinator.com/item?id=28803117

adamzegelin · on June 9, 2022

Cadence: https://cadenceworkflow.io

ricklamers · on June 9, 2022

We’re building Orchest (https://github.com/orchest/orchest)

tldr; a GUI for Argo

rektide · on June 8, 2022

Ooh Drools plugins, for rule based event-processing. Neat. Hope I can find some examples!

I havent used Airflow, but my impression is this fits a similar role. That itcs built atop good tech like Apache Beam & can use things like Flink is, in my book, a nice win.

deterministic · on June 9, 2022

Workflow applications gets reinvented all the time (there are hundreds of them out there plus many “standards”). However I have never seen a successful usage of workflow applications in industry. So my question is if anybody has any good examples of workflow applications being more successful than “normal” business applications?

PeterStuer · on June 9, 2022

Built an automated credit decision support system for a major financial services institution that used workflows. We needed masses of legacy data systems integration, rule based descisions and human task coordination.

This system was built with Microsoft's Biztalk around 2006. It performed very well in production, but Biztalk had quite a few gotchas that needed to be creatively worked around in development.

nojito · on June 9, 2022

power automate, airflow, etc.

waynesonfire · on June 8, 2022

I'd be curious how this contrasts with apache nifi.

mring33621 · on June 8, 2022

potato, potahto

tiffanyh · on June 9, 2022

Slightly OT: what library do people use as the frontend web GUI for workflow orchestration?

E.g. a drag’n drop web GUI workflow library like pictured below?

https://customer.io/wp-content/uploads/2021/06/use-case-onbo...

ricklamers · on June 9, 2022

We’ve rolled our own but https://reactflow.dev is pretty neat and OSS.

tiffanyh · on June 9, 2022

This is perfect. Thanks so much.

Anyone know of others?

arthurcolle · on June 8, 2022

Wow I was expecting this to be the 463rd distributed computing streaming framework since its Apache. Shocked that its not

Tostino · on June 9, 2022

How does this compare with Apache DolphinScheduler? That seems to fit the orchistration / workflow scheduling role pretty well, and seems like the next iteration of Airflow... Not quite getting how this compares, and not finding anyone directly comparing them on Google (though that's been less reliable lately).

mi_lk · on June 8, 2022

tangent - what is it about Apache or big data that the associated softwares are mostly written in Java?

oaiey · on June 8, 2022

Java, like .NET, are just solid application platforms which are statically typed and their performance is good.

Java has a history in big systems for soon 30 years.

Rust, Python and Go are just not there yet. Rust is too low level, Python is not statically typed and will always suffer performance wise and Go ... I is a youngster :). And .NET is always not everyone's free choice.

And Apache, well they just liked Java for their applications. They started with some C/C++ code but then quickly aggregated a lot of Java tech.

GordonS · on June 8, 2022

> And .NET is always not everyone's free choice.

Hey, sometimes it really is!!

oaiey · on June 10, 2022

I am a big fanboy actually. The wording is a bit screwed. The "free" was as in (perceived) free/open/... software.

zmmmmm · on June 8, 2022

Performance, portability, stability, scalability, concurrency, ecosystem (libraries, etc) .... despite all the new languages around, there actually still aren't many alternatives that give you the same combination of all these to the same level as Java does.

vips7L · on June 9, 2022

Don’t forget observability and tooling!

ojhughes · on June 8, 2022

MapReduce and HDFS were written in Java and they paved the way for a lot of the other big data tools

erik_seaberg · on June 9, 2022

I get the impression that Google used C++ (their dialect) for the first MapReduce system, while open source Hadoop and HDFS came later in Java.

tomwheeler · on June 9, 2022

That is correct.

manish_gill · on June 8, 2022

Something I've been asking for a long time as well. Java/JVM are great, but it would be great to see _some_ diversity in the Big Data ecosystem when it comes to implementations. :)

jpgvm · on June 9, 2022

Why? JVM is essentially perfectly suited for data workloads. Diversity of languages serves no technical purpose.

FridgeSeal · on June 9, 2022

You have to deal with the JVM?

You have to deal with the 13m JVM config options?

You have to deal with the confusion and complexity that is “java dependencies”

Everything is OOP-abstraction-heavy API’s?

I’ve spent too much of my time and effort recently debugging Scala/Spark/JVM resource issues/dependency issues and the more I have to deal with it, the less I want anything to do with JVM-based solutions. The closest I want to get to another awful JVM application is a docker container.

I will rejoice the day that spark-alternatives progress enough for our team to replace our workloads and I can throw our spark stuff into the literal bin.

jpgvm · on June 9, 2022

Yeah but you are basically asking for a world which has fragmented implementations such that some stuff will interoperate and be usable within Spark et al and "the new stuff" which could either all be on the same platform or much more likely spread across a number of platforms.

Once you have gotten the hang of running Spark (and working knowledge of JVM itself) then you have paid the cost and it's done. Adding a whole bunch of new stuff just means more costs to pay, less integration, more fragmentation of knowledge.

Doesn't seem like a net win on any front to me.

FridgeSeal · on June 9, 2022

If we were to stabilise on a framework, I don’t think Java/spark is the right one. It’s notoriously fickle fragile to run, it’s obscenely resource heavy, often not actually that fast, and it’s construction is locked into a set of implementations that have aged poorly-you basically need a JVM wizard to tune everything correctly for you, and you need to run so much supporting infrastructure that I think it greatly erodes the benefits of spark.

> "the new stuff" which could either all be on the same platform or much more likely spread across a number of platforms.

Yes. I’d like to see more variety in approaches, more variety in specialisations, and for interop between “platforms” to occur at a slightly “higher” level (I.e. maybe common sql dialect, common set of messaging/interop protocols).

I think rather than pouring all of our effort into a singular platform, we should put our effort towards improving our tooling and languages so that it is more straightforward and viable to build these platforms.

Everyone language has got their own web-server framework(s), I see Spark et al as the web-server framework of the data space.

EdwardDiego · on June 9, 2022

I feel like you're conflating Spark and the JVM.

I get your point on tuning Spark, been there, suffered the frustration where a job fails 4 hours in, but yeah, that's Spark for you.

62951413 · on June 9, 2022

https://arrow.apache.org/datafusion/ (aka Ballista) is the first serious contender that started on the JVM and quickly moved to Rust.

qbasic_forever · on June 8, 2022

From like 2000 to 2010 or even 2015 either java or .NET was the default choice for big enterprise companies. Nobody ever got fired for picking Microsoft or Java (I would add), as they say. A lot of these Apache projects have been donated from work at big enterprises so it comes out of that background from enterprise I imagine.

Tostino · on June 9, 2022

Even now, what options are a huge step up from Java/.NET for your standard tech business backend/webbapp? They integrate well with so many other languages, and both have great cross-platform stories...

oaiey · on June 10, 2022

Outside of the SV bubble actually none. When you look from a conservative angle (static typing, developer availability, productivity, tool support, library support, performance, etc) it goes quickly down to Java and .NET. The dynamic interpretation of Python and JavaScript are their core deal breakers.

erik_seaberg · on June 9, 2022

Java classloading is a natural fit for calling into a typesafe plugin fetched from a URL. Spark can even serialize lambdas and distribute their execution.

It seems to be more of a hassle to do this kind of thing over IPC with native binaries, and if ARM starts displacing a lot of x86-64 in datacenters it gets more complicated.

zekrioca · on June 8, 2022

Good question, but maybe due to Java's stability and portability.

bradknowles · on June 10, 2022

So:

1 Corinthians 13: "Faith, Hop and Charity, and the greatest of these is Hop."

Does that sound about right?