More

pablohoffman · on March 16, 2016

Hey folks, I'm the one interviewed, happy to answer any follow up question here.

pablohoffman · on March 5, 2016

Not surprised, I'm pretty low profile :)

pablohoffman · on Oct 18, 2015

Portia from Scrapinghub is 100% open source: http://scrapinghub.com/portia/

pablohoffman · on Nov 6, 2014

We released a similar open source tool for visual scraping, earlier this year, called Portia: https://github.com/scrapinghub/portia

It's been getting quite a bit of traction and we're currently working on the integration with Scrapinghub platform (disclaimer: I work there) for those who prefer a hosted version.

pablohoffman · on April 1, 2014

Not yet.

pablohoffman · on April 1, 2014

Yes, you just need to select a different field type ("text", instead of "html").

pablohoffman · on May 14, 2013

We are not plainly complaining about MongoDB, nor saying it's useless. We are just explaining why it's a poor choice for a specific use case: storing scraped data.

FWIW, we still use Mongo in other internal applications, it's just not the right choice for our crawl data storage backend.

threeseed · on May 14, 2013

One issue is that many of these points are design characteristics of MongoDB and should have been known before hand. I am not criticising but it's almost like you did zero research before hand.

Transactions for example have never existed in MongoDB and joins doesn't really make much sense.

j-kidd · on May 14, 2013

Perhaps they did their research on MongoDB and knew all the limitations, but thought to themselves "meh, I can solve all that in the application code", and eventually found out it wasn't so easy to handle transactions and joins in the code?

After all, developers are rather susceptible to the "don't tell me I can't do that" behavior.

brasetvik · on May 14, 2013

How was the evaluation process that led to using MongoDB in the first place?

At some point you must have compared it to, say, Postgres – which is what the section before the summary hints to.

pablohoffman · on May 14, 2013

We went with HBase. Cassandra would have been suitable too, but we already use Hadoop for data processing so it was a natural choice within the infrastructure ecosystem. We will write a followup about that.

monstrado · on May 14, 2013

Clouderan here! Glad to hear you guys went with HBase, I'm looking forward to your follow up post. Will you detail your key design / architectural setup?

Did you guys roll your own HBase environment or did you go with the CDH? If you're using the CDH version and have any questions, feel free to shoot an email to cdh-user.

pablohoffman · on May 14, 2013

We are using CDH4.2 and have had a very positive experience so far.

Cloudera has in fact been an inspiration for us to follow, you guys have really struck the right balance between open source and commercial support. We follow the same philosophy with Scrapy (an open source web crawling framework), as you do with Hadoop and its ecosystem.

monstrado · on May 14, 2013

That's really awesome to hear, thanks for your kind words. I'm looking forward to the follow up blog, depending on your key design you may be able to take advantage of Impala for ad-hoc queries using SQL.

pablohoffman · on Nov 26, 2011

The title of this post is misleading, it makes you think GitHub is actually broken, when the article is about complaining how their fork feature works.

pablohoffman · on Nov 8, 2011

I initially submitted this post, but then deleted it and resubmitted to the original post on Common Crawl blog: http://news.ycombinator.com/item?id=3208853

I now regret since this one got much more attention. I was under the impression that linking to the original post was more welcomed here HN, but it seems this is not always the case.