UPSERTisms in Postgres

chrisbolt · on April 21, 2014

The graphs remind me of https://news.ycombinator.com/item?id=7588929 (specifically the truncated Y-axis)

jakobe · on April 21, 2014

Here, I fixed the graphs for you: http://imgur.com/xi2nhgx

lifthrasiir · on April 21, 2014

In this case, though, the goal of these plots is to determine the most performing solution for the particular hit rate range, which the plots make obvious.

inportb · on April 21, 2014

I don't find these plots misleading. A line plot emphasizes slope and changes in slope, rather than the absolute heights of the data points as a bar or area plot would; so it's more important to have consistent vertical and horizontal intervals.

joesb · on April 21, 2014

Supposed solution A changes from 10000 to 10001 and solution B changes from 10000 to 10000.5, emphasizing only the slopes doesn't tell you that this changes may be insignificant.

The given link above (How to lie with data visualization) gives axis truncation as a very first examples.

inportb · on April 21, 2014

Indeed, and I disagree with the article regarding axis truncation of line plots, in this context.

I'd also be careful with terms such as "insignificant" when we're talking about data. Suppose that tiny difference were significant. Depicting it as a bar would completely mask the difference, and truncating the axis would give the false impression of the bar's height representing the total magnitude. There'd be a similar effect with a line plot, but the line focuses on continuity and de-emphasizes height: it's not as big of a problem.

There are myriad ways to tell stories using data, and there are myriad ways to tell misleading stories using data. Each type of plot has distinctive properties that make it more or less appropriate to be used/manipulated in certain ways, so I'd hesitate to apply a static set of rules indiscriminately.

thwarted · on April 21, 2014

Just because axis truncation is in the lier's toolbox doesn't mean that every use of axis truncation does, or intends to, misrepresent the data or is inaccurate.

ScottBurson · on April 21, 2014

I have a different complaint. The Y axis should be logarithmic. What we really care about is the reciprocal of the value being plotted: not operations per second, but microseconds per operation. On a log scsle the curves of a function and its reciprocal are just mirror images, so it's easy to visualize one from the other.

As for the truncated Y-axis -- always, always, always read the axis labels before interpreting the graph. (Log scales can't reach zero, so there's no way to escape this when using them.)

ricardobeat · on April 21, 2014

Too fast. Truncating the Y-axis makes more of the relevant data visible, it's not misleading at all here.

Grue3 · on April 21, 2014

These are not bar charts, so truncating Y-axis is acceptable (although I still wouldn't recommend it).

muxxa · on April 21, 2014

The author dismisses the RULE based approach. I'd love to see it compared, as it was the only thing that I could get to complete for a multi-GB table merge (I can't recall what other approaches I tried). http://stackoverflow.com/a/6176044/6691

It would be a boon if upsert was supported as a declarative primitive in psql so that the query planner can do it's magic. Pity it didn't make it in 9.4.

johto · on April 21, 2014

Author here. I dismissed RULEs and TRIGGERs since they don't work correctly under concurrency (and most RULE-based approaches are broken even without concurrency). I felt it wasn't a fair comparison.

However, I guess it would be nice to know how much of a gain would be possible if one was willing to forfeit correct behaviour under concurrency. But that would probably require more work than I'm willing to put in, since the answer doesn't interest me that much personally at this moment.

ExpiredLink · on April 21, 2014

BTW, why is it called UPSERT (ugly) and not MERGE (less ugly) as in DB2?

ldng · on April 21, 2014

AFAIK, it is because UPSERT are only a subpart of MERGE. The devs plan to add real MERGE in a later release (9.5 or 9.6, I don't remember). Someone more knowledgeable should shim in for the details.

chaz · on April 21, 2014

I don't believe there actually is an UPSERT command and that it's just a nickname for the activity. MERGE is a part of the standard, so I think any implementation would use that as the actual command.

jeltz · on April 21, 2014

The reason the author of the blog does not call it MERGE is probably to avoid confusion with the MERGE statement in the the SQL standard. Thew current plan is to add syntax to the INSERT statement in PostgreSQL 9.5 to ignore or return unique key violations. This is a subset of MERGE since MERGE does not require unique constraints to work.

For most OLTP users the proposed new syntax is more useful than MERGE since (unlike MERGE) it makes it clear exactly which rows will be locked. You do not want your application to stall on locks or give unique violations when doing upserts.

tomjonesmi · on April 21, 2014

I think the term UPSERT really gained popularity in NoSQL databases, such as MongoDB and CouchDB. That, combined with missing implementation details of MERGE could explain the term usage.