Incuriosity Will Kill Your Infrastructure

jes · on March 16, 2015

Embedded software medical device developer here.

I once worked with a hardware engineer that would verbalize his thought process very explicitly, as we worked in the lab.

He would say things like: "Ok, I'm about to let the target out of reset. I expect to see the I2C bus controller initiate a master read of address 0x80."

Then he would do it, and look at the oscope, and see if his expectation was confirmed.

If it wasn't, or was fishy in any way, he'd say something like "OK, I have a mystery. I expected to see <X>, but I saw <Y> instead. I'm investigating this before I go further."

So, you get the drift.

For this guy, the rule was "NO MYSTERIES."

Working with him was a fantastic and valuable experience.

jfroma · on March 17, 2015

This is exactly my though process as a software developer. I work remotely so I dont speak out loud (most of time), but i think it helps too. It is similar to when you have been debugging something for hours and you figure the solution right away when you start explaining the problem to a coworker.

aptwebapps · on March 17, 2015

Pre-emptive rubber ducking. I'd like to try it, but I'm afraid it might be a nuisance for others.

L_Rahman · on March 16, 2015

Aside:

I'm a biomedical engineer by way of Hopkins and would love to be able to ask you questions about medical device development.

If this is something you'd be interested in, what's the best way to reach you? Conversely, my contact info is in my profile description.

LunaSea · on March 17, 2015

This is really interesting to me.

Do you have other stories or experiences to share about programming on "strict" systems like medical devices ?

Stuff like coding styles, rules, practices and the likes.

pjungwir · on March 16, 2015

Not just your infrastructure, your code base too. I've seen a lot of developers practice what I call "debugging by superstition," where they make random changes until it appears to work. I prefer to keep digging until I understand. Sometimes I make a hypothesis and test it, which superficially resembles debugging by superstition but is different.

One benefit of experience is that you gain a better intuition about your hypotheses, and you know how to more quickly devise "experiments" to test them. Also if you know more things in depth (because of prior digging) you don't have so many rabbit holes to explore.

Another benefit of waiting until you understand is that you don't make bull-in-a-china-closet edits to unfamiliar code. As a freelancer I have a strong bias towards adopting the style/patterns/architecture of whatever code base I'm working in. I wish more people did this! More often programmers skim through some code and start making changes, without trying to learn why the code is how it is or what other parts of the system need it that way.

Since this is the Internet I feel compelled to add: of course moderation in all things.

quanticle · on March 16, 2015

>Sometimes I make a hypothesis and test it, which superficially resembles debugging by superstition but is different.

As Adam Savage [1] says, "The only difference between science and screwing around is writing it down."

I don't follow this process for every problem I encounter, but when I have a really intractable issue, where nothing I've thought of seems to work, I start a "lab notebook" (usually a few sheets of printer paper). I write down all my assumptions, and start designing experiments to test each one in turn. It's a fair amount of overhead (which is why I don't use this approach for everything), but when all else fails, the scientific method powers through.

[1] https://www.youtube.com/watch?v=BSUMBBFjxrY

vacri · on March 17, 2015

As Adam Savage [1] says, "The only difference between science and screwing around is writing it down."

Not true at all, though publishing is an important step. Understanding what is going on is important. A startling revelation that the MythBusters guys have no understanding of statistics was when they invented their buttery-toast dropper. Doing a 'calibration' dry run (literal dry run!) with toast with one side marked with an X instead of butter, 7 out of 10 trials read one way. Adam remarked "This isn't random enough - it should be 5!".

The Mythbuster guys get 11/10 for curiosity and the spirit of investigation, but 4/10 for scientific rigour :)

quanticle · on March 17, 2015

That's not how I interpreted it at all. To me "writing it down" has nothing to do with publishing. "Writing it down" means that you systematically track the results of your experiments and then you use those results to update your hypotheses. If you don't write anything down, you're debugging by superstition.

vacri · on March 17, 2015

I would not categorise that rather complex set of activities as 'only'. It's not just 'writing down', but analysing, predicting, and designing new experiments.

dredmorbius · on March 18, 2015

I'd have to see the segment in question, but that sounds suspiciously like an ironic comment.

protomyth · on March 17, 2015

I saw this in a code base once:

  Line above commented out seems to fix Bug 34541

This makes for an impressive code land mine and creates some seriously unhealthy superstitions.

MichaelGG · on March 17, 2015

Yes I get very annoyed when I see some strange looking code written because the right way "didn't work" or "there was a problem". Sometimes that's a legitimate reason, with legacy/buggy/messy systems, when you just don't have the time or effort to do it right. But then at least document the hack as what it is. And what the general issue with the obvious approach was.

optimusclimb · on March 16, 2015

So what I want to know, is, as per all the recent agile/scrum discussions - how does the modern "do sprint planning/commit to a number of sprint points to do/tasks to work on/be the product manager's monkeys" align with, "you saw something that's probably representative of a major problem in your system, but stopping what you're doing to investigate it will kill your velocity, and make your team's statistics look bad"?

trustfundbaby · on March 17, 2015

I came here to make this point ... engineering organizations applying scrum or agile make no allowance for something like this. And while you might dig deeper, find a problem that indeed would have been a major crisis, and in some cases get to add that as a Product support ticket, after the fact, if you're lucky, the reality is that a lot of times you'll spend a lot of time investigating, learn that much more about your infrastructure/codebase but find its a small (but not insignificant problem) that gets thrown on the backlog, while the PM seethes at you for screwing up team velocity, and going "rogue" (working on a problem without telling him). Or, maybe, even worse ... you don't find anything, now you just wasted a day on nothing.

Of course you could say something in standup about it, but you and I both know you'd probably be gently admonished for wasting time on it and asked to go back to what you were doing. because ... sprint goals, quarterly objectives yada yada yada.

If you make doing this enough of a habit, it might show up in your one on ones and evaluations even. There goes your big annual raise.

In essence, you have to do your work AND go figure out these sorts of things on your own time. Thats how you find yourself putting in 50-60 hour weeks, but its okay because you're a "passionate" engineer.

Either way The business wins.

isaacaggrey · on March 17, 2015

Any time a sprint commitment must be broken the decision rests on the team's product manager who is best (or _should_ be best) suited to understanding the tradeoff between sprint completion and a potential emergency bug situation. If the PM decides it's not worth it, then that's on them. It's worth noting, you may want to include a wide enough distribution so it's known that you noticed but PM didn't want you to work on it in order to cover yourself.

As for how to account for that work in your velocity, I don't believe it is realistic to size a story for fixing a potential issue (not to mention the difficulties in sizing bugs anyway [1]). However, it is possible (or at least a bit easier) to size a story for research that answers a specific question as its acceptance criteria -- in this case, the acceptance criteria for this article's situation would've been something like 1) "will latency continue to rise?" and 2) if so and it is unacceptable, what is the cause and add an implementation story to the backlog"

Some may say "well, how do you know how what is causing the issue", and if no one really can figure it out the answer to the research story question and the bug has manifested itself as an emergency, then sure your velocity will tank as you break the sprint(s), but it's up to the team/business to understand how to remove outliers for an accurate velocity.

[1]: http://www.agileforall.com/2010/05/agile-antipattern-sizing-...

optimusclimb · on March 17, 2015

Yeah, I figured someone would respond that way. I have a feeling that the more process that lies around making such fixes...the more likely the system is likely to be under performing (and or broken) long term.

isaacaggrey · on March 17, 2015

In Scrum, speed and efficiency are traded for predictability and consistency (whether or not those trade-offs are worth it is another conversation).

It's worth noting that I agree with your point and also don't believe Scrum is a panacea, but I am starting to understand its appeal from a business perspective.

mvc · on March 17, 2015

Your sprint budget should be some percentage less than the average velocity over the past 3 sprints. That allows for bugs to be fixed mid-sprint without affecting velocity.

Velocity shouldn't be used to measure how "good" the team is.

jd007 · on March 16, 2015

Not related to the post, but after going to the main Yeller page, I noticed that at the bottom one of the features listed is "HTTPS Everywhere (we don't even allow HTTP over our API or website)", yet the site is not over SSL at all. In fact manually entering HTTPS in the URL shows that the certificate is not valid.

t__crayford · on March 17, 2015

Hi (author/founder of Yeller) here.

You're totally right, I need to change the wording there. The marketing site doesn't run over https - I'm bootstrapping, with relatively limited funds, and so can't properly afford the SSL costs for a CDN (my current one wants to charge $600 or so a month for serving ssl requests).

The webapp and the api are all HTTPS only.

I should change the wording on that page to reflect that.

amacneil · on March 17, 2015

I highly recommend CloudFlare. Their basic plan is completely free, and even comes with a free SSL cert.

totally · on March 17, 2015

"Not once have I regretted spending unbounded amounts of time investigating something fishy"

While I agree with the gist here (heed warning signs, proactively preempt failure), there are literally hundreds of "fishy" things, many/most of them low impact, that I could investigate on a given day, and my time is bounded.

At the semi-formal dance of distributed systems, meandering investigation should be chaperoned by ruthless prioritization.

matttb · on March 17, 2015

Can't find any way to contact you guys, so hopefully you'll read it here. On OSX Chrome v 41, the 'Subscribe to your free one month course' button extends a decent amount beyond the pink box on the right side.

t__crayford · on March 17, 2015

Post author/founder of Yeller here:

Huh, interesting. That's my setup as well. I'm not super great at CSS (yet), so not too surprised by a few minor visual bugs like that. I'll fix it soon. Thanks so much.

dredmorbius · on March 18, 2015

If that's this xpath: /html/body/div[2]/div[2]/div

Being:

    <div class="span10 offset3">

Then swapping out the "margin-left" property for a "padding-left" for selector ".row-fluid .offset2:first-child" should fix the problem.

tobz · on March 16, 2015

The bit about how Riak resolves concurrent writes sounds backwards. As far as I know, it's last-write-wins by default. You need to opt into storing all the writes via allow_mult.

Sinjo · on March 16, 2015

allow_mult has been enabled by default since at least 2.0 - http://docs.basho.com/riak/latest/dev/using/conflict-resolut...

Quanticles · on March 17, 2015

Does anyone have a good article for a more generalized version of this motto? It seems to apply to many forms of design