Hacker News new | past | comments | ask | show | jobs | submit login
My Python testing style guide (2017) (thea.codes)
201 points by rbanffy on March 24, 2021 | hide | past | favorite | 84 comments



Personally, I've come to really dislike test names like "test_refresh_failure". They tell you what component is being tested, but not what kind of behavior is expected. Which can lead to a whole lot unnecessary confusion (or bugs) when you're trying to maintain a test whose implementation is difficult to understand, or if you're not sure it's asserting the right things.

It also encourages tests that do too much. If the test is named "test_refresh", well, it says right there in the name that it's a test for any old generic refresh behavior. So why not just keep dumping assertions in there?

I'm much more happy with names like, "test_displays_error_message_when_refresh_times_out". Right there, you know exactly what's being verified, because it's been written in plain English. Which means you can recognize a buggy test implementation when you see it, and you know what behavior you're supposed to be restoring if the test breaks, and you are prepared to recognize an erroneously passing test, and all sorts of useful things like that.


Call me crazy, but I go even more verbose than that. I go full Rspec/BDD mode:

  class TestAuthenticateMethod:
      
      class TestWhenUserExists:

          class TestWhenPasswordIsValid:

              def test_it_returns_200(self, app, user):
                  assert ...

          class TestWhenPasswordIsInvalid:

              def test_it_returns_403(self, app, user):
                  assert ...

      class TestWhenUserDoesNotExist:

          def test_it_returns_403(self, app, user):
              assert ...
It is so worth it. I can look back at my tests 3 months later and know exactly what was intended wthout duplicating explanations with comments. Yes some methods can get a bit long and indented, but it's worth it for me. The other downside is you have to start all classes/methods with "test" but I got used to that very quickly.

When you run pytest it's very clear what's being tested and what the expected outcome is since it gives you the full class hierarchy in the output.


This is neat, but pytest allows for similar granulation using the right fixtures:

E.g. in test_authenticate_method.py:

    def test_returns_200_on_valid_password(app, user_exists):
        ...

    def test_returns_403_on_valid_password(app, user_exists):
        ...

    def test_returns_403(app, user_not_exist):
        ...


I’ve been working on a codebase like this, and my major complaint is that I’ll see a deeply-nested test and have to scroll waaaaay up to understand the context. I prefer ‘foo_bar_baz’, ‘foo_bar_qux’ to

    foo
        <snip 1000 lines>
        bar
            <snip 553 lines>
            baz
            <snip 243 lines>
            qux


This idea is both crazy and genius. Thanks for sharing.


We don't need to put this much responsibility on test names. If there is more to explain, write a few words in a comment.


I've just never seen the "this belongs in comments" approach work out in practice. Maybe it's something about human psychology. Perhaps things might seem obvious when you've just written the test, so you don't think to comment them. Perhaps code reviewers feel less comfortable saying, "I don't understand this, could you please comment it?" Perhaps it's the simple fact that optional means optional, which means, "You don't have to do it if you aren't in the mood." Regardless of the reason, though, it's a thing I've seen play out so many times that I've become convinced that asking people to do otherwise is spitting into the wind.


I really hate, but I can see its virtue, the pylint rule that complains about lacking docstrings.

In tests, however, I prefer not to have them as my favorite test runners replace the full name of the test method with its docstring, which makes it a lot harder to find the test in the code.


My only issue is that the rule doesn't give you a good way to annotate that the object is documented by its signature. Sometimes devs are lazy, but the rule doesn't make them not lazy so you get pointless docs like:

    def matriculate_flange(via: Worble):
        "Matriculates flange via a Worble."


That's more or less why I hate this feature - it encourages lazy docstrings. Right after you named something is probably the worst possible time for you to explain it.


I've been speculating that the right place for the "why" of tests to live is in documentation, where references should be hung like citations (which may or may not be included in any given rendering of the documentation in question) supporting claims of fact. Then we can surface that any time a test fails, and find tests that no longer check anything we care about.


All this is true, but doesn't it apply just as much to writing descriptive test function names?

Test function names are less sensitive than regular functions, since they're not explicitly called, but I still don't want to read a_sentence_with_spaces_replaced_by_underscores.


I don't want to either, but until Python gives us backtick symbols or we something like Kotlin's kotest that lets the test name just be an actual string, that's sort of the choice we're left with. And I'm inclined to take the option that I've known to lead to more maintainable tests over the long run over the option that I've known to engender problems, even if it is harder on the eyes. Form after function.

As far as whether or not people do a better job with descriptive test function names, what I've seen is that they do? I of course can't share any data because this is all in private codebases, so I guess this could quickly devolve into a game of dueling anecdotes. But what I've observed is that people tend to take function names - and, by extension, function naming conventions - seriously, and they are more likely to think of comments as expendable clutter. (Probably because they usually are.) Which means that they'll think harder about function names in the first place, and also means that code reviewers are more likely to mention if they think a function name isn't quite up to snuff.

And I just don't like to cut those sorts of human behavior factors out of the picture, even when they're annoying or hard to understand. Because, at the end of the day, it's all about human factors.


I don't disagree with much of this.

I was talking specifically about a `def test_displays_error_message_when_refresh_times_out()` function.

That's too big a name for me to keep in my head, so I'd look for other solutions.


Why would you want to keep test function names in your head? I would rather use that space for regular functions which are in use throughout my program. Test functions are usually single-use anyways so being greedy with characters makes much more sense elsewhere. The most important thing to remember about a test function is what property it tests (and how). Putting that in the function name as a reminder makes a lot of sense imo.


I would read that any day over a function called test() where I have to figure out what is actually being tested.


Me too!


You don’t see comments in a test report though — maybe have an optional description as part of the framework for more detail along with -v


I think this is one of those conventions where your team agrees on one convention, and you just try to follow it consistently. If that's looking at the test in code to find descriptive comments, great. If that's long test names, cool. Just do the same thing consistently.


True - I think it depends on who uses your test outputs. If PMs and BAs care, having a more verbose output (descriptions, pretty graphs, etc) helps a lot. If it’s just for devs then they can more happily go through the code base.


I know that the docstring in Unittest is part of the reported output, and I was pretty sure that it's the same in Pytest.


I would've thought so but no, pytest will show the docstring in `--collect-only -v` (badly), but it doesn't show any sort of description when running the tests, even in verbose mode (see issue #7005 which doesn't seem to be very popular)


`unittest` prints the docstring alongside the name of the test in verbose mode (e.g. failure).

pytest does not though.


It's a choice of course, but I look at test functions like other functions. If you can name a function such that it doesn't need a comment (but also doesn't use 100 characters to do that), I'll gladly take that over a comment. Same like in all other code: if you need comments to explain what it does, it's likely not good enough and/or needs to be put in a function which tells what it does. Comments on why it does stuff the way it does are of course where it's at.


s/comment/docstring/


I have yet to understand the point of docstrings.

How are they, in practical reality, better than comments?


The difference is in tooling. Docstrings are collected and displayed in many contexts. The intended purpose is writing “documentation” in the same place as code. Comments are only seen if you open the source file.


OK, I can see how that's useful for a certain workflow.

The way we work, it's just a different comment syntax. I do like have a dedicated place for it.


Python's builtins, the standard library, and many popular community libraries and tools use docstrings to generate tooling and documentation websites. Describing it as "useful for a certain workflow" is so understated it borders on false. The reverse is more true: docstrings are not useful for a certain workflow, but most Python programmers will find them quite helpful.

All Python programmers already benefit from the docstrings written in the libraries they use, which generate tooltips and documentation websites.


you can access it from object.__doc__, and there is tooling in ide's like pycharm to quick view them to see what a function does, auto generated documentation

prior to mypy/type hints it allowed you to document the types of a function


Sometimes there are practical reasons to avoid this approach and have fewer test methods that test multiple behaviors of a single thing. For example a lot of our projects setup and teardown a database between test methods so a few fatter test methods run a lot faster than a large number of small test methods. We rely on good commenting within the methods to understand what exactly is being checked.


In that case, why not use some auxiliary function to load the resources (in your case, the database) and decorate it with functools.cache[0] to avoid the function from getting executed multiple times? Sure, this means re-using resources between multiple tests (which is discouraged for good reasons) but your current test effectively does the same thing, the only difference being that everything is being tested inside one single test function.

PS: How come your setup and teardown operations are so expensive in the first place? Why do you even need to set up an entire database? Can't you mock out the database, set up only a few tables or use a lighter database layer?

[0]: https://docs.python.org/3/library/functools.html#functools.c...


The difference is that it's very obvious in our current tests when the database is being reset. And to be clear it's not like we're testing the entire app in a single test function - our test functions just tend to cover more than one feature.

We've also found over the years that if you want to be sure everything works on version X of database system Y, you run your CI tests on that.


One of the most common test modules in R is called “test that” and you invoke a test by calling a function (rather than defining one) called “test_that”, the first argument is a string containing a description of what you want to test and the second argument is the code you want to test.

That way, all your unit tests reads: “test that error message is displayed when refresh times out” etc.

I think it’s a really nice way to lay things out and it avoids all the “magic” of some functions being executed by virtue of their name.


This. Like probably most programmers, I tend to read code more than I write code and when I read tests, I want to know immediately what a test does and checks. I don't want to read the test itself, unless I already know it's of interest to me.

In my experience, when a unit test has no clear name, this is usually a sign that the test does too much (meaning that it's actually not a unit test anymore) or even comprises multiple tests, leading to all the known bad consequences this has (leaking state between tests, unclear test assumptions etc.)


I think that behave (https://behave.readthedocs.io/en/stable/) is more useful for testing these more real-life usecases. I tend to have a `test_my_function` in pytest tests and the more integration and functionality-related testing in the behave tests.


I usually put that info in the asset:

Assert foo(), "foo should bar"

So that each failure has a clear meaning.

Indeed, i may have several assert in a test.


have you considered using "test_refresh__failure" instead? makes clear "refresh" is the component being tested and "failure" is a description of the behavior


The author doesn't like pytest fixtures, but personally they're one of my favorite features of pytest.

Here's an example use case: I have a test suite that tests my application's interactions with the DB. In my experience, the most tedious part of these kinds of tests is setting up the initial DB state. The initial DB state will generally consist of a few populated rows in a few different tables, many linked together through foreign keys. The initial DB state varies in each test.

My approach is to create a pytest fixture for each row of data I want in a test. (I'm using SQLAlchemy, so a row is 1-1 with a populated SQLAlchemy model.) If the row requires another row to exist through a foreign key constraint, the fixture for the child row will depend on the fixture for the parent. This way, if you add the child test fixture to insert the child row, pytest will automatically insert the parent row first. The fixtures ultimately form a dependency tree.

Finally in a test, creating initial DB state is simple: you just add fixtures corresponding to the rows you want to exist in the test. All dependencies will be created automatically behind the scenes by pytest using the fixtures graph. (In the end I have about ~40 fixtures which are used in ~240 tests.)


One thing I've encounter with pytest fixtures is they have a tendency to balloon in size.

We started out with like 50 fixtures, but now we have a conftest.py file that has `institution_1`, ..., `institution_10`.

My end conclusion is that fixtures are nice for some things, like managing mocks, and clearing the databases after tests, but for data it's better to write some functions to create stuff.

So instead of `def test_something(institution_with_some_flag_b)` you'd write in your test body:

    def test_something() -> None:
        institution = create_institution(some_flag="b")
Also another benefit is you can click into the function whereas fixtures you have to grep.


I’ve rewritten a bunch of our tests to this factory pattern last week, too (the factory is a fixture though - FactoryBoy is worth a look).

I’d argue that too many global fixtures in conftest have a high risk of becoming a “Mystery Guests” or too general fixtures. For a test reader it’s impossible to know the semantics of “institution_10”.

I believe this to be rooted in DRY obsession leading to coupling of tests: “We need a second institution in two modules? Let’s lift it up to global!”


I'm the exact opposite, I absolutely hate pytest fixtures. They are effectively global state, so adding a fixture somewhere in your code base might affect the tests in a completely different location. This gets even worse with every fixture you add because, being global state, fixtures can interact with one another – often in unexpected ways. Finally, readers unfamiliar with your code won't know where the arguments for a given `test_xy()` function come from, i.e. the dependency injection is completely unclear and your IDE won't help you much.

There are so many other (better) ways to achieve the same goal, such as decorators or – as already mentioned by emptysea in their sibling comment – explicitly invoking some function from within the test to do the setup/teardown.


I'm mixed on fixtures.

One one hand, I've been impressed with how they compose and have let me do some great things. For example, I had system tests that needed hardware identifiers. I had a `conftest.py` to add CLI args for them. I then made fixtures to wrap the lookup of these. In the fixture, I marked it as Skip if the arg was missing. This was then propagated to all of the tests, only running the ones the end-user had the hardware for.

On the other hand, when I need to vary the data between tests and that data is an input to something that I'd like to abstract the creation of, fixtures break down and I have to instead use a function call.


I've been looking for something like this for ages. I'm excited to try some of this stuff out, like spec'd Mocks.

I'm curious if anyone else who has been drawn in by the allure of the Mock has some strategies to avoid the footguns associated with them? (Python specifically)


Just don't use them unless you have to. The two main reasons to mock are for network requests and because something is too slow. Other than that, test things for real. Do not isolate parts of your code from other parts of your code by using mocks. If your code does side effects on your own machine, like writes to the file system, let it write to a temporary directory.

I linked this excellent talk in another thread recently. I'll put it here again: https://www.youtube.com/watch?v=EZ05e7EMOLM


Indeed. And if you're mocking eg. API calls in a client library, try to have tests for the real things as well. They don't have to be part of your normal test suite, they can run only if env vars are set with the API keys needed.


VCR.py (https://github.com/kevin1024/vcrpy) is a great utility for mocking APIs. It will run each request once, save the responses to YAML files, and then replay the responses every time you re-run the tests. It's also very useful for caching API responses (e.g. you have a trial account with limited number of requests). Unfortunately, if used for testing, it will not cover the case when the original API changes its interface.


That Ian Cooper talk is just fantastic. It's perhaps best contribution to the subject of TDD anyone has produced since Kent Beck popularized the idea in the first place.


I'd really recommend this video. I had seen it years ago and just came back to it. It's about changing your architecture, one of the effects of that being changing how you need/use mock. https://www.youtube.com/watch?v=DJtef410XaM&t


assert on the call_count attribute of a mock instead of trying to use methods on it like .assert_called_once_with()

"a mock’s job is to say, “You got it, boss” whenever anyone calls it. It will do real work, like raising an exception, when one of its convenience methods is called, like assert_called_once_with. But it won’t do real work when you call a method that only resembles a convenience method, such as assert_called_once (no _with!)."

https://engineeringblog.yelp.com/2015/02/assert_called_once-...


This behaviour has changed in Python 3.5 [1], and it was also backported to the mock package.

When unsafe=False (the default), accessing an attribute that begins with assert will raise an error.

[1]: https://docs.python.org/3/library/unittest.mock.html#the-moc...


> accessing an attribute that begins with assert

Or any of the names in this surprising hardcoded list of typos https://github.com/python/cpython/blob/fdb9efce6ac211f973088...


do you happen to know if this is true on mock==2.0?


Yes it looks as if that functionality is in 2.0, but the list of typos isn’t as extensive as in later versions.

https://mock.readthedocs.io/en/latest/changelog.html#and-ear...


What footguns are you looking to avoid with mocks? I've been using them for about a year now and haven't run into much issues.


If the mock's model of the mocked-out component is inaccurate, it reduces the relevance of the test using the mock.


Check out the talk by Edwin Jung, "Mocking and Patching pitfalls": https://www.youtube.com/watch?v=Ldlz4V-UCFw


I've seen things like the mock returns null for an error. But the real thing throws an exception.


I'd add to that, that test should be readable. personally I prefer to use: GIVEN, WHEN, THEN as comments in the tests. Also; it's ok not to be DRY while writing tests.


> it's ok not to be DRY

Depending on context and implementation details, I'd say DRYing tests can be anywhere from indispensable to toxic.

I'm fine with creating libraries of shared functionality that tests can use, especially when it helps readability. If you've got several tests with the same precondition, having them all call a function named "givenTheUserHasLoggedIn()" in order to do the setup is a nice readability win. And, since it's a function call, it's not too difficult to pick apart if a test's preconditions diverge from the others' at a later date.

What I absolutely cannot stand is using inheritance to make tests DRY. If you've got an inheritance hierarchy for handling test setup, the cost of implementing a change to the test setup requirements is O(N) where N is the hierarchy depth, with constant factors on the order of, "Welp, there goes my afternoon."


I've gotten lured into the inheritance stuff and it's super nice at the very, very beginning and becomes a nightmare to maintain. Obviously a horrible tradeoff for software.

I've found that having a class/function as a parameter and explicitly listing the classes/functions that get tested is a small step back and way easier to maintain and read. It sets off some DRY alarms, cause usually that whole list is just "subclasses of X". And it seems like burden to update. "So if I make a new subclass, I have to add it everywhere?". Yes. Yes you do. Familiarity with the test suite is table stakes for development. You'll need to add your class name to like ten lists, and get 90% coverage for your work, then write a few tests about what's special about your class. When something breaks, you'll know exactly what's being tested. And you'll be able to opt out a class from that test with one keystroke.

That being said… I still have a dream of writing a library for generating tests for things that inherit from collections.abc. Something like “oh, you made a MutableSequence? let’s test it works like a list except where you opt-out.”


I'm an "it depends" fan myself.

It does annoy the many programmers who want clear and absolute rules for everything.

Then again they are always annoyed, living in a world where so many things "depend".


The given, when, then breakdown is interesting, though I've never seen language test utilities actually enforce that structure. Maybe an interesting potential experiment (regardless of language) ?

I feel like your last point is especially important. Sooooo many times have I seen over-abstracted unit tests that are unreadable and are impossible to reason about, because somebody decided that they needed to be concise (which they don't).

I'd much rather tests be excessively verbose and obvious/straightforward than over abstracted. It also avoids gigantic test helper functions that have a million flags depending on small variations in desired test behaviour...


As always, there are tradeoffs.

Personally, I work with some incredibly (100+line) long "unit" tests and they are a nightmare to work with.

Especially when the logic is repeated across multiple tests, and it's incorrect (or needs to be changed).

I really, really like shorter tests with longer names, but I'd imagine there are definitely pathologies at either end.


Are you reading my mind? I even blogged about it :) https://stribny.name/blog/2018/11/writing-test-cases-with-gi...


I've found pytest to encourage tests with really long method names, examples from the post:

  test_refresh_failure
  test_refresh_with_timeout
These get even longer like test_refresh_with_timeout_when_username_is_not_found for example.

pytest-describe allows for a much nicer testing syntax. There's a great comparison here: https://github.com/pytest-dev/pytest-describe#why-bother

TL;DR, this is nicer:

  def describe_my_function():
    def with_default_arguments():
    def with_some_other_arguments():
This isn't as nice:

  def test_my_function_with_default_arguments():
  def test_my_function_with_some_other_arguments():


I like the concept, but using the profiler to grab locally declared tests is a bit more magic than I'm comfortable with in my tests.

Something like this might be a good compromise:

    def describe_my_function(register):
        @register
        def with_this_thing():
            ...
I think most Python devs understand that "register" can have a side-effect.


Do you have an opinion on grouping pytest tests in classes?

    class Test_my_function:
        def with_default_arguments(self):
        def with_some_other_arguments(self):
If you can make your eye stop twitching after seeing snake cased class names, this is at least another option of grouping tests for a single function.


They're already grouped by module which normally provides enough granularity (in my experience, I've only scaled this up to 50k LOC apps though so YMMV).


I just saw the github readme for this project. How is the describe variant different from just grouping the tests together into a module called test_describe_my_function.py and then having smaller named functions inside?


This grouping convention reminds me a lot of Better Specs from the Ruby world:

https://www.betterspecs.org/

With rspec, you use the describe and context keywords.

At one level, yes, it's mainly syntactical sugar. As the test-writer, the two approaches may seem interchangeable.

Where I find it really helps is when I'm not the test-writer but rather I'm reviewing another developer's tests, say in PR. I find this syntax and hierarchy produces a much more coherent test suite and makes it easier for me to twig different use cases and test quality generally.


The readme says:

> With pytest, it's possible to organize tests in a similar way with classes. However, I think classes are awkward. I don't think the convention of using camel-case names for classes fit very well when testing functions in different cases. In addition, every test function must take a "self" argument that is never used.

So there's no reason to do this, aside from aesthetics.

I'd recommend against doing un-Pythonic stuff like this, it makes your code harder to pick up for new engineers.


You could call it aesthetics, but it's also readability, and that's an important aspect of tests.


My suggestion is that adding a mini-DSL for parsing nested functions as nested tests is less readable for the average Python programmer who has not seen that plugin before.

And I've never had a problem with reading class names vs. function names; we do that all the time when reading Python code.

I think this is clearly in the realm of subjective preference for what "looks nicer", which I'd call aesthetics.

On the other hand, this probably breaks your IDE's pytest integration, which would be an objective material downside.

Whatever floats your boat though, definitely not a hill I'd die on.


I'm curious how others test code that operates on large datasets—eg, transformations of a dataframe, parsing complicated responses, important implementations of analytics functions.

I've previously used serialized data—JSON, or joblib if there are complex types (eg, numpy)—but these seem pretty brittle...


If you're in the serverless space, a useful addendum: https://towardsdatascience.com/how-i-write-meaningful-tests-...


What are people using in terms of testing frameworks now?


Most of the codebases I maintain use unittest, but pytest is much better and my preferred framework.


This is a good talk by the core developer Raymond Hettinger [1]. He prefers pytest too. I don't do any crazy testing, but I really like property-based testing with Hypothesis, which is also mentioned. This video isn't Python but it's a great intro to property-based testing [2].

1. https://www.youtube.com/watch?v=ARKbfWk4Xyw

2. https://www.youtube.com/watch?v=AfaNEebCDos


unittest is in the standard library; this counts for a lot.


I used to use unittest for this reason, but it's pretty silly. Having extra dependencies for the tests makes no difference for end users and these days it barely makes a difference to developers.


> Having extra dependencies for the tests

What do you mean, extra dependencies? The only difference between pytest and unittest in this regard is that tests using unittest declare their dependency explicitly, using an import[0]. Most pytest tests still implicitly require pytest as a dependency, though. (Think of fixtures etc. etc.)

I actually like unittest's approach here – in my book, explicit is better than implicit.


I believe they mean that you need to install the pytest package, whereas unittest comes with Python.


What makes pytest so much better in your opinion?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: