Software Testing - The Meh of It All

I write this post at my own peril.  Not so much because it's a hot take, but because it's on a technical topic, which, surprisingly on a blog nominally dedicated to tech topics, tends to have a lower audience.  Granted, that's only a difference of about 18 views, but that's 50%, people!

Actually, I'm grateful that my readership hovers consistently in the high 30s right now.  It has improved steadily since I first started writing, and there's a noticeable bump when I announce new posts on LinkedIn, so thank you all!

But, enough with me whining about my analytics. On to the controversy.

Today I want to discuss the value of various types of software testing, namely unit testing, integration testing, end-to-end testing (E2E testing), and manual testing.  The following thoughts are aimed predominantly at unit tests, but apply to all automated software tests to some extent or another.

Historically (at least as my career can be considered historical), management typically tends to insist that as many tests as possible be codified in a software repository and meet an arbitrary percent of code coverage (the number of lines of working code covered by tests) in order for the codebase to be considered stable.

I believe this thought process is a variation of the 'no one ever got fired for buying IBM' adage (except for me - I'd fire you for buying IBM).  Sure, there are bugs galore and the site is suffering instability, but the code coverage is at 92%, so no one can point fingers at me because I qualitied!

Let's start by exploring a few problems around code coverage metrics.

First, when faced with shipping a feature or spending an arbitrary amount of time ensuring the test coverage is sufficiently high, guess what's going to win out?  This decision is often mitigated by stating (a) we'll add the tests for appropriate coverage later (we won't) and (b) by browbeating an overworked QA analyst and complicit software engineers into claiming that the testing done to this point should suffice (i.e. "Feel free to speak honestly if there are issues, but only if it's exactly what I want to hear.")  

If quality isn't going to be an absolute requirement, then why carp on hitting an arbitrary metric?  It's like screaming at someone about eating a balanced diet with two half-eaten donuts in your hand.

Second, the code that isn't covered is often legacy code.  Legacy code is often legacy because it (a) is the predominant revenue generator for a business and can't easily be extracted and (b) is so byzantine and fragile, it can't be easily extracted.  So code coverage very likely doesn't cover your critical infrastructure.

Third, there's no guarantee that the code that is covered is thoroughly tested.  The testing software that measures code coverage simply ensures that some test covers the line of code in question, not that it covers it properly.

Fourth, even if the coverage above is perfect - according to both the letter and spirit of testing law, this only tests functionality, not performance or reliability (of which I'm sure some people are clamoring that those tests need to be included as well.  I generally agree, but we'll get to that).  Assuming that this is a sufficient qualifier to move to production is like assuming you'll always be fine with one jet engine.

So, considering these four counterpoints, I find it a bit short-sighted to fixate on code coverage.  As usual, corporate solutions try to find one metric to rule them all, or one process or automated bot to replace them all, but the stochastic world in which we live continues to add wrinkles to this foolish demand for simplicity.

But, wait, a few of you might say - didn't you extol the praises of learning the benefits of unit testing in previous posts and using them to improve your own software development?  If you actually remembered that, then kudos!  Also, yes, I did say that, and still believe in unit testing as a valuable tool.  But, as with everything, perspectives change with time and (hopefully) experience.

I've never been a big advocate for fixed code coverage numbers for the reasons listed above, but let's look at relative numbers as an alternative.

Prior to the common usage of GenAI tools, if you saw your coverage double from a shockingly low 20% to a more paltry 40%, that's an indicator that your code quality has improved, even without any other information to support that conclusion.  It was indicative that someone was looking at the code and either testing the existing functionality for its robustness or fixing bugs and ensuring the code was in better shape.

Even with the advent of GenAI tools that can spit tests out sight unseen, the tests are likely either so far off base that you'll notice multiple failures (the more likely scenario) or they'll match your code base well and unmask hidden bugs or prove that your code is actually meeting the requisite quality standards.  It's not quite as strong a signal as tracking manual improvements, but it's still worthwhile.

I'm not opposed to setting arbitrarily high targets for code in new projects (potentially including older code that you need to modify as part of the project), but following the spirit of the law is more important than the letter here:

Unit tests shine most when using them to design or write your code.  I'm not a Test-Driven Development zealot who insists all tests must fail before writing code to pass them, but TDD does have a point.  If you have to create the failure first, you stop yourself from gaming the system, and you force yourself to build things brick by brick, reducing the likelihood of errors.

In doing so, you make sure that functions, methods, or classes don't grow unwieldy in their size of scope.  Incremental change typically leads to better consideration of edge cases, whether or not you include code to handle the cases.  For instance, if I know a function takes an integer as a parameter, and I'm building the function incrementally, I'm more likely to progress through thinking about positive integers, then negative integers, then zero, then floats, then non-numbers, then nulls, and so on.  

And, in doing so, maybe I realize I need code to handle all integers, but not anything else.  It may come back to bite me, but at least it was a design consideration, rather than an oversight, and any further correction will be much easier to implement.

Once you write a unit test, try the code out in its natural habitat.  Maybe you'll find that the function handles integers, but they're nominally passed as strings, which will make you need to update your code.

Mock data or methods also present problems.  If you're not using the data or the actual API calls that your code will be interacting with, you run a real risk of coding to a theoretical solution that doesn't test actual use cases rather than a practical one that does.  

When possible, use live systems to test against.  When that's not possible, and you can run nearly identical systems used exclusively for test, do that.

When those systems are too complicated to stand up for testing, only then consider mocking responses.  But make sure that you've communicated that you're mocking the systems and that the requisite risks are attached.

Once you've finished writing these unit tests, checking the tests into your repository, and ensuring that all code in the new project meets a 90% coverage metric, you should promptly delete most, if not all of the unit tests you just wrote.

Wait, WHAT?!?!

Yeah, delete 'em.  

As usual, I have way more to say on this topic than I expected, so I'll explain myself in the next post.

Until next time, my human and robot friends. 

Comments

Popular Posts