TDD and BDD are great. As development tools, they set a common language for the team, create an active documentation and keep track of the project’s status. This advantages had made BDD/TDD an indispensable part of my process, but some experience with them made me realize there are still problems to be tackled with the BDD/TDD practice if it’s going to be viable for every type of project. Specifically, some nuisances related to TDD/BDD make it suboptimal for projects with great levels of uncertainty and even worse for projects where the domain is to be discovered during and after the implementation.
In my experience, the BDD/TDD practice has oftenly resulted anti-lean, making it harder to pivot and demanding a lot of extra energy to maintain. I narrowed down why:
Tests are hard to read
This is the more serious challenge with BDD/TDD practice. Most tools (RSpec, Jasmine, JUnit) provide great (even beautiful) output formats for the tests results, and while that is great, it does not solve the underlying problem: tests are invariably hard to read. There are several facts that contribute to this:
Metaprogramming is not part of the languages’ design
Most testing frameworks rely at least partially in metaprogramming capabilities so to be able to capture method executions, extend objects dynamically or even create full classes on runtime. While cleverly crafted frameworks (such as RSpec) structure the API so that assertions and message expectations look like plain english to the reader, the gymnastics being done behind the curtains is revealed by the very unconventional use of the language’s features.
Incidentally, this makes BDD/TDD frameworks hard to design and crack, since to make them work you need a ninja tinkering with the engine.
External dependencies are hard to mock
Untestable services (such as a web API that provides no harmless alternative) and external libraries with extensive and expressive APIs (such as an ORM) are really bureaucratic to mock and sometimes require a full fledged copy of the API.
Faced with such a problem, some developers assume that the only way around is renouncing to unit testing and leaping directly to BDD/integration testing with the live external service—that is, they resort to perform dangerous testing with the actual live tool. Testing live without an inert fallback is a desperate solution. It may be the only fit in some ocassions since mocking an entire service is not always attainable, but on the other hand, mission critical services must have good code coverage to be maintainable and this approach fails to prevent eventual nightmares when the external services are updated.
Fortunately, more and more services and libraries provide harmless or test-ready versions, such as
rack-test for Ruby web services, or a test flag in the webservice as provided by the good people of Mandrillapp. I regard this challenge, however, as a symptom of a different problem, implied in the language and protocol design of the current generation of programming languages and communication protocols:
Reflective programming is anti-paradigmatic
Both the assertion and the preparations to the tests are, in a way, observing the actual code from the inside out. The imperative programming paradigm (and in a way much of our way of thinking) makes it hard for the brain to switch perspectives as drastically as many real-life testing scenarios require to effectively cover the tested code.
The recent years have been rich with usability improvements in the development tools, including languages and core frameworks themselves. Still, while widely supported, testing does not feel built in, but added later, as an extra that’s nice to have, and mostly in the form of assertions. I think a revision of this attitude is probably to follow.
Tests are repetitive
The scenario: a user input that should be sanitized. The test examples: many invalid strings, some valid strings, and assertions on the results.
How many strings are enough? How many cases should the tests cover?
In a white box approach, the answer may be: as many cases as actual exceptions are hard coded in the algorithm (which can be subsequently optimized so to be as few examples as possible). While in this approach test examples are few and maintainable, it is not conducive to good design, and white box testing is generally recognized as cheating.
In a black box approach, the strategy is usually intuition or some heuristics for algorithm discovery (such as the still-incomplete yet quite enlightnening Transformation Priority Premise). Like shooting darts, the amount of test examples needed to cover each method can be quite handsome without the examples differing much between themselves. This is where repetition usually begins.
Clever tools such as Cucumber make it possible for certain data in the tests to be passed as an easy-to-read format, but the unwanted consequence is that the actual test is now even more counterintuitive, since the data being passed lies in an entirely different file. This makes it quite confusing for debugging too, since individual cases are harder to tell apart and to intercept at runtime.
Data mocking is a newborn discipline
Granted, there are several data mocking tools out there, but most are adequate only for specific purposes, either too narrow (populate a MySQL database with a WordPress-like setup) or too general (generate random strings of a certain length). A good data mocking tool should have a couple of features still missing from most tools around:
- Data used in tests should be easy to save in an intuitive and actionable format.
- The developer should be able to interact with the specific test cases for each datum.
- For specific scenarios, a cascading plugins approach might be used: for example, the WordPress MySQL setup should be performed with an SQL plugin extended by a SQL.WordPress plugin.
- Rules for data generation should not be arcane: providing an ABNF of the possible inputs is precisely what a coder is not able to do on the early stages of development, when valid inputs are sometimes not even vaguely defined.
- It should be possible to gather data from existing tests. It is of crucial concern for code base maintainability that the developer must be able, not just to generate test cases automatically and interact with each of them manually, but also to gather data from existing test cases and be able to regenerate test cases from the data. This can be used in a different layer of the application, such as the browser instead of the server, or in a new implementation written in a different language.
Tests are hard to change
Have you ever tried to refactor a code base by changing names and roles in an extensively tested set of classes? If not the case: may it never happen to you. It tops out there as the hardest intellectual challenge I ever faced. I sweated, I sweared, I drew diagrams, I drank a lot of coffee. I lost a lot of sleep. Mid size refactorings happen, and the timing is not welcome: they usually happen midway through the project, when design is supposed to have crystallized, the team is not as motivated as in the early stages and the code base has reached a non negligible size; the stage in which it can take as much as fifteen minutes to revisite the code in order to remember what did that one method do. Most times, I usually end up going up for a walk and a latte only to decide to build the new implementation mostly from scratch, usually in a new git branch with commits full of nihilistic messages.
The thing is, when faced with the prospect of refactoring, the tests made it all worse. The tests are the ultimate reflections of the class design, but they are not as malleable as the UML or the proverbial napkin—in fact, if it wasn’t for the tests, partial redesigns would be more like a headache instead of the full scale thunderstorms they actually are.
It’s the trivial things that get you. How many time consuming, mind numbing, pointless looking tasks can you look forward to when doing such a refactoring? Some of them:
- Large amount of file renaming
- Search and replace across the entire code base
- Search and replace the same word with case-sensitivity in class names, method names and references
- Comment consistency correction (for example: if the class called Dog is now called Cat, the comments should not be talking about the Cat barking)
- Picky search and replace: search for a keyword representing an element of the architecture whose role has changed and correct each instance to the new behavior, removing where obsolete and adding it later in new places.
The potential for mistakes is huge. Hell, even to be able to write down a roadmap of all the required changed you need to revisite the entire code base several times—and having the time and the wisdom to build a roadmap at all is a rare blessing. If you plan on doing it by the book, each little refactoring (such as a method renaming) requires a dedicated commit into SVC and a run of all the available tests. It’s immensely time consuming.
The bad news is that I faced this exact scenario several times already, and it’s quite likely that I’ll face it again in the near future. And since BDD/TDD generate repetition and cognitive effort, my very enthusiasm for keeping the code adequately covered amplifies the effort to be invested in adapting it to changing requirements.
BDD/TDD generating extra inertia and making code less malleable it’s quite a paradoxical situation, since one of testing’s main objectives it to enable maintainability. Fortunately, this issue lies in the current implementation of code base structure (the file hierarchy, the verbosity of the programming languages) and not in the testing practice itself, so I can envise a time when lean projects will be doable with limited time and resources without sacrificing such an useful practice as testing.
While in the mood of describing problems revealed by the struggles in BDD/TDD practice:
Modern APIs are not reflected in a class hierarchy
This is a problem that touches the very core of BDD/TDD, and software design in general. How would you describe an API consisting partially of embedded methods in the
Object superclass (such as RSpec) in traditional class-hierarchy terms? While it is possible, it is quite contrived (as can be seen in the actual RSpec implementation). In the end another set of tests may be required to provide examples of a perspective more similar to the actual intended usage. This separation creates a headache for implementators who must keep in mind the quirky, user-friendly API, all the while building a counterintuitive architecture to support it.
One of the principles of TDD unit test, as enforced in JUnit, it that each class has its matching test class. This has transpired into most modern test frameworks as the standard approach, even when the classical object-oriented approach is being challenged or submitted to different paradigms. There is a issue here and it’s not a minor one.
Cucumber aims implicitly to tackle this problem with the plain text testing scenarios, but as I said before in this article it works with the drawback of making the actual testing code more obscure and less automatable.
So far, the only effective solution to this challenge I found is featured in my favorite testing framework, Vows.js which reworks the test example structure: instead of before test, test, and after tests, you get the topic (the example) and the assertions. The topics can be nested so that examples can be built in several steps. It is not a complete solution though because the other problems (repetition, cognitive strain, inertia) are not solved.
What I want to see is a next generation language
I feel that the statement of a problem is incomplete unless I try to describe at least some workings towards a viable solution. How would a new generation BDD/TDD tool look like?
My point is that the problem lies not with the tools but with the programming paradigm itself. I imagine a language in which reflection and metaprogramming are a natural part of, and not a hack. I imagine a language in which callbacks are not regarded as second-class citizens. I imagine a language in which the code structure is versatile enough to reflect heterodox APIs such as the popular Domain Specific Languages. I imagine the core library of the language having useful assertion methods, including asynchronous message expectations.
I can imagine an environment that allows assertions to be embedded in the actual code for lightweight testing while coding. I imagine tools for encapsulating test examples generated in the REPL with automatically generated meaningful descriptions.
That would be nice.