
There’s a specific kind of dread that comes with inheriting a codebase that needs significant structural work but has almost no test coverage. You know a refactor is necessary. The architecture is fighting the roadmap, and every new feature requires navigating a tangle of assumptions that nobody documented. But you also know that touching the wrong thing could silently break an API that real users depend on.
The instinct is to pause everything and write tests before touching anything. That instinct is mostly right, but the execution matters a lot.
I was brought into a team that needed to do serious structural refactoring to unlock the next phases of their roadmap. The codebase had accrued real technical debt, not through negligence but through the completely normal process of a startup shipping fast. The problem was that there was very little test coverage and no CI/CD pipeline, which meant the risk profile of any significant change was high. A bad deploy could break live APIs with no automated safety net to catch it first.
The goal wasn’t just to refactor the code. It was to create conditions under which refactoring could happen safely, and to do it without stopping feature development entirely.
The rule I established before touching any production code: write tests that describe the current behavior first. Not the desired behavior, but the current behavior, even if it’s imperfect.
This distinction matters. If you write tests for what the system should do and then refactor toward that ideal, you’re conflating two different problems: fixing broken behavior and improving structure. That’s a recipe for confusion. Better to lock in what the system does today, refactor structure without changing behavior, then address any behavioral issues separately with their own tests.
One decision that made a meaningful difference: instead of writing tests against synthetic fixtures, I anonymized production data and used that as the test baseline.
Toy fixtures lie. They represent the happy path, the case you thought to cover when you wrote the code. Real data contains all the edge cases that nobody anticipated: the weird user state, the legacy record format, the null where you assumed a value. Those are exactly the scenarios that break during refactors.
Anonymizing production data is a bit of extra setup work, but it pays back quickly. You get test coverage that actually reflects reality, and you stop being surprised by things in production that your tests never caught.
I chose to focus on high-level integration tests rather than unit tests, and it’s worth explaining why.
Unit tests are more precise and faster to run. But they require you to understand the internal structure of the code well enough to mock dependencies correctly. In a codebase you are still mapping, writing unit tests often means making assumptions about how things are wired together, assumptions the refactor may prove wrong.
Integration tests describe what the system does from the outside. They test inputs and outputs, not internal mechanics. They survive structural changes as long as behavior is preserved, which is exactly the property you need when refactoring. The coverage per hour of effort is also significantly higher: a single integration test can exercise a large surface area that would take dozens of unit tests to cover.
The tradeoff is that integration tests are slower and give less precise failure signals. When one fails, you know something broke, but not necessarily what. I accepted that tradeoff because at that stage, knowing that something broke was the primary goal. Granular diagnosis could come later, once the codebase was in better shape.
The refactoring approach I prefer when possible: keep the old code running and build the new structure alongside it. Route a subset of traffic or internal requests through the new path, compare outputs, and only switch over once you have confidence.
This is slower than a single cutover, but the risk profile is completely different. You’re never in a position where the only option is to ship the refactor and hope. The legacy code acts as a continuously-running reference implementation until you are ready to retire it.
It also means the “big switch” isn’t actually big. By the time you cut over entirely, the new code has already been handling real traffic and you have already seen how it behaves.
The end result was not just a cleaner codebase. It was a team that could move confidently again. Once the integration test suite was in place and CI/CD was running, the anxiety around major changes dropped noticeably. Developers stopped hedging and started refactoring. That shift in culture is, in my experience, worth more than any individual technical improvement.
The path to full CI/CD was clear. The path to the next roadmap phase was clear. And nothing broke in production during the transition.
It sounds like a low bar. But in a codebase with no prior safety net, getting through a major structural change without a production incident isn’t a given. It’s the result of a specific approach, and it’s repeatable.