Most Terraform problems are not really caused by Terraform itself. They appear when the operating model around infrastructure changes makes divergence easier than controlled promotion.
I saw this in a regulated cloud environment where the Terraform setup had become harder to govern over time. Separate repositories, environment-specific branches, and fragmented workflows made it too easy for environments to drift apart.
This was not just a maintainability issue. It made changes harder to track, promotion across environments less reliable, and left too much knowledge and control with the infrastructure team. As a result, self-service became harder to expand without weakening governance.
The goal was therefore not just to clean up Terraform. It was to support self-service without giving up traceability, promotion discipline, or stronger review for higher-risk changes. Teams should be able to evolve resources within a controlled scope, while more sensitive areas such as IAM and access boundaries remain under stricter review.
Design principles ๐
Keep environments aligned by default ๐
If dev, QA, and prod evolve as separate structures, they stop being meaningful stages in a promotion path. They become separate systems.
That is why I moved away from a fragmented multi-repository setup toward a monorepo model where environment-specific values were handled explicitly while the Terraform resource structure stayed shared.
If something should exist in dev but not in prod, that had to be represented intentionally in code rather than by letting environments evolve independently. That is less convenient in the short term, but much cleaner for long-term promotion and drift control.
Make promotion easier than divergence ๐
A healthy Terraform model should make it natural to start in dev and move upward. If special cases are easier than promotion, the platform will drift.
A large part of the redesign was therefore about making the promotion path more natural and deviations more explicit, so that differences between environments remained deliberate rather than accidental.
Make self-service traceable ๐
Self-service only works when infrastructure changes remain traceable. Each change needed to be linked to a ticket with enough context to explain what was changing and why. Without that, reviews become harder and traceability degrades over time.
The same applied to pull requests. Changes were not meant to be approved by the same person who submitted them. Some teams could operate with more autonomy, while higher-risk changes still required infrastructure review depending on the project and resource scope.
Why the execution layer matters ๐
Treat the execution layer as a security boundary ๐
This is often underestimated. The system that runs Terraform across environments is not just a delivery convenience. It is a high-trust control point with broad privileges and, in some cases, access to sensitive internal networks.
That was one of the limits of the original Terraform Cloud model. Per-user licensing made broader self-service harder to support economically, and once Terraform needed to operate inside internal networks, keeping that setup would have required a much more expensive enterprise setup than seemed justified.
Before moving to Atlantis, I rebuilt the pipeline flow in Azure DevOps with those constraints in mind. Plans were exported, encrypted, and only applied later through a controlled approval flow. That way, the reviewed plan remained the one that was actually executed.
Atlantis came later to support a more collaborative workflow. But the main point was not the tool itself. It was to treat Terraform execution as a security boundary, not just as a delivery mechanism.
Keep environment boundaries meaningful ๐
If engineers can change multiple environments too easily, the separation between those environments starts to lose value. That weakens staged promotion and makes mistakes more costly.
To reduce that risk, guardrails were added to limit unsafe multi-environment changes and to make access rights clearer through team-based controls. The purpose was to keep environment boundaries meaningful while still allowing self-service within a defined scope.
Make structural change practical ๐
A better Terraform structure is not enough on its own. The platform also needs a practical way to move toward it when refactoring becomes necessary at scale.
That is why part of the redesign also involved tooling to make large Terraform state refactors more practical at scale. I published that tool here: terraform-move-helper.
What improved ๐
Infrastructure changes became easier to trace, review, and promote across environments. More teams could participate in infrastructure workflows without weakening control over higher-risk changes. The platform also moved away from a more restrictive and costly setup, while making long-term drift harder to accumulate.
Most importantly, Terraform changes now had a shared interface with the teams: the pull request. It became the place where changes could be reviewed, explained, and challenged, with closer infrastructure involvement when the risk was higher.