Episode 11 — Backout Plans vs. Fail Forward: Recovering from Bad Changes (1.2)
In this episode, we start with what happens when a change goes wrong and you have to decide how to recover. This is a very practical part of change management because even careful teams make mistakes, and even well-tested changes can behave differently in a live environment. A patch may break an application, a firewall rule may block the wrong traffic, a cloud setting may remove access, or a deployment may cause a service to fail. When that happens, you need more than hope and panic. You need a recovery decision. In security and operations, two common ideas are a backout plan and fail forward. A backout plan returns the environment to a known-good state. Failing forward means continuing ahead with a fix instead of reversing everything. Both can be valid, but they fit different situations.
Before we continue, a quick note. This audio course is part of our companion study series. The first book is a detailed study guide that explains the exam and helps you prepare for it with confidence. The second is a Kindle-only eBook with one thousand flashcards you can use on your mobile device or Kindle for quick review. You can find both at Cyber Author dot me in the Bare Metal Study Guides series.
A bad change does not always mean someone was careless. Sometimes the environment is complex, dependencies are hidden, test systems do not perfectly match production, or a small difference creates a large effect. You may have a change that worked during testing but fails when real users, real data, and real traffic are involved. You may also see a change that technically succeeds but creates a security problem, such as weakening access control or breaking monitoring. The key is not to pretend every failure can be avoided. The key is to plan for failure before it happens. That way, when something goes wrong, the team already knows the options, the decision points, and the people responsible for acting. Recovery is much safer when it is planned instead of improvised under pressure.
A backout plan is the planned path for undoing a change and returning to the previous stable condition. You can think of it as knowing how to get back to the last safe place. If a software update breaks an application, the backout plan may restore the previous version. If a firewall rule blocks important traffic, the plan may remove that rule and return to the last approved rule set. If a configuration change causes errors, the plan may restore the earlier configuration. The phrase known-good state matters because you are not just going backward randomly. You are returning to a condition that was already working and understood. A good backout plan reduces uncertainty because it gives the team a tested way to stop the damage and restore normal service.
A backout plan should exist before the change begins, not after the first signs of failure. This is where many weak change processes fall apart. Someone makes a change, it fails, and only then does the team start asking how to undo it. By that point, people may be stressed, users may be affected, and the available recovery options may be unclear. A strong plan explains what will be backed out, how it will be backed out, who will perform the action, how long it should take, and how success will be confirmed. It should also identify the point where the team must stop trying to make the change work and begin recovery. Without that decision point, people may keep troubleshooting too long and make the outage worse.
Fail forward is different. When you fail forward, you do not return to the old state. You continue ahead by applying a fix, adjustment, or follow-up change that makes the new state work. This approach can make sense when going backward would be slower, riskier, or impossible. For example, if an application deployment has already changed data in a way that cannot easily be reversed, rolling back the application alone may create more problems. If a security patch fixes a serious active vulnerability, removing the patch may reopen a dangerous weakness. In those cases, the better decision may be to fix the new problem while staying on the new version. Failing forward is not stubbornly pushing ahead no matter what. It is choosing the safer recovery path when forward repair is better than rollback.
A patch failure is a useful situation to compare the two choices. Suppose a patch is installed on a production server and an important application stops working. If the patch fixed a low-risk issue and the outage is affecting users, rolling back may be the best immediate choice. The team can restore the previous version, bring the application back, investigate what went wrong, and plan a better patch attempt later. But if the patch closed a critical vulnerability that attackers are actively exploiting, rolling back may expose the organization to serious risk. In that case, failing forward may be more appropriate. The team might adjust a setting, update a dependency, or apply a vendor fix while keeping the security patch in place. The right answer depends on both operational impact and security risk.
Firewall rule mistakes also show the difference clearly. Imagine a new firewall rule is added to block suspicious traffic, but it accidentally blocks a business application from communicating with a database. A backout plan might remove the new rule and restore the previous rules so the application can work again. That may be the right move if the outage is severe and the original risk can be managed temporarily another way. But sometimes failing forward is better. If the original firewall change was needed to stop a dangerous exposure, the team may choose to adjust the rule instead of removing it completely. They might narrow the rule, correct an address range, or allow only the required traffic. The goal is to fix the mistake without losing the security benefit that motivated the change.
Application deployments often make the decision more complicated. A deployment may include new code, database changes, configuration changes, and new dependencies. If the new version fails immediately and no data has changed, rollback may be straightforward. The team can return to the earlier version and restore service quickly. But if the deployment changes the database structure, processes new transactions, or modifies user data, rolling back may be dangerous. The old application version may not understand the new data format. Restoring an old database backup may erase legitimate work that happened after the deployment. In that kind of situation, failing forward may be the safer option. The team may need to patch the new version, correct the migration, or repair the configuration while preserving the new data state.
The decision between backout and fail forward should consider time, risk, data, security exposure, and business impact. Time matters because users may be affected during the recovery process. Risk matters because one option may reduce an outage but increase security exposure. Data matters because some changes can alter information in ways that are difficult to reverse. Security exposure matters because rolling back a protective change can reopen a vulnerability or weaken a control. Business impact matters because the organization may tolerate one kind of disruption better than another. You should not think of backout and fail forward as one always being safer than the other. The safer choice is the one that reduces overall harm in that specific situation. Good change planning helps you make that decision before emotions take over.
Validation is required no matter which path is chosen. If you roll back, you need to confirm that the environment really returned to a working state. The application should function, users should be able to complete necessary tasks, security controls should still operate, logs should still be collected, and any temporary access should be removed. If you fail forward, you also need validation. The fix should solve the immediate problem without creating a new one. The team should confirm that the original security goal was preserved, that users can work, and that monitoring shows normal behavior. Recovery is not complete just because the first visible error disappears. You need evidence that the environment is stable enough to trust again.
Communication is also part of recovery. When a change goes wrong, people need clear information. Users may need to know that a service is unavailable or that access may behave differently. The help desk may need to know what symptoms users will report. Business owners may need updates because the outage or security risk affects their work. Security teams may need to watch for related alerts if controls were changed during recovery. Communication should be calm, accurate, and honest about what is known. You do not need to share every technical detail with every audience, but you do need to prevent confusion. Poor communication can make a bad change feel worse because people start guessing, duplicating work, or creating workarounds that increase risk.
Documentation matters after a rollback or a fail-forward fix. The change record should show what was attempted, what failed, what decision was made, who approved the recovery path, and what was done to restore stability. If the team rolled back, the record should explain what state was restored and what still needs to be investigated before the change is tried again. If the team failed forward, the record should explain what fix was applied and whether the environment now differs from the original plan. This is not just administrative cleanup. Documentation helps future troubleshooting, audits, security reviews, and lessons learned. If a similar problem appears later, the record can save time and prevent the same mistake from being repeated.
You should also pay attention to temporary measures during recovery. During a failed change, someone may open access, disable a control, pause monitoring, bypass a check, or grant extra permissions to get service restored quickly. Sometimes a temporary measure is necessary, but it should be tracked and removed when the emergency is over. Temporary security exceptions are risky because they are easy to forget once the visible problem is fixed. A backout plan should include restoring security settings, not only restoring application function. A fail-forward plan should include closing any temporary gaps created during troubleshooting. The system is not truly recovered if it works for users but quietly remains less secure than intended. Recovery should bring back both service and protection.
After the situation is stable, the organization should learn from what happened. That review should not be about blaming one person by default. It should ask what the team knew before the change, what the team missed, whether testing was realistic, whether impact analysis found the right dependencies, whether the maintenance window was long enough, and whether the recovery plan worked. If the backout plan failed, that is a serious lesson. If failing forward took longer than expected, that is also useful information. Maybe a dependency needs better documentation. Maybe test systems need to match production more closely. Maybe approval criteria need to include additional security checks. A failed change can improve future security if the organization turns it into better planning instead of just moving on.
The conclusion is that backout plans and fail-forward decisions are both about recovering safely when a change does not behave as expected. A backout plan returns you to a known-good state when reversing the change is the best way to reduce harm. Failing forward keeps the new direction and applies a fix when going backward would be riskier, slower, or impossible. Patch failures, firewall rule mistakes, and application deployments can all require different choices depending on security risk, downtime, data changes, and business impact. As you study this topic, do not memorize rollback as always good or fail forward as always better. Ask which path restores stability, protects data, preserves security, and reduces overall damage. That is the decision-making skill Security Plus wants you to build.