Navigating Turbulence in Production

It is not hard to understand why digital engineers get a little bit nervous when we start to talk about chaos engineering. ‘Shift Left’ has taught engineering teams the value of testing and fixing bugs as early as possible in the digital lifecycle, especially if they have ever spent significant time unraveling problems that were not discovered early.
So, when we discover faults earlier on, that must mean better quality software, and fewer late nights for hard-working development teams fixing issues, right? If only it worked that way.
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” (Leslie Lamport)
With the rise of more complex software, IoT, cloud, distributed systems, and microservices, a new approach to quality and resilience is required to account for the many permutations and interdependencies between all the constituent parts. This is where chaos engineering comes in. Traditional software testing verifies the code is doing what we want it to (and continues to be an essential part of digital engineering). Chaos engineering, meanwhile, is a way of testing that the entire system is doing what we want it to, and code is just one part of the mix. To do this effectively, we must test the system in production. This is because many other factors, like state, inputs, and how external systems behave, all play a part in the way a system runs.
This complexity has given rise to the idea of “dark debt,” referring to the unforeseen anomalies that happen in complex systems when different parts of the software and hardware interact with one another in ways we can’t predict. The term borrows from the concepts behind “technical debt” (IT) and “dark matter” (space) to suggest the inevitable, unseen complications that arise in complex systems. This is exactly what chaos engineering seeks to identify. How that turbulence in production is managed is a critical part of the planning that needs to go into every experiment. Navigating safely through these stormy waters will ensure greater confidence in and resilience of the whole system.
No surprises
The approach Infostretch advocates is one we outlined in the first of our chaos engineering blogs. Talk to co-workers, explain your plans, and don’t do anything if you suspect it will fail. (In that case, fix the weakness). Chaos engineering is no substitute for resiliency planning and patterns. Instead, organizations embarking on chaos engineering should carefully create hypotheses they wish to prove, considering how to limit their blast radius. The meticulously planned reality of chaos engineering is a far cry from how it was once described by Amazon’s Werner Vogel, “Break everything to see how your systems respond.”
Small is beautiful
Start small and limit the blast radius of your experiments. That includes taking into consideration when the experiment runs, and which departments and resources are available after the experiment runs. By now, I hope it is clear that when we talk about chaos engineering, it’s never about cutting a cable or unplugging a machine randomly to see what happens. The goal is to prove a hypothesis. Even when fault tolerance is within acceptable margins, there are always insights to be gained from examining how the system responded.
The environment matters
If running experiments in a full production environment feels like a step too far into the abyss, that’s ok. For an organization’s baby steps in chaos engineering, production may be too risky. In this case, they should start in a different environment, but one that is as close to the production environment as possible. Quite simply, the findings will not be sufficiently relevant to shed light on potential failures of the system unless the environment is very similar.
Keep going
Software and systems are continuously being tweaked, so chaos engineering experiments should mirror this. It is not safe to assume that if a system responded to a fault injection test (FIT) in a particular way a month ago, the same holds true today. Many of these experiments can be automated, which enables engineers to focus on increasing the scope, intensity, and variety of tests.
Expanding efforts
Once you’ve tested the system for one type of fault, it’s time to adapt the hypothesis. It may also be time to try other hypotheses. Organizations that embark on chaos engineering sometimes get “stage fright” after the initial few tests, especially if these have been fairly minor. The thinking goes a little like this, “I don’t think there’s a problem in service X, but it’s too big a deal to risk.” Wrong!! Remember dark debt and the unforeseen anomalies inherent in complex systems? As Nora Jones from the original Netflix chaos engineering team has said, “Chaos engineering doesn’t cause problems. It reveals them.” Instead of getting cold feet when it matters most, organizations should absolutely tackle the big, important services, but do so in a careful, cautious way. When it comes to improving resiliency and confidence in systems, knowledge is power.
What’s your toughest digital challenge?

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
Open chat
1
You can contact our live agent via WhatsApp! Via + 1 4129036714

Feel free to ask questions, clarifications, or discounts available when placing an order.

Order your essay today and save 20% with the discount code SOLVER