I read a lot of discussions about complexity in networking and IT today that include a large amount of FUD. Topics range from “we’ll all lose our jobs because abstraction” to “you can’t fix complexity” to “welcome robot overlords” ;-)
Complexity is something that may be easy to move, even easier and to increment, hard to remove.
I admit I still haven’t finished the book so it is possible that part of this post are already included there, I’ll know that soon.
An IT infrastructure has a complexity that is embedded and strictly correlated to technical choices, tech debt, available technology, constrains and more.
Those who manage the infrastructure fight every day with the Infrastructure Complexity:
Move complexity to.. consultants
Hire a bunch of consultants, pay them $bigmoney to manage the complexity, get a bonus! It doesn’t work well because consultants have more than a few incentives to actually grow complexity.
Consultants with right skills must be found, on-boarding is not always very fast, they’re hard to replace when leave, somebody must manage the consultants and be able to understand if they’re matching the scope of the project and providing actual value.
So the total system now has
Total system complexity = $initialComplexity + $consultantsComplexity
In the most optimistic scenario the second addend is less than the first. No one should be surprised to know that this is often not the case.
Move complexity… the Cloud!
The cloud looks like a black hole able to suck all the complexity an make it disappear. Many experiences show that moving to the cloud reduces some kind of complexity but creates a new one. Cloud must be managed, costs must be monitored, moving data between clouds is a new problem, a new kind of security challenge must be addressed with skills and products.
Total system complexity = $initialComplexity + $cloudManagementComplexity
$initialComplexity includes server and storage managements and all the related physical requirements (space, power, cooling, physical security etc).
$cloudManagementComplexity requires some skills that may not easily available today. Hiring consultants, at least initially, may be the only option.
Moving to the cloud has one big advantage: the process of moving part of IT outside may force a pre-migration process that often includes the two methods to remove complexity I’ll discuss later.
Move complexity to… the new abstraction layer!
Rule 6 is always valid:
(6) It is easier to move a problem around than it is to solve it.
Abstraction layers like Intent Based Networking or Napalm create an additional layer with the promise to reduce complexity. What they actually do is to create a black box around the existing complexity and expose a simpler interface.
Complexity moved behind the abstraction product in this case but again it is not removed.
Total system complexity = $initialComplexity + $abstractionToolComplexity
We can consider consultants and cloud just a variety of abstraction layers used to move complexity to another place, keeping in mind that it creates new complexity on top of the initial one.
The immediate advantage is now the perceived complexity is smaller than before
But what actually happened is the overall system complexity is the sum of Infrastructure Complexity and Abstraction Complexity
The final system has more entropy than it had before.
Can complexity be moved at zero cost?
Complexity always increases
Entropy behavior is well described by the Second Law of Thermodynamics. From many aspects Complexity works the same way.
Moving complexity from one place to the other always increases the total complexity of the system. With enough money it is possible to reduce the amount of perceived complexity, this only demonstrated that
Can automation fixe complexity?
Automation tools like Ansible move complexity from CLI to YAML but they don’t fix the complexity problem. They actually require a simple and standardized environment to perform well.
Are abstraction layers bad?
The answer in general is “no”. An abstraction layer, when properly implemented, reduces user’s perceived complexity ad allows more skilled engineers to handle the actual complexity.
If the automation layer adds too much complexity there’s a point where it doesn’t make sense anymore.
Each abstraction layer should be analyzed and evaluated before implementation.
My personal example: after working with Ansible now I’m working more and more with Python. I understand Ansible can a great tools for some uses but in my case I prefer to have more control and visibility. It’s a matter of uses cases and skills you have or are willing to grow.
Is complexity bad?
Complexity isn’t good or bad per se. The process that introduces complexity to a system can be good or bad.
Complexity that is carefully understood and accepted can be introduced in a system by a designer. The same designer could remove previously introduced complexity when a simpler solution is available. This is good.
Complexity created as a temporary fix that will stay forever or, even worse, created without realizing is a bad. The final system will be fragile, almost impossible to understand and good luck to whom will manage it.
How to reduce Complexity
Possible solutions to reduce complexity are simple in concept but hard to apply, it’s always easier to have good ideas that to actually apply them.
To simplify an infrastructure is a trade off: too simple you lose flexibility, not simple enough is just another way to say complex. No general rule can apply here, an experienced designer can help.
A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. - Antoine de Saint-Exupery
Standardization works with simplification and has an immediate impact on reducing complexity.
I’m a big fan of the Pareto Principle so I’d suggest to start from standardization to get the most results with little effort.
An example: a $manager says a new branch site will be open in a month, people already hired, building already rented. IT infrastructure is required asap.
How to be prepared for that request? Create a standard template for a new site with a BOM, IP addressing, firewall rules, new domain user creation rules, WAN connectivity (mpls, flexvpn etc) and so on. The template should include average procurement time for each item in the list for a realistic expectation.
Once a standard is created it is simple to generate device configs via template (with Jinja2) and deploy them (with Ansible).
Examples like this often generate many objections because IT professional still think they manage wonderful snowflakes.
Sometimes it is just a matter of mindset, simplicity and standardization of an IT infrastructure should be an element of design and included at each redesign or following evolution.
The infrastructure reflects the mindset of the creator and is subject to many political forces. If every time a feature request arrives the designer have to give up part of simplicity or break a standard the whole system will become complex soon. Keep it hard and do not miss the long-term goal of keeping complexity low and improving infrastructure resiliency.
When complexity moves it generates more complexity, like any movement of energy generates entropy. The trade off should be carefully evaluated to make sure it is convenient to move complexity and understand how much additional effort and tech skills will be required by the movement itself.