Lessons learned from the largest transformation projects at IBM
Enterprise Software Systems, regardless whether they are fully home-grown, on third-party or a composed mixture, inevitably age like anything else does in this universe. This phenomenon is referred to as bit rot, code rot, software erosion, software rot or software [system] decay.
Software decay is either a slow deterioration of software quality over time or it is diminishing responsiveness that will eventually lead to software becoming faulty, unusable, or in need of upgrade. This is not a physical phenomenon: the software does not actually decay, but rather suffers from a lack of being responsive and updated with respect to the changing environment in which it resides.
[Software rot @ wikipedia]
About Thomas Reinecke — in my roles as Chief Architect on several key IBM-internal transformation projects executed during the last 5 years, I’ve had unique opportunities to co-lead and influence a few of the most complex, comprehensive and impactful business transformations at IBM, with transformation of Support, Sales and the Business Partner Ecosystem being mentioned as examples. I have always been a very curious and reflective person, and so I have been trying to understand the mechanisms behind the decay of software for some time now. I’m still an IT engineer at heart, so I’m mostly interested in the practical side of things. However, we will start with a bit of theory before I share my own real-world experiences.
The most relevant questions related to software system decay seem to be:
- What are the mechanics and reasons of aging?
- Can aging be influenced and how?
- What is the consequence of aging and the cost related to it?
This article will answer these questions.
This chapter provides the theory and scientific concepts of software decay. I recommend the articles Software Entropy (wikipedia), Software Rot (wikipedia) and Dealing With The Inevitability of Software Entropy — Is Software Equilibrium Attainable (medium).
Maintaining order and structure is a costly business, and the laws of physics seem to be against us, in form of Software entropy. Based on [Lehman-Belady 1985]…
- a computer program that is used will be modified
- when a program is modified, its complexity will increase, provided that one does not actively work against it
- Software entropy is increased with accumulation of technical debt
- increasing entropy leads to increasing disorder, increasing error rate and increasing costs and time for maintenance and improvements, eventually leading to the economic death of the system
You may also find a more philosophical perspective on entropy interesting…
Like our own decay, it’s safe to say that rotting of software systems is inevitable, whether we like it or not. What is important, however, is to know about accelerating and decelerating factors, and that some of them can be directly influenced by us as we plan, build or maintain software systems.
Take the following list (roughly sorted by relevance) as a collection of my personal lessons learned, which is sourced by real-world experience:
Level of functional complexity
In general: the lower, the better. This aspect is driven by the number of personas and use-cases that the system has to serve. Time is the largest influencer, as we can assume that a system that is used will be modified, so the initial state usually has the lowest complexity during its lifecycle, especially when the system was started as an MVP. My top recommendation here is to always challenge upcoming business requirements — the more a system takes on, the faster it ages. Not all requirements are business critical, many are even invalid, especially requests for deviations from standardized business processes for specific units of the business should raise a red flag. The more one-offs are accepted, the faster the decay. Establishing an empowered Architecture Review Board (more to that later) with a strong business architect overseeing the process area can be valuable and slow down decay.
I’ve seen situations where personas and use-cases have moved out of an existing system, however this mostly meant a legacy system was drained and migrated to a new system. In the rare scenarios that a limited reorganizing move happened, however the risk was high that the code/config supporting the moving functionality was not properly cleaned up — usually due to risk, cost reasons and herewith this is usually leaving tech-debt behind.
We could talk about this for hours…. Build vs. Buy, the latest cool shit vs. established top dogs, on-prem vs. hybrid or public cloud, a mix of both, development languages, frameworks, container solutions and so on. In theory, a very rich and exciting set of options, but the reality is different:
Foundational technology decisions are often driven by corporate strategy, large vendor deals and executive opinion. I have seen examples where such decisions were significantly influenced by a smaller senior technical leadership group. In general however, it should be assumed that the list of potential options is drastically condensed by “we have always done so” (Build) or by “we’ll radically shift and use a vendor” (Buy).
To slow down decay from the first moment, make sure the resource & skills available in your organization nicely fit to your technology choices. The aging rate will be much higher, if you have to start a project on a technology stack that the technical people in your org are not familiar with, simply because avoidable mistakes are more likely to happen, especially in the very early days of a project.
Level of Standardization
In general, the more investments into standardization are made the more decay can be slowed down, however it comes with a high price especially for “try fast — fail fast” approaches. Since we focus on decay though, here are the areas worth to look at:
- segmentation — if you need to deploy multiple offerings onto the same platform (e.g. a Salesforce deployment with external, internal personas), make sure that code, config, object model, storage, API endpoints a.s.o. are properly segmented by offering using prefixes on the element names. This will help everyone who may ever be involved in the project to realize where elements belong to. This may not be that simple for shared elements. Segmentation can also be applied on larger features
- integrations — obviously, the less the better. But even if you need to integrate a large number of subsystems, as is typically the case in larger organizations, hub solutions and orchestrators can be used to structure a tangled spider web of integrations, from a communication flow and technology perspective. It should be noted, however, that integration hub systems are usually not very popular at the business or executive level, as they usually have only an indirect impact on the business.
- source code & development — the most obvious point where wrinkles show up early is the source code. At the time this article was written, source code was still mostly written by humans, and the human factor brings different skills, coding styles, and quality expectations to the code.. The larger and more diverse the development organization is, the higher the level of deviations which will increase the aging rate over time leading to earlier decay. I recommend using a standardized linting process (which can be customized to your needs), setting standards for code reviews, testing, and for code documentation, and enforcing those standards through automation in the CI/CD pipeline. Another recommendation — very similar to hub solutions — is to invest in libraries and core services that can centralize functions that have broader usage patterns. Proper naming conventions for objects, classes, methods and exceptions can also help to drastically slow down decay.
Architecture & Ownership
If we assume that business requirements are a major reason for the aging rate, we can consider architecture as its counterpart. However, this requires an organizational setup where the Architecture Review Board is empowered to review, challenge and even stop implementation proposals. In an ideal world, there is one product owner per offering (for smaller setups) or per capability (for larger projects) who translates between the business and the project team and has a development squad. There should also be multiple architects involved in the project who specialize in their respective areas, such as data architects, integration architects, business architects, and a lead architect who oversees the entire offering. Product owners would work with one or more assigned architects to drive the creation of proposed solutions to the requirements they receive through a process we call “elaboration”. In this process, requirements are translated into personas/user stories and changes to the existing architecture, object model, required configuration and implementation details are specified, keeping some goals in mind:
- The proposed solution must comply with the architectural principles
- Follow KISS (Keep it simple, stupid): Over-engineering is always much easier than reducing a proposal to the absolute minimum including reuse of existing capabilities, which requires a lot of knowledge about the implementation
- Take on no or very little debt and keep an eye on software entropy
Ownership is another key driver to the pace a software system rots. The above setup works great with stable ownership, where each PO and squad owns the code and configuration of their respective capabilities, ideally seasoned with some passion for quality, design and development principles. However, if it is assumed that POs and squads are easily replaceable and new requirements are simply thrown onto the currently spare capability of any squad, rapid system decay of the whole system is guaranteed. Who would be enthusiastic about “their baby” and its quality if every day someone else could step in and change it? Ownership could be a mixture of vertical (by capability, e.g. “checkout feature”) and horizontal (by technology, e.g. “notification framework” or “REST endpoint exposure” used across multiple capabilities). The biggest challenge is to make sure that ideally one squad owns a piece of code, not several or none.
Software Maintenance and Technical Debt
Over time, market requirements change and so do functional and non-functional requirements. As we have learned, this increases complexity, which inevitably leads to aging of your software. Let’s assume, the storm and stress period of the initial phases of your project is over, you have all the personas and requested use-cases covered, and your software is pretty much in maintenance mode. Demand in this lifecycle phase is at least two-fold:
- defects get uncovered over time and your development org (now greatly reduced) tries to fix them. You can assume that the key players who led the project initially have moved on, so this phase is now mostly in the hands of people who rely on the quality others have left
- dependencies (e.g. libraries) in your software get updated, which quite often require changes to your software that are more expensive than just raising version numbers in package.json. The nightmare scenario here is deprecation of an important library which does not have a compatible successor
Both activities need to be done to maintain your software regardless whether you add new or change existing capabilities. These are the fixed costs you’ll have if you’d like to keep the lights on.
There is a third aspect however, which is usually greatly underestimated relative to its impact to decay — refactoring. IMHO, you should plan a refactoring sprint once every quarter in the lifecycle phases before a project reaches maintenance, and as best practice a refactoring sprint should not focus on direct business value. The indirect value your development org will produce in these sprints is to fight back debt, to slow down decay, to improve quality, performance, architecture, the object model and clean up all the tactical implementations that very likely had to be done in the rush for capability development which normally happens.
The sad reality though is, that teams are very rarely given the time for refactoring sprints — market demands are just too important and too urgent, indirect business impact is rarely popular and it is assumed, that teams cleanup as they go. It is just a fact, that the time for debt reduction is rarely given, and if this is paired with strong pressure to deliver feature/functions, the software rots almost faster as it grows. In this mode, a software system in development does not even need 3–4 years to be internally rotten before it even enters maintenance mode.
Software development is a story of resource scarcity and pressure on time and cost. All three aspects usually drive the desire to minimize the investments in elements that are not directly business value attached, which often is leading to a lack of insufficient documentation quantity, quality & maintenance. Overall you have multiple ways to document your project, I personally recommend a two-fold approach : source code and a playbook.
- source code documentation should be applied to all classes, methods, metadata, configuration including the object model right insight the source files. A proper terminology on class- and method-names will be very valuable, but it should not be assumed that the names will be sufficient to describe what a function does. This is the sort of heritage the initial development team can leave to the maintainers and if done in the right quantity and quality will help to reduce the aging rate
- we used the playbook approach on any of the recent large-scale projects during the last few years. Its a deployment of a lightweight Wiki/CMS solution (like wiki.js or Confluence) or something comparable that is setup, promoted and lived as the one-stop-shop for any project relevant documentation (excluding source code obviously). Ideally the solution comes with a proper tagging and search capability and allows WYSIWYG authoring to make the content creation as simple as possible. For ERM diagrams we normally used lucid chart, mural or other comparable vehicles. If taken seriously, creating proper documentation in the form of a playbook can have a very positive impact on the rate at which your system rots.
A very lightweight documentation system can be built for free on Github pages, you may find this article helpful:
Never think of your demand management system (regardless whether you use Jira or something else) to be your documentation system, issues are not a reliable vehicle to store documentation — simply because it describes what the request was, not necessarily how the implementation was at the end 😉
We’ve learned that software ages like anything else does in this universe and that decay is inevitable. Looking at a number of influencers should have given an idea what you could do to influence the rate your system ages. We’ve also learned that the biggest opponents to a sturdy and slowly aging system are requirements, time and money, which is not a surprise. To make the right priorities and investment decisions, you should first answer the question of how long you think the system will last and whether the pace of decay is consistent with the decisions you have made.
I do not receive any compensation or incentives for the brands, products and companies mentioned. If you liked this story and want to read thousands of stories like this on medium, you can become a medium member for as little as 5$/Month. If you’d like to support my writing, use my referral link below, I will get a portion of your membership without any extra cost to you.