Shipping software has always been about balancing speed and quality control. In fact, many great technology companies built their empires by mastering this skill.
Their ideas and practices around striking that balance created whole new classes of tooling that are now mainstream in the software world—application containerization, CI/CD pipelines, and cloud computing are being rapidly adopted by forward-thinking software organizations worldwide. The modern software ecosystem is slowly evolving into a set of intertwined, well-oiled, almost fully-automated systems that produce the final piece of software delivered to our customers.
This “renaissance of infrastructure,” however, is not without its faults. In the middle of all this innovation one age-old problem is actually getting much harder for developers: figuring out what exactly is going on inside these shiny artifacts our users love so much. More importantly, it’s also getting harder to understand what the developers who build these pieces of software can do to fix them when they break in production.
A vast, unknowable world of bugs
Many classes of bugs used to be easier to predict and identify back when applications were mostly designed to run on a single machine. After all, development and test environments could closely resemble the target production environment, there were fewer potential third party services (hence fewer potential sets of third-party configurations), and the number of dependencies was significantly lower.
Today’s applications are, by and large, built to run on distributed systems. More specifically, many organizations are adopting the practice of building cloud-native applications—applications that are specifically designed to run on modern, scalable cloud infrastructure. Think Kubernetes, service meshes, and microservices instead of bare-metal, single-tenant monoliths.
Cloud-native technologies make production environments exponentially more complex. One might say that the problem is not with the technologies themselves, but that the increase in “moving parts” indirectly leads to developers having a harder time grasping all of the potential flaws in the system. This complexity also gives developers more opportunities and edge cases to make logic mistakes, opens the gates for more configuration mistakes, and reduces the ability to discuss a complete system in any one conversation (without blocking a full day to talk about all the components).
We need better quality controls in order to fight this battle head-on. In practice, that means we need some sort of mechanism to alert us when our software is not living up to the standards we expect it to.
Debugging in production is a quality control
Quality controls like code reviews and proper testing are not going to disappear. They have their rightful place in the world.
However, there are too many subtleties in production for us to be able to anticipate every problem in advance—i.e., during development or while writing our tests. In fact, development and testing are the wrong places to apply those mind cycles. The most practical (and fastest) way to get to the root cause of a production problem is debugging in production. Where else can you see real users, real requests, real data, real infrastructure, real… everything?
Not every bug reproduces locally. Not every application can be easily spun up in an environment that replicates the production system state at the relevant point in time. Not all data is easily extractable and available for the debugging developer’s consumption. It’s easy in theory, but in practice there’s always one more thing we forgot to mimic along the way—a set of configurations, a certain load on the database, a specific outage in a third-party service.
It all comes back to balancing the speed of shipping software and quality control. Every company wants to deliver new features fast, iterate, and repeat. But in order to make “moving fast” work, teams need to push code to production as well as see how that code behaves with real usage. Developers need a safe and easy way to get a grip on their production services—from within the tools they already use.
Troubleshooting the pitch-black box
When an IT operator or a devops engineer gets that dreaded page that a service is down, they have a full suite of health metrics and a variety of buttons to press and knobs to turn in order to restore service health. They have mechanisms that allow them to get a 360-degree view of the situation.
Developers… not so much. We’re doomed to scroll through endless logs and make do with dashboards of the infrastructure metrics and whatever information we remembered to instrument. When a developer gets that dreaded notification that there is a bug, it’s like the early stages of a crime scene investigation that has no suspects and only traces of clues.
Usually the first steps are asking questions like, “Did this happen for just this user or for all users?” Or, “Was it related to a recent feature rollout or a configuration change?” Thus begins a long process of elimination, with the developer sifting through application logs to try and draw correlations between their working theories and what actually happened. More often than we’d like to admit, the problem lurks inside a black box—an object or path we simply did not log for from the beginning.
This is often followed by a series of hotfixes that are applied during the investigation, resulting in many precious incident minutes spent waiting for the CI/CD pipeline to finish and for releases to roll—all to get a better peek at what’s happening inside the running production application. This is a time-consuming, expensive, and decidedly non-agile process that also uproots developers from the tooling they know and love and drops them into the world of IT monitoring, dashboards, and GUIs. Not good.
Developers deserve developer-native observability
Thinking back for a moment on the dramatic evolution currently underway in the software world, it’s amazing to see how many state-of-the-art systems were built solely for the purpose of allowing developers to get from code to a running application in production faster and with less manual work.
Troubleshooting production code should be just as fast and easy, with the same tooling that allows for an agile, real-time, and developer-native process. Developers—not operators—should own reliability of application logic, as well as the discovery and remediation of bugs in production. Developers need observability tooling that is integrated into the development workflows, into the IDEs, into the source control.
Unfortunately, we’re not there yet. Our tools for breaking apart code-level issues in real time are still cumbersome. Although it’s easy to ship your code to a machine in another continent, it’s difficult to understand the structure of an object or its size in memory if you didn’t explicitly log for that from the get-go.
As a longtime developer and manager who has had the privilege of leading a team of excellent developers, I can say for a fact that we deserve better tooling. Tools that work where we work. Tools that are integrated into our workflow, not borrowed from other roles in the company. We need tools that allow us to “touch” the system (without affecting its actual state) and gain a better view of how our applications are behaving when they aren’t behaving as they should.
Leonid Blouvshtein is cofounder and CTO at Tel-Aviv based Lightrun.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to firstname.lastname@example.org.
Copyright © 2021 IDG Communications, Inc.