DevLife #5: Microservice Hell

Our technical debt is your debt

Feb 07, 2023

The way I see it, most SaaS applications are deployed as microservices because it is easy to structure an organization into teams by micros (there are other compelling reasons too). Perhaps your organization has 6 or so development teams. Each team gets 1 micros. Responsibilities and division of labor is well-defined. Here’s a possible architecture.

An architecture that works well with an organizational chart!

I like how this looks. It’s clean. Each component should be able to evolve and scale independently of the others.

Except that’s not really how it works .

In reality, each micro is heavily constrained by their callers (upstreams) or their dependencies (downstreams) or both. Let’s redraw the diagram to see what it really looks like.

Every single one of these micros likely has to interact with the user and tenant management service. Every request will need to be authenticated and authorized. The caller context has to be verified as it flows through every service. Each service will have to figure out how to ship their data to the Search micro. The various business logic services likely have dependencies on each other to avoid duplicated effort.

In general, these pains are worth it. More micros means a smaller blast radius if a single micro is breached, more fault tolerance in case one of the micros becomes unavailable (unless it’s auth or user management), better resource utilization for heavily used services (like auth) and rarely used micros.

However, the one alleged “benefit” that I completely find ridiculous is the idea that micros evolve independently. I have never found this to be the case. Let’s look at this from the perspective of callers (upstreams) and dependencies (downstreams).

The Upstream Problem

By upstreams I mean services that are early in the call flow. The client (frontend/Web browser) is the most upstream on this diagram (the client isn’t a micro though). API Gateway is second. Auth is likely third. User/tenant management is probably fifth. Everything else is after that.

The issue for upstream services is that downstream micros often do not yet provide sufficient APIs to enable your features. Why? Usually, this is because those services are owned by other teams which have differing priorities from your team. This creates overhead in that leadership and the two teams need to negotiate priority. However, if the downstream team is unable to provide the needed functionality in a timely manner, then the upstream might implement it themselves or find a clunky workaround.

This is often necessary, but it is problematic because it has potential to create technical debt that is difficult to unwind because micros end up having functionality built into them that logically should be elsewhere.

For example, what would happen if the Users/Tenants team failed to deliver their data to the Search team. Thus, the Search team cannot provide search functionality containing user data. However, Business Logic 1 service needs this functionality. Business Logic 1 finds a way to maintain a list of users, so instead of using the Search team, they use their own data. To make matters worse, Business Logic 2 service decides to invoke Business Logic 1 to search user data because the functionality is ready.

Months later, the Search to finally is able to get the data they need from the Users/tenants team. Search builds the search functionality for user data. However, Business Logic 2 team does not adopt it because they already have the functionality they need.

The Downstream Problem

By downstream I mean services that are late in the call flow. The User/Tenants service in the above diagrams serves as our example.

The issue for downstreams is that changing the interface is hard. For one, you MUST never break your callers/upstreams (except under rare circumstances like an active breach in which case it might be necessary). For instance, you own the user service and have an API for retrieving a user by social security number that is invoked over HTTP like the following:

HTTP GET /users?ssn=333-33-3333

A few years after your product launch, your leadership realizes that querying for user data by social security number is a bad idea. In fact, a compliance agency threatens to remove one of your certifications when they discover this API. The decision is made to remove it as quickly as possible.

But you can’t remove it immediately. The API is used by nearly every other micro. The best you can do is allow it to optionally take in a second field called id like the following:

HTTP GET /users?ssn=333-33-3333&id=abcde-fghi-jklmnopq

Your upstream micros can then invoke you either using ssn or id. This gives them a migration path off of the API. But you are stuck with ssn as a parameter until the last caller switches to using id. Removing ssn prior to that would break the upstream services and cause the product to lose service.

This is the downstream problem. Fixing your technical debt often requires your callers to change. If your clients are external, then moving your customers onto the non-deprecated feature could take years. This is why companies like Amazon and Microsoft very rarely remove APIs and take the interface very seriously.

Solving these Problems

The management and leads of each micro must periodically meet to align on their mission. Teams must commit to adhering to the principles of the architecture except as a last resort. Should the principles be compromised, teams must commit to resolving once a principled solution is ready. Teams must strive to rarely deprecate APIs or change the APIs that require a difficult migration.

Many of these problems shows up less often by increasing the number of micros owned by each team or simply combining those micros into a single service. This empowers teams to more often unblock themselves.

To sum it up, the benefits of microservices include

easy to divvy up the work to a lot of teams
better fault tolerance
low blast radius
independent scaling

The drawbacks include

technical debt is often shared across many micros
migrations involving multiple teams have to be carefully orchestrated
latency has to constantly be dealt with

Chris

Apr 20, 2023

"In general, these pains are worth it" - with a system as outlined in the article, I disagree. As you described it, there are a lot of dependencies between the teams, which results in a lot of technical dependencies between the services. Which makes them a distributed monolith rather than microservices.

That said, you're on the right track with your proposed solution: As described, the system is a "people problem" and thus needs solutions centered around people. I'd suggest closely examining how the teams are cut. Rather than more meetings and alignment, I'd suggest finding ways to make the teams more independent (which may include a small re-org).

Expand full comment

Sheep Code

Discussion about this post