What happens when Amazon rolls back from serverless?

What happens when Amazon rolls back from serverless?

Starting from a controversial article, we explore the reasons that could drive serverless adoption or represent a point of friction.

Unless you've been living under a rock recently, the entire serverless world has been shaken when "even Amazon" apparently decided to move away from serverless.

Let's take a deep breath and try to understand things for what they are, not what someone would like them to be. Long story short, more than one month ago Amazon Prime team published a controversial article titled "Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%," where they stated they had to shift to a monolithic architecture to comply with some challenging requirements. We'll deep dive into the article in a few lines, but stay on the title and pretend we have a preliminary understanding.

The article went unnoticed for almost five weeks until the well-established anti-cloud prophet DHH wrote a heartfelt article that ChatGPT could summarize accurately as "I told you so. I was right." I don't know whether Bezos's dog ever bit David's hand in the past or being the father of the Ruby on Rails monolith was just enough to make him resentful. Honestly, I would have expected better balance and a deeper analysis from a highly skilled professional like him than just yelling at the cloud.

Unfortunately, DHH is strongly opinionated since he moved away from the cloud with Basecamp. I do not have the details of this choice that make sense to him, but his public arguments have been at least questionable. To begin, he wrote a challenging article outlining the cost saving related to moving away from the cloud, needing to mention housing, people, and maintenance costs.

Recently he raised the bar of his overconfidence with this naive position about serverless, thus coming right after same-kind statements against microservices. Let's go one step further and analyze what the Amazon team explained.

What did the Amazon team say?

Reading the article carefully, we can outline some key points for their use case worth mentioning.

“Our Video Quality Analysis (VQA) team at Prime Video already owned a tool for audio/video quality inspection, but we never intended nor designed it to run at high scale (our target was to monitor thousands of concurrent streams and grow that number over time). While onboarding more streams to the service, we noticed that running the infrastructure at a high scale was very expensive. We also noticed scaling bottlenecks that prevented us from monitoring thousands of streams.”

This is clear: they had a tool projected to handle a few streams that ended up being used for thousands of streams to be processed in real-time.

We designed our initial solution as a distributed system using serverless components (for example, AWS Step Functions or AWS Lambda), which was a good choice for building the service quickly. In theory, this would allow us to scale each service component independently. However, the way we used some components caused us to hit a hard scaling limit at around 5% of the expected load. Also, the overall cost of all the building blocks was too high to accept the solution at a large scale.

This part is quite challenging: orchestration was a problem. This makes sense because orchestrating services inevitably introduces friction, and doing it at scale means your system needs to scale up to maintain the throughput of the fastest part of the system. Moreover, another relevant aspect emerges: all the functions mentioned belong to the same business domain. This is a fundamental aspect of the application: the team broke down the architecture as a distributed system with many components to achieving fast iterations while building the solution, probably because the requirements were unclear and many reworks occurred along the way. This was a wise choice that needed to be changed later on when the application reached feature stability and needed to guarantee high scalability with real-time response times.

We need to find out what constraints they faced regarding response times and whether some could be relaxed in favor of an asynchronous choreography instead of StepFunction orchestration, thus avoiding the costs of StepFunction state transitions.

The second cost problem we discovered was about the way we were passing video frames (images) around different components. To reduce computationally expensive video conversion jobs, we built a microservice that splits videos into frames and temporarily uploads images to an Amazon Simple Storage Service (Amazon S3) bucket. Defect detectors (where each of them also runs as a separate microservice) then download images and processed it concurrently using AWS Lambda. However, the high number of Tier-1 calls to the S3 bucket was expensive.

Here we need to get information about the requirements that prevented the team from saving video frames into something like EFS, then pulling them into the Lambda execution context. I would have considered this approach because S3 is excellent for storing data but becomes very expensive when you have to scale the number of accesses. If your use case allows for this, it is better to prefer EFS over S3, which has a pay-per-use model not bound to the number of requests.

One size doesn't fit all.

Microservices are characterized by their smaller size and greater ease of management. They can utilize technology stacks tailored to their business needs, resulting in shorter deployment times and faster developer onboarding. Additionally, microservices allow new components to be added without disrupting the entire system; if one microservice fails, the rest of the system can continue to function. This architecture is known for its evolvability, which can be easily adapted and modified. Beginning with a simple structure, it can gradually increase in complexity to align with the overall vision.

This is not a mandatory prescription but a collection of benefits that could and should be evaluated carefully. Monoliths may be a good choice for a small startup because of their ease of development. The team may go for a technology choice that increases communication friction but significantly lowers the cognitive load of the team without requiring them to manage various aspects of their application, from instances to database scalability or infrastructure best practices.

Microservices and serverless components are tools that do work at a high scale, but whether to use them over monolith has to be made on a case-by-case basis.

Every architect has several options, or analysis dimensions, which can shape the project outcome, ranging from monolith/distributed to serverless/VMs, to a single language for everything to the right tool for the right job. In my career, I've seen developers using the same database repeatedly because it was the only one they were confident in, and teams changing language/framework every three months to be cutting edge. All of these positions are deeply wrong because they do not consider the outcome's business value and consider the many constraints an architect has to face.

This is why I am not sold to DHH's "go-away-from-cloud" religion, not because I love serverless (which I consider a means to build event-driven evolutionary architectures) but because it is an absolute position, perfectly indistinguishable from someone yelling you must use serverless for everything.

We all should have learned only Siths deal in absolutes.