Chapter 4. Rate Limiting

Much of what I’ve covered so far looks at the things you can do on the client side to deal with problems getting an appropriate response from a server. Often, when a client is struggling to get a response it’s because the server it’s talking to has too much work to do, and is overwhelmed.

In this chapter, I’ll be looking at approaches to reduce the amount of work being processed, a technique called rate limiting. You’ll see two main types of rate limiting, along with some practical tips for how to choose the right type of rate limiting.

Before I get into the details though, let’s step back a bit and consider what your options are when a server has too much work.

Ways To Handle Having Too Much Work

When a server has so much work to do that it is in danger of falling over as a result, you have five options to consider:

Just fall over
Throw away some of the work - load shedding
Reduce the amount of work being sent - back pressure
Queue up the work
Provision more computing resources dynamically

I’m going to explore each of these ideas briefly, before drilling down into more detail on load shedding and back pressure.

Just Fall Over

If a server is being hammered with requests, you could just let the server fall over. Now, I suspect this is unlikely to be desirable behavior for you, dear reader. You’ve made a decision to read this book likely because you don’t want this sort of thing to happen.

The problem with just allowing a server to fall over is that it is uncontrolled. Requests will be terminated half way through being processed, clients will be left confused about what state their work is in. In the worst cases it’s possible that an uncontrolled failure can result in data loss, and perhaps a lengthy time to recover.

In fact, if you were in a situation where you thought a server was about to die, it might actually be better to shut it down yourself in a controlled fashion rather than let it fall over in an uncontrolled state.

But, we can of course do better than this.

Throw Away Some Of The Work (Load Shedding)

If you have too much work coming in, one answer is to just throw away some of the work - this is known as load shedding. Some work will get rejected, but some other work will be done, and the system remains up. The tradeoff here is about prioritizing the health of a system over attempting to serve an individual request. By throwing away some work, you may reduce contention on the server where the server has a better chance of surviving some sort of onslaught.

I’ll explore the mechanics of load shedding later in this chapter.

Reduce The Work Being Sent (Back Pressure)

If the server is aware it has too much work, it would be ideal if it could tell the clients to reduce the work being to the server, and if possible redirect the work to an alternative deployment of the service. This concept, called back pressure, allows us to limit the work flowing through a system to keep the system stable. So if a server is being overwhelmed by a large number of requests, if these requests can be “pushed back”, it reduces pressure on the server itself.

Back pressure can be implemented in a number of ways. Later in this chapter you’ll see how circuit breakers can be used on the client side to stop requests being sent, and I’ll also show how adding information to the client-server protocol can make back pressure even smarter.

Queue Up The Work

Rather than throwing work away, or reducing the amount of work you can accept, a server can decide to queue up the work with the aim to work through the queue of work as quickly as possible whilst keeping the server stable.

The queue can be thought of as a throttling mechanism. Clients can keep sending work in, and this builds up in the queue. The server can limit how much work it can process. The tradeoff here is that when this queue is building up, the latency of the work will increase.

I’ll come back to queuing in [Link to Come] where you’ll see different ways queueing can be implemented.

Dynamically Provision More Resources

Often the issue with having too much work to do is that the amount of computing resources available is constrained. Your server has run out of CPU, memory, IO, storage or something else. I’m sure many of you are running on infrastructure that allows for computing resources to be dynamically launched, which then opens the possibility of spinning up more computing resources to handle the increased amount of work.

I’ll share more details around dynamic provisioning of resources, and the wider topic of scaling for improved resiliency in more detail in [Link to Come].

Now that you’ve seen some of the options for dealing with too much work, let’s focus for the rest of this chapter on two of them - load shedding and back pressure.

Load Shedding

If a server attempts to handle everything, it can collapse under the load, and end up serving nothing. A sensible alternative is to throw away some of the work - this is known as load shedding.

If you can drop some of the work you are being asked to do, you may still be able to get the rest of the work done. If there is a choice between processing zero work items or processing some work items, the answer starts looking pretty clear.

There are a few moving parts to consider here. Firstly, how does a server know when to trigger load shedding? Secondly, how is the fact that a service is shedding load communicated to the client? Finally, should all work be considered equal when load shedding kicks in?

Triggering Load Shedding

With load shedding you will track one or more metrics, and set a “safe” threshold, above which load shedding kicks in. One of the simplest mechanisms is to track the number of work items the server is currently processing. For example, if your service exposed an HTTP API, you would track the number of concurrent requests. You would then set an upper boundary, and any requests received beyond that limit would be rejected.

Connections vs Requests

It can be easier to monitor a number of active connections rather than the number of requests. The problem with that is some protocols will keep connections between client and server open even if they’re idle. Secondly, protocols like HTTP/2 can send multiple requests over a single connection.

Once you’ve decided what you are going to use to track the current amount of work being processed, the tricky part is knowing what the upper bound is. Is the limit for a server 100 concurrent requests? 50? 25? Typically working out what the acceptable level is for your server will involve observing it under different load patterns. Gathering information from production can be helpful, but a load test for a service could be even more useful to better simulate different conditions to understand how your service behaves under load.

Inconsistent Load

Just tracking the amount of work, such as the number of concurrent requests, can be problematic if the load generated by each item of work varies significantly. Consider a situation where a client can send two different calls to a Customer service instance. One of these calls is to create a new customer, which represents a pretty key piece of functionality in the system. The Customer service also handles Subject Access Requests (SAR), which are required for systems that operate under the auspices of the General Data Protection Regulation (GDPR).

A SAR represents a request from a user for the information we hold on them. Creating a new customer doesn’t generate too much load, but processing a SAR might be more intensive, especially if it involves gathering information from lots of other data sources, such as other services. In such a situation, an instance of your Customer service might be quite happy handling 100 customer creations, but might struggle handling 50 SAR requests.

Server vs Service

In some situations you may mostly be concerned with clients speaking to a single server. I suspect many (if not most) of you though are working in a services environment, where a client talks to a logical service, which could in fact be deployed on to multiple servers. In Figure 4-1 you can see a service with multiple instances, with the client making calls to the service’s API via a load balancer.

A client is sending HTTP requests to a load balancer

This then raises the question about where you should be tracking your thresholds for load shedding. Do you set limits for the service as a whole? For example, you might say that the entire service can only handle 300 concurrent requests, at which point load shedding should commence. This creates the opportunity for load shedding to be handled by the mechanism that distributes load in the first place. Many load balancers allow you to set an upper limit on a maximum number of requests that can be allowed. If you were using a message broker for load distribution, setting a maximum queue size would have a similar effect.

The downside with this model is that if the work cannot be evenly distributed then a single instance could still be suffering even if the overall service-level work thresholds look fine. Imagine that in our three instances of the Customer service example, more SAR requests went to one node, as shown in Figure 4-2.

The Customer service is handling 300 requests overall. Each instance has 100 requests

Here, we can see that although each instance is handling the same number of requests, one of the instances is suffering more as it just happens to have more SAR work items to process.

If you do try to manage load shedding on a single instance level of a service, then you need to understand the implications. For example in a load balancer setup, if one of the instances starts load shedding, these errors will be detected by the load balancer itself which will likely result in the instance being removed from the load balancer pool. This could result in a significant loss of capacity for the service as a whole, and could even lead to a cascading failure. I’ll explore this in more detail in [Link to Come].

Tracking Computing Resources

Arguably, tracking concurrent requests is in part a proxy measure for the underlying contention on resources. The thinking goes, as a server does more work, more computing resources are being tied up, so you get closer to the point where the server is overwhelmed.

But if the state of the computing resources is really what you care about, then why not track the underlying resources instead? This would help in situations where the load generated by different types of requests can vary, as in the case of the Customer service example.

So there may be value in tracking the underlying computing resources like CPU, memory etc., instead and setting acceptable thresholds for those.

The challenge of using more sophisticated metrics for triggering load shedding is that it can be difficult to obtain the underlying metrics of the server within the context of handling a request Inside the process running your service, getting a connection count isn’t too difficult. But are you able to access the underlying machine’s CPU or memory use?

If you are configuring load shedding at a proxy layer like a load balancer, does the proxy have access to the load characteristics of the machines it is talking to?

Secondly, you go from having one simple measure to use as a trigger for your load shedding - number of items of work - to now having multiple potential metrics to track. Do you set a threshold for memory, CPU, network IO? If you change the type of server you run your code on (which is easy to do with virtualized compute), all of these thresholds would likely need to change. But with tracking the number of requests, switching to a bigger underlying machine would just require changing a single threshold.

So whilst I have seen the state of underlying computing resources used to trigger load shedding, it’s not common. It’s important to note that you will still want to capture metrics associated with the underlying computing resources, as this can be vital in understanding failure modes, capacity planning and more. I’ll explore this topic in more detail in [Link to Come].

Communicating Load Shedding To Clients

Once the server has decided to shed load, the question becomes what you should tell the client. As I discussed in Chapter 3, when a client gets an error from a server, it’s important that the client is able to distinguish between errors which should or should not be retried.

If using HTTP for example, you might be tempted to respond with a generic 503 Service Unavailable response. However, this is a somewhat generic error, and it may not be clear that you want the client to back off. It would be detrimental for us to reject a request as we are load shedding, only to have the client just retry the request.

In fact a dedicated HTTP status code, 429 Too Many Requests, was added explicitly as an indicator that the service was load shedding. The use of a 4XX error code for this is interesting - remember that these response codes are broadly considered client-side errors, and are used in situations when an HTTP server is telling a client that the client has done something wrong. In this situation, it seems to make sense - the client is being admonished that too many requests are being sent to the server. Additionally, as with other 4XX error codes, the server wouldn’t expect the client to retry the request (although actually you might - I’ll come back to the use of the 429 later in this chapter when we look at back pressure).

Of course, telling a client “this work was dropped - please don’t retry it!” is problematic for things that really do need to happen. Some work is important, and needs to be carried out. For example the GDPR-related Subject Access Request that the Customer service has to fulfill is actually a regulatory requirement! In situations where we just reject the work, we then leave the client with a puzzle about how to make forward progress.

If you can take the concept of better client communication one step further, and give more information to the client, then it’s possible to help in situations like this. Later in this chapter I’ll cover the concept of accord-based back pressure, which requires the client and server to work together.

Is All Work Equal?

So far, I’ve looked at load shedding as being pretty indiscriminate. A threshold is hit, above which all subsequent requests will be rejected. However the impact of rejecting some work might be greater than others.

Coming back to our scenario with the Customer service, I showed you that there are two types of requests that need to be dealt with - creating a new customer, and dealing with subject access requests (SAR). Creating a new customer is a key business process that is critical to the system. Getting a new customer created is important to grow the business, and you’d want to do this quickly, so that your new customer can actually start using your software (and maybe paying you money). Just rejecting new customer signups might not be a good first impression.

On the other hand, whilst you might have a regulatory requirement to process an SAR, you have a long timescale to do this in - in fact, you have up to a month. So in a situation where Customer is under load, having a client still allow new customer creations whilst throttling SARs may make a lot of sense, assuming you have a mechanism to ensure that those SARs get processed later.

In Figure 4-3 we see the Customer service with two REST resources exposed over HTTP for the different operations. To create a new customer, a client sends an HTTP POST to /customer/. When a SAR is required, they POST a request to /sar/. In this case, you can clearly trace the entry points of the two types of requests, and this could allow you to tune different load shedding for each entry point.

The customer service has two entry points

In practice though, I see this as working around a more fundamental problem which may need to be addressed. Is it OK for one service to be doing two very different types of work?

Splitting Services?

When you have a service carrying out a mix of work with different types of load profile and complexity, this could be a sign to consider splitting the service apart.

For example if you split out the SAR functionality from the Customer service, you’d end up with two services with different load profiles and differing levels of criticality. This could simplify a number of things - for example our availability SLOs for the new Subject Access Request service might be much lower than the Customer service.

The existing Customer service could remain focused on dealing quickly with new customer requests over HTTP. On the other hand, the Subject Access Request could switch to a model where the SARs are sent to a queue, which is then processed in order at a defined rate, as we see in Figure 4-4 (I’ll explore why queues might be a good fit for this type of problem space more in [Link to Come]).

Customer creation requests get routed to the Customer service as before. SAR work is now put into a queue

Once split, you also have the ability to run the different services on different infrastructure most appropriate to their requirements.

The decision to create a new service comes with a cost of course. This is another component that needs to be maintained, and it adds some further complexity to the system. This brings us back to the paradox we explored in Chapter 1 - sometimes, the solutions we use to improve the resiliency of a system can increase complexity. However increased complexity can lead to new sources of failure.

The decision about when to split a service isn’t always clear cut, but if you want to explore this in more detail I have a book on the topic¹.

Back Pressure

If a service is overloaded to the point where it is shedding load, does it make sense for the clients to keep sending work? Now a client with retry limits will stop trying to get the service to process a specific request, but even if the client eventually gives up with one request, it doesn’t mean it won’t try and send others.

In a situation where a client is constantly sending calls that fail, at a certain point it makes sense to just decide that the destination server is having a problem, and that perhaps the client should stop sending work for a period of time. This would give the server more of a chance to recover, and also allow the client to fail fast rather than failing slow.

Even better, rather than the client working this out for itself, wouldn’t it be better if the service could tell the client to back off?

Back pressure describes a client reducing or eliminating the calls it is sending to a server in reaction to the server being overloaded. A client can decide to trigger back pressure arbitrarily, in reaction to increased error rates - I’ll refer to this as client-only back pressure. It can also choose to implement back pressure as a result of information being received from the server itself - this can be more effective, but must be built into the protocol used for the client and server to communicate. I describe this as accord-based back pressure. Let’s look at client-only back pressure first.

Client-Only Back Pressure

With client-only back pressure, a client decides to apply back pressure based on its own information. This could be based on a manual operator’s intervention, or based on observation of the success (or failure) of calls the client has been sending to the server. A threshold is reached, and the client stops sending traffic. A circuit breaker is a common pattern used to implement client-only back pressure - and I’ll explore that more shortly.

Aside from the generic benefits that back pressure brings in terms of improving system stability, client-only back pressure is often a popular choice because it can be somewhat easily retrofitted into existing inter-process communication. The protocol between client and server doesn’t need to change.

The main downside of client-only back pressure is that the decision to apply back pressure is being applied locally. In a service-based architecture, it’s common for each server to have multiple clients. If one client decides to stop sending calls to apply back pressure, then it doesn’t mean that another client will have reached the same decision - unless you have some mechanism for the clients to share this information. As a result, you may not end up relieving as much pressure on the server as you would have hoped.

Another issue with client-only back pressure is that you lack information on how much back pressure to apply. Decisions like how much to reduce calls by, or how long a circuit breaker should remain open, are ones the client is reaching by itself. The client is relying on local information, rather than any wider understanding about what is happening at the server. This of course is where accord-based back pressure comes in.

Accord-based Back Pressure

With accord-based back pressure, the server provides additional information about the back pressure that is required, which the client then acts on. Put another way, the server is agreeing to send some information asking the client to back off, and the client is agreeing to listen.

For accord-based back pressure to work, we need to build this information into the client-server communication protocol. A good example of a real-world use case would be the HTTP 429 Too Many Requests status code I touched on previously. When sending back a 429, the server can include a Retry-After header which tells the client how long it should wait before sending additional requests.

gRPC’s equivalent error code would be GRPC_STATUS_RESOURCE_EXHAUSTED. Unlike in the case of HTTP, there is no defined equivalent of the Retry-After field in the gRPC specification. As a result, you’d need to define for yourself what information needs to be relayed back to the client to trigger the back pressure. In addition, gRPC also supports custom back-end metrics - this does allow for more detailed information to be sent either within the context of the call, or out of band, and is primarily used to help gRPC clients laid balancing decisions. I’ll look at load balancing in more detail in [Link to Come].

In situations where multiple clients are involved, a server’s ability to provide back pressure guidance means that we are more likely to reduce back pressure more quickly - we’re not waiting for all the clients to reach their own thresholds.

Note that with accord-based back pressure, a server could decide to start applying back pressure for some types of work but not others. So in our previous example of the Customer service, if you were to prioritize creation of new customers over SAR-related work, Customer might start sending back +429+s for SARs first, before it considers doing the same for customer creation work.

Quota Limits vs Load Shedding/Back Pressure

Both gRPC and HTTP define standard error codes to be used when load shedding or back pressure is applied (GRPC_STATUS_RESOURCE_EXHAUSTED+ and 429 Too Many Requests respectively). However these codes can also be used when an individual client has reached a quota limit as well.

Especially when exposing APIs to third parties it is common to limit the amount of calls that a specific client can make. This ensures that each client is getting a fair share of the server’s time, or in situations where API access is monetized, that the client is allowed to make the number of calls they paid for and no more.

From a client point of view, at the moment the error is received there isn’t much difference as to why the error was sent back - if you get a 429, you’re being told you sent too many requests, and that you should stop.

From a service’s point of view, a single client being rate limited due to exceeding their quota is quite different from a service being overloaded. If one client breaches their quota, then that client gets rate limited. But if a service is close to its breaking point, then all clients will end up getting rate limited.

Because of this, one client getting a 429 cannot be extrapolated to the service itself having an issue. This means if you were trying to diagnose a failure on the client side, it might be difficult to understand why a specific client got rate limited. A simple message in the 429 response body could help make things clearer to a human operator wanting to understand what has happened. If you wanted a client to programmatically handle a generic load shedding scenario vs a quota issue differently, then having an easily parsable field in the response (perhaps a custom header) could also be an option.

Circuit Breakers

In your own home, circuit breakers exist to protect your electrical devices from spikes in the power. If a spike occurs, the circuit breaker switches into an open state, breaking the circuit, and therefore protecting your expensive home appliances. You can also manually disable a circuit breaker to cut the power to part of your home, allowing you to work safely on the electrics. In a pattern I first learned about from Mike Nygard’s book Release It!², we can implement a similar mechanism in our client-side software as a way of implementing client-only back pressure.

Implementation Overview

With a circuit breaker, after a certain number of requests to the downstream resource have failed (either due to error or timeout), the circuit breaker is switched into an “open” state. Any requests routed to a circuit breaker that are in an open state will fail fast, as shown in Figure 4-5. The terminology of an “open” breaker, meaning requests can’t flow, can be confusing, but it comes from electrical circuits. When the breaker is “open,” then the circuit is broken and current cannot flow. Closing the breaker allows the circuit to be completed, and current to flow once again.

Once a circuit breaker has switched into an open state, we need a way to “close” it again, so that work can start flowing again. This can be done in a variety of ways, but on common approach is to still allow a few requests through even if the circuit breaker is open, and close it if those requests succeed at an acceptable rate. Another option is to monitor some sort of health check endpoint on the remote server. When the server is considered healthy again, the circuit breaker is reset.

Getting the settings right can be a little tricky. You don’t want to open the circuit breaker too readily, nor do you want to take too long for it to switch open. Likewise, you really want to make sure that the service is healthy again before sending traffic.

Once you have a circuit breaker mechanism in place (as with the circuit breakers in your home), you can use them manually to make it safer to carry out maintenance work. For example, if you had to shut down a service for a period of time, you could manually open all the circuit breakers of the service’s consumers so they fail fast while the service is offline. Once the service is back, you can close the circuit breakers and everything should go back to normal. This could all be automated as part of a deployment process - although in general moving to a model where services can be updated without causing a loss of availability is preferable.

Case Study: AdvertCorp

In Chapter 2, I introduced the real world example of AdvertCorp, which had suffered a major outage. As you may recall on that project, we had an issue with the Turnip system responding very slowly, before eventually returning an error. One of the issues we had was that the timeouts were too generous, leading to resource contention, so fixing those was a priority. Once we did that though, we realized we still had an issue. Even if we got the timeouts right, we’d be waiting for the timeout threshold to be reached before we received the error. During the failure mode we saw, the Turnip service had a fundamental issue - but we kept sending requests.

To deal with this issue, we decided to wrap calls to all the legacy systems with circuit breakers, as Figure 4-6 shows. When these circuit breakers blew into the open state, we programmatically updated the website to show that we couldn’t currently show adverts for, say, turnips. We kept the rest of the website working, and clearly communicated to the customers that there was an issue restricted to one part of our product, all in a fully automated way.

We were able to scope our circuit breakers so that we had one for each of the downstream legacy systems—this lined up well with the fact that we had decided to have different request worker pools for each service we were talking to.

For Client-only or Accord-based Back Pressure

A typical circuit breaker implementation would provide client-only back pressure. A client, on observing a certain number of call failures, would decide to open the breaker. This is the typical implementation of a circuit breaker you’ll see in connection libraries.

Theoretically though, you can put a circuit breaker into an open state based on information from the server. Rather than waiting for a certain number of failures, you could implement a circuit breaker to trigger on a single 429, using the Retry-After field to determine when the circuit breaker would reset to its “closed” state.

Issues With Circuit Breakers

On the face of it, the circuit breaker pattern seems like a straightforward and relatively simple way of implementing client-only back pressure. If you currently don’t have any back pressure in your system, it’s certainly a good place to start. However, there are some downsides that you need to be aware of.

Too Late To The Party

When relying entirely on locally available information to set a circuit breaker state, circuit breakers have the same downside as any client-only back pressure mechanism - by the time the back pressure is triggered and the circuit breaker is opened, the server we’re talking to is already having a problem. As such, our back pressure may not trigger quickly enough to maintain an acceptable degree of system stability.

Boom & Bust

Each circuit breaker is an all or nothing affair. It’s either letting requests flow, or it isn’t. This can create something of a boom and bust pattern. Consider Figure 4-7, which shows the number of requests being sent to a server over a period of time. As the requests build up, the server is struggling to deal with them, causing the failure rate of requests to increase (perhaps via client-side timeouts and/or load shedding). This then causes the circuit breakers in the clients to trigger, resulting in no requests being sent.

A chart showing load over time. The load starts to increase

After a period of time, the server recovers, the circuit breakers close, and the calls come flooding back in. In this example, the peaks are too great for the server, but the server is then idle during the troughs - not a great use of our computing resources.

Circuit breakers give us a great way to fail fast, which is always preferable to failing slowly. It means you free up computing resources more quickly, helps reduce load on potentially stressed components, and allows the system to carry out mitigating actions as soon as possible. But there is no nuance with the circuit breaker - a client either sends no traffic, or all the traffic.

Partial Failure

Another problem is how well circuit breakers deal with a server that is partially failing, resulting in a situation where some types of work seem to be processed without issue, but other types of work are getting rejected. Coming back to our example of the Customer service, imagine that you’ve decided to prioritize customer creation requests over SAR requests. The SAR limit is reached, so you start load shedding those requests.

In Figure 4-8 we see that a client sending calls to both customer creation and SAR requests is likely to have a single circuit breaker for the Customer service as a whole. If enough SAR requests fail, this could cause our circuit breaker to open–even if all the calls to create customers are working fine. In effect, the circuit breaker has made a partial failure worse.

A client is sending both SAR and customer creation requests to the Customer service via a load balancer. The SAR submissions start getting rejected

There are a couple of ways to solve this. The first is to allow each initial request to go straight to the server, and only route retries to the circuit breaker. This means you are always sending some requests to the server, and you can use the success (or failure) of the initial attempt as a way to determine the state of the circuit breaker itself. This is more of a partial mitigation than a total fix - this approach ensures that at least some customer creation calls get through, but if one of them needs to be retried and the circuit breaker is open, then the retry will be rejected.

Coming back to the original analogy, in an electrical circuit a circuit breaker reacts to an issue in a specific circuit, and breaks that circuit. In the situation where some types of work are being processed successfully, but different types of work are not, it’s worth considering if we actually have two circuits in a situation like this.

This leads to the second solution, which would be for your client to have two circuit breakers for talking to the same service - one for creation requests, the other for SAR requests, as we see in Figure 4-9.

The client is sending customer creation events via one circuit breaker

If this solution looks appealing for you, I’d urge you to check what the underlying issues are here. Why are one set of requests failing, with others working fine? It’s possible that the types of work are so divergent that splitting the service apart might be a more sensible approach.

Reducing vs Stopping Traffic

So far, we’ve mostly looked at traffic from the client being stopped entirely when the back pressure is applied - and this is exactly what a circuit breaker does for us.

Just stopping all work being sent from the client is a simple, and perhaps overly simplistic reaction to back pressure being applied. Let’s look at some of the downsides of this next, and also explore some alternatives.

Boom & Bust Cycle

If all clients simply stop all calls in relation to back pressure, this will have an immediate impact, but we have the issue I discussed previously with respect to circuit breakers. In Figure 4-10 we revisit our previous example showing the load over time for a service. When the load hits its threshold, back pressure is applied, which in this case results in clients ceasing all work.

Once back pressure is applied, you can see the load plummet. The service is then idle until the clients decide to start sending calls again. At this point, it’s possible that there is still a lot of work the clients want to get done, so the calls come back, and the cycle is repeated.

In such a situation, throttling rather than stopping client-generated work could be beneficial. Rather than all work stopping, resulting in idle time for the server, if you continue to get some work done, when the back pressure starts reducing then there might actually be less work left to be handled.

Prioritizing some work over other work can help resolve the issue of large peaks and troughs when back pressure is applied. For example stopping all SAR requests might reduce the load on the server enough that normal customer creation can continue, as we see in Figure 4-11.

A Delicate Balance

One of the problems with reducing vs stopping traffic is that if you still allow some work through, then in high load situations you may not be reducing traffic enough to make a difference. Imagine that you have dropped the SAR-related work, but are still allowing new customer creation. If at that point you are still over capacity, you’ll have to start rate limiting customer creation calls as well.

The issue then is that it takes more time for the service to get to the point of equilibrium - more aggressively rate limiting earlier would have made the service healthier earlier.

As you can see, it’s not always clear which approach makes sense, so this is why understanding your traffic patterns is so key to working out which rate limiting technique is most appropriate.

Leaky And Token Bucket Rate Limiting

Rather than the boom and bust cycle we saw earlier in [Link to Come], it would have been preferable to reduce the service’s traffic to the point where it was within healthy bounds, whilst also reducing or eliminating periods when the server sat idle. A smarter client can help here, with the help of a bucket.

Picture a bucket with a hole in the bottom. At a regular, predictable rate, water drips out of the bottom of the bucket. From time to time, you need to add water to the bucket. If the bucket has room, you can add more water. If it doesn’t have room, you have to wait until enough water drains out of the bucket to make space for the new water you want to add.

If you increase the rate at which water is draining from the bucket, you increase the rate at which water can be added - and vice versa.

Lets take this concept and apply it to our client talking to a service, as shown in Figure 4-12. When a client wants to make a request, it needs to add a token to the bucket. The bucket has a fixed amount of space for tokens, and these tokens leave the bucket at a steady rate. If there is room in the bucket, you can add the token. If there is no room, you have to either wait till there is, or else just reject the request.

On the left is a bucket with space for a new token to be added - this allows the call to proceed. On the right

This ensures that requests being made from your client will be set at a steady rate. By controlling the rate at which tokens drain from the bucket, you control the rate at which calls can be made.

A related algorithm is the token bucket, which is like the mirror image of the leaky bucket. To make a request, you have to remove a token from the bucket, and tokens are added at a regular rate. If a token isn’t available, you can’t make a call.

Both the standard leaky bucket and token bucket algorithms provide static rate limiting from the client. The net result of this approach is a predictable upper bound in terms of the calls that can be made, with a simple mechanism to dial that number of requests up or down as appropriate.

Variations & Implementations

There are a number of variations of the the token/leaky bucket to be found. Resilience4j implements a version of the token bucket in its RateLimiter+³, and .NET 7 provides an implementation in +TokenBucketRateLimiter. Variations in this algorithms are also found in TCP and in message brokers as well⁴.

You could also consider adding a circuit breaker to a leaky bucket. If the server is healthy, water flows out of the bucket at the normal rate. If it is unhealthy, then the circuit breaker stops the flow. The circuit breaker here is being used in a different way - rather than stopping the calls, it’s plugging the hole in the bucket—although the end result is very similar to normal circuit breaker use.

Marc Brooker⁵ from AWS has proposed an adaptive retry mechanism using a token bucket. When a client makes the initial request, it is sent as normal. If the request succeeds, a partial token (e.g. 0.1 of a token) is added to the bucket. If a retry is required, that needs a full token from the bucket.

Marc’s simulations seem to show that this mechanism performs better at lower failure rates than a circuit breaker approach, with the potential downside of still creating some load when failure rates are high.

Conclusion

Everything starts with knowing what the limits of your server are. At what point do you start having issues? How many items of work can you try and handle at once before the quality of service starts degrading? Without this information, it becomes difficult to know where to set the thresholds for things like load shedding or back pressure.

Your goal is almost always going to be about protecting the system as a whole, rather than trying to handle every bit of work. With that in mind, load shedding is an excellent technique to protect servers. Even if you are more interested in back pressure, it makes sense to start with load shedding first.

In situations where you can’t control the client, or the clients cannot be given the smarts to apply back pressure, load shedding will be vital. Consider a public facing website where you’re exposed to traffic from the wider internet. Or a situation where your clients are actually IOT devices where rolling out changes to how they work is not practical.

If you already have simple server side load shedding in place, and also have the ability to change the behavior of both client and server, a sensible next step is to change clients to implement proper accord-based back pressure to allow the clients to apply smarter back pressure in concert with information coming from the server.

On the other hand, if you are unable to change the server, but can change the client, then implementing a client-only back pressure either via a circuit breaker or leaky/token bucket approach would be the way to go.

This chapter has focused on how to reduce the amount of work being sent to a server. Sometimes though rate limiting may not be enough, and back pressure may not be possible. In the next chapter, I’ll take you through a variety of situations where spikes in load may threaten the resiliency of your system, and give you some more ideas on how to deal with them.

¹ Newman, Sam. Monolith to Microservices. Sebastopol: O’Reilly, 2019.

² Nygard, Michael. Release It!, 2nd Edition. Pragmatic Programmers, 2018

³ https://resilience4j.readme.io/docs/ratelimiter

⁴ RabbitMQ calls this “Credit Based Flow Control”. It’s an interesting variation as the “credits” actually propagate across multiple parties

⁵ https://brooker.co.za/blog/2022/02/28/retries.html

Get Building Resilient Distributed Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 4. Rate Limiting

Ways To Handle Having Too Much Work

Just Fall Over

Throw Away Some Of The Work (Load Shedding)

Reduce The Work Being Sent (Back Pressure)

Queue Up The Work

Dynamically Provision More Resources

Load Shedding

Triggering Load Shedding

Connections vs Requests

Inconsistent Load

Server vs Service

Figure 4-1. In a services environment, a service is often deployed on multiple instances

Figure 4-2. The number of requests might be evenly balanced, but some requests are more expensive than others

Tracking Computing Resources

Communicating Load Shedding To Clients

Is All Work Equal?

Figure 4-3. Setting different load shedding thresholds for different endpoints

Splitting Services?

Figure 4-4. Splitting the Customer service due to differing load profiles

Back Pressure

Client-Only Back Pressure

Accord-based Back Pressure

Circuit Breakers

Implementation Overview

Figure 4-5. An overview of circuit breakers

Case Study: AdvertCorp

Figure 4-6. Adding circuit breakers to AdvertCorp

For Client-only or Accord-based Back Pressure

Issues With Circuit Breakers

Too Late To The Party

Boom & Bust

Figure 4-7. Curcuit breakers opening and closing together can create a boom & bust cycle

Partial Failure

Figure 4-8. A partial failure of a service can cause the circuit breaker to trigger

Figure 4-9. Splitting the different types of requests across two circuit breakers

Reducing vs Stopping Traffic

Boom & Bust Cycle

Figure 4-10. Totally stopping calls from clients can lead to large idle periods, followed by peaks when the back pressure is removed and the calls can flow easily again

Figure 4-11. Dropping SAR calls may allow Customer creation calls to continue to be processed

A Delicate Balance

Leaky And Token Bucket Rate Limiting

Figure 4-12. A client using leaky bucket rate limiting

Variations & Implementations

Conclusion

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly