Chapter 4. Interfaces

You shouldn’t be uneasy about any parts of the architecture. It shouldn’t contain anything just to please the boss. It shouldn’t contain anything that’s hard for you to understand. You’re the one who’ll implement it; if it doesn’t make sense to you, how can you implement it?

Steve McConnell, Code Complete (Microsoft Press)

In Chapter 3, we discussed architecture. Traditionally, people think of the architecture of a system as the boxes in the diagram, but even more critical are the lines connecting the boxes. These lines can signify many things, and they are an abstraction of how these systems connect and communicate. In this chapter, we will study one major part of those lines: interfaces.

This chapter will discuss what an interface is, what to consider when creating or connecting to one, and the most common constructs the cloud providers offer for these lines.

Adopting modern application design comprised of smaller, independent components or services enables you to focus on development and empowers you to make the best choices in how you want to solve each problem. You should be able to focus on the features and the business logic, and the infrastructure should give you lift. But this does not come for free. You have to mind every point where coupling can occur, and minimize that coupling as much as possible. As previously discussed (in Chapter 3), you have to have rules and standards for your services’ interfaces and how they interact with other services. But you will also have to implement software using the interfaces provided by other services internally or ones outside of your company altogether, such as Stripe or Twilio. We will cover how to best handle interfacing with software you don’t control from both perspectives.

Your service relies on other services. Other services will rely on yours. The key to winning is to design your service with purpose and foresight. You will make trade-offs. Document, expose, and hold steadfast to these trade-offs until they are no longer necessary.

Interfaces: Some Assembly Required

In the scope of this chapter, an interface is the surface area between two components of the application. It is how they join together in order to serve a larger purpose. These components can be internal or external, proprietary or open source, self-hosted or managed. For our purposes we are concerned with the structure and schema of the messages being passed around, how they are passed around, and what happens when these actions do not behave as expected.

The Message

The message is what is being sent between components and how that message is packaged. This may include information about the requestor, headers, sessions, and/or information to validate that a given request or task is authorized. The most common encapsulation of these messages will be in JSON.

The Protocol

The most ubiquitous application-level protocol in the world today is HTTP. (And don’t forget about the S.) Remember, networks can’t be trusted, ever. They are not safe, they are not secure, and they are not reliable. HTTPS at least provides some assurances that messages are not being improperly modified.

Be mindful of abstractions when discussing or debugging interfaces. For example, when a developer says HTTP, they generally mean HTTP over TLS (HTTPS) over TCP over IP over Ethernet or fiber. Any one part of that stack may fail, cause issues or limitations, or otherwise drive your implementation details.

The API you utilize to issue commands to your cloud provider is implemented over HTTP, and HTTP is even used by cloud instances to get credentials to connect to those APIs. However, you are not limited to HTTP. Many providers have the option to communicate to clients over WebSockets. Your functions can utilize any type of outgoing network connection to talk to other systems. For example, SFTP is still commonly used to move data and even money around in nightly batch jobs, and you can use a periodic invocation to start such a task.

The Contract

Finally, your interface includes the contract, or expectation of what will happen as a result of a certain message. This is the functionality that you expose to software clients of your component, generally via documentation. For example, what should happen if a client tries to add the same email address to a mailing list twice? These are the decisions you will be left to make, and you must provide a human-readable artifact to convey promises and expectations to those integrating with your service.

Serverless Interfaces

Before we discuss designing interfaces, let’s examine the options and building blocks available in serverless, and some of the characteristics of serverless compute components in your systems.

When connecting the architectural boxes of our serverless functions, we can choose between two types of invocations: synchronous and asynchronous. Synchronous, or request/response, invocations are blocking operations, meaning the caller is waiting for the function to finish its request before returning a response. Asynchronous invocations, also known as events, are nonblocking and don’t require the request to complete before responding.

A good rule of thumb is that if the action or logic that invokes a function cares about the result of the function in order to fulfill its own objectives, it fits into the synchronous model. If it does not directly care about the result of the function (other than knowing it was triggered), it is best served by the asynchronous or event model. In the asynchronous model, the result or actions taken by a function will likely be important to the overall application, but not specifically to the action or logic that first triggered it.

Some integrations offered by your cloud provider may surprise you with the type of invocation utilized. For example, processing a stream of data that has been written to the database is a very asynchronous action. It is literally a building block of an event-driven architecture. But since the stream is processed in order, at least when utilizing DynamoDB, the actual function invocations are synchronous. This is because the hidden component that is responsible for keeping track of its place in processing the stream and firing of your functions with your business code relies on the result of each invocation to update state, and fire the next one as well.

Automatic Retries and Dead Letter Queues

Sending failed function invocations automatically to a queue of failures, or a dead letter queue, is a fundamental building block of an effective serverless component.

As far as serverless is concerned, in AWS, asynchronous invocations will be retried automatically up to three times, with no control on your part as to how. After that, you can have failed invocations fail into a failure queue. With Google Cloud Functions, you have the option to enable retries for background functions. However, Google cautions that invocations will retry repeatedly for up to seven days, so they should be used only for retriable failures. Azure offers dead letter queues and retry behavior for certain types of integrations.

Concurrency

An important component of serverless compute is the ability to set a concurrency per function, as well as an overall maximum concurrency across all functions. The concurrency is the number of simultaneous invocations being processed at a given time. The biggest benefit of the granularity of deploying serverless functions is the ability to scale a function independently of others based on demand, and this setting is quite literally the scale of a given function.

Why not just set this to its maximum? First, you want to prevent unexpected behavior, so it is best to never leave any option unbounded. A runaway component can cause havoc on other parts of the system, not to mention your monthly bill. The unlimited scale of serverless is powerful and will break other components if not left in check.

Also remember that your cloud provider will have default limits to the concurrency of your overall account that you will want to incorporate into planning for the future. If you do not have a support contract with your cloud provider, it may take a week for them to respond to an increase in a service limit.

Finite Versus Infinite Scale

Serverless as a paradigm will break other services that are not set up for massive and instant scale. What are you going to use for caching? How does it scale in relation with demand?

Benchmark your tools, or find others who have. Have something planned to handle it. Maybe you can even use a function.

Your customers will always be the least predictable point in your system. Will a surge in new user sign-ups cause an influx of invocations to your service? Sure, your serverless compute will be able to scale, but other nonserverless components of your application may not be able to handle the sudden increase in load. One interesting solution is to use functions to scale up or down other parts of your infrastructure based on demand.

Designing Your Interfaces

In the FIRST Robotics Competition,1 they limit the maximum width of robots to 36 inches because that is the width of a door. They could allow robots to exceed this width by being disassembled, or possibly even rotated, but enforcing this limitation greatly simplifies the transport of all robots developed for the competition. Keep commonsense ideas like this in mind when developing the standard operating procedure for your services.

Don’t, however, allow these standards to limit your technological choices. A commonly used but not perfect pseudostandard exists in JSON because at this point, it’s likely that even your light switch can encode and decode JSON.

Consistency doesn’t improve the reliability, resilience, and scalability of your system by magic; it does so by setting and communicating clear expectations of how components interact with each other, and reduces the cognitive load to develop, debug, and maintain your applications.

Because you are going to have many different independent components, such as functions, in your serverless system, having a strict design for how the services interface with each other will be critical to long-term stability.

Services are becoming increasingly distributed. With that distributed nature comes increased complication. As discussed in Chapter 1, a small service with a well-defined responsibility is simple. A constellation of those simple services is complex. The rest of this chapter will discuss best practices around how your service interacts and depends on other services, as well as how other services will interact and depend on yours.

Messages/Payloads

It is important to thoughtfully design both the input and output payloads of a system.

JSON

Most messages are passed around in JSON. JSON is not perfect, but it is omnipresent. As with any universally used tool, it does not handle every single use case with grace and perfection. For example, the number type of JSON may not always perform in the way you expect it to, because 64-bit numbers in JavaScript are not 64-bit integers. This is a perfect example of how your components will have to adapt to their interface, and how interfaces will impact implementation details. While this is a problem that should be minimized, JSON may not have been an intentional choice: it was chosen by popular vote.

Thoughtful design of your payloads should also include creating a standard format for error messages when a unit of work runs into a problem. Remember, just because you might expect something to work and your code did not raise an exception or return with an error, does not mean it worked as expected.

Securing messages at rest

HTTPS provides encryption in transit to keep messages secure from eavesdroppers. Encryption at rest is the principle of ensuring data is encrypted when it sits on a disk. The payload of a function invocation may be stored on disk, but not all payloads are stored securely. Keep this in mind when deciding what data to pass around in messages, and utilize proper encryption on any data that may touch a disk. Ensure that your failure queues utilize encryption at rest, if possible. Avoid logging sensitive data.

Sessions and Users/Auth

An important part of your interfaces to consider is authentication. Authentication is knowing that an entity is who it says it is. Depending on how a function is invoked, or a component processes a task, there is either an implicit or explicit authorization component that depends on that authentication. Authorization is ensuring that an identified entity is permitted to perform an action, or access certain data. Never trust a message payload on its own merit, as the network is never to be trusted. If the function was executed, you can generally assume the caller has some authority to do so. But some serverless patterns will rely on information about the user session, provided by an API gateway. Never take this data at face value: always validate it in some way. For some systems, this means utilizing JSON web tokens (JWTs); for others, it means validating the session information with another service.

Avoid Unbounded Requests

Some requests are not bound by time and use time-outs to compensate for that. As you write your code, don’t write just for now; write for future scale, and incorporate consistency and standardization. One such standard to follow would be to never allow an unbounded request by default. For example, fetching a query from a SQL must have a LIMIT clause as the default, both to prevent it from growing in time complexity as your usage grows, and to protect the precious resource that is the database.

HTTP was widely adopted in part due to its versatility. It is powerful but not a perfect protocol, and developers struggle with utilizing its full power and capability. One underused feature is headers, which are a great way to encapsulate metadata about a request, that can be extended using the X- namespace to indicate a nonstandard header. Most custom headers are implemented with an additional namespace such as X-LEARNING-SERVERLESS.

Status codes are integral to success with HTTP as a transport mechanism, but your services should define the minutiae of what each status means. In addition, be mindful of the external services ideology of their status codes. Generally speaking, statuses in the 200 or 2xx range are successful requests, statuses in the 4xx range indicate an issue with the validity of the request, and statuses in the 5xx range are reserved for server-side issues and errors. But not all statuses are implemented by the book. For example, if you visit a private GitHub repository while logged out, or while using an account that does not have access to that repository, you will get a 404 or File Not Found. The application is telling you it is not found, even though it exists. GitHub in fact found it, determined you were not able to see it, and instead of leaking data about its existence, lied and said it was not found. This is considered by many to be a best practice, and it is another reason why the implementation of your status codes should be standardized and well documented.

Another example of the power of granular status codes is sharing that a result was successful, but that the system already knew about it. You may want to return a success message regardless of the previous state because the end result is the same. You may also want to return a more specific status such as 208, Already reported. But you may not want to provide such information externally, as it could be useful to hackers to know if a user with a leaked password has an account on your system. Many times, a website with strict rate limiting and monitoring on incorrect login attempts will leak information about what emails are registered on another endpoint. Never let your interfaces leak accidentally.

Interface Versus Implementation

Just as an interface should not dictate an implementation, an implementation should not dictate the interface. I was working on a system with a bunch of rules codified in a YAML file. While I was onboarding another engineer to the team, an error with that file caused part of the system to stop functioning. The engineer wanted to create a test case for the CI/CD pipeline that would prevent a bad configuration from being deployed. Sounds like a solid use case of best practices…right? Until I explained, “That’s not a file, it’s a database.” The file consisted of rules that were meant to operate independently of each other. A mistake in one entry should not prevent the whole system from running. The database happens to be a file because we don’t need a database. A bad entry in this file shouldn’t prevent a good entry from going out in the same commit or deployment. It is important that the file doesn’t have any syntax errors (corrupt database), and maybe that the data is in the correct layout (validating the data before saving it). In this example, the interface is not the implementation. For now, we care about how the rules were processed, not how they were stored.

Remember, your interface should not leak your implementation details, as then you become stuck on one way of doing things. You want to have flexibility in how you implement it.

Avoid hidden coupling and interfaces

What happens when you share a datastore such as Redis with another service? (Redis is an in-memory datastore commonly used for caching, or storing temporary data such as user sessions). Sometimes, even sharing something as benign-seeming as S3 or bucket storage can break the interface of a service and cause issues for all involved. You can utilize a smart redirect code like 30X to redirect requests to the underlying resource as the current implementation, but having that request come to your service to retrieve the resource will save a lot of trouble down the road if you ever want to modify the behavior of this component or even change the underlying storage.

Lines with Logic

When we zoom in on an architectural diagram, we see that the lines are really more like boxes—and those boxes are spring-loaded. They absorb load, but when given too much load that is not released, they can fail. I introduced these components in the previous chapter, and we will now look at a couple of options for designing them.

Queues

Queues are a great way to decouple two components of a system. You can reliably pass a message or unit of work between systems without forcing them to interact directly, and you can store messages while a component is down. They are like voicemail for your systems! And just like voicemail, they have limits and automatic purging of stale messages. Be sure to understand the promises your queue makes, a part of its interface, when integrating it into your system.

Streams/Event bus

A stream, or event bus, links two items together in a decoupled and scalable way. These components are a great way for actions in your system to have reactions without having to explicitly hard code the reactions in the original source of the action. You also benefit from deferring tasks that don’t have to happen immediately as the result of an action but can be in near-real time, instead of causing the original action to fail because of an inability to trigger a reaction.

Designing the Unhappy Path

Yes, it is time to talk about the author’s favorite topic, failure.

The surface area between services, or how their interfaces interact, is the most critical failure point and requires adequate design to be properly decoupled.

A cornerstone of being an effective engineer is being able to turn as much unexpected behavior as possible into expected behavior. We don’t have infinite time, so we can’t do this for all aspects, but sometimes it may be as simple as properly documenting something unexpected so that it’s expected.

Validating Input

Be sure to validate all input that flows into your components; do not even trust the metadata about the request itself. You never know when that request, “authenticated” by your cloud provider, is going to inadvertently misroute traffic or let traffic that is not authenticated through. That is why they recommend validating even that data to ensure it is authentic. Just because you can npm install a plug-in that gives you authentication, or click some button on your cloud provider’s console, that doesn’t mean your integration work is done. You must validate all your services. Remember that the nature of the network means you will receive events past the replacement of code that generated them, and you will even receive messages intended for other services that may have previously occupied the same IP address.

Even webhooks (which we will discuss later in “Webhooks”) from service providers such as Stripe must be validated. There is no way to accurately validate the sender of the message using the network alone, so you must verify the signature they provide as authentic before taking any actions based on the message.

Failures

If interfaces are the surface area between components of your application, failures are cracks that wish to spread using these interfaces. Any place where two components are connected is a point of eventual failure. Thoughtful interface design can minimize failure, but its occurrences can never be reduced to zero, and therefore you must design for them in your systems for maximum resilience, and minimum wake-up calls to fix broken services.

Partial failures

A partial failure is a task execution that performed some work before it failed. It is a pain point of developing robust systems, as some steps of a task may be successful, and trying again can cause a failure due to that partial success. Earlier when discussing contracts in “The Contract”, we asked about how you might handle trying to add a user to a mailing list that is already registered. If you have chosen to return a failure in this situation, it may prevent a retry of a task that depends on this step successfully being reprocessed. In these cases, idempotence is your friend: that is, the same action performed multiple times with the same result every time. You may want to return a success message for the idempotent step regardless of the previous state because the end result is the same, and this may help you when dealing with partial failures so they can be retried successfully.

But this will not be the case with all actions, so you may need to take extra care when writing the application code for your functions to handle steps that may have already completed successfully. You may not think that this is part of your interface, but it definitely will be exposed and should be taken into consideration not just in the implementation, but also in the contract and communicated expectations of the component.

Cascading failures

Cascading failures are when a failure in one part of the system or application spreads throughout the system. Want a quick idea of this? If you are running a classic “three-tier” app, imagine what would happen if you shut down the database. Depending on the implementation of your service, it would likely cause delays or time-outs and would take down your service. The failure has spread.

Now imagine instead, someone pushes a database migration that locks the user table in a way that prevents login from succeeding. Eventually, multiple users unintentionally hammering the login will use up all the connection pool resources (you are using a connection pool, right?), and all database connections will be taken by processes trying to wait for the table to unlock. The actions of users who were able to browse the site begin to slow down to the point of total failure, where all the available instances running the monolithic web app are taken with requests waiting for the database, and any new spun-up instances are waiting for database connections, which are fully exhausted.

To avoid this type of failure, you must isolate and decouple services, as well as section off failures.

The poison pill, or the importance of interface stability

For synchronous events, handling retries is up to the caller of the function. For managed integrations, such as our previous example with streams, where the invocations are synchronous but the overall appearance of the component to you is asynchronous, the implementation logic of the cloud provider will be responsible for retries. In the case of the DynamoDB streams, there is a metric you can consume or alert on, called IteratorAge, that lets you see the status of the internal AWS logic handling the stream, or the iterator. This is how you know that your stream is blocked, in what is commonly known as the poison pill. The poison pill is a great example of the importance of interfaces. If there is a message in a stream that cannot be processed, it will prevent the consumer of that stream from moving forward to the next message. One bad line of code here can hold up your entire system. One failing component can cause others to fail in a set of cascading failures.

Don’t fail silently

Do not let important failures drop silently on the floor unnoticed and unfixed. Other than the previously mentioned retry behavior of certain asynchronous function invocations, failures will go unnoticed by default. Not every failure needs to sound the alarms, but a good starting point is to use a dead letter queue when you can, and a platform for monitoring exceptions such as Sentry. Every task and message in your system has some level of importance, and should not be relegated to a data point on a chart of failures. Engineers may make jokes about only testing their code in production, but even when you have an exhaustive test suite, there is no better source of truth of what is currently broken than the errors being faced in the realities of production traffic.

Later, in Chapter 6, we will discuss monitoring so that your systems can alert you to their own health and to a potential degradation of service.

Strategies for Integrating with Other Services

Finally, as you pull all this together into your system design, there are several functions to consider that can help make integration with other services seamless.

Time-Outs

Any operation can fail, but usually it’s one that relies on the network or any component of a computer that is not the CPU or RAM. If you are having issues with the CPU or RAM, you have much bigger problems to deal with; with functions or containers, the broken node should eventually fail and be brought back up. But if you are sending or receiving data over the network, or even reading a file from local storage, you will want to be mindful of time-outs.

Computers are very obedient. If you tell the computer to fetch a piece of data over the network from an unresponsive system, by default, the computer will wait forever! Imagine sending your dog outside to fetch the paper, but the newspaper goes out of business. Your dog will sit outside obediently, waiting forever. You would be surprised how bad the default settings for time-outs are in many popular languages and libraries, or even in the kernel level networking implementation.

Luckily, serverless functions have an inherent time-out by default. If you have a function that is a discrete and retriable unit of work and it is OK for it to partially fail and be retried, boom, you now have time-outs! But when and where should you use time-outs? The short answer is: always and everywhere.

Luckily, in the world of functions, there is a shortcut. If your function does one thing but takes a couple of network connections to get it done, you can set a time-out on your function. In fact, you have to. A time-out that is applied only to the connection will not protect you against a very slow but active response trickling in over the network. But, let’s say you have a one-minute time-out on your function. If you want to get a lot of HTTP requests done in a function invocation, you want to set a reasonable time-out on each of those requests. Check with the library you are using and its defaults. Some libraries have no time-outs by default. Some have multiple time-outs you can set, and for good reason. There will likely be a time-out for a connection to be established and a time-out for the maximum time elapsed while waiting for packets from a server, as well as an overall time-out. A connection may be established quickly, and the server may consistently respond with additional information, but that may not be enough to prevent the request from taking too long.

Be mindful of the service limits and time-outs when designing your time-outs. Keep in mind that Amazon API Gateway, for example, has a maximum 29-second time-out. Your users will get a 502 response if your lambda takes 60 seconds. Your lambda will think everything went great, and your user will think it didn’t work at all. The user will retry and you will get stuck performing the same work twice, then they won’t think it works, so they will try again. Adjust your time-outs to coordinate with your services’ time-outs.

Retries

Retrying work has an inherent balance to it. Retry too soon, too often, or too many times, and your attempt to make sure one unit of work gets done can quickly prevent any work from being done throughout the whole system.

An incurable, or terminal, error is one that has no chance of a successful outcome if retried. In reality, it may just be a temporary condition where the chance of a successful outcome is close enough to zero to round down. Depending on the observer, or designer, of the system, you can determine if an error that is likely to succeed eventually if retried should be considered terminal in the current situation. A simple example would be a lambda with a time-out limit of 60 seconds trying to access a crashed system that takes at least 5 minutes to recover. Sure, the error itself is not terminal, but given all the parameters available, it has a 0% chance of succeeding. But, that does not mean the work should not be retried. Even if that unit of work get retried until its natural exhaustion into a failure queue, as soon as it gets there, the other system may be up and running and is no longer terminal. You should plan for how to inspect and/or retry failures from your failure queues. If you just open the floodgates and reprocess the entire failure queue against a service that is recovering to full health and handling the backlog of retries from other components, you can easily cause it to fail again. By coordinating your systems with those you work with, you’ll be better able to prevent bigger, scarier failures.

Exponential Backoff

Exponential backoff is the strategy of increasing the amount of time between retries exponentially. It prevents a component that is already struggling from performing a task from being overwhelmed with retries. Instead, by using an exponentially increasing delay, a number of distributed components can coalesce on a retry strategy without any coordination.

This is useful for any type of network-based request that can fail and should be retried. You can use it for connecting to your database, interacting with third-party APIs, or even retrying failures due to service or rate limits.

Webhooks

Webhooks are the name for an incoming HTTP request coming from the third-party API to an endpoint you register with them. REST APIs are not bidirectional. So when utilizing a popular API such as Stripe, they will utilize webhooks to give you updates on changes, so you do not have to poll for updates. The interface for the webhook, or the schema and behavior it is expected to implement, is defined by the third party.

An external service such as Stripe will send you very important webhooks, such as a failure to renew a subscription, or even a chargeback.

Now let’s think about this in the legacy world. Imagine your payment processor called you with the fact that a user’s payment bounced. Would you put them on hold while you go and figure out what you are supposed to do with that information? Or do you write it down, maybe verify that you have the information correct (and verify the identity/authenticity of the information), save it somewhere important, and tell them that you received it? They don’t care what you do with that information; that’s outside the scope of their job. Their job is just to tell you. Your job is to faithfully receive that information and make sure something happens as a result. Anytime you want to take a synchronous action and make it asynchronous, this works too.

Tight coupling in your applications can cause cascading failures. These can even happen across applications. You may operate a SaaS offering that delivers webhooks to other applications across the internet. If they tightly couple that HTTP request to their database, an influx of traffic can cause an outage. It’s more common than you would think. Decouple anything and everything you can.

In this case, take in an HTTP request through an API gateway to a function invocation. Validate the payload as valid and authentic, and then throw it into a queue, stream, or a messaging bus. Return the appropriate HTTP status code for the payload to the sender of the webhook. This is very important because it helps you in other ways too…let’s say your database is down. The sender of the webhook may not care at all. You give them a 5xx status code, so they faithfully retry. Now, those retries are slowly starting to build up a DoS attack on your systems since they promised you delivery of these messages and retries. Instead, if some other service is down, you can just buffer up all the work and pick it back up when it matters.

Evaluating External Services

If you have the luxury of choosing or recommending services to integrate with, and you likely do if you are reading this book, search on the internet for other developers complaining about what that other service can’t do. What issues are they having? How many issues do they have open on their GitHub? What are they searching for on Stack Overflow about that system? How many migrated to a competitor after they hit some serious traffic or issue?

Choose great APIs

Choose a service with great APIs. Look for a clean abstraction around difficult processes you don’t want to manage. Then, if for some reason in the future they can no longer facilitate your use case, you can still use the API you integrated with and make your own implementation. You don’t have to be stuck with their service, but you’ll save time by sticking with their API.

Read their docs

Read (or scan) all of the docs before implementing or choosing a service. Look for the trade-offs they had to make. What are the limitations? Kick the tires; read about things even if you don’t yet know what you want to do with them. Maybe you will get inspired. Maybe you will uncover some hidden knowledge. Maybe you will find out that in order to get feature x to work, you really need to do action y. (We will talk about documenting your service with a runbook in Chapter 11.)

Rate Limits

The services you interface with likely have rate limits, so in addition to the consideration of using rate limits with your own interfaces, you should consider how to be a polite user of rate limits. Just because there are rate limits does not mean you have to brute force API requests until they are successful. Use concurrency limits for functions that talk to rate-limited services, and remember to allocate that rate limit across all the functions that interact with that service, and across regions, if you are using multiple regions. If you are allowed to perform 100 requests per second, and you are in 2 regions, you should limit concurrency to 50 in each region. Also, regardless of this safeguard, utilize retry mechanisms such as exponential backoff to safely retry when you do encounter a limit.

Conclusion

When designing your system, don’t just think about the boxes—think about the lines too, the interfaces. Ultimately, the choices you make for your interfaces will reflect the culture and norms of your engineering organization, but the encoding and transport will likely be some form of JSON over HTTP. Never trust any message based on the assumption that it must be valid if you were able to receive it. Just as you may push an error to production, so might the network team at your cloud provider. Last but not least, always plan for errors and failures, and plan how to minimize the impact of preventable issues.

Congratulations! You now have the basic system design information needed to get started with serverless.

1 FIRST: “for the inspiration and recognition of science and technology.”

Get Learning Serverless now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.