Chapter 1. Losing Control
So, here we are in the world of the cloud, with ever-expanding elements of our websites being placed in the hands of others.
Advantages to Giving Up Control
There are many positive aspects to making this move (after all, why else would so many people be doing it?), so before going into the negatives, let’s remind ourselves of some of the advantages of cloud-based systems:
- Quick and easy access to enterprise-level solutions
- For example, building your own geographically available SQL server cluster with real-time failover would take lots of hardware, high-quality connectivity between data centers, a high degree of expertise in databases and networking, and a reasonable amount of time and ongoing maintenance. Services such as Amazon RDS make this achievable within an hour, and at a reasonable hourly rate.
- Flexibility and the ability to experiment and evolve systems easily
- The ability to create and throw away systems means that you can make mistakes and learn from experience what’s the best setup for your system. Rather than spending time and effort doing capacity estimates to determine the hardware needed, you can just try different sizes, find the best size, and then change the setup if you reach capacity, or even at different times of day.
- Access to data you could never create yourself
- Third-party data sources do create risks, but they also enhance the attractiveness of your system by providing data that you otherwise wouldn’t be able to provide but that your users rely upon—either because that data is about a third-party system (e.g., Twitter feeds), or because it would not be economical (e.g., mapping data).
- They improve performance and resilience
While they are out of your control, most cloud-based systems have higher levels of resilience built in than you would build into an equivalent system. Likewise, though, there are potential issues created by how CDNs route traffic; CDNs will usually offer performance improvements over systems that do not use them.
Cloud-based systems are also built for high performance and throughput and designed to scale out of the box. Many services will scale automatically and invisibly to you as the consumer, and others will scale at the click of a button or an API call.
- Access to systems run by specialists in the area—not generalists
- In house or using a general data center, you may have a small team dedicated to a task—or more likely, a team of generalists who have a degree of expertise across a range of areas. Bringing in a range of specialist cloud providers allows to you work with entire companies that are dedicated to expertise in specific areas, such as security, DNS, or geolocation.
Despite these advantages, it’s important to be aware of the inherent performance risks, especially in this era where good website performance is key to user satisfaction. The next sections cover important considerations for performance and outline key performance risks, following the journey that a user must travel in order to take advantage of your website.
1. The Last Mile
Before any user can access your website, they need to connect from their device to your servers. The first stage of this connection, between the user’s device and the Internet backbone, is known as the last mile. For a desktop user, this is usually the connection to their ISP, whether that be by DSL, cable, or even dial-up. For a mobile user, it’s the connection via their mobile network.
This section of the connection between user and server is the most inefficient and variable, and it will add latency onto any connection.
To illustrate this, in 2013 the FCC released research that showed that a top-speed fiber connection would add 18ms latency—and that was the best-case scenario—a DSL connection would add 44ms, and dial-up was considerably slower. For mobile users, the story was even worse: a 4G connection had a latency overhead of 600ms on new connections, a 3G connection had a latency of over 2s on new connections, and even existing open connection had a latency as high as 500ms.
This is a high-impact area of the delivery of any website, and it’s the one area where there is genuinely little to be done about the issue. Nevertheless, it’s important to be aware of the variations that are possible and actually being experienced, and to ensure that your website’s functionality is not affected by them.
- Unreliable delivery of content
- The variability in connection speed of the last mile means that it’s hard to determine how fast content will be delivered to users. This presents many of the same challenges that we’ll explore in the next section—they’re often amplified by the challenges of the last mile.
2. Backbone Connectivity
Traditionally, this is seen as the path that the data from your website takes after it leaves your data center until it arrives at the end user’s machine. However, in the Internet age, backbone connectivity can be seen more as the means by which a user reaches your data—you have little control over how or from where the user is coming to you to request it.
Users are now accessing data from an expanding range of devices, via many different means of connectivity, and from an ever-widening range of locations.
To understand the performance challenges caused by unknown means of connectivity, you need to look at three key factors:
- Bandwidth is the amount of traffic that can physically pass through the hardware en route to the end destination. Bandwidth can usually be increased on demand from your ISP.
- Contention is the amount of other traffic that is sharing your connectivity. This will often vary greatly depending on the time of day. Like bandwidth, contention is something that can be minimized on demand from your ISP.
- Latency is based on the distance that the data has to travel to get from end to end and any other associated delays involved in establishing and maintaining a connection.
Which Is the Biggest Challenge to Performance?
Bandwidth is often discussed as a limiting factor, but in many cases, latency is the killer—bandwidth can be scaled up, but latency is not as easy to address.
There is a theoretical minimum latency that will exist based on the physical distance between two places. Optimally configured fiber connections can travel at approximately 1.5× the time taken to travel at the speed of light. The speed of light is very fast, but there is still a measurable delay when transmitting over long distances. For example, the theoretical fastest speed for sending data from New York to London is 56ms; to Sydney, it’s 160ms.
This means that to serve data to a user in Sydney from your servers in New York, 160ms will pass to establish a connection, and another 160ms will pass before the first byte of data is returned. That means that 320ms is the fastest possible time, even in optimal conditions, that a single byte of data could be returned. Of course, most requests will involve multiple round trips for data and multiple connections.
However, data often doesn’t travel by an optimal route.
The BGP (Border Gateway Protocol) that manages most of the routing on the Internet is designed to find optimal routes between any two points. Like all other protocols, though, it can be prone to misconfiguration, which sometimes results in the selection of less-than-optimal routes.
More commonly, such suboptimal routes are chosen due to the peering arrangements of your network provider. Peering determines which other networks a network will agree to forward traffic to. You should not assume that the Internet is a non-partisan place where data moves freely from system to system; the reality is that peering is a commercial arrangement, and companies will choose their peers based on financial, competitive, and other less-idealistic reasons. The upshot for your system is that it is important to be aware that the peering arrangement that your hosting company (and the companies they have arrangements with) has in place can affect the performance of your system.
When choosing a data center, you can get information about these arrangements; however, cloud providers are not so open. Therefore, it’s important to monitor what’s happening to determine the best cloud provider for your end users.
The variability of connectivity across the backbone really boils down to a single performance risk, but it’s a fundamental one that you need to be aware of when building any web-based system.
- Unreliable delivery of content
If you cannot control how data is being sent to a user, you cannot control the speed at which it arrives. This makes it very difficult to determine exactly how a website should be developed. For example:
- Can data can be updated in real time?
- Can activity be triggered in response to a user activity, e.g., predictive search?
- Which functionality should be executed client side and which server side?
- Can functionality be consistent across platforms?
3. Servers and Data Center Infrastructure
Traditionally, when hosting in a data center, you can make an informed choice about all aspects of the hardware and infrastructure you use. You can work with the data center provider to build the hardware and the network infrastructure to your specific requirements, including the connectivity into your systems. You can influence or at least be aware of the types of hardware and networking being used, the peering relationships, the physical location of your hardware, and even its location within the building.
The construction of your platform is a process of building something to last, and once built, it should remain relatively static, with any changes being non-trivial operations.
The migration of many data centers to virtualized platforms started a process of migration from static to throwaway platforms. However, it was with the growth of cloud-based Infrastructure as a Service (IaaS) platforms that systems became completely throwaway. An extension to IaaS is Platform as a Service (PaaS), where, rather than having any access to the infrastructure at all, you simply pass some code into the system, a platform is created, and the code deployed upon it is ready to run.
With these systems, all details of the underlying hardware and infrastructure are hidden from view, and you’re asked to put your trust in the cloud providers to do what is best. This way of working is practical and can be beneficial; cloud providers are managing infrastructure across many users and have a constant process of upgrading and improving the underlying technology. The only way they can coordinate rolling out the new technology is to make it non-optional (and therefore hidden from end users).
Loss of control over the data center creates two key performance risks.
- Loss of ability to fine-tune hardware/networking
Cloud providers will provide machines based on a set of generic sizes, and they usually keep the underlying architecture deliberately vague, using measurements such as “compute units” rather than specifying the exact hardware being used.
Likewise, network connectivity is expressed in generic terms such as small, medium, large, etc., rather than specifying the actual values so that the exact nature of the networking is out of your control.
All of this means that you cannot benchmark your application and then specify the exact hardware you want your application to run on. You cannot make operating system modifications to suit that exact hardware, because at any point, your servers may restart on different hardware configurations.
- No guarantee of consistency
Every time you reboot a machine it can potentially (and usually, actually) come back up on completely different hardware, so there is no guarantee that you’ll get consistent performance. This is due in part to varying hardware, and also to the potential for noisy neighbors—that is, other users sharing your infrastructure and consequently affecting the performance of your infrastructure. In practice, these inconsistencies are much rarer than they used to be.
Some cloud vendors will offer higher-priced alternatives that will guarantee that certain pieces of hardware will be dedicated for your use.
4. Third-Party SaaS Tools
While you lose control over the hardware and the infrastructure with IaaS, you still have access to the underlying operating system; however, in the world of the cloud, systems are increasingly dependent on higher-level Software as a Service (SaaS) systems that deliver functionality rather than a platform on which you can execute your own functionality.
All access is provided via an API, and you have absolutely no control over how the service is run or configured.
Examples in this section
For consistency and to illustrate the range of services offered by single providers, all examples of services in this section are provided by Amazon Web Services (AWS); other providers offer similar ranges of services.
These SaaS systems can provide a wide range of functionality, including database (Amazon RDS or DynamoDB), file storage (Amazon S3), message queuing (Amazon SQS), data analysis (Amazon EMR), email sending (Amazon SES), authentication (AWS Directory Service), data warehousing (Amazon Redshift), and many others.
There are even cloud-based services now that will provide shared code-execution platforms (such as Amazon Lambda). These services trigger small pieces of code in response to a schedule or an event (e.g., an image upload or button click) and execute them in an environment outside your control.
As you start to introduce third-party SaaS services, there are two key performance risks that you must be aware of.
- Complete failure or performance degradation
- Although one of the selling points of third-party SaaS systems is that they are built on much more resilient platforms than you could build and manage on your own, the fact remains that if they do go down or start to run slowly, there is nothing you can do about it—you are entirely in the hands of the provider to resolve the issue.
- Loss of data
- Though the data storage systems are designed to be resilient (and in general, they are), there have been examples in the past of cloud providers losing data due to hardware failures or software issues.
5. CDNs and Other Cloud-Based Systems
Many systems now sit behind remote cloud-based services, meaning that any requests made to your server are routed via these systems before hitting it.
The most common example of these systems are CDNs (content delivery networks). These are systems that sit outside your infrastructure, handling traffic before it hits your servers to provide globally distributed caching of content.
CDNs are part of any best-practice setup for a high-usage website, providing higher-speed distribution of data as well as lowering overhead of your servers.
The way they work is conceptually simple: when a user makes a request for a resource from your system, the DNS resolution is resolved to the point of presence within the CDN infrastructure that has the least latency and load. The user then makes the request to that server. If the server has a cached copy of the resource the user is requesting, it returns it; if it doesn’t, or if the version it has has expired, then it requests a copy from your server and caches it for future requests.
If the CDN has a cached copy, then the latency for that request is much lower; if not, then the connection between the CDN and the origin server is optimized so that the longer-distance part of the request is completed faster than if the request was made directly by the end user.
There are many other examples of systems that can sit in front of yours, including:
- DDoS protection
- Protects your system from being affected by a DoS (Denial of Service) attack.
- Web application firewall
- Provides protection against some standard security exploits, such as cross-site scripting or SQL injection.
- Traffic queuing
- Protects your site from being overrun with traffic by queuing excess demand until space becomes available.
- Translation services
- Translate content into the language of the locale of the user.
It is not uncommon to find that requests have been routed via multiple cloud-based services between the user and your server.
There are a number of performance risks associated with moving your website behind cloud-based services.
- Complete failure or performance degradation
Like with third-party SaaS tools, if a cloud system you rely on goes down, so will your system. Likewise, if that cloud system starts to run slowly, so will your system.
This could be caused by hardware or infrastructure issues, or issues associated with software releases (SaaS providers will usually release often and unannounced). They could also be caused by third-party malicious activities such as hacking or DoS attacks—SaaS systems can be high profile and therefore potential targets for such attacks.
- Increased overhead
- All additional processing being done will add time to the overall processing time of a request. When adding an additional system in front of your own system, you’re not only adding the time taken for that service to execute the functionality that it is providing, but you’re also adding to the number of network hops the data has to make to complete its journey.
- Increased latency
- All services will add additional hops onto the route taken by the request. Some services offer geolocation so that users will be routed to a locally based service, but others do not. It’s not uncommon to hear of systems where requests are routed back and forth across the Atlantic several times between the user and the server as they pass through cloud providers offering different functionality.
6. Third-Party Components
Websites are increasingly dependent on being consumers of data or functionality provided by third-party systems.
Client-side systems will commonly display data from third parties as part of their core content. This can include:
- Data from third-party advertising systems (e.g., Google AdWords)
- Social media content (e.g., Twitter feeds or Facebook “like” counts)
- News feeds provided by RSS feeds
- Location mapping and directions (e.g., Google Maps)
- Unseen third party calls, such as analytics, affiliate tracking tags, or monitoring tools
Server-side content will often retrieve external data and combine it with your data to create a mashup of multiple data sources. These can include freely available and commercial data sources; for example, combining your branch locations with mapping data to determine the nearest branch to the user’s location.
Dependence on these third-party components can create the following performance risks.
- Complete failure or inconsistent performance
- If your system depends on third-party data and that third party becomes unavailable, your system could fail completely. Likewise, poor performance by the third party can have a domino effect on your system’s performance.
- Unexpected results
- Third parties can sometimes change the data they return or the way their data feeds work, resulting in errors when you make requests or when the requests return unexpected data.