BUY THIS BOOK
Add to Cart

Print Book $39.99


Add to Cart

PDF $31.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint or License this content?


Building Scalable Web Sites
Building Scalable Web Sites Building, scaling, and optimizing the next generation of web applications

By Cal Henderson
Book Price: $39.99 USD
£28.50 GBP
PDF Price: $31.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
Before we dive into any design or coding work, we need to step back and define our terms. What is it we're trying to do andhow does it differ from what we've done before? If you've already built some web applications, you're welcome to skip aheadto the next chapter (where we'll start to get a bit nerdier), but if you're interested in getting some general context thenkeep on reading.
If you're reading this book, you probably have a good idea of what a web application is, but it's worth defining our terms because the label has been routinely misapplied. A web application is neither a web site nor an application in the usual desktop-ian sense. A web application sits somewhere between the two, with elements of both.
While a web site contains pages of data, a web application is comprised of data with a separate delivery mechanism. While web accessibility enthusiasts get excited about the separation of markup and style with CSS, web application designers get excited about real data separation: the data in a web application doesn't have to have anything to do with markup (although it can contain markup). We store the messages that comprise the discussion component of a web application separately from the markup. When the time comes to display data to the user, we extract the messages from our data store (typically a database) and deliver the data to the user in some format over some medium (typically HTML over HTTP). The important part is thatwe don't have to deliver the data using HTML; we could just as easily deliver it as a PDF by email.
Web applications don't have pages in the same way web sites do. While a web application may appear to have 10 pages, addingmore data to the data store increases the page count without our having to add further markup or source code to our application. With a feature such as search, which is driven by user input, a web application can have a near infinite number of "pages," but we don't have to enter each of these as a blob of HTML. A small set of templates and logic allows us to generatepages on the fly based on input parameters such as URL or POST data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is a Web Application?
If you're reading this book, you probably have a good idea of what a web application is, but it's worth defining our terms because the label has been routinely misapplied. A web application is neither a web site nor an application in the usual desktop-ian sense. A web application sits somewhere between the two, with elements of both.
While a web site contains pages of data, a web application is comprised of data with a separate delivery mechanism. While web accessibility enthusiasts get excited about the separation of markup and style with CSS, web application designers get excited about real data separation: the data in a web application doesn't have to have anything to do with markup (although it can contain markup). We store the messages that comprise the discussion component of a web application separately from the markup. When the time comes to display data to the user, we extract the messages from our data store (typically a database) and deliver the data to the user in some format over some medium (typically HTML over HTTP). The important part is thatwe don't have to deliver the data using HTML; we could just as easily deliver it as a PDF by email.
Web applications don't have pages in the same way web sites do. While a web application may appear to have 10 pages, addingmore data to the data store increases the page count without our having to add further markup or source code to our application. With a feature such as search, which is driven by user input, a web application can have a near infinite number of "pages," but we don't have to enter each of these as a blob of HTML. A small set of templates and logic allows us to generatepages on the fly based on input parameters such as URL or POST data.
To the average user, a web application can be indistinguishable from a web site. For a simple weblog, we can't tell by looking at the outputted markup whether the pages are being generated on the fly from a data store or written as static HTML documents. The file extension can give us a clue, but can be faked for good reason in either direction. A web application tends to appear to be an application only to those users who edit the application's data. This is often, although not always, accomplished via an HTML interface, but could just as easily be achieved using a desktop application that edits the data store directly or remotely.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Do You Build Web Applications?
To build a web application, we need to create at least two major components: a hardware platform and software platform. Forsmall, simple applications, a hardware platform may comprise a single shared server running a web server and a database. Atsmall scales we don't need to think about hardware as a component of our applications, but as we start to scale out, it becomes a more and more important part of the overall design. In this book we'll look extensively at both sides of applicationdesign and engineering, how they affect each other, and how we can tie the two together to create an effective architecture.
Developers who have worked at the small scale might be asking themselves why we need to bother with "platform design" when we could just use some kind of out-of-the-box solution. For small-scale applications, this can be a great idea. We save time and money up front and get a working and serviceable application. The problem comes at larger scales—there are no off-the-shelf kits that will allow you to build something like Amazon or Friendster. While building similar functionality might be fairly trivial, making that functionality work for millions of products, millions of users, and without spending fartoo much on hardware requires us to build something highly customized and optimized for our exact needs. There's a good reason why the largest applications on the Internet are all bespoke creations: no other approach can create massively scalableapplications within a reasonable budget.
We've already said that at the core of web applications we have some set of data that can be accessed and perhaps modified. Within the software element of an application, we need to decide how we store that data (a schema), how we access and modify it (business logic), and how we present it to our users (interaction logic). In Chapter 2 we'll be looking at these different components, how they interact, and what comprises them. A good application design works down from the very top, defining software and hardware architecture, the components that comprise your platform, and the functionality implemented by those layers.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is Architecture?
We like to talk about architecting applications, but what does that really mean? When an architect designs a house, he has a fairly well-defined task: gather requirements, explore the options, and produce a blueprint. When the builders turn that blueprint into a building, we expect a few things: the building should stay standing, keep the rain and wind out, and let enough light in. Sorry to shatter the illusion, but architecting applications is not much like this.
For a start, if buildings were like software, the architect would be involved in the actual building process, from laying the foundations right through to installing the fixtures. When he designed and built the house, he would start with a coupleof rooms and some basic amenities, and some people would then come and start living there before the building was complete. When it looked like the building work was about to finish, a whole bunch more people would turn up and start living there, too. But these new residents would need new features—more bedrooms to sleep in, a swimming pool, a basement, and on and on. The architect would design these new rooms and features, augmenting his original design. But when the time came to build them, the current residents wouldn't leave. They'd continue living in the house even while it was extended, all the time complaining about the noise and dust from the building work. In fact, against all reason, more people would move in while the extensions were being built. By the time the modifications were complete, more would be needed to house the newcomers and keep them happy.
The key to good application architecture is planning for these issues from the beginning. If the architect of our mythical house started out by building a huge, complex house, it would be overkill. By the time it was ready, the residents would have gone elsewhere to live in a smaller house built in a fraction of the time. If we build in such a way that extending our house takes too long, then our residents might move elsewhere. We need to know how to start at the right scale and allow our house to be extended as painlessly as possible.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Do I Get Started?
To get started designing and building your first large-scale web application, you'll need four things. First, you'll need an idea. This is typically the hardest thing to come up with and not traditionally the role of engineers;). While the techniques and technologies in this book can be applied to small projects, they are optimal for larger projects involving multiple developers and heavy usage. If you have an application that hasn't been launched or is small and needs scaling, then you've already done the hardest part and you can start designing for the large scale. If you already have a large-scale application, it's still a good idea to work your way through the book from front to back to check that you've covered your bases.
Once you have an idea of what you want to build, you'll need to find some people to build it. While small and medium applications are buildable by a single engineer, larger applications tend to need larger teams. As of December 2005, Flickr has over 100,000 lines of source code, 50,000 lines of template code, and 10,000 lines of JavaScript. This is too much code for a single engineer to maintain, so down-the-road responsibility for different areas of the application needs to be delegated to different people. We'll look at some techniques for managing development with multiple developers in Chapter 3. To build an application with any size team, you'll need a development environment and a staging environment (assuming you actually want to release it). We'll talk more about development and staging environments as well as the accompanying build tools in Chapter 3, but at a basic level, you'll need a machine running your web server and database server software.
The most important thing you need is a method of discussing and recording the development process. Detailed spec documents can be tedious overkill, but not writing anything down can be similarly catastrophic. A good pad of paper can suffice for very small teams, or a good whiteboard (which you can then photograph to keep a persistent copy of your work). If you find you can't tear yourself away from a computer long enough to grasp a pen, a Wiki can fulfill a similar role. For larger teamsa Wiki is a good way to organize development specifications and notes, allowing all your developers to add and edit and allowing them to see the work of others.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Web Application Architecture
So you're ready to start coding. Crack open a text editor and follow along. . . .
Actually, hold on for a moment. Before we even get near a terminal, we're going to want to think about the general architecture of our application and do a fair bit of planning. So put away your PowerBook, find a big whiteboard and some markers, order some pizza, and get your engineers together.
In this chapter, we'll look at some general software design principles for web applications and how they apply to real world problems. We'll also take a look at the design, planning, and management of hardware platforms for web applications and the role they play in the design and development of software. By the end of this chapter, we should be ready to start getting our environment together and writing some code. But before we get ahead of ourselves, let me tell you a story. . . .
A good web application should look like a trifle, shown in Figure 2-1.
Figure 2-1: A well-layered trifle (photo by minky sue: http://flickr.com/photos/kukeit/8295137)
Bear with me here, because it gets worse before it gets better. It's important to note that I mean English trifle and not Canadian—there is only one layer of each kind. This will become clear shortly. If you have no idea what trifle is, then this will still make sense—just remember it's a dessert with layers.
At the bottom of our trifle, we have the solid layer of sponge. Everything else sits on top of the sponge. It wouldn't be trifle without it. The sponge supports everything above it and forms the real inner core of the dessert. The sponge is big, solid, and reliable. Everything above it is transient and whimsical.
In web applications, persistent storage is the sponge. The storage might be manifested as files on disk or records in a database, but it represents our most important asset—data. Before we can access, manipulate, or display our data, it has to have a place to reside. The data we store underpins the rest of the application.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Layered Software Architecture
A good web application should look like a trifle, shown in Figure 2-1.
Figure 2-1: A well-layered trifle (photo by minky sue: http://flickr.com/photos/kukeit/8295137)
Bear with me here, because it gets worse before it gets better. It's important to note that I mean English trifle and not Canadian—there is only one layer of each kind. This will become clear shortly. If you have no idea what trifle is, then this will still make sense—just remember it's a dessert with layers.
At the bottom of our trifle, we have the solid layer of sponge. Everything else sits on top of the sponge. It wouldn't be trifle without it. The sponge supports everything above it and forms the real inner core of the dessert. The sponge is big, solid, and reliable. Everything above it is transient and whimsical.
In web applications, persistent storage is the sponge. The storage might be manifested as files on disk or records in a database, but it represents our most important asset—data. Before we can access, manipulate, or display our data, it has to have a place to reside. The data we store underpins the rest of the application.
Sitting on top of the sponge is the all-important layer of jelly (Jell-O, to our North American readers). While every trifle has the same layer of sponge—an important foundation but essentially the same thing everywhere—the personality of the trifle is defined by the jelly. Users/diners only interact/eat the sponge with the jelly. The jelly is the main distinguishing feature of our trifle's uniqueness and our sole access to the supporting sponge below. Together with the sponge, the jelly defines all that the trifle really is. Anything we add on top is about interaction and appearance.
In a web application, the jelly is represented by our business logic. The business logic defines what's different and unique about our application. The way we access and manipulate data defines the behavior of our system and the rules that govern it. The only way we access our data is through our business logic. If we added nothing but a persistent store and some business logic, we would still have the functioning soul of an application—it's just that nobody would want to eat/use it. In days of old, business logic was written in C or COBOL (seriously). These days, only the big (where performance
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Layered Technologies
Let's start at the top, since it's the simplest there. If we're talking about web pages, our presentation layer is going to consist of CSS. We could also go for <font> tags with color attributes, but favor is definitely turning from that sort of thing. Besides, to keep our layers nicely separated, we'll want to use something that allows us to keep the presentation separate from the markup. CSS fits this role perfectly.
Under the presentation lies the markup. For web-based markup we have a pair of main choices and some options under each of those. We'll either be serving HTML or XHTML, with the various versions available of each. While the in-crowd might make a big thing of XHTML and standards-compliance, it's worth remembering that you can be standards-compliant while using HTML 4. It's just a different standard. As far as separating our markup from the logic layer below it, we have a couple of workable routes: templating and plain old code segregation. Keeping your code separate is a good first step if you're coming from a background of mixed code and markup. By putting all display logic into separate files and include()ing those into your logic, you can keep the two separate while still allowing the use of a powerful language in your markup sections. While following down this route, it can be all too easy to start merging the two unless you stay fairly rigorous and aware of the dangers. All too often, developers stay aware of keeping display logic in separate files, but application logic starts to sneak. For effective separation, the divide has to be maintained in both directions.
There are a number of downsides to code separation: as described, it's easy to fall prey to crossing the line in both directions—but a little rigor can help that. The real issue comes when a team has different developers working on the logic and markup. By using a templating system, you enforce the separation of logic and markup, require the needed data to be explicitly exported into the template's scope, and remove complex syntax from the markup. The explicit importing of data into templates forces application designers to consider what data needs to be presented to the markup layer, but also protects the logic layer from breaking the markup layer accidentally. If the logic layer has to explicitly name the data it's exporting to the templates, then template developers can't use data that the logic developer didn't intend to expose. This means that the logic developer can rewrite her layer in any way she sees fit, as long as she maintains the exported interface, without worry of breaking other layers. This is a key principle in the layered separation of software, and we'll be considering it further in the next section.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Software Interface Design
Separating the layers of our software means a little additional work designing the interfaces between these layers. Where we previously had one big lump of code, we'll now have three distinct lumps (business logic, interaction logic, and markup), each of which need to talk to the next. But have we really added any work for ourselves? The answer is probably not: we were already doing this when we had a single code layer, only the task of logic talking to markup wasn't an explicit segment, but rather intermeshed within the rest of the code.
Why bother separating the layers at all? The previous application style of sticking everything together worked, at least insome regards. However, there are several compelling reasons for layered separation, each of which becomes more important asyour application grows in size. Separation of layers allows different engineers or engineering teams to work on different layers simultaneously without stepping on each other's toes. In addition to having physically separate files to work with, the teams don't need an intimate knowledge of the layers outside of their own. People working on markup don't need to understand how the data is sucked out of the data store and presented to the templating system, but only how to use that data once it's been presented to them. Similarly, an engineer working on interaction logic doesn't need to understand the application logic behind getting and setting a piece of data, only the function calls he needs to perform the task. In each of these cases, the only elements the engineers need concern themselves with are the contents of their own layer, and the interfaces to the layers above and below.
What are the interfaces of which we speak? When we talk about interfaces between software layers, we don't mean interfaces in the Java object-oriented sense. An interface in this case describes the set of features allowing one layer to exchange requests and responses with another. For the data and application logic layers, the interface would include storing and fetching raw data. For the interaction logic and application logic layers, they include modifying a particular kind of resource—the interface only defines how one layer asks another to perform a task, not how that task is performed.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Getting from A to B
While it's important to build large applications on good foundations, and equally important to avoid building small applications on skyscraper-scale foundations, we need some way to transition from one scale to another. We may already have an existing application of the "one giant function" variety, or we might be building a prototype to later scale into a large application. When building prototypes, we might skip the disciplined architectural design to get the product working as soon as we can. Once we start to scale these small applications and prototypes out, we need to transition from small to large foundations and impose some structure.
The first step in this process is usually to separate the presentation layer out, moving inline markup into separate template files. This process in itself can be further split down into three distinct tasks, which can be performed independently, avoiding a single large change. This approach allows you to continue main development as you split templates into their own layers, avoiding a developmental halt. The three steps are fairly straightforward:
Separate logic code from markup code
The first step is just a case of splitting the code that generates HTML markup into some new files and having the page-driving files include them at runtime. By the end of this step, logic and markup generation will live in different sets of files, although markup for multiple pages may be generated by a single file.
Split markup code into one file per page
In anticipation of switching to a templating system, we'll want to split out each page or reusable page component into its own file. At this point, we're still using regular old source code to generate the markup, but each logical markup segment (page or block) has a file of its own.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Software/Hardware Divide
The role of software engineer is typically as the title suggests—engineering software. When building a desktop application or writing a mainframe system, you're more or less stuck with the hardware you have. For modern web applications, designing an application goes beyond the realm of simply designing and writing code. Hardware starts to come into play.
It's probably a mistake to think about hardware in too much isolation from the software you design, leaving the nuts and bolts of it to system administrators or site operations staff. From the start of your application design, you'll want to work closely with the person managing your hardware, or even take on that role yourself.
With that said, the level at which you get involved with the hardware side of things can vary greatly. As a software architect, you won't really need to decide which RAID card your file servers use (beyond checking that it has a battery backup, which we'll talk about in Chapter 8) or which particular network interface cards you're using (beyond the speed). In the rest of this chapter, we'll look at some general issues surrounding hardware platforms for web applications so that we can at least have a working knowledge of some of the issues involved, even if we avoid taking part ourselves.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hardware Platforms
For large-scale web applications, software comprises an important but not complete piece of the puzzle. Hardware can be as significant as software, in the design as well as the implementation stages. The general architecture of a large application needs to be designed both in terms of the software components and the hardware platform they run on. The hardware platform, at least initially, tends to form a large portion of the overall cost of deploying a web application. The cost for software development, comprised of ongoing developer payroll, is usually bigger in the end, but hardware costs come early and all at once. Thus it's important to think carefully about designing your hardware platform in order to be in a position where initial cost is low and the track for expansion is clearly defined.
Donald Knuth said it best, in a quote that we'll be revisiting periodically:
We should forget about small efficiencies, about 97 percent of the time. Premature optimization is the root of all evil.
This applies directly to software development, but also works well applied as a rule for hardware platform design andthe software process in general. By starting small and general, we can avoid wasting time on work that will ultimately be thrown away.
Out of this principal come a few good rules of thumb for initial design of your hardware platform:
Buy commodity hardware
Unless you've built a very similar application of the same scale before, buying commodity hardware, at least initially, is almost always a good idea. By buying off-the-shelf servers and networking gear, you'll reduce cost and maximize repurposability. If your application fails before it takes off, you've wasted less money. If your application does well, you've spent less money upfront and have more for expansion. Overestimating hardware needs for startup applications can dry up a lot of money that would have otherwise been available for more pressing causes, such as paying staff and expanding your hardware platform when needed.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hardware Platform Growth
As if the problems involved with designing and implementing a hardware platform for a large application weren't already enough, growing your implemented platform from one scale to another brings a whole new set of issues to the table. The hardware platform for a large-scale application usually looks significantly different than its small-scale siblings. If a small application or prototype supporting 100,000 users is based on a single dedicated server, then assuming a linear scaling model (which is not often the case), the same application with a 10 million-person user base will require 100 machines. Managing a server platform with 100 boxes requires far more planning than a single box and adds some additional requirements.
The responsibility for specifying and purchasing hardware, especially on a small team, often falls within the domain of the architectural or engineering lead. In larger operations, these tasks fall into the realm of an operations manager, but having a dedicated coordinator for smaller platforms would be overkill.
When making the jump from a pure engineering role to hardware platform management responsibilities, there are a number of factors to consider that aren't immediately obvious. We'll look at a few of these in turn and try and cover the main pain points involved with organizing an initial build-out.
When choosing a hardware vendor, in addition to taking into account the specification and cost of the hardware you order, it's important to find out how difficult ordering more of the same will be. If you're planning to rely a single type of hardware for some task—for instance, a specific RAID controller—then it's important to find out up front how easy it's going to be to order more of them. This includes finding out if your supplier keeps a stock of the component or, if not, how long the lead time from the manufacturer is. You don't want to get caught in a situation where you have to wait three months for parts to be manufactured, delaying your ability to build out.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hardware Redundancy
Business continuity planning (BCP) lies at the heart of redundancy planning. BCP is a methodology for managing risk from a partial or complete interruption of critical services. Applied to web applications, this covers the continuity of business in the case of software and hardware malfunctions, attacks, and disasters. Most of the technical jargon can be ignored at the small scale, but BCP basically means having a solid plan for disaster recovery.
The various levels of BCP apply to the various grades of catastrophe that could occur. Being prepared to deal with a single hard disk failing is very basic, while redundant networking equipment falls into a middle tier. At the highest level of BCP compliance, a business will choose to host critical applications in multiple DCs on multiple continents. While this reduces latency for international users, more importantly, the service can continue operating even if a whole DC is lost, and such things do happen from time to time.
For applications where dual DC failover is out of the question, a fairly acceptable level of redundancy is to have at least one spare of everything, or more where necessary (having one spare disk for a platform with over one hundred disks in use is, for instance, woefully inadequate). It's also very important to bear in mind that absolutely anything can fail, and eventually, everything will fail. This includes the usual suspects, such as hard disks, all the way through to components that are thought of as immutable: power cables, network cables, network switches, power supplies, processors, RAM, routers, and even rack posts—anything at all.
We'll be talking more about redundancy from a design point of view, rather than just in terms of raw hardware, when we cover scaling in Chapter 9.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Networking
The classic seven-layer OSI model has lately been replaced by the easier-to-understand four-layer TCP/IP model. To understand what role the network plays in the architecture of our application, we need to understand these four layers and how they interact with one another. The four layers form a simple stack, shown in Figure 2-2.
Figure 2-2: The TCP/IP stack (photo by minky sue: http://flickr.com/photos/kukeit/8295137)
The bottom layer represents both the physical medium over which the signal travels (raido waves, pulses in a twisted pair cable, etc.) and the format used for encoding data in the medium (such as Ethernet frames or ATM cells). At the Internetworking layer, we start to deliver messages between different networks, where frames or cells get wrapped up into the familiar IP packets. On top of the Internetwork sits the transport, which is responsible for ensuring messages get from one end of the connection to the other reliably (as with TCP) or not (as with UDP). The final layer is where all the application-level magic starts to happen, using the preceeding layers to create a meaningful experience. Each layer sits on top of the layers below it, with data conceptually flowing only from one layer to the next. In a simple example where two computers are on the same network, a message being passed looks something like Figure 2-3.
A message starts at the top of the stack, an HTTP request in this example, and descends through the layers, being gradually wrapped in larger and larger encapsulations (the IP packet encapsulates the HTTP request and the Ethernet frame encapsulates the IP packet). When the message hits the bottom of the stack, it moves over the wire as electrical, light, microwave, or similar signals, until it hits the bottom of the next stack. As this point, the message travels up the stack, being unwrapped at each layer. When the message reaches the top of the second stack, it's presented as an HTTP request. The key to this architecture is that the layers don't need to know what's below them. You can perform an HTTP request without caring about how IP works. You can create an IP packet without worrying how that packet will be sent as electrical signals in a copper cable.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Languages, Technologies, and Databases
If you've gotten this far, you probably already have a good idea of what technologies you're going to build your application with. Using the right tools for the job is important; I'm not going to recommend that you start using any particular technology over another. While this book covers some specific technologies, the general lessons and advice can be applied to all development models, from the open source LAMP architecture, right through to the full Microsoft application stack.
It's always a good idea to use base technologies that have already been proven to work well together. Although using the latest trendy language or framework might be all the rage, stacks that have been proven to work at large scales are going to save you time and effort. Being on the bleeding edge for every element of your application can soon become tiring. You don't want to get yourself into the position of having to wonder if the web server is at fault every time you hit a bug. The LAMP stack has been used for many large-scale applications over the last few years and is a stable and well-understood platform to build on.
The examples in this book focus mainly on PHP, with Perl alternatives shown where appropriate. When we talk about web servers, we generally mean Apache, although the underlying operating system is irrelevant. For the database portion of this book, we'll be focusing quite heavily on MySQL 4 (specifically the InnoDB storage engine), with lots of MySQL-specific advice and ideas. Whether you're planning to base your architecture on PostgreSQL, Oracle Grid, or SQL Server, it's still a good idea to read the database sections because much of the advice transfers well.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Development Environments
Before you sit down and start writing your world-changing application, there are several things you're going to need to consider and plan for. Working on a team application with a large code base is a very different challenge to creating small personal web applications. How do you coordinate between multiple developers? How do you work on the same code at the same time? How do you keep track of what you're doing and what needs to be done? How do you make sure your site doesn't appear broken to users while you're working on it?
In this chapter, we'll look at each of these questions and try to answer them. Our solutions can then be brought together to create a development environment in which you can work with a team and make tangible progress, avoiding some of the common mistakes developers make when moving from small to large projects.
Everybody has their own favorite rules and guidelines that they absolutely must follow to develop any kind of large-scale application. Depending on the particular brand of development methodology you happen to be following, some of the global set of "rules" may actually apply. But in the field of large-scale web applications, there are three rules that crop up again and again in successful development teams. Perhaps "rule" is too strong a term and "guideline" would be more apt—you can certainly ignore one or all of them and still turn out a working product. These three simple rules are designed to help you avoid common pitfalls when moving from the small- to large-scale applications, and to get you there faster:
  • Use source control.
  • Have a one-step build.
  • Track your bugs.
And we'll deal with each of them in turn.
The first rule is hopefully the most obvious—all development teams and even individual developers should be using source control for all of their work. Unfortunately, this is often not the case, and the repercussions can be fairly dire.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Three Rules
Everybody has their own favorite rules and guidelines that they absolutely must follow to develop any kind of large-scale application. Depending on the particular brand of development methodology you happen to be following, some of the global set of "rules" may actually apply. But in the field of large-scale web applications, there are three rules that crop up again and again in successful development teams. Perhaps "rule" is too strong a term and "guideline" would be more apt—you can certainly ignore one or all of them and still turn out a working product. These three simple rules are designed to help you avoid common pitfalls when moving from the small- to large-scale applications, and to get you there faster:
  • Use source control.
  • Have a one-step build.
  • Track your bugs.
And we'll deal with each of them in turn.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Use Source Control
The first rule is hopefully the most obvious—all development teams and even individual developers should be using source control for all of their work. Unfortunately, this is often not the case, and the repercussions can be fairly dire.
This first rule is by far the most important. It's the key to creating a solid development environment. If you're already working on a web application that doesn't use source control, now is the time to halt development and get it into place. The importance of source control cannot be emphasized enough.
If you've not encountered source control (often called software configuration management or SCM) in your work before, you're going to kick yourself for missing out on it for so long. It could be summarized as "the ability to undo your mistakes."
If your code-editing software didn't have an undo function, you'd soon notice. It's one of the most basic features we expect. But typically, when you close a file you lose its undo history and are left with only the final saved state of the document. The exception to this rule is with "fat" file formats, such as Corel's WordPerfect, which save the undo history (or at least a segment of it) right into the file.
When you're dealing with source code files, you don't have the ability to store arbitrary data segments into the file, as with WordPerfect documents, and can only save the final data. Besides, who would want source files that kept growing and growing? Source control, at its most basic, allows you to keep a full undo history for a file, without storing it in the file itself. This feature, called versioning , isn't all that source control is good for, and we'll look at the main aspects in turn.

Section 3.2.1.1: Versioning

Versioning, the most basic feature of a source-control system, describes the ability to store many versions of the same source file. The typical usage sequence, once a file is established in source control, is as follows. A user
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
One-Step Build
As your web application progresses, you move from working on a single live version to two or more copies in order, to avoid breaking your production site while doing development work. At this point, a process arises for getting code into production. This process, referred to variously as building, releasing, deploying, or pushing (although these terms mean subtly different things), is something you'll perform over and over again.
As time goes on, this process also typically becomes more complex and involved. What was once a manual copy process grows to involve changing files, restarting services, reloading configuration files, purging caches, and so on.
The time at which you release a new feature or piece of code is often the time at which you're under the most pressure—you've been working like crazy to get it done and tested and want to push it out to your adoring public as soon as you can. A complex and arcane build process—in which making mistakes or missing steps causes damage to your application—invites disaster.
A good rule of thumb for each stage of the build process is to have a single button that performs all of the needed actions. As time goes by and your process becomes more complicated, the action required to perform these steps remains a simple button push. We'll explore that evolution now.
When you first start development on a web application, it's probably installed on your desktop or laptop machine. You can edit the source files directly and see the results in your browser. It's a short hop from there to putting the application onto a shared production server and pointing the assembled masses at it. Hopefully, by this point your code is in source control, so each change you make is tracked and reversible.
The easiest way to make changes at this stage is to edit files directly. This might mean opening up a shell on your web server and using a command-line editor, mounting your web server's disk over NFS or Samba and using a desktop editor, or modifying files on your personal machine and copying them to the production server. Editing files this way is all very well, and often a sensible choice when you first start development and alpha release. You can very quickly see what you're doing, and your work is immediately available to your users.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Issue Tracking
, often called bug tracking (and sometimes request tracking), is the process of keeping track of your open development issues. Bug tracking is a misleading term in many ways and obviously depends on your definition of bug. "Issue" is a broad enough term to describe most of the kinds of tasks you might need to track when developing a web application, and so drives our choice of terminology here.
A good issue-tracking system is the key to smooth development of any large application. As time goes by and you pass the point where everything about the application easily fits into the head of one developer, you're going to need to keep track of everything that needs to be done. For a small application or a very small number of developers, keeping this sort of information on Post-it notes cluttered around your monitor is usually the natural way to handle keeping track of minor tasks (I'd be lying if I said my desk wasn't covered in scribbled notes). When you start to have more than 20 or so notes, or more than a couple of developers, this approach quickly breaks down. There's no way to get a good tactical view of everything that needs doing, swap tasks between developers, assign priorities, and track milestones.
Bringing in software to fill this role is a fairly logical step, and has clearly occurred to people before—hence the proliferation of available software. Since we're developing a web application and need some kind of multiuser tools, a web-based solution seems to fit the bill—multiple simultaneous users, a single central data store, and no special client software. This is especially useful when you have a development team who use different operating systems.
So what is it that we want to get out of our issue-tracking software? After we've defined what we want, we can look at a few available alternatives that fulfill our needs. We've already said that we're probably interested in a web-based tool.
The core feature of any issue-tracking system is, you guessed it, tracking issues. So the system has to handle entities (which we're calling issues here) and associate a few properties with them. At the very minimum, an issue needs a title and description. As an issue is worked on, we'll need to add comments or notes to keep track of our progress and ideas. When somebody creates an issue, he needs to retain ownership of it, but also be able to assign it to other developers. When an issue is assigned to a developer, he needs to be notified of this action somehow (email is an obvious mechanism), and in the same way, the issue's owner should be notified when the issue is resolved. To know when the issue has been resolved, we need some kind of status for the issue. So we have a small grab bag of properties: title, description, notes, owner, assignee, and status.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Scaling the Development Model
Source control is as important for a single developer as it is for a small team, but the utility increases as the number of developers grow. Most source-control systems are designed to scale to many simultaneous developers, allowing many people to work on the same project or even the same files at once.
As your development team grows from a single engineer to a small group, little will need to change about your source-control usage. When moving from one to two developers, you'll need to start updating your checked out copy of the source code more often, to integrate changes from other developers back into your working copy. A good rule of thumb is to update your working copy at the start of every development session and additionally when another developer commits a large change set.
If two developers work on the same file at the same time, the changes will be automatically merged by the source-control system, as long as they don't overlap within the file. If you find yourself in a situation where you get merge conflicts, then this probably isn't an issue with your tools but with your processes. Although most source-control systems allow multiple people to work on the same file at once, you still need to coordinate your development between developers and avoid having two people working on the same task at once. No amount of locking, branching, and merging can make up for plain old communication between developers.
As far as deployment tools go, a developer team may need to assign one or more developers to deploy control or deploy branch check in rights. As your team grows, you may also wish to designate a single developer as the release manager, responsible for overseeing all code releases. For one developer, the role of checking in production code and deploying it is straightforward, but as you add more developers each release cycle has more dependencies. At some point you may want to limit which developers can commit code to a deployment branch, having other developers commit code to the source-control trunk. When code is ready for deployment, engineers with deploy-branch access can move code into production status and the release manager can release the code to production. This team would still be using one-button deployment tools, but limiting who has access to them. Within the Flickr team, several engineers can commit code to the deployment branch and make a release, while more developers can only release certain portions of the application, such as the configuration module.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Coding Standards
The structure of a software system exists not only at the macro level of general architecture, but also at the micro level of the actual code. As systems grow larger and more complex, having a defined coding standard starts to pay dividends. Every programmer has his favorite style of coding, whether related to indentation, naming, brace styles, or new lines, but there's a very simple rule to bear in mind when talking about coding standards on a project:
It's more important for people on a team to agree to a single coding style than it is to find the perfect style.
When you have more than one developer working a project, it's easy to waste a lot of time having people reformat each other's code. This is clearly a waste of developer resources and time. Having a standard that everyone agrees on saves time spent reformatting and makes it much easier for one developer to pick up and understand another developer's code.
Since getting developers to agree on a coding standard is a difficult (if not impossible) task, the sensible way to achieve this goal is by creating a standard and forcing your developers to comply. At the start of your project, the lead developer or architect should sit down and document how code and files should be laid out. Any new developers to the project can then be presented with a copy of the document that, as well as coding standards, can act as a guide to reading the application source code.
What do we actually mean when we say that we need a coding standard? A coding standard is loosely defined as a set of rules that govern the structure of code from the smallest detail up to the entirety of the component or layer. A good coding standards document will contain guidance for naming of files, modules, functions, variables, and constants. In addition to describing the layout of the code at the small level, the document can also act as an official and easily referenceable guide to the overall component structure. When adding a feature to a component, the coding standards document should give a fairly clear idea of what file and module the feature belongs in, what its functions should be called, and how the calling system should look.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Testing
Before we get any further, it's worth relaying a fundamental fact: testing web applications is hard.
There are two main types of application testing. The first is automated testing, an important portion of which is called regression testing. The second is manual testing, whereby a human uses the application and tries to find bugs. Hopefully at some point you'll have millions of people testing for you, but it can also be helpful to have planned testing, often referred to as Quality Assurance (QA) to avoid exposing bugs to your wider audience.
Regression testing is designed to avoid regressing your application to a previous buggy state. When you have a bug to fix, you create a test case that currently fails, and then fix the bug so that the test case passes. Whenever you work on the same area of code, you can rerun the test after your changes to be sure that you haven't regressed to the original bug.
Automated regression testing requires a fairly closed system with defined outputs given a set of inputs. In a typical web application, the inputs and outputs of features as a whole are directed at the presentation and page logic layers. Any tests that rely on certain page logic have to be updated whenever the presentation or page logic layers are changed, even when the change in interaction has nothing to do with the bug itself. In a rapid development environment, the presentation and page logic layers can change so fast that keeping a test suite working can be a full-time job—in fact, you can easily spend more time maintaining a test suite than fixing real bugs or developing new features.
In a well-layered web application, automated testing belongs at the business logic layer, but as with the layers above, rapid changes can mean that you spend more time updating your tests than on regular development. Unless you have a very large development team and several people to dedicate to maintaining a full coverage test suite (that is, one that covers every line of code in your application), you're going to have to pick and choose areas to have test coverage.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: i18n, L10n, and Unicode
Internationalization, localization, and Unicode are all hot topics in the field of modern web application development. If you build and launch an application without support for multiple languages, you're going to be missing out on a huge portion of your possible user base. Current research suggests that there are about 510 million English-speaking people in the world. If your application only caters to English speakers, you've immediately blocked 92 percent of your potential global audience. These numbers are actually wildly inaccurate and generally used as a scare tactic; you have to consider how many of the world's six billion or so population is online to begin with. But even once we factor this in, we are still left with 64 percent of online users (around 680 million people) who don't speak English (these statistics come from the global reach web site: http://global-reach.biz/). That's still a huge number of potential users you're blocking from using your application.
Addressing this problem has historically been a huge deal. Developers would need advanced knowledge of character sets and text processing, language-dependent data would need to be stored separately, and data from one group of users could not be shared with another. But in a world where the Internet is becoming more globally ubiquitous, these problems needed solving. The solutions that were finally reached cut out a lot of the hard work for developers—it's now almost trivially easy to create a multilanguage application, with only a few simple bits of knowledge.
This chapter will get you quickly up to speed with the issues involved with internationalization and localization, and suggest simple ways to solve them. We'll then look at Unicode in detail, explaining what it is, how it works, and how you can implement full Unicode applications quickly and easily. We'll touch on the key areas of data manipulation in web applications where Unicode has a role to play, and identify the potential pitfalls associated with them.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Internationalization and Localization
Internationalization and localization are buzzwords in the web applications field—partly because they're nice long words you can dazzle people with, and partly because they're becoming more important in today's world. Internationalization and localization are often talked about as a pair, but they mean very distinct things, and it's important to understand the difference:
  • Internationalization is adding to an application the ability to input, process, and output international text.
  • Localization is the process of making a customized application available to a specific locale.
Internationalization is often shortened to i18n (the "18" representing the 18 removed letters) and localization to L10n (for the same reason, although an uppercase "L" is used for visual clarity) and we'll ref