Chapter 5. Servers

“The cloud” is a very vague term, but there’s been a real change in the availability of computing resources. Rather than the purchase or long-term leasing of a physical machine that used to be the norm, now it’s much more common to rent computers that are being run as virtual instances. This makes it economical for the provider to offer very short-term rentals of flexible numbers of machines, which is ideal for a lot of data processing applications. Being able to quickly fire up a large cluster makes it possible to deal with very big data problems on a small budget. Since there are many companies with different approaches to this sort of server rental, I’ll look at what they offer from the perspective of a data processing developer.

In simple terms, EC2 lets you rent computers by the hour, with a choice of different memory and CPU configurations. You get network access to a complete Linux or Windows server that you can log into as root, allowing you to install software and flexibly configure the system. Under the hood these machines are actually hosted virtually, with many running on each physical server in the data center, which keeps the prices low. There are many other companies offering virtualized servers, but Amazon’s EC2 stands out for data processing applications because of the ecosystem that’s grown up around it. It has a rich set of third-party virtual machine snapshots to start with and easy integration with S3, both through a raw interface and through ...

Get Big Data Glossary now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.