Introduction
Big data pushed the boundaries in 2016. It pushed the boundaries of tools, applications, and skill sets. And it did so because it’s bigger, faster, more prevalent, and more prized than ever.
According to O’Reilly’s 2016 Data Science Salary Survey, the top tools used for data science continue to be SQL, Excel, R, and Python. A common theme in recent tool-related blog posts on oreilly.com is the need for powerful storage and compute tools that can process high-volume, often streaming, data. For example, Federico Castanedo’s blog post “Scalable Data Science with R” describes how scaling R using distributed frameworks—such as RHadoop and SparkR—can help solve the problem of storing massive data sets in RAM.
Focusing on storage, more organizations are looking to migrate their data, and storage and compute operations, from warehouses on proprietary software to managed services in the cloud. There is, and will continue to be, a lot to talk about on this topic: building a data pipeline in the cloud, security and governance of data in the cloud, cluster-monitoring and tuning to optimize resources, and of course, the three providers that dominate this area—namely, Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.
In terms of techniques, machine learning and deep learning continue to generate buzz in the industry. The algorithms behind natural language processing and image recognition, for example, are incredibly complex, and their utility, in the ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access