book

Big Data Now: 2016 Edition

Name: Big Data Now: 2016 Edition
Author: O'Reilly Media, Inc.
ISBN: 9781491977484

by O'Reilly Media, Inc.

February 2017

Beginner to intermediate

160 pages

3h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
1. Careers in Data
Five Secrets for Writing the Perfect Data Science ResumeThere’s Nothing Magical About Learning Data SciencePut Aside the Technology StackKeep Data Lying AroundHave a StrategyHackExperimentData Scientists: Generalists or Specialists?Early DaysLater StageConclusion
2. Tools and Architecture for Big Data
Apache Cassandra for Analytics: A Performance and Storage AnalysisWide Spectrum of Storage Costs and Query SpeedsSummary of Methodology for AnalysisScan Speeds Are Dominated by Storage FormatStorage Efficiency Generally Correlates with Scan SpeedA Formula for Modeling Query PerformanceCan Caching Help? A Little Bit.The Future: Optimizing for CPU, Not I/OFiltering and Data ModelingCassandra’s Secondary Indices Usually Not Worth ItPredicting Your Own Data’s Query PerformanceConclusionsScalable Data Science with RData Science GophersGo, a Cure for Common Data Science PainsThe Go Data Science EcosystemData Gathering, Organization, and ParsingArithmetic and StatisticsExploratory Analysis and VisualizationMachine LearningGet Started with Go for Data ScienceApplying the Kappa Architecture to the Telco IndustryWhat Is Kappa Architecture?Building the Analytics PipelineIncorporating a Bayesian Model to Do Advanced AnalyticsConclusion
3. Intelligent Real-Time Applications
The World Beyond Batch StreamingStreaming 102Extend Structured Streaming for Spark MLSemi-Supervised, Unsupervised, and Adaptive Algorithms for Large-Scale Time SeriesSurfacing AnomaliesAdaptive, Online, Ensupervised Algorithms at ScaleDiscovering Relationships Among KPIs and Semi-Supervised LearningRelated Resources:Uber’s Case for Incremental Processing on HadoopNear-Real-Time Use CasesIncremental Processing via “Mini” BatchesChallenges of Incremental ProcessingTakeaways
4. Cloud Infrastructure
Where Should You Manage a Cloud-Based Hadoop Cluster?High-Level DifferentiatorsCloud Ecosystem IntegrationBig Data Is More Than Just HadoopKey TakeawaysSpark Comparison: AWS Versus GCPSubmitting Spark Jobs to the CloudConfiguring Cloud ServicesYou Get What You Pay ForPerformance ComparisonConclusionTime-Series Analysis on Cloud Infrastructure MetricsInfrastructure Usage DataScheduled Auto ScalingDynamic Auto ScalingAssess Cost Savings First
5. Machine Learning: Models and Training
What Is Hardcore Data Science—in Practice?Computing RecommendationsBringing Mathematical Approaches into IndustryUnderstanding Data Science Versus ProductionWhy Start Small?Distinguishing a Production System from Data ScienceData Scientists and Developers: Modes of CollaborationConstantly Adapt and ImproveTraining and Serving NLP Models Using Spark MLlibConstructing Predictive Models with SparkThe Process of Building a Machine-Learning ProductOperationalizationSpark’s RoleFitting It Into Our Existing Platform with IdiMLFaster, Flexible Performant SystemsThree Ideas to Add to Your Data Science ToolkitUse a Reusable Holdout Method to Avoid Overfitting During Interactive Data AnalysisUse Random Search for Black-Box Parameter TuningExplain Your Black-Box Models Using Local ApproximationsRelated ResourcesIntroduction to Local Interpretable Model-Agnostic Explanations (LIME)Intuition Behind LIMEExamplesConclusion
6. Deep Learning and AI
The Current State of Machine Intelligence 3.0Ready Player WorldWhy Even Bot-Her?On to 11111000001Peter Pan’s Never-Never LandInspirational Machine IntelligenceLooking ForwardHello, TensorFlow!Names and Execution in Python and TensorFlowThe Simplest TensorFlow GraphThe Simplest TensorFlow NeuronSee Your Graph in TensorBoardMaking the Neuron LearnFlowing OnwardCompressing and Regularizing Deep Neural NetworksCurrent Training Methods Are InadequateDeep CompressionDSD TrainingGenerating Image DescriptionsAdvantages of Sparsity

Content preview from Big Data Now: 2016 Edition

Introduction

Big data pushed the boundaries in 2016. It pushed the boundaries of tools, applications, and skill sets. And it did so because it’s bigger, faster, more prevalent, and more prized than ever.

According to O’Reilly’s 2016 Data Science Salary Survey, the top tools used for data science continue to be SQL, Excel, R, and Python. A common theme in recent tool-related blog posts on oreilly.com is the need for powerful storage and compute tools that can process high-volume, often streaming, data. For example, Federico Castanedo’s blog post “Scalable Data Science with R” describes how scaling R using distributed frameworks—such as RHadoop and SparkR—can help solve the problem of storing massive data sets in RAM.

Focusing on storage, more organizations are looking to migrate their data, and storage and compute operations, from warehouses on proprietary software to managed services in the cloud. There is, and will continue to be, a lot to talk about on this topic: building a data pipeline in the cloud, security and governance of data in the cloud, cluster-monitoring and tuning to optimize resources, and of course, the three providers that dominate this area—namely, Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.

In terms of techniques, machine learning and deep learning continue to generate buzz in the industry. The algorithms behind natural language processing and image recognition, for example, are incredibly complex, and their utility, in the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Just Right: Introduction to Large-Scale Data & Analytics

Publisher Resources

ISBN: 9781492049197

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Big Data Now: 2016 Edition

by O'Reilly Media, Inc.

Introduction

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.