book

Big Data Now: 2016 Edition

Name: Big Data Now: 2016 Edition
Author: O'Reilly Media, Inc.
ISBN: 9781491977484

by O'Reilly Media, Inc.

February 2017

Beginner to intermediate

160 pages

3h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
1. Careers in Data
Five Secrets for Writing the Perfect Data Science ResumeThere’s Nothing Magical About Learning Data SciencePut Aside the Technology StackKeep Data Lying AroundHave a StrategyHackExperimentData Scientists: Generalists or Specialists?Early DaysLater StageConclusion
2. Tools and Architecture for Big Data
Apache Cassandra for Analytics: A Performance and Storage AnalysisWide Spectrum of Storage Costs and Query SpeedsSummary of Methodology for AnalysisScan Speeds Are Dominated by Storage FormatStorage Efficiency Generally Correlates with Scan SpeedA Formula for Modeling Query PerformanceCan Caching Help? A Little Bit.The Future: Optimizing for CPU, Not I/OFiltering and Data ModelingCassandra’s Secondary Indices Usually Not Worth ItPredicting Your Own Data’s Query PerformanceConclusionsScalable Data Science with RData Science GophersGo, a Cure for Common Data Science PainsThe Go Data Science EcosystemData Gathering, Organization, and ParsingArithmetic and StatisticsExploratory Analysis and VisualizationMachine LearningGet Started with Go for Data ScienceApplying the Kappa Architecture to the Telco IndustryWhat Is Kappa Architecture?Building the Analytics PipelineIncorporating a Bayesian Model to Do Advanced AnalyticsConclusion
3. Intelligent Real-Time Applications
The World Beyond Batch StreamingStreaming 102Extend Structured Streaming for Spark MLSemi-Supervised, Unsupervised, and Adaptive Algorithms for Large-Scale Time SeriesSurfacing AnomaliesAdaptive, Online, Ensupervised Algorithms at ScaleDiscovering Relationships Among KPIs and Semi-Supervised LearningRelated Resources:Uber’s Case for Incremental Processing on HadoopNear-Real-Time Use CasesIncremental Processing via “Mini” BatchesChallenges of Incremental ProcessingTakeaways
4. Cloud Infrastructure
Where Should You Manage a Cloud-Based Hadoop Cluster?High-Level DifferentiatorsCloud Ecosystem IntegrationBig Data Is More Than Just HadoopKey TakeawaysSpark Comparison: AWS Versus GCPSubmitting Spark Jobs to the CloudConfiguring Cloud ServicesYou Get What You Pay ForPerformance ComparisonConclusionTime-Series Analysis on Cloud Infrastructure MetricsInfrastructure Usage DataScheduled Auto ScalingDynamic Auto ScalingAssess Cost Savings First
5. Machine Learning: Models and Training
What Is Hardcore Data Science—in Practice?Computing RecommendationsBringing Mathematical Approaches into IndustryUnderstanding Data Science Versus ProductionWhy Start Small?Distinguishing a Production System from Data ScienceData Scientists and Developers: Modes of CollaborationConstantly Adapt and ImproveTraining and Serving NLP Models Using Spark MLlibConstructing Predictive Models with SparkThe Process of Building a Machine-Learning ProductOperationalizationSpark’s RoleFitting It Into Our Existing Platform with IdiMLFaster, Flexible Performant SystemsThree Ideas to Add to Your Data Science ToolkitUse a Reusable Holdout Method to Avoid Overfitting During Interactive Data AnalysisUse Random Search for Black-Box Parameter TuningExplain Your Black-Box Models Using Local ApproximationsRelated ResourcesIntroduction to Local Interpretable Model-Agnostic Explanations (LIME)Intuition Behind LIMEExamplesConclusion
6. Deep Learning and AI
The Current State of Machine Intelligence 3.0Ready Player WorldWhy Even Bot-Her?On to 11111000001Peter Pan’s Never-Never LandInspirational Machine IntelligenceLooking ForwardHello, TensorFlow!Names and Execution in Python and TensorFlowThe Simplest TensorFlow GraphThe Simplest TensorFlow NeuronSee Your Graph in TensorBoardMaking the Neuron LearnFlowing OnwardCompressing and Regularizing Deep Neural NetworksCurrent Training Methods Are InadequateDeep CompressionDSD TrainingGenerating Image DescriptionsAdvantages of Sparsity

Content preview from Big Data Now: 2016 Edition

Chapter 3. Intelligent Real-Time Applications

To begin the chapter, we include an excerpt from Tyler Akidau’s post on streaming engines for processing unbounded data. In this excerpt, Akidau describes the utility of watermarks and triggers to help determine when results are materialized during processing time. Holden Karau then explores how machine-learning algorithms, particularly Naive Bayes, may eventually be implemented on top of Spark’s Structured Streaming API. Next, we include highlights from Ben Lorica’s discussion with Anodot’s cofounder and chief data scientist Ira Cohen. They explored the challenges in building an advanced analytics system that requires scalable, adaptive, and unsupervised machine-learning algorithms. Finally, Uber’s Vinoth Chandar tells us about a variety of processing systems for near-real-time data, and how adding incremental processing primitives to existing technologies can solve a lot of problems.

The World Beyond Batch Streaming

By Tyler Akidau

This is an excerpt. You can read the full blog post on oreilly.com here.

Streaming 102

We just observed the execution of a windowed pipeline on a batch engine. But ideally, we’d like to have lower latency for our results, and we’d also like to natively handle unbounded data sources. Switching to a streaming engine is a step in the right direction; but whereas the batch engine had a known point at which the input for each window was complete (i.e., once all of the data in the bounded input source had ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Just Right: Introduction to Large-Scale Data & Analytics

Publisher Resources

ISBN: 9781492049197

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design