book

Big Data Now: 2016 Edition

Name: Big Data Now: 2016 Edition
Author: O'Reilly Media, Inc.
ISBN: 9781491977484

by O'Reilly Media, Inc.

February 2017

Beginner to intermediate

160 pages

3h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
1. Careers in Data
Five Secrets for Writing the Perfect Data Science ResumeThere’s Nothing Magical About Learning Data SciencePut Aside the Technology StackKeep Data Lying AroundHave a StrategyHackExperimentData Scientists: Generalists or Specialists?Early DaysLater StageConclusion
2. Tools and Architecture for Big Data
Apache Cassandra for Analytics: A Performance and Storage AnalysisWide Spectrum of Storage Costs and Query SpeedsSummary of Methodology for AnalysisScan Speeds Are Dominated by Storage FormatStorage Efficiency Generally Correlates with Scan SpeedA Formula for Modeling Query PerformanceCan Caching Help? A Little Bit.The Future: Optimizing for CPU, Not I/OFiltering and Data ModelingCassandra’s Secondary Indices Usually Not Worth ItPredicting Your Own Data’s Query PerformanceConclusionsScalable Data Science with RData Science GophersGo, a Cure for Common Data Science PainsThe Go Data Science EcosystemData Gathering, Organization, and ParsingArithmetic and StatisticsExploratory Analysis and VisualizationMachine LearningGet Started with Go for Data ScienceApplying the Kappa Architecture to the Telco IndustryWhat Is Kappa Architecture?Building the Analytics PipelineIncorporating a Bayesian Model to Do Advanced AnalyticsConclusion
3. Intelligent Real-Time Applications
The World Beyond Batch StreamingStreaming 102Extend Structured Streaming for Spark MLSemi-Supervised, Unsupervised, and Adaptive Algorithms for Large-Scale Time SeriesSurfacing AnomaliesAdaptive, Online, Ensupervised Algorithms at ScaleDiscovering Relationships Among KPIs and Semi-Supervised LearningRelated Resources:Uber’s Case for Incremental Processing on HadoopNear-Real-Time Use CasesIncremental Processing via “Mini” BatchesChallenges of Incremental ProcessingTakeaways
4. Cloud Infrastructure
Where Should You Manage a Cloud-Based Hadoop Cluster?High-Level DifferentiatorsCloud Ecosystem IntegrationBig Data Is More Than Just HadoopKey TakeawaysSpark Comparison: AWS Versus GCPSubmitting Spark Jobs to the CloudConfiguring Cloud ServicesYou Get What You Pay ForPerformance ComparisonConclusionTime-Series Analysis on Cloud Infrastructure MetricsInfrastructure Usage DataScheduled Auto ScalingDynamic Auto ScalingAssess Cost Savings First
5. Machine Learning: Models and Training
What Is Hardcore Data Science—in Practice?Computing RecommendationsBringing Mathematical Approaches into IndustryUnderstanding Data Science Versus ProductionWhy Start Small?Distinguishing a Production System from Data ScienceData Scientists and Developers: Modes of CollaborationConstantly Adapt and ImproveTraining and Serving NLP Models Using Spark MLlibConstructing Predictive Models with SparkThe Process of Building a Machine-Learning ProductOperationalizationSpark’s RoleFitting It Into Our Existing Platform with IdiMLFaster, Flexible Performant SystemsThree Ideas to Add to Your Data Science ToolkitUse a Reusable Holdout Method to Avoid Overfitting During Interactive Data AnalysisUse Random Search for Black-Box Parameter TuningExplain Your Black-Box Models Using Local ApproximationsRelated ResourcesIntroduction to Local Interpretable Model-Agnostic Explanations (LIME)Intuition Behind LIMEExamplesConclusion
6. Deep Learning and AI
The Current State of Machine Intelligence 3.0Ready Player WorldWhy Even Bot-Her?On to 11111000001Peter Pan’s Never-Never LandInspirational Machine IntelligenceLooking ForwardHello, TensorFlow!Names and Execution in Python and TensorFlowThe Simplest TensorFlow GraphThe Simplest TensorFlow NeuronSee Your Graph in TensorBoardMaking the Neuron LearnFlowing OnwardCompressing and Regularizing Deep Neural NetworksCurrent Training Methods Are InadequateDeep CompressionDSD TrainingGenerating Image DescriptionsAdvantages of Sparsity

Content preview from Big Data Now: 2016 Edition

Chapter 2. Tools and Architecture for Big Data

In this chapter, Evan Chan performs a storage and query cost-analysis on various analytics applications, and describes how Apache Cassandra stacks up in terms of ad hoc, batch, and time-series analysis. Next, Federico Castanedo discusses how using distributed frameworks to scale R can help solve the problem of storing large and ever-growing data sets in RAM. Daniel Whitenack then explains how a new programming language from Google—Go—could help data science teams overcome common obstacles such as integrating data science in an engineering organization. Whitenack also details the many tools, packages, and resources that allow users to perform data cleansing, visualization, and even machine learning in Go. Finally, Nicolas Seyvet and Ignacio Mulas Viela describe how the telecom industry is navigating the current data analytics environment. In their use case, they apply both Kappa architecture and a Bayesian anomaly detection model to a high-volume data stream originating from a cloud monitoring system.

Apache Cassandra for Analytics: A Performance and Storage Analysis

By Evan Chan

You can read this post on oreilly.com here.

This post is about using Apache Cassandra for analytics. Think time series, IoT, data warehousing, writing, and querying large swaths of data—not so much transactions or shopping carts. Users thinking of Cassandra as an event store and source/sink for machine learning/modeling/classification would also benefit greatly ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Just Right: Introduction to Large-Scale Data & Analytics

Publisher Resources

ISBN: 9781492049197

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Big Data Now: 2016 Edition

by O'Reilly Media, Inc.

Chapter 2. Tools and Architecture for Big Data

Apache Cassandra for Analytics: A Performance and Storage Analysis

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.