Data Science and Engineering at Enterprise Scale

Book description

As enterprise-scale data science sharpens its focus on data-driven decision making and machine learning, new tools have emerged to help facilitate these processes. This practical ebook shows data scientists and enterprise developers how the notebook interface, Apache Spark, and other collaboration tools are particularly well suited to bridge the communication gap between their teams.

Through a series of real-world examples, author Jerome Nilmeier demonstrates how to generate a model that enables data scientists and developers to share ideas and project code. You’ll learn how data scientists can approach real-world business problems with Spark and how developers can then implement the solution in a production environment.

  • Dive deep into data science technologies, including Spark, TensorFlow, and the Jupyter Notebook
  • Learn how Spark and Python notebooks enable data scientists and developers to work together
  • Explore how the notebook environment works with Spark SQL for structured data
  • Use notebooks and Spark as a launchpad to pursue supervised, unsupervised, and deep learning data models
  • Learn additional Spark functionality, including graph analysis and streaming
  • Explore the use of analytics in the production environment, particularly when creating data pipelines and deploying code

Table of contents

  1. Foreword
  2. Preface
    1. What This Book Will Cover and How It Will Help You with Your Daily Work
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  3. 1. Sharing Information Across Disciplines in the Enterprise
    1. The Overlap Between Data Scientist and Data Engineer
    2. How Notebooks Bridge the Gap
    3. Notebooks as a Medium of Communication
    4. Example: Validating Statistical Functions and Developing Unit Tests
      1. Evaluating a Validated Desktop-Scale Function
      2. Understanding the Logic to Be Used at Scale
      3. Generating a Unit Test with a Smaller Sample
      4. Writing the Scalable Code
    5. Summary
  4. 2. Setting Up Your Notebook Environment
    1. Quick Start with Watson Studio
      1. Creating a Project and Importing a Notebook
    2. Setting Up Your Own Environment
      1. Using Docker Images
      2. Installing Apache Spark, TensorFlow, and Notebooks
      3. Installing Spark
      4. Installing Java
      5. Installing Spark Binary
      6. Creating the Python Environment
      7. Installing Jupyter
      8. Installing Deep Learning Frameworks
    3. Summary
  5. 3. Data Science Technologies
    1. Apache Spark
      1. Spark Core: Executors, Cluster Configurations, and More
      2. RDDs, Datasets, and DataFrames: How to Use Them
      3. Example: Creating and Calulating with an RDD
      4. Caching Results
    2. Spark SQL and DataFrames
    3. Summary
  6. 4. Introduction to Machine Learning
    1. Linear Regression as a Machine Learning Model
    2. Defining the Loss Function
    3. Solving for Parameters
      1. A “trick” for linear models: The normal equation
    4. Numerical Optimization, the Workhorse of All Machine Learning
    5. Feature Scaling
    6. Letting the Libraries Do Their Job
    7. The Data Scientist Has a Job to Do Too
    8. Summary
  7. 5. Classic Machine Learning Examples and Applications
    1. Supervised Learning Models
      1. The Activation Function: From a Value to a Label
      2. Using Labeled Data for Training Your Model
    2. Making Predictions with the Trained Model
      1. Evaluating Model Performance and Deploying the Model
    3. Collaborative Filtering
      1. Understanding the Model as a Latent Feature Model
    4. Unsupervised Learning Models
      1. K Means Clustering
    5. From Clusters to Topics: Text Analytics with Unsupervised Learning
      1. K Means Clustering Using Word2Vec
      2. The Latent Dirichlet Allocation
    6. Summary
  8. 6. Advanced Machine Learning Examples and Applications
    1. Deep Learning Models with Spark and TensorFlow
      1. The Neural Network
      2. Training the Neural Network
    2. Graph Analytics
      1. What Is a Graph and Why Should We Care?
    3. Summary

Product information

  • Title: Data Science and Engineering at Enterprise Scale
  • Author(s): Jerome Nilmeier
  • Release date: April 2019
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492039334