O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Hands-On Big Data Modeling

Book Description

Solve all big data problems by learning how to create efficient data models

Key Features

  • Create effective models that get the most out of big data
  • Apply your knowledge to datasets from Twitter and weather data to learn big data
  • Tackle different data modeling challenges with expert techniques presented in this book

Book Description

Modeling and managing data is a central focus of all big data projects. In fact, a database is considered to be effective only if you have a logical and sophisticated data model. This book will help you develop practical skills in modeling your own big data projects and improve the performance of analytical queries for your specific business requirements.

To start with, you'll get a quick introduction to big data and understand the different data modeling and data management platforms for big data. Then you'll work with structured and semi-structured data with the help of real-life examples. Once you've got to grips with the basics, you'll use the SQL Developer Data Modeler to create your own data models containing different file types such as CSV, XML, and JSON. You'll also learn to create graph data models and explore data modeling with streaming data using real-world datasets.

By the end of this book, you'll be able to design and develop efficient data models for varying data sizes easily and efficiently.

What you will learn

  • Get insights into big data and discover various data models
  • Explore conceptual, logical, and big data models
  • Understand how to model data containing different file types
  • Run through data modeling with examples of Twitter, Bitcoin, IMDB and weather data modeling
  • Create data models such as Graph Data and Vector Space
  • Model structured and unstructured data using Python and R

Who this book is for

This book is great for programmers, geologists, biologists, and every professional who deals with spatial data. If you want to learn how to handle GIS, GPS, and remote sensing data, then this book is for you. Basic knowledge of R and QGIS would be helpful.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Hands-On Big Data Modeling
  3. About Packt
    1. Why subscribe?
    2. Packt.com
  4. Contributors
    1. About the authors
    2. About the reviewers
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Introduction to Big Data and Data Management
    1. The concept of big data 
      1. Interesting insights regarding big data
      2. Characteristics of big data
    2. Sources and types of big data
      1. Challenges of big data
    3. Introduction to big data modeling
      1. Uses of models
    4. Introduction to managing big data
    5. Importance and implications of big data modeling and management
      1. Benefits of big data management
      2. Challenges in big data management 
    6. Setting up big data modeling platforms
      1. Getting started on Windows
      2. Getting started on macOS
    7. Summary
    8. Further reading
  7. Data Modeling and Management Platforms
    1. Big data management
      1. Data ingestion
      2. Data storage
      3. Data quality
      4. Data operations
      5. Data scalability and security
    2. Big data management services
      1. Data cleansing
      2. Data integration
    3. Big data management vendors
    4. Big data storage and data models
      1. Storage models
        1. Block-based storage
        2. File-based storage 
        3. Object-based storage
      2. Data models
        1. Relational stores (SQLs)
          1. Scalable relational systems
          2. Database as a Service (DaaS)
        2. NoSQL stores
          1. Document stores
          2. Key-value stores
          3. Extensible-record stores
    5. Big data programming models
      1. MapReduce
        1. MapReduce functionality
        2. Hadoop
          1. Features of Hadoop frameworks
        3. Yet Another Resource Negotiator 
      2. Functional programming
        1. Spark
          1. Reasons to choose Apache Spark
        2. Flink
          1. Advantages of Flink
      3. SQL data models
        1. Hive Query Langauge (HQL)
        2. Cassandra Query Language (CQL)
        3. Spark SQL
        4. Apache Drill
    6. Getting started with Python and R
      1. Python on macOS
      2. Python on Windows
      3. R on macOS
      4. R on Windows
    7. Summary
    8. Further reading
  8. Defining Data Models
    1. Data model structures
      1. Structured data
      2. Unstructured data
        1. Sources of unstructured data
      3. Comparing structured and unstructured data
    2. Data operations
      1. Subsetting
      2. Union
      3. Projection
      4. Join
    3. Data constraints
      1. Types of constraints
        1. Value constraints
        2. Uniqueness constraints
        3. Cardinality constraints
        4. Type constraints
        5. Domain constraints
        6. Structural constraints
    4. A unified approach to big data modeling and data management
    5. Summary
    6. Further reading
  9. Categorizing Data Models
    1. Levels of data modeling
      1. Conceptual data modeling
      2. Logical data modeling
        1. Benefits of constructing LDMs
      3. Physical data modeling
        1. Features of the physical data model
    2. Types of data model
      1. Hierarchical database models
      2. Relational models
        1. Advantages of the relational data model
      3. Network models
      4. Object-oriented database model
      5. Entity-relationship models
      6. Object-relational models
    3. Summary
    4. Further reading
  10. Structures of Data Models
    1. Semi-structured data models
      1. Exploring the semi-structured data model of JSON data
        1. Installing Python and the Tweepy library
        2. Getting authorization credentials to access the Twitter API
    2. VSM with Lucene
      1. Lucene
    3. Graph-data models
      1. Graph-data models with Gephi
    4. Summary 
    5. Further reading
  11. Modeling Structured Data
    1. Getting started with structured data
      1. NumPy
        1. Operations using NumPy
      2. Pandas
      3. Matplotlib
      4. Seaborn
      5. IPython
    2. Modeling structured data using Python
      1. Visualizing the location of houses based on latitude and longitude
      2. Factors that affect the price of houses
        1. Visualizing more than one parameter
      3. Gradient-boosting regression
    3. Summary
    4. Further reading
  12. Modeling with Unstructured Data
    1. Getting started with unstructured data
      1. Tools for intelligent analysis
      2. New methods of data processing
    2. Tools for analyzing unstructured data
      1. Weka
      2. KNIME
        1. Characteristics of KNIME
      3. The R language
      4. Unstructured text analysis using R
        1. Data ingestion
        2. Data cleaning and transformations
        3. Data visualization
        4. Improving the model
    3. Summary
    4. Further reading
  13. Modeling with Streaming Data
    1. Data stream and data model versus data format
    2. Why is streaming data different?
      1. Use cases of stream processing
      2. What is a data stream?
      3. Data streaming systems
      4. How streaming works
        1. Data harvesting
        2. Data processing
        3. Data analytics
    3. Importance and implications of streaming data
      1. Needs for stream processing
      2. Challenges with streaming data
      3. Streaming data solutions
    4. Exploring streaming sensor data from the Twitter API
      1. Analyzing the streaming data
    5. Summary
    6. Further reading
  14. Streaming Sensor Data
    1. Sensor data
    2. Data lakes
      1. Differences between data lakes and data warehouses
      2. How a data lake works
    3. Exploring streaming sensor data from a weather station
    4. Summary
    5. Further study
  15. Concept and Approaches of Big-Data Management
    1. Non-DBMS-based approach to big data
      1. Filesystems
        1. Problems with processing files
    2. DBMS-based approach to big data
      1. Advantages of the DBMS
        1. Declarative Query Language (DQL)
        2. Data independence
        3. Controlling data redundancy
        4. Centralized data management and concurrent access
        5. Data integrity
        6. Data availability
        7. Efficient access through optimization
    3. Parallel and distributed DBMS
      1. Parallel DBMS
        1. Motivations for parallel DBMS
        2. Architectures for parallel databases
      2. Distributed DBMS
        1. Features of a distributed DBMS
        2. Merits of a distributed DBMS
    4. DBMS and MapReduce-style systems
    5. Summary
    6. Further reading
  16. DBMS to BDMS
    1. Characteristics of BDMS
      1. BASE properties
    2. Exploring data management with Redis
      1. Getting started with Redis on macOS
      2. Advanced key-value stores
      3. Redis and Hadoop
    3. Aerospike
      1. Aerospike technology
    4. AsterixDB
      1. Data models
      2. The Asterix query language
      3. Getting started with AsterixDB
      4. Unstructured data in AsterixDB
      5. Inserting into datasets
      6. Querying in AsterixDB
    5. Summary
    6. Further reading
  17. Modeling Bitcoin Data Points with Python
    1. Introduction to Bitcoin data
    2. Theory
    3. Importing Bitcoin data into iPython
      1. Importing required libraries
    4. Preprocessing and model creation
    5. Predicting Bitcoin price using Recurrent Neural Network
      1. Importing packages
      2. Importing datasets
      3. Preprocessing
      4. Constructing the RNN model
      5. Prediction
    6. Summary
    7. Further reading
  18. Modeling Twitter Feeds Using Python
    1. Importing Twitter feed data
    2. Modeling Twitter feeds
      1. The frequency of the tweets
      2. Sentiment analysis
        1. Installing TextBlob
        2. Parts of speech
        3. Noun-phrase extraction
        4. Tokenization
        5. Bag of words
    3. Summary
    4. Further reading
  19. Modeling Weather Data Points with Python
    1. Introduction to weather data
    2. Importing data
      1. Forecasting Nepal's temperature change
    3. Modeling with data
      1. Persistence model forecast
      2. Weather statistics by country
      3. Linear regression to predict the temperature of a city
    4. Summary
    5. Further reading
  20. Modeling IMDb Data Points with Python
    1. Introduction to IMDb data
      1. Episode data
      2. Rating data
    2. Theory 
    3. Modeling with the IMDb dataset
      1. Starting the platform
      2. Importing the required libraries
      3. Importing a file
      4. Data cleansing
      5. Clustering
    4. Summary
    5. Further reading
  21. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think