7Using SQL with SAS and R
7.1 What is SQL?
SQL (Structured Query Language) is a language for querying and modifying data in Relational Database Management Systems (RDBMs). However SQL is also used within Apache Hive and Python as well as PySpark. The pandasql package allows you to query pandas DataFrames using SQL syntax. The entry point into all SQL functionality in Spark is the SQLContext class. The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
7.1.1 Basic Terminology
A database is a collection of information that is organized so that it can be easily accessed, managed and updated.
A relational database is a set of tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables.
7.1.2 CAP Theorem
CAP Theorem is a concept that a distributed database system can only have 2 of the 3: Consistency, Availability, and Partition Tolerance.
- Consistency: Every read receives the most recent write or an error
- Availability: Every request receives a (non‐error) response – without the guarantee that it contains the most recent write
- Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.
ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions intended to guarantee validity even in the event ...
Get SAS for R Users now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.