SQLite for Data Scientists
After years of NoSQL hype, data scientists are starting to love SQL again. Modern hardware can now easily and economically support the “traditional” relation approach without needing complex distributed systems. SQLite—a C-language library that implements a lightweight, full-featured SQL database engine—is one of the most popular databases in the industry and is used in everything from web applications, web browsers, and every Android smartphone to airplanes.
Join expert Florents Tselai to learn how SQLite can help data scientists iterate the data process as quickly as possible. You’ll explore some of the more advanced features of SQL and SQLite that will help you empower and streamline your workflow as you discover how to seamlessly integrate SQLite into a data pipeline and use it to produce reports simply and efficiently, analyze unstructured and semistructured data, and store machine learning models. Along the way, you’ll also get hands-on with features like triggers—all integrated into Python-based data science scenarios.
What you'll learn-and how you can apply it
By the end of this live online course, you’ll understand:
- How standard database concepts such as triggers fit into the modern data science practice
- Why SQL is still relevant
- How SQL can be more than enough for data cleaning and transformation
- How SQL and Python can work nicely together
- How to use SQLite to produce advanced reports simply and efficiently
- How to integrate SQLite seamlessly into your data pipeline
- How to use SQLite to analyse unstructured data (e.g., text) and semistructured data (e.g., JSON documents) without having to use complex systems like ElasticSearch or NLTK
- How to use SQLite to store machine learning models
And you’ll be able to:
- Replace multiple pandas one-liners with a few lines of SQL code
- Replace time-consuming Python-based data cleaning with SQL triggers
- Use SQLite to generate reports
- Avoid using complex and expensive big data systems and still achieve better results
- Use Python for true machine learning and experimentation, instead of “boring” data cleaning
- Reduce the financial cost of your machine learning models and analyses
This training course is for you because...
- You think SQL could solve many of your problems.
- You’re tired of having to book meetings with database administrators, data engineers, business analysts, and project managers to produce a useful report.
- Your ETL processes have become too complex, and you want to simplify them.
- You’re interested in automating report generation but don’t know how to go about it.
- You want to reduce the amount of time you spend writing ETL processes in Python.
- Familiarity with SQL concepts (e.g., simple queries, joins, and group-by clauses) and Python syntax
- A basic understanding of the Python data stack (e.g., tools like pandas and scikit-learn) (useful but not required)
- Read Python for Data Analysis, second edition (book)
About your instructor
Florents Tselai has research, engineering, and leadership experience across the entire data supply chain, from modeling and applying machine learning algorithms to engineering and managing production deployments of big data pipelines. He’s collaborated with partners from industries as varied as retail, advertising, energy, insurance, pharmaceuticals, and telecoms, mainly in the areas of predictive analytics, market analysis, and marketing budget allocation. Companies Florents has worked for include Deutsche Telekom, BASF, GfK, and XING.
The timeframes are only estimates and may vary according to how the class is progressing
Understanding SQLite (20 minutes)
- Presentation: How SQLite is different from pandas and other databases
- Demo: Install SQLite and load a dataset
SQL with SQLite: Part 1 (30 minutes)
- Presentation: Time-related queries with SQLite; views; common-table expressions (CTEs)
- Demo: Run SQL on the SQLite CLI; run SQLite queries via Python/pandas
- Jupyter Notebook exercise: Write an SQL view and CTE
Break (5 minutes)
SQL with SQLite: Part 2 (30 minutes)
- Presentation: Execution plan; database indices
- Jupyter Notebook exercise: Create an index to speed up execution time
- Demo: Storing Machine Learning Models
Querying Unstructured JSON Data (30 minutes)
- Presentation: Unstructured data; JSON and document stores; schemaless data and indices on expressions
- Demo: Query JSON with SQLite
- Jupyter notebook exercises: Query JSON data, create views, and fetch results into a pandas DataFrame; create indices to speed up JSON queries
Break (5 minutes)
Implementing Full-Text Search (FTS5) (30 minutes)
- Presentation: How full-text search (FTS) works
- Demo: Run FTS queries with SQLite
- Jupyter Notebook exercise: Run FTS queries from Python
Advanced SQL for data scientists (20 minutes)
- Presentation: ETL data with triggers
- Presentation: Window functions
- Jupyter Notebook exercise: Write a trigger function to clean input data
Wrap-up and Q&A (10 minutes)