Hacks, Leaks, and Revelations

Book description

Unlock the internet’s treasure trove of public interest data with Hacks, Leaks, and Revelations by Micah Lee, an investigative reporter and security engineer. This hands-on guide blends real-world techniques for researching large datasets with lessons on coding, data authentication, and digital security. All of this is spiced up with gripping stories from the front lines of investigative journalism.

Dive into exposed datasets from a wide array of sources: the FBI, the DHS, police intelligence agencies, extremist groups like the Oath Keepers, and even a Russian ransomware gang. Lee’s own in-depth case studies on disinformation-peddling pandemic profiteers and neo-Nazi chatrooms serve as blueprints for your research.

Gain practical skills in searching massive troves of data for keywords like “antifa” and pinpointing documents with newsworthy revelations. Get a crash course in Python to automate the analysis of millions of files.

You will also learn how to:

  • Master encrypted messaging to safely communicate with whistleblowers.
  • Secure datasets over encrypted channels using Signal, Tor Browser, OnionShare, and SecureDrop.
  • Harvest data from the BlueLeaks collection of internal memos, financial records, and more from over 200 state, local, and federal agencies.
  • Probe leaked email archives about offshore detention centers and the Heritage Foundation.
  • Analyze metadata from videos of the January 6 attack on the US Capitol, sourced from the Parler social network.

We live in an age where hacking and whistleblowing can unearth secrets that alter history. Hacks, Leaks, and Revelations is your toolkit for uncovering new stories and hidden truths. Crack open your laptop, plug in a hard drive, and get ready to change history.

Publisher resources

View/Submit Errata

Table of contents

  1. Praise for Hacks, Leaks, and Revelations
  2. Title Page
  3. Copyright
  4. Dedication
  5. About the Author and Technical Reviewer
  6. Acknowledgments
  7. Introduction
    1. Why I Wrote This Book
    2. What You’ll Learn
    3. What You’ll Need
  8. Part I: Sources and Datasets
    1. 1. Protecting Sources and Yourself
      1. Safely Communicating with Sources
        1. Working with Public Data
        2. Protecting Sensitive Information
        3. Minimizing the Digital Trail
        4. Working with Hackers and Whistleblowers
      2. Secure Storage for Datasets
        1. Low-Sensitivity Datasets
        2. Medium-Sensitivity Datasets
        3. High-Sensitivity Datasets
      3. Authenticating Datasets
        1. The AFLDS Dataset
        2. The WikiLeaks Twitter Group Chat
      4. Redaction
        1. What Data to Publish
        2. What to Redact
      5. Making Requests for Comment
      6. Password Managers
      7. Disk Encryption
      8. Exercise 1-1: Encrypt Your Internal Disk
        1. Windows
        2. macOS
        3. Linux
      9. Exercise 1-2: Encrypt a USB Disk
        1. Windows
        2. macOS
        3. Linux
      10. Protecting Yourself from Malicious Documents
      11. Exercise 1-3: Install and Use Dangerzone
      12. Summary
    2. 2. Acquiring Datasets
      1. The End of WikiLeaks
      2. Distributed Denial of Secrets
      3. Downloading Datasets with BitTorrent
      4. The Origins of BlueLeaks
      5. Exercise 2-1: Download the BlueLeaks Dataset
      6. Communicating with Encrypted Messaging Apps
      7. Exercise 2-2: Install and Practice Using Signal
      8. Encrypting Messages with PGP
      9. Staying Anonymous Online with Tor and OnionShare
      10. Exercise 2-3: Play with Tor and OnionShare
      11. Communicating with My Tea Party Patriots Source
      12. Other Options for Acquiring Datasets from Sources
        1. Encrypted USB Drives
        2. Virtual Private Servers
      13. Whistleblower Submission Systems
      14. Summary
  9. Part II: Tools of the Trade
    1. 3. The Command Line Interface
      1. Introducing the Command Line
        1. The Shell
        2. Users and Paths
        3. User Privileges
      2. Exercise 3-1: Install Ubuntu in Windows
      3. Basic Command Line Usage
        1. Opening a Terminal
        2. Clearing Your Screen and Exiting the Shell
        3. Exploring Files and Directories
        4. Navigating Relative and Absolute Paths
        5. Changing Directories
        6. Using the help Argument
        7. Accessing Man Pages
      4. Tips for Navigating the Terminal
        1. Entering Commands with Tab Completion
        2. Editing Commands
        3. Dealing with Spaces in Filenames
        4. Using Single Quotes Around Double Quotes
      5. Installing and Uninstalling Software with Package Managers
      6. Exercise 3-2: Manage Packages with Homebrew on macOS
      7. Exercise 3-3: Manage Packages with apt on Windows or Linux
      8. Exercise 3-4: Practice Using the Command Line with cURL
        1. Download a Web Page with cURL
        2. Save a Web Page to a File
      9. Text Files vs. Binary Files
      10. Exercise 3-5: Install the VS Code Text Editor
      11. Exercise 3-6: Write Your First Shell Script
        1. Navigate to Your USB Disk
        2. Create an Exercises Folder
        3. Open a VS Code Workspace
        4. Write the Shell Script
        5. Run the Shell Script
      12. Exercise 3-7: Clone the Book’s GitHub Repository
      13. Summary
    2. 4. Exploring Datasets in the Terminal
      1. Introducing for Loops
      2. Exercise 4-1: Unzip the BlueLeaks Dataset
        1. Unzip Files on macOS or Linux
        2. Unzip Files on Windows
        3. Organize Your Files
      3. How the Hacker Obtained the BlueLeaks Data
      4. Exercise 4-2: Explore BlueLeaks on the Command Line
        1. Calculate How Much Disk Space Folders Use
        2. Use Pipes and Sort Output
        3. Create an Inventory of Filenames in a Dataset
        4. Count the Files in a Dataset
      5. Exercise 4-3: Find Revelations in BlueLeaks with grep
        1. Filter for Documents Mentioning Antifa
        2. Filter for Certain Types of Files
        3. Use grep with Regular Expressions
        4. Search Files in Bulk with grep
      6. Encrypted Data in the BlueLeaks Dataset
      7. Data Analysis with Servers in the Cloud
      8. Exercise 4-4: Set Up a VPS
        1. Generate an SSH Key
        2. Add Your Public Key to the Cloud Provider
        3. Create a VPS
        4. SSH into Your Server
        5. Start a Byobu Session
        6. Install Updates
      9. Exercise 4-5: Explore the Oath Keepers Dataset Remotely
      10. Summary
    3. 5. Docker, Aleph, and Making Datasets Searchable
      1. Introducing Docker and Linux Containers
      2. Exercise 5-1: Initialize Docker Desktop on Windows and macOS
      3. Exercise 5-2: Initialize Docker Engine on Linux
      4. Running Containers with Docker
        1. Running an Ubuntu Container
        2. Listing and Killing Containers
        3. Mounting and Removing Volumes
        4. Passing Environment Variables
        5. Running Server Software
        6. Freeing Up Disk Space
      5. Exercise 5-3: Run a WordPress Site with Docker Compose
        1. Make a docker-compose.yaml File
        2. Start Your WordPress Site
      6. Introducing Aleph
      7. Exercise 5-4: Run Aleph Locally in Linux Containers
      8. Using Aleph’s Web and Command Line Interfaces
      9. Indexing Data in Aleph
      10. Exercise 5-5: Index a BlueLeaks Folder in Aleph
        1. Mount Your Datasets into the Aleph Shell
        2. Index the icefishx Folder
        3. Check Indexing Status
      11. Explore BlueLeaks with Aleph
      12. Additional Aleph Features
      13. Dedicated Aleph Servers
      14. Summary
    4. 6. Reading Other People’s Email
      1. The Email Protocol and Message Structure
      2. File Formats for Email Dumps
        1. EML Files
        2. MBOX Files
        3. PST Outlook Data Files
      3. Exercise 6-1: Download Email Dumps from Three Datasets
        1. The Nauru Police Force Dataset
        2. The Oath Keepers Dataset
        3. The Heritage Foundation Dataset
      4. Researching Email Dumps with Thunderbird
      5. Exercise 6-2: Configure Thunderbird for Email Dumps
      6. Reading Individual EML Files with Thunderbird
      7. Exercise 6-3: Import the Nauru Police Force EML Email Dump
      8. Searching Email in Thunderbird
        1. Quick Filter Searches
        2. The Search Messages Dialog
      9. Exercise 6-4: Import the Oath Keepers MBOX Email Dump
      10. Exercise 6-5: Import the Heritage Foundation PST Email Dump
      11. Other Tools for Researching Email Dumps
        1. Microsoft Outlook
        2. Aleph
      12. Summary
  10. Part III: Python Programming
    1. 7. An Introduction to Python
      1. Exercise 7-1: Install Python
        1. Windows
        2. Linux
        3. macOS
      2. Exercise 7-2: Write Your First Python Script
      3. Python Basics
        1. The Interactive Python Interpreter
        2. Comments
        3. Math with Python
        4. Strings
      4. Exercise 7-3: Write a Python Script with Variables, Math, and Strings
      5. Lists and Loops
        1. Defining and Printing Lists
        2. Running for Loops
      6. Control Flow
        1. Comparison Operators
        2. if Statements
        3. Nested Code Blocks
        4. Searching Lists
        5. Logical Operators
        6. Exception Handling
      7. Exercise 7-4: Practice Loops and Control Flow
      8. Functions
        1. The def Keyword
        2. Default Arguments
        3. Return Values
        4. Docstrings
      9. Exercise 7-5: Practice Writing Functions
      10. Summary
    2. 8. Working with Data in Python
      1. Modules
      2. Python Script Template
      3. Exercise 8-1: Traverse the Files in BlueLeaks
        1. List the Filenames in a Folder
        2. Count the Files and Folders in a Folder
      4. Traverse Folders with os.walk()
      5. Exercise 8-2: Find the Largest Files in BlueLeaks
      6. Third-Party Modules
      7. Exercise 8-3: Practice Command Line Arguments with Click
      8. Avoiding Hardcoding with Command Line Arguments
      9. Exercise 8-4: Find the Largest Files in Any Dataset
        1. Dictionaries
          1. Defining Dictionaries
          2. Getting and Setting Values
      10. Navigating Dictionaries and Lists in the Conti Chat Logs
        1. Exploring Dictionaries and Lists Full of Data in Python
        2. Selecting Values in Dictionaries and Lists
        3. Analyzing Data Stored in Dictionaries and Lists
      11. Exercise 8-5: Map Out the CSVs in BlueLeaks
        1. Accept a Command Line Argument
        2. Loop Through the BlueLeaks Folders
        3. Fill Up the Dictionary
        4. Display the Output
      12. Reading and Writing Files
        1. Opening Files
        2. Writing Lines to a File
        3. Reading Lines from a File
      13. Exercise 8-6: Practice Reading and Writing Files
      14. Summary
  11. Part IV: Structured Data
    1. 9. Blueleaks, Black Lives Matter, and the CSV File Format
      1. Installing Spreadsheet Software
      2. Introducing the CSV File Format
      3. Exploring CSV Files with Spreadsheet Software and Text Editors
      4. My BlueLeaks Investigation
        1. Focusing on a Fusion Center
        2. Introducing NCRIC
        3. Investigating a SAR
      5. Reading and Writing CSV Files in Python
      6. Exercise 9-1: Make BlueLeaks CSVs More Readable
        1. Accept the CSV Path as an Argument
        2. Loop Through the CSV Rows
        3. Display CSV Fields on Separate Lines
      7. How to Read Bulk Email from Fusion Centers
        1. Lists of Black Lives Matter Demonstrations
        2. “Intelligence” Memos from the FBI and DHS
      8. A Brief HTML Primer
      9. Exercise 9-2: Make Bulk Email Readable
        1. Accept the Command Line Arguments
        2. Create the Output Folder
        3. Define the Filename for Each Row
        4. Write the HTML Version of Each Bulk Email
      10. Discovering the Names and URLs of BlueLeaks Sites
      11. Exercise 9-3: Make a CSV of BlueLeaks Sites
        1. Open a CSV for Writing
        2. Find All the Company.csv Files
        3. Add BlueLeaks Sites to the CSV
      12. Summary
    2. 10. Blueleaks Explorer
      1. Undiscovered Revelations in BlueLeaks
      2. Exercise 10-1: Install BlueLeaks Explorer
        1. Create the Docker Compose Configuration File
        2. Bring Up the Containers
        3. Initialize the Databases
      3. The Structure of NCRIC
        1. Exploring Tables and Relationships
        2. Searching for Keywords
      4. Building Your Own BlueLeaks Structure
        1. Defining the JRIC Structure
        2. Showing Useful Fields
        3. Changing Field Types
        4. Adding JRIC’s Leads Table
        5. Building a Relationship
      5. Verifying BlueLeaks Data
      6. Exercise 10-2: Finish Building the Structure for JRIC
      7. The Technology Behind BlueLeaks Explorer
        1. The Backend
        2. The Frontend
      8. Summary
    3. 11. Parler, the January 6 Insurrection, and the JSON File format
      1. The Origins of the Parler Dataset
        1. How the Parler Videos Were Archived
        2. The Dataset’s Impact on Trump’s Second Impeachment
      2. Exercise 11-1: Download and Extract Parler Video Metadata
        1. Download the Metadata
        2. Uncompress and Download Individual Parler Videos
        3. Extract Parler Metadata
      3. The JSON File Format
        1. Understanding JSON Syntax
        2. Parsing JSON with Python
        3. Handling Exceptions with JSON
      4. Tools for Exploring JSON Data
        1. Counting Videos with GPS Coordinates Using grep
        2. Formatting and Searching Data with the jq Command
      5. Exercise 11-2: Write a Script to Filter for Videos with GPS from January 6, 2021
        1. Accept the Parler Metadata Path as an Argument
        2. Loop Through Parler Metadata Files
        3. Filter for Videos with GPS Coordinates
        4. Filter for Videos from January 6, 2021
      6. Working with GPS Coordinates
        1. Searching by Latitude and Longitude
        2. Converting Between GPS Coordinate Formats
        3. Calculating GPS Distance in Python
        4. Finding the Center of Washington, DC
      7. Exercise 11-3: Update the Script to Filter for Insurrection Videos
      8. Plotting GPS Coordinates on a Map with simplekml
      9. Exercise 11-4: Create KML Files to Visualize Location Data
        1. Create a KML File for All Videos with GPS Coordinates
        2. Create KML Files for Videos from January 6, 2021
      10. Visualizing Location Data with Google Earth
      11. Viewing Metadata with ExifTool
      12. Summary
    4. 12. Epik Fail, Extremism Research, and SQL Databases
      1. The Structure of SQL Databases
        1. Relational Databases
        2. Clients and Servers
        3. Tables, Columns, and Types
      2. Exercise 12-1: Create and Test a MySQL Server Using Docker and Adminer
        1. Run the Server
        2. Connect to the Database with Adminer
        3. Create a Test Database
      3. Exercise 12-2: Query Your SQL Database
        1. INSERT Statements
        2. SELECT Statements
        3. JOIN Clauses
        4. UPDATE Statements
        5. DELETE Statements
      4. Introducing the MySQL Command Line Client
      5. Exercise 12-3: Install and Test the Command Line MySQL Client
      6. MySQL-Specific Queries
      7. The History of Epik
        1. The Epik Hack
        2. Epik’s WHOIS Data
      8. Exercise 12-4: Download and Extract Part of the Epik Dataset
      9. Exercise 12-5: Import Epik Data into MySQL
        1. Create a Database for api_system
        2. Import api_system Data
      10. Exploring Epik’s SQL Database
        1. The domain Table
        2. The privacy Table
        3. The hosting and hosting_server Tables
      11. Working with Epik Data in the Cloud
      12. Summary
  12. Part V: Case Studies
    1. 13. Pandemic Profiteers and Covid-19 Disinformation
      1. The Origins of AFLDS
      2. The Cadence Health and Ravkoo Datasets
        1. Extracting the Data into an Encrypted File Container
        2. Analyzing the Data with Command Line Tools
      3. Creating a Single Spreadsheet of Patients
      4. Calculating Revenue from Prescriptions Filled by Ravkoo
        1. Finding the Price and Quantity of Drugs Sold
        2. Categorizing Prescription Data by Drug
      5. A Deeper Look at the Cadence Health Patient Data
        1. Finding Cadence’s Partners
        2. Searching for Patients by City
        3. Searching for Patients by Age
      6. Authenticating the Data
      7. The Aftermath
        1. HIPAA’s Breach Notification Rule
        2. Congressional Investigation
        3. Simone Gold’s New Business Venture
        4. Scandal and Infighting at AFLDS
      8. Summary
    2. 14. Neo-Nazis and their Chatrooms
      1. How Antifascists Infiltrated Neo-Nazi Discord Servers
      2. Analyzing Leaked Chat Logs
        1. Making JSON Files Readable
        2. Exploring Objects, Keys, and Values with jq
        3. Converting Timestamps
        4. Finding Usernames
      3. The Discord History Tracker
      4. A Script to Search the JSON Files
      5. My Discord Analysis Code
        1. Designing the SQL Database
        2. Importing Chat Logs into the SQL Database
        3. Building the Web Interface
        4. Using Discord Analysis to Find Revelations
      6. The Pony Power Discord Server
      7. The Launch of DiscordLeaks
      8. The Aftermath
        1. The Lawsuit Against Unite the Right
        2. The Patriot Front Chat Logs
      9. Summary
  13. Afterword
  14. A. Solutions to Common WSL Problems
    1. Understanding WSL’s Linux Filesystem
    2. The Disk Performance Problem
    3. Solving the Disk Performance Problem
      1. Storing Only Active Datasets in Linux
      2. Storing Your Linux Filesystem on a USB Disk
    4. Next Steps
  15. B. Scraping the Web
    1. Legal Considerations
    2. HTTP Requests
    3. Scraping Techniques
      1. Loading Pages with HTTPX
      2. Parsing HTML with Beautiful Soup
      3. Automating Web Browsers with Selenium
    4. Next Steps
  16. Index

Product information

  • Title: Hacks, Leaks, and Revelations
  • Author(s): Micah Lee
  • Release date: January 2024
  • Publisher(s): No Starch Press
  • ISBN: 9781718503120