book

Algorithms and Data Structures for Massive Datasets

Name: Algorithms and Data Structures for Massive Datasets
ISBN: 9781617298035

by Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

July 2022

Intermediate to advanced

304 pages

9h 15m

English

Manning Publications

Read now

Unlock full access

title
Copyright
contents
front matter
prefaceacknowledgmentsabout this bookWho should read this bookHow this book is organized: A road mapAbout the codeliveBook discussion forumabout the authorsabout the cover illustration
1 Introduction
1.1 An example1.1.1 An example: How to solve it1.1.2 How to solve it, take two: A book walkthrough1.2 The structure of this book1.3 What makes this book different and whom it is for1.4 Why is massive data so challenging for today’s systems?1.4.1 The CPU memory performance gap1.4.2 Memory hierarchy1.4.3 Latency vs. bandwidth1.4.4 What about distributed systems?1.5 Designing algorithms with hardware in mindSummary
Part 1 Hash-based sketches
2 Review of hash tables and modern hashing
2.1 Ubiquitous hashing2.2 A crash course on data structures2.3 Usage scenarios in modern systems2.3.1 Deduplication in backup/storage solutions2.3.2 Plagiarism detection with MOSS and Rabin-Karp fingerprinting2.4 O(1)—What's the big deal?2.5 Collision resolution: Theory vs. practice2.6 Usage scenario: How Python’s dict does it2.7 MurmurHash2.8 Hash tables for distributed systems: Consistent hashing2.8.1 A typical hashing problem2.8.2 Hashring2.8.3 Lookup2.8.4 Adding a new node/resource2.8.5 Removing a node2.8.6 Consistent hashing scenario: Chord2.8.7 Consistent hashing: Programming exercisesSummary
3 Approximate membership: Bloom and quotient filters
3.1 How it works3.1.1 Insert3.1.2 Lookup3.2 Use cases3.2.1 Bloom filters in networks: Squid3.2.2 Bitcoin mobile app3.3 A simple implementation3.4 Configuring a Bloom filter3.4.1 Playing with Bloom filters: Mini experiments3.5 A bit of theory3.5.1 Can we do better?3.6 Bloom filter adaptations and alternatives3.7 Quotient filter3.7.1 Quotienting3.7.2 Understanding metadata bits3.7.3 Inserting into a quotient filter: An example3.7.4 Python code for lookup3.7.5 Resizing and merging3.7.6 False positive rate and space considerations3.8 Comparison between Bloom filters and quotient filtersSummary
4 Frequency estimation and count-min sketch
4.1 Majority element4.1.1 General heavy hitters4.2 Count-min sketch: How it works4.2.1 Update4.2.2 Estimate4.3 Use cases4.3.1 Top-k restless sleepers4.3.2 Scaling the distributional similarity of words4.4 Error vs. space in count-min sketch4.5 A simple implementation of count-min sketch4.5.1 Exercises4.5.2 Intuition behind the formula: Math bit4.6 Range queries with count-min sketch4.6.1 Dyadic intervals4.6.2 Update phase4.6.3 Estimate phase4.6.4 Computing dyadic intervalsSummary
5 Cardinality estimation and HyperLogLog
5.1 Counting distinct items in databases5.2 HyperLogLog incremental design5.2.1 The first cut: Probabilistic counting5.2.2 Stochastic averaging, or “when life gives you lemons”5.2.3 LogLog5.2.4 HyperLogLog: Stochastic averaging with harmonic mean5.3 Use case: Catching worms with HLL5.4 But how does it work? A mini experiment5.4.1 The effect of the number of buckets (m)5.5 Use case: Aggregation using HyperLogLogSummary

Part 2 Real-time analytics
6 Streaming data: Bringing everything together
6.1 Streaming data system: A meta example6.1.1 Bloom-join6.1.2 Deduplication6.1.3 Load balancing and tracking the network traffic6.2 Practical constraints and concepts in data streams6.2.1 In real time6.2.2 Small time and small space6.2.3 Concept shifts and concept drifts6.2.4 Sliding window model6.3 Math bit: Sampling and estimation6.3.1 Biased sampling strategy6.3.2 Estimation from a representative sampleSummary
7 Sampling from data streams
7.1 Sampling from a landmark stream7.1.1 Bernoulli sampling7.1.2 Reservoir sampling7.1.3 Biased reservoir sampling7.2 Sampling from a sliding window7.2.1 Chain sampling7.2.2 Priority sampling7.3 Sampling algorithms comparison7.3.1 Simulation setup: Algorithms and dataSummary
8 Approximate quantiles on data streams
8.1 Exact quantiles8.2 Approximate quantiles8.2.1 Additive error8.2.2 Relative error8.2.3 Relative error in the data domain8.3 T-digest: How it works8.3.1 Digest8.3.2 Scale functions8.3.3 Merging t-digests8.3.4 Space bounds for t-digest8.4 Q-digest8.4.1 Constructing a q-digest from scratch8.4.2 Merging q-digests8.4.3 Error and space considerations in q-digests8.4.4 Quantile queries with q-digests8.5 Simulation code and resultsSummary
Part 3 Data structures for databases and external memory algorithms
9 Introducing the external memory model
9.1 External memory model: The preliminaries9.2 Example 1: Finding a minimum9.2.1 Use case: Minimum median income9.3 Example 2: Binary search9.3.1 Bioinformatics use case9.3.2 Runtime analysis9.4 Optimal searching9.5 Example 3: Merging K sorted lists9.5.1 Merging time/date logs9.5.2 External memory model: Simple or simplistic?9.6 What’s nextSummary
10 Data structures for databases: B-trees, Bε-trees, and LSM-trees
10.1 How indexing works10.2 Data structures in this chapter10.3 B-trees10.3.1 B-tree balancing10.3.2 Lookup10.3.3 Insert10.3.4 Delete10.3.5 B+-trees10.3.6 How operations on a B+-tree are different10.3.7 Use case: B-trees in MySQL (and many other places)10.4 Math bit: Why are B-tree lookups optimal in external memory?10.4.1 Why B-tree inserts/deletes are not optimal in external memory10.5 Bε-trees10.5.1 Bε-tree: How it works10.5.2 Buffering mechanics10.5.3 Inserts and deletes10.5.4 Lookups10.5.5 Cost analysis10.5.6 Bε-tree: The spectrum of data structures10.5.7 Use case: Bε-trees in TokuDB10.5.8 Make haste slowly, the I/O way10.6 Log-structured merge-trees (LSM-trees)10.6.1 The LSM-tree: How it works10.6.2 LSM-tree cost analysis10.6.3 Use case: LSM-trees in CassandraSummary
11 External memory sorting
11.1 Sorting use cases11.1.1 Robot motion planning11.1.2 Cancer genomics11.2 Challenges of sorting in external memory: An example11.2.1 Two-way merge-sort in external memory11.3 External memory merge-sort (M/B-way merge-sort)11.3.1 Searching and sorting in RAM vs. external memory11.4 What about external quick-sort?11.4.1 External memory two-way quick-sort11.4.2 Toward external memory multiway quick-sort11.4.3 Finding enough pivots11.4.4 Finding good enough pivots11.4.5 Putting it all back together11.5 Math bit: Why is external memory merge-sort optimal?11.6 Wrapping upSummary
references
index
inside back cover

Content preview from Algorithms and Data Structures for Massive Datasets

6 Streaming data: Bringing everything together

This chapter covers

Learning about the streaming data pipeline model and its distributed framework
Determining where streaming data applications and the data stream model meet
Identifying where algorithms and data structures fit in data streams
Setting up basic computing constraints and concepts inherent to data streams
Giving some probabilistic background for the next two chapters to follow

Previous chapters introduced a number of algorithms/data structures for sketching (an important characteristic) huge amounts of data residing in a database or, as you saw in the application of the HyperLogLog in network traffic surveillance, arriving and expiring at a lightning rate. In this chapter, we will ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781617298035Publisher Support Publisher Website Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Algorithms and Data Structures for Massive Datasets

by Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

6 Streaming data: Bringing everything together

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.