Advanced Algorithms and Data Structures

Book description

As a software engineer, you’ll encounter countless programming challenges that initially seem confusing, difficult, or even impossible. Don’t despair! Many of these “new” problems already have well-established solutions. Advanced Algorithms and Data Structures teaches you powerful approaches to a wide range of tricky coding challenges that you can adapt and apply to your own applications. Providing a balanced blend of classic, advanced, and new algorithms, this practical guide upgrades your programming toolbox with new perspectives and hands-on techniques.

About the Technology
Can you improve the speed and efficiency of your applications without investing in new hardware? Well, yes, you can: Innovations in algorithms and data structures have led to huge advances in application performance. Pick up this book to discover a collection of advanced algorithms that will make you a more effective developer.

About the Book
Advanced Algorithms and Data Structures introduces a collection of algorithms for complex programming challenges in data analysis, machine learning, and graph computing. You’ll discover cutting-edge approaches to a variety of tricky scenarios. You’ll even learn to design your own data structures for projects that require a custom solution.

What's Inside
  • Build on basic data structures you already know
  • Profile your algorithms to speed up application
  • Store and query strings efficiently
  • Distribute clustering algorithms with MapReduce
  • Solve logistics problems using graphs and optimization algorithms


About the Reader
For intermediate programmers.

About the Author
Marcello La Rocca is a research scientist and a full-stack engineer. His focus is on optimization algorithms, genetic algorithms, machine learning, and quantum computing.

Quotes
An accessible introduction to the fundamental algorithms used to run the world.
- Richard Vaughan, Purple Monkey Collective

Easy to follow and filled with practical examples, this book is a fantastic introduction to data structures and algorithms.
- Zachary Fleischmann, Hawthorne Interactive

Look no further if you’re looking for a rich book to bridge the stuff of computer science courses and the pragmatic world of software development.
- George Thomas, Manhattan Associates

Teaches how to modify algorithms and data structures to solve practical problems.
- Maciej Jurkowski, Grupa Pracuj

Publisher resources

View/Submit Errata

Table of contents

  1. inside front cover
  2. Advanced Algorithms and Data Structures
  3. Copyright
  4. dedication
  5. contents
  6. front matter
    1. foreword
    2. preface
      1. Welcome to Advanced Algorithms and Data Structures
    3. acknowledgments
    4. about this book
      1. Who should read this book?
      2. How this book is organized: a roadmap
      3. About the code
      4. liveBook discussion forum
    5. about the author
    6. about the cover illustration
  7. 1 Introducing data structures
    1. 1.1 Data structures
      1. 1.1.1 Defining a data structure
      2. 1.1.2 Describing a data structure
      3. 1.1.3 Algorithms and data structures: Is there a difference?
    2. 1.2 Setting goals: Your expectations after reading this book
    3. 1.3 Packing your knapsack: Data structures meet the real world
      1. 1.3.1 Abstracting the problem away
      2. 1.3.2 Looking for solutions
      3. 1.3.3 Algorithms to the rescue
      4. 1.3.4 Thinking (literally) outside of the box
      5. 1.3.5 Happy ending
    4. Summary
  8. Part 1. Improving over basic data structures
  9. 2 Improving priority queues: d-way heaps
    1. 2.1 Structure of this chapter
    2. 2.2 The problem: Handling priority
      1. 2.2.1 Priority in practice: Bug tracking
    3. 2.3 Solutions at hand: Keeping a sorted list
      1. 2.3.1 From sorted lists to priority queues
    4. 2.4 Describing the data structure API: Priority queues
      1. 2.4.1 Priority queue at work
      2. 2.4.2 Priority matters: Generalize FIFO
    5. 2.5 Concrete data structures
      1. 2.5.1 Comparing performance
      2. 2.5.2 What’s the right concrete data structure?
      3. 2.5.3 Heap
      4. 2.5.4 Priority, min-heap, and max-heap
      5. 2.5.5 Advanced variant: d-ary heap
    6. 2.6 How to implement a heap
      1. 2.6.1 BubbleUp
      2. 2.6.2 PushDown
      3. 2.6.3 Insert
      4. 2.6.4 Top
      5. 2.6.5 Update
      6. 2.6.6 Dealing with duplicates
      7. 2.6.7 Heapify
      8. 2.6.8 Beyond API methods: Contains
      9. 2.6.9 Performance recap
      10. 2.6.10 From pseudo-code to implementation
    7. 2.7 Use case: Find the k largest elements
      1. 2.7.1 The right data structure . . .
      2. 2.7.2 . . . and the right use
      3. 2.7.3 Coding it up
    8. 2.8 More use cases
      1. 2.8.1 Minimum distance in graphs: Dijkstra
      2. 2.8.2 More graphs: Prim's algorithm
      3. 2.8.3 Data compression: Huffman codes
    9. 2.9 Analysis of branching factor25
      1. 2.9.1 Do we need d-ary heaps?
      2. 2.9.2 Running time
      3. 2.9.3 Finding the optimal branching factor
      4. 2.9.4 Branching factor vs memory
    10. 2.10 Performance analysis: Finding the best branching factor
      1. 2.10.1 Please welcome profiling
      2. 2.10.2 Interpreting results
      3. 2.10.3 The mystery with heapify
      4. 2.10.4 Choosing the best branching factor
    11. Summary
  10. 3 Treaps: Using randomization to balance binary search trees
    1. 3.1 Problem: Multi-indexing
      1. 3.1.1 The gist of the solution
    2. 3.2 Solution: Description and API
    3. 3.3 Treap
      1. 3.3.1 Rotations
      2. 3.3.2 A few design questions
      3. 3.3.3 Implementing search
      4. 3.3.4 Insert
      5. 3.3.5 Delete
      6. 3.3.6 Top, peek, and update
      7. 3.3.7 Min, max
      8. 3.3.8 Performance recap
    4. 3.4 Applications: Randomized treaps
      1. 3.4.1 Balanced trees
      2. 3.4.2 Introducing randomization
      3. 3.4.3 Applications of Randomized Treaps
    5. 3.5 Performance analysis and profiling
      1. 3.5.1 Theory: Expected height
      2. 3.5.2 Profiling height
      3. 3.5.3 Profiling running time
      4. 3.5.4 Profiling memory usage
      5. 3.5.5 Conclusions
    6. Summary
  11. 4 Bloom filters: Reducing the memory for tracking content
    1. 4.1 The dictionary problem: Keeping track of things
    2. 4.2 Alternatives to implementing a dictionary
    3. 4.3 Describing the data structure API: Associative array
    4. 4.4 Concrete data structures
      1. 4.4.1 Unsorted array: Fast insertion, slow search
      2. 4.4.2 Sorted arrays and binary search: Slow insertion, fast(-ish) search
      3. 4.4.3 Hash table: Constant-time on average, unless you need ordering
      4. 4.4.4 Binary search tree: Every operation is logarithmic
      5. 4.4.5 Bloom filter: As fast as hash tables, but saves memory (with a catch)
    5. 4.5 Under the hood: How do Bloom filters work?
    6. 4.6 Implementation
      1. 4.6.1 Using a Bloom filter
      2. 4.6.2 Reading and writing bits
      3. 4.6.3 Find where a key is stored
      4. 4.6.4 Generating hash functions
      5. 4.6.5 Constructor
      6. 4.6.6 Checking a key
      7. 4.6.7 Storing a key
      8. 4.6.8 Estimating accuracy
    7. 4.7 Applications
      1. 4.7.1 Cache
      2. 4.7.2 Routers
      3. 4.7.3 Crawler
      4. 4.7.4 IO fetcher
      5. 4.7.5 Spell checker
      6. 4.7.6 Distributed databases and file systems
    8. 4.8 Why Bloom filters work21
      1. 4.8.1 Why there are no false negatives . . .
      2. 4.8.2 . . . But there are false positives
      3. 4.8.3 Bloom filters as randomized algorithms
    9. 4.9 Performance analysis
      1. 4.9.1 Running time
      2. 4.9.2 Constructor
      3. 4.9.3 Storing an element
      4. 4.9.4 Looking up an element
    10. 4.10 Estimating Bloom filter precision25
      1. 4.10.1 Explanation of the false-positive ratio formula
    11. 4.11 Improved variants
      1. 4.11.1 Bloomier filter
      2. 4.11.2 Combining Bloom filters
      3. 4.11.3 Layered Bloom filter
      4. 4.11.4 Compressed Bloom filter
      5. 4.11.5 Scalable Bloom filter
    12. Summary
  12. 5 Disjoint sets: Sub-linear time processing
    1. 5.1 The distinct subsets problem
    2. 5.2 Reasoning on solutions
    3. 5.3 Describing the data structure API: Disjoint set
    4. 5.4 Naïve solution
      1. 5.4.1 Implementing naïve solution
    5. 5.5 Using a tree-like structure
      1. 5.5.1 From list to trees
      2. 5.5.2 Implementing the tree version
    6. 5.6 Heuristics to improve the running time
      1. 5.6.1 Path compression
      2. 5.6.2 Implementing balancing and path compression
    7. 5.7 Applications
      1. 5.7.1 Graphs: Connected components
      2. 5.7.2 Graphs:15 Kruskal’s algorithm for minimum spanning tree
      3. 5.7.3 Clustering
      4. 5.7.4 Unification
    8. Summary
  13. 6 Trie, radix trie: Efficient string search
    1. 6.1 Spell-check
      1. 6.1.1 A prncess, a Damon, and an elf walkz into a bar
      2. 6.1.2 Compression is the key
      3. 6.1.3 Description and API
    2. 6.2 Trie
      1. 6.2.1 Why is it better again?
      2. 6.2.2 Search
      3. 6.2.3 Insert
      4. 6.2.4 Remove
      5. 6.2.5 Longest prefix
      6. 6.2.6 Keys matching a prefix
      7. 6.2.7 When should we use tries?
    3. 6.3 Radix tries
      1. 6.3.1 Nodes and edges
      2. 6.3.2 Search
      3. 6.3.3 Insert
      4. 6.3.4 Remove
      5. 6.3.5 Longest common prefix
      6. 6.3.6 Keys starting with a prefix
    4. 6.4 Applications
      1. 6.4.1 Spell-checker
      2. 6.4.2 String similarity
      3. 6.4.3 String sorting
      4. 6.4.4 T9
      5. 6.4.5 Autocomplete
    5. Summary
  14. 7 Use case: LRU cache
    1. 7.1 Don’t compute things twice
    2. 7.2 First attempt: Remembering values
      1. 7.2.1 Description and API
      2. 7.2.2 Fresh data, please
      3. 7.2.3 Handling asynchronous calls
      4. 7.2.4 Marking cache values as “Loading”
    3. 7.3 Memory is not enough (literally)
    4. 7.4 Getting rid of stale data: LRU cache
      1. 7.4.1 Sometimes you have to double down on problems
      2. 7.4.2 Temporal ordering
      3. 7.4.3 Performance
    5. 7.5 When fresher data is more valuable: LFU
      1. 7.5.1 So how do we choose?
      2. 7.5.2 What makes LFU different
      3. 7.5.3 Performance
      4. 7.5.4 Problems with LFU
    6. 7.6 How to use cache is just as important
    7. 7.7 Introducing synchronization
      1. 7.7.1 Solving concurrency (in Java)
      2. 7.7.2 Introducing locks
      3. 7.7.3 Acquiring a lock
      4. 7.7.4 Reentrant locks
      5. 7.7.5 Read locks
      6. 7.7.6 Other approaches to concurrency
    8. 7.8 Cache applications
    9. Summary
  15. Part 2. Multidimensional queries
  16. 8 Nearest neighbors search
    1. 8.1 The nearest neighbors search problem
    2. 8.2 Solutions
      1. 8.2.1 First attempts
      2. 8.2.2 Sometimes caching is not the answer
      3. 8.2.3 Simplifying things to get a hint
      4. 8.2.4 Carefully choose a data structure
    3. 8.3 Description and API
    4. 8.4 Moving to k-dimensional spaces
      1. 8.4.1 Unidimensional binary search
      2. 8.4.2 Moving to higher dimensions
      3. 8.4.3 Modeling 2-D partitions with a data structure
    5. Summary
  17. 9 K-d trees: Multidimensional data indexing
    1. 9.1 Right where we left off
    2. 9.2 Moving to k-D spaces: Cycle through dimensions
      1. 9.2.1 Constructing the BST
      2. 9.2.2 Invariants
      3. 9.2.2 The importance of being balanced
    3. 9.3 Methods
      1. 9.3.1 Search
      2. 9.3.2 Insert
      3. 9.3.3 Balanced tree
      4. 9.3.4 Remove
      5. 9.3.5 Nearest neighbor
      6. 9.3.6 Region search
      7. 9.3.7 A recap of all methods
    4. 9.4 Limits and possible improvements
    5. Summary
  18. 10 Similarity Search Trees: Approximate nearest neighbors search for image retrieval
    1. 10.1 Right where we left off
      1. 10.1.1 A new (more complex) example
      2. 10.1.2 Overcoming k-d trees’ flaws
    2. 10.2 R-tree
      1. 10.2.1 A step back: Introducing B-trees
      2. 10.2.2 From B-Tree to R-tree
      3. 10.2.3 Inserting points in an R-tree
      4. 10.2.4 Search
    3. 10.3 Similarity search tree
      1. 10.3.1 SS-tree search
      2. 10.3.2 Insert
      3. 10.3.3 Insertion: Variance, means, and projections
      4. 10.3.4 Insertion: Split nodes
      5. 10.3.5 Delete
    4. 10.4 Similarity Search
      1. 10.4.1 Nearest neighbor search
      2. 10.4.2 Region search
      3. 10.4.3 Approximated similarity search
    5. 10.5 SS+-tree18
      1. 10.5.1 Are SS-trees better?
      2. 10.5.2 Mitigating hyper-sphere limitations
      3. 10.5.3 Improved split heuristic
      4. 10.5.4 Reducing overlap
    6. Summary
  19. 11 Applications of nearest neighbor search
    1. 11.1 An application: Find nearest hub
      1. 11.1.1 Sketching a solution
      2. 11.1.2 Trouble in paradise
    2. 11.2 Centralized application
      1. 11.2.1 Filtering points
      2. 11.2.2 Complex decisions
    3. 11.3 Moving to a distributed application
      1. 11.3.1 Issues handling HTTP communication
      2. 11.3.2 Keeping the inventory in sync
      3. 11.3.3 Lessons learned
    4. 11.4 Other applications
      1. 11.4.1 Color reduction
      2. 11.4.2 Particle interaction
      3. 11.4.3 Multidimensional DB queries optimization
      4. 11.4.4 Clustering
    5. Summary
  20. 12 Clustering
    1. 12.1 Intro to clustering
      1. 12.1.1 Types of learning
      2. 12.1.2 Types of clustering
    2. 12.2 K-means
      1. 12.2.1 Issues with k-means
      2. 12.2.2 The curse of dimensionality strikes again
      3. 12.2.3 K-means performance analysis
      4. 12.2.4 Boosting k-means with k-d trees
      5. 12.2.5 Final remarks on k-means
    3. 12.3 DBSCAN
      1. 12.3.1 Directly vs density-reachable
      2. 12.3.2 From definitions to an algorithm
      3. 12.3.3 And finally, an implementation
      4. 12.3.4 Pros and cons of DBSCAN
    4. 12.4 OPTICS
      1. 12.4.1 Definitions
      2. 12.4.2 OPTICS algorithm
      3. 12.4.3 From reachability distance to clustering
      4. 12.4.4 Hierarchical clustering
      5. 12.4.5 Performance analysis and final considerations
    5. 12.5 Evaluating clustering results: Evaluation metrics
      1. 12.5.1 Interpreting the results
    6. Summary
  21. 13 Parallel clustering: MapReduce and canopy clustering
    1. 13.1 Parallelization
      1. 13.1.1 Parallel vs distributed
      2. 13.1.2 Parallelizing k-means
      3. 13.1.3 Canopy clustering
      4. 13.1.4 Applying canopy clustering
    2. 13.2 MapReduce
      1. 13.2.1 Imagine you are Donald Duck . . .
      2. 13.2.2 First map, then reduce
      3. 13.2.3 There is more under the hood
    3. 13.3 MapReduce k-means
      1. 13.3.1 Parallelizing canopy clustering
      2. 13.3.2 Centroid initialization with canopy clustering
      3. 13.3.3 MapReduce canopy clustering
    4. 13.4 MapReduce DBSCAN
    5. Summary
  22. Part 3. Planar graphs and minimum crossing number
  23. 14 An introduction to graphs: Finding paths of minimum distance
    1. 14.1 Definitions
      1. 14.1.1 Implementing graphs
      2. 14.1.2 Graphs as algebraic types
      3. 14.1.3 Pseudo-code
    2. 14.2 Graph properties
      1. 14.2.1 Undirected
      2. 14.2.2 Connected
      3. 14.2.3 Acyclic
    3. 14.3 Graph traversal: BFS and DFS
      1. 14.3.1 Optimizing delivery routes
      2. 14.3.2 Breadth first search
      3. 14.3.3 Reconstructing the path to target
      4. 14.3.4 Depth first search
      5. 14.3.5 It’s queue vs stack again
      6. 14.3.6 Best route to deliver a parcel
    4. 14.4 Shortest path in weighted graphs: Dijkstra
      1. 14.4.1 Differences with BFS
      2. 14.4.2 Implementation
      3. 14.4.3 Analysis
      4. 14.4.4 Shortest route for deliveries
    5. 14.5 Beyond Dijkstra’s algorithm: A*
      1. 14.5.1 How good is A* search?
      2. 14.5.2 Heuristics as a way to balance real-time data
    6. Summary
  24. 15 Graph embeddings and planarity: Drawing graphs with minimal edge intersections
    1. 15.1 Graph embeddings
      1. 15.1.1 Some basic definitions
      2. 15.1.2 Complete and bipartite graphs
    2. 15.2 Planar graphs
      1. 15.2.1 Using Kuratowski’s theorem in practice
      2. 15.2.2 Planarity testing
      3. 15.2.3 A naïve algorithm for planarity testing
      4. 15.2.4 Improving performance
      5. 15.2.5 Efficient algorithms
    3. 15.3 Non-planar graphs
      1. 15.3.1 Finding the crossing number
      2. 15.3.2 Rectilinear crossing number
    4. 15.4 Edge intersections
      1. 15.4.1 Straight-line segments
      2. 15.4.2 Polylines
      3. 15.4.3 Bézier curves
      4. 15.4.4 Intersections between quadratic Bézier curves
      5. 15.4.5 Vertex-vertex and edge-vertex intersections
    5. Summary
  25. 16 Gradient descent: Optimization problems (not just) on graphs
    1. 16.1 Heuristics for the crossing number
      1. 16.1.1 Did you just say heuristics?
      2. 16.1.2 Extending to curve-line edges
    2. 16.2 How optimization works
      1. 16.2.1 Cost functions
      2. 16.2.2 Step functions and local minima
      3. 16.2.3 Optimizing random sampling
    3. 16.3 Gradient descent
      1. 16.3.1 The math of gradient descent
      2. 16.3.2 Geometrical interpretation
      3. 16.3.3 When is gradient descent appliable?
      4. 16.3.4 Problems with gradient descent
    4. 16.4 Applications of gradient descent
      1. 16.4.1 An example: Linear regression
    5. 16.5 Gradient descent for graph embedding
      1. 16.5.1 A different criterion
      2. 16.5.2 Implementation
    6. Summary
  26. 17 Simulated annealing: Optimization beyond local minima
    1. 17.1 Simulated annealing
      1. 17.1.1 Sometimes you need to climb up to get to the bottom
      2. 17.1.2 Implementation
      3. 17.1.3 Why simulated annealing works
      4. 17.1.4 Short-range vs long-range transitions
      5. 17.1.5 Variants
      6. 17.1.6 Simulated annealing vs gradient descent: Which one should I use?
    2. 17.2 Simulated annealing + traveling salesman
      1. 17.2.1 Exact vs approximated solutions
      2. 17.2.2 Visualizing cost
      3. 17.2.3 Pruning the domain
      4. 17.2.4 State transitions
      5. 17.2.5 Adjacent vs random swaps
      6. 17.2.6 Applications of TSP
    3. 17.3 Simulated annealing and graph embedding
      1. 17.3.1 Minimum edge crossing
      2. 17.3.2 Force-directed drawing
    4. Summary
  27. 18 Genetic algorithms: Biologically inspired, fast-converging optimization
    1. 18.1 Genetic algorithms
      1. 18.1.1 Inspired by nature
      2. 18.1.2 Chromosomes
      3. 18.1.3 Population
      4. 18.1.4 Fitness
      5. 18.1.5 Natural selection
      6. 18.1.6 Selecting individuals for mating
      7. 18.1.7 Crossover
      8. 18.1.8 Mutations
      9. 18.1.9 The genetic algorithm template
      10. 18.1.10 When does the genetic algorithm work best?
    2. 18.2 TSP
      1. 18.2.1 Fitness, chromosomes, and initialization
      2. 18.2.2 Mutations
      3. 18.2.3 Crossover
      4. 18.2.4 Results and parameters tuning
      5. 18.2.5 Beyond TSP: Optimizing the routes of the whole fleet
    3. 18.3 Minimum vertex cover
      1. 18.3.1 Applications of vertex cover
      2. 18.3.2 Implementing a genetic algorithm
    4. 18.4 Other applications of the genetic algorithm
      1. 18.4.1 Maximum flow
      2. 18.4.2 Protein folding
      3. 18.4.3 Beyond genetic algorithms
      4. 18.4.4 Algorithms, beyond this book
    5. Summary
  28. appendix A. A quick guide to pseudo-code
    1. A.1 Variables and basics
    2. A.2 Arrays
    3. A.3 Conditional instructions
      1. A.3.1 Else-if
      2. A.3.2 Switch
    4. A.4 Loops
      1. A.4.1 For loop
      2. A.4.2 While loop
      3. A.4.3 Break and continue
    5. A.5 Blocks and indent
    6. A.6 Functions
      1. A.6.1 Overloading and default arguments
      2. A.6.2 Tuples
      3. A.6.3 Tuples and destructuring objects
  29. appendix B. Big-O notation
    1. B.1 Algorithms and performance
    2. B.2 The RAM model
    3. B.3 Order of magnitude
    4. B.4 Notation
    5. B.5 Examples
  30. appendix C. Core data structures
    1. C.1 Core data structures
    2. C.2 Array
    3. C.3 Linked List
    4. C.4 Tree
      1. C.4.1 Binary search trees
    5. C.5 Hash table
      1. C.5.1 Storing key-value pairs
      2. C.5.2 Hashing
      3. C.5.3 Conflicts resolution in hashing
      4. C.5.4 Performance
    6. C.6 Comparative analysis of core data structures
  31. appendix D. Containers as priority queues
    1. D.1 Bag
    2. D.2 Stack
    3. D.3 Queue
    4. D.4 A comparative analysis of containers
  32. appendix E. Recursion
    1. E.1 Simple recursion
      1. E.1.1 Pitfalls
      2. E.1.2 Good recursion
    2. E.2 Tail recursion
    3. E.3 Mutual recursion
  33. appendix F. Classification problems and randomnized algorithm metrics
    1. F.1 Decision problems
    2. F.2 Las Vegas algorithms
    3. F.3 Monte Carlo algorithms
    4. F.4 Classification metrics
      1. F.4.1 Accuracy
      2. F.4.2 Precision and recall
      3. F.4.3 Other metrics and recap
  34. index
  35. inside back cover

Product information

  • Title: Advanced Algorithms and Data Structures
  • Author(s): Marcello La Rocca
  • Release date: July 2021
  • Publisher(s): Manning Publications
  • ISBN: 9781617295485