Chapter 1. The Security Data Lake

Leveraging Big Data Technologies to Build a Common Data Repository for Security

The term data lake comes from the big data community and is appearing in the security field more often. A data lake (or a data hub) is a central location where all security data is collected and stored; using a data lake is similar to log management or security information and event management (SIEM). In line with the Apache Hadoop big data movement, one of the objectives of a data lake is to run on commodity hardware and storage that is cheaper than special-purpose storage arrays or SANs. Furthermore, the lake should be accessible by third-party tools, processes, workflows, and to teams across the organization that need the data. In contrast, log management tools do not make it easy to access data through standard interfaces (APIs). They also do not provide a way to run arbitrary analytics code against the data.

Comparing Data Lakes to SIEM

Are data lakes and SIEM the same thing? In short, no. A data lake is not a replacement for SIEM. The concept of a data lake includes data storage and maybe some data processing; the purpose and function of a SIEM covers so much more.

The SIEM space was born out of the need to consolidate security data. SIEM architectures quickly showed their weakness by being incapable of scaling to the loads of IT data available, and log management stepped in to deal with the data volumes. Then the big data movement came about and started offering ...

Get The Security Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.