Chapter 7. Security, Access Control, and Auditing

When Hadoop was getting started, its basic security model might have been described as “build a fence around an elephant, but once inside the fence, security is a bit lax.” While the HDFS has access control mechanisms, security is a bit of an afterthought in the Hadoop world. Recently, as Hadoop has become much more mainstream, security issues are being addressed through the development of new tools, such as Sentry and Knox, as well as established mechanisms like Kerberos.

Large, well-established computing systems have methods for access and authorization, encryption, and audit logging, as required by HIPAA, FISMA, and PCI requirements.

Authentication answers the question, “Who are you?” Traditional strong authentication methods include Kerberos, Lightweight Directory Access Protocol (LDAP), and Active Directory (AD). These are done outside of Hadoop, usually at the client site, or within the web server if appropriate.

Authorization answers the question, “What can you do?” Here Hadoop is spread all over the place. For example, the MapReduce job queue system stores its authorization in a different way than HDFS, which uses a common read/write/execute permission for users/groups/other. HBase has column family and table-level authorization, and Accumulo has cell-level authorization.

Data protection generally refers to encryption, both at rest or in transit. HTTP, RPC, JDBC, and ODBC all provide encryption in transit or over the wire. ...

Get Field Guide to Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.