O'Reilly logo

Identity and Data Security for Web Development by Tim Messerschmidt, Jonathan LeBlanc

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Introduction

One of the most important investments that you can make in a system, company, or application is in your security and identity infrastructures. We can’t go a week without hearing about another user/customer data breach, stolen credit cards, or identity theft. Even though you can put an entire series of hurdles in the way of a potential attacker, the possibility will always exist that your databases will be breached, information will be stolen, and an attacker will attempt to crack the sensitive data that is stored (if encrypted).

There is no bulletproof, secure method for protecting your data. Identity and data security has always been about mitigating risk, protecting the secure data, and buying yourself enough time to take action and reduce damage if something like this should ever happen to you.

As we dive down into the concepts, technology, and programming methodologies behind building a secure interface for data and identity, you will explore the trade-offs and core concepts that you need to understand as you embark on making those final decisions about your security. The best place to start is to explore the major problems with identity and data security in the industry right now.

The Problems with Current Security Models

The current state of industry security is not one in which the technology can’t keep up with the potential attack vectors, it’s one in which development choices lead us down a path of weak systems. One of the biggest mistakes that many of us tend to make is to assume that users will understand how to protect their own accounts, such as with strong password choices or two-factor authentication—or even if they do, that they wouldn’t pick the most usable choice over the easiest one. We, as developers, have to protect our users in the same way that we try to protect our systems, and we must assume that users will not do that for themselves.

To do that, we have to purge a few misconceptions from our heads:

Users will always use the most secure options

The simple fact is that the worst thing to count on is that users will be capable, or willing, to use the option that will best secure their data. The onus has to be on the site or service owner to ensure that data provided by users for their security (such as a password) is hardened to ensure that minimum levels of security are imposed (see more about data encryption and security in Chapter 2). For instance, when two-factor authentication services are offered, a typical adoption rate is approximately between 5% and 10% of users.

We should always make systems more secure, at the cost of usability

This is typically one of the reactions to the preceding point—to make a system as secure as possible, at the cost of usability of the system for the user. This is simply not the case; numerous mechanisms can be put in place to enhance security without drastically affecting the user. We’ll explore this further in “Security over Usability”.

Our security will never be breached

From startups to large companies, many engineers have put too much faith in the security of their systems. This has led to lax data encryption standards, meaning that personal and privileged information, such as credit card data, home addresses, etc., is stored as cleartext—data that is not encrypted in any way. When the system is breached, hackers have to put in no effort to capture and use that data.

Assume Your Data Will Be Stolen and Use Proper Data Encryption

In June 2015, a massive breach of US government data was said to expose the personal information of millions of government workers, because the data itself was not encrypted.1 No matter how big you are, you should always assume that the possibility exists that your database security will be breached, and data stolen. All sensitive information should always be properly encrypted.

Let’s drill down into some of these issues a bit further to see the cause and effect of the choices we make as users and developers.

Poor Password Choices

As we stated previously, users are notorious for choosing highly unsecure passwords for their accounts. To expand on that proof point, let’s look at the top passwords of 2015 (listed in Table 1-1), compiled by SplashData from files containing millions of stolen passwords that have been posted online during the previous year.2

Before we get too far up in arms about people choosing these passwords, we need to be aware of some possible issues with the data used to compile this list:

  • Because most of this data comes from information leaks, it could be that these passwords are just easier to crack through dictionary or brute-force attacks.

  • We don’t know the source of much of this data, so we can’t validate the security measures in place on the sites or services.

  • The data may contain anomalies or simply bad data. If a default password is being set by a service with a lot of leaked data (and never changed), it will push it higher on the list. If we are analyzing data from multiple sources using information that was poorly parsed, or has those anomalies, the list will be skewed.

With that said, even though those passwords may constitute a smaller number than the lists purport them to be, and the data may be highly skewed, they still exist. When building a data and identity security system, you have to provide an adequate level of protection for these people. Typically, you want to build for the weakest possible authentication system, which, depending on your security requirements, might comprise this list.

In many ways this is because of what we expect of people when they are creating a password: provide a password with mixed case, at least one symbol and number, and nothing recognizable in a dictionary or guessable from those who know you. These types of expectations create poor usability for users, in that they won’t be able to remember the password, and also ensures that they either pick the easiest way they can to enter the site, or write down that complex password on a Post-it note on their display. Usability needs to be a part of identity security for it to be effective.

Security over Usability

Favor security too much over the experience and you’ll make the website a pain to use.

Anthony T, founder of UX Movement

Your main objective when handling the data and identity of your users is to ensure their security, but at the same time you don’t want to alienate your entire user base by making your sign-in forms complex, or by forcing a multiscreen, manual checkout process for purchasing goods, or by continually challenging users for identification details as they are trying to use your service. Those are surefire ways of ensuring that your users never return.


Some of the main reasons for shopping-cart abandonment include users being uncomfortable with the buying process (it is too complex/lengthy) or being forced to sign-up before purchasing. Many of these concerns can be solved through the usability considerations, such as a single-page checkout, and allowing a simplified guest checkout.

The concept of usability versus security is always a balancing act. You need to ensure that you have a high-enough confidence in the security of your users, and at the same time do as much behind the scenes as you can so that they aren’t forced to break out of the experience of your site to continually verify themselves.

Here are some of the questions that we can ask ourselves, when thinking this through, are:

  • Can I obtain identity information to increase my confidence that the user is who she says she is, without imposing additional security checks?

  • If I have a high confidence that the user is who she says she is, can I build a more usable experience for that user versus one that I have no confidence in?

  • What content requires user identification, and when should I impose additional levels of security to verify that?

We’ll explore these concepts further in Chapter 3, as you learn about trust zones and establishing identity information on a user.

Improper Data Encryption

Data security and identification isn’t about planning for the best, it’s about planning for the worst. If there is the possibility of something happening, you should assume that it will happen and have a plan in place to decrease or mitigate the damage that is done.

On March 27, 2015, Slack announced that its systems had been breached, and user information was stolen. The damage of the security incident was lessened because of its strong data encryption methods. From the company’s blog on the incident, “Slack maintains a central user database that includes usernames, email addresses, and one-way encrypted (hashed) passwords. Slack’s hashing function is bcrypt with a randomly generated salt per password, which makes it computationally infeasible that your password could be re-created from the hashed form.” In addition, following this incident, Slack introduced two-factor authentication for users, as well as a password kill switch for team owners that automatically logged out all users, on all devices, and forced them to create a new password.

In this case, data encryption and quick action prevented a massive theft of user accounts, and lessened the damage to Slack’s credibility and the confidence its users had in the company. Data encryption isn’t always about trying to prevent data from being stolen; it’s meant to slow down hackers long enough to make it infeasible for them to decrypt massive amounts of data, or to delay them until you can take appropriate action.

The Weakest Link: Human Beings

As developers and service providers, our biggest interest should be treating our users’ data with the most respect we can provide. Hence, we try to secure any kind of information a user provides to us by using encryption algorithms, offer safe ways to communicate, and continuously harden our infrastructure in an ongoing struggle.

The most important element in this chain, the human being, is often taken out of the equation. Therefore, we open up our application to threats that we might not have considered when laying out and designing our security layer. The truth is, users tend to go the easy way. People are likely to choose easy-to-remember and short passwords, simple-to-guess usernames, and might not be educated about current authentication technology like two-factor authentication (also known as 2FA). We discuss two-factor authentication in depth in Chapter 5—it certainly deserves extra attention and focus. We will also discuss a technology derived from 2FA, called n-factor authentication, which represents a scalable security approach depending on the use case.

It is easy to understand why people tend to use and especially reuse simple passwords—it saves them time while setting up user profiles and makes authenticating against services and applications an easy task. Especially with the rise of mobile technology, users are often faced with small screen real estate and touchscreen keyboards, which can add an additional burden.

The phenomenon described here is also known as password fatigue. Gladly, there are multiple tools that we, as developers, can use in order to counter these problems and ensure a smooth and pleasing registration and authentication flow within our applications while still maintaining user security.


Many operating systems, browsers, and third-party applications try to solve password fatigue by allowing users to generate randomized passwords and by offering a way to store those passwords under protection of a master password.

A popular example is the password-management application Keychain that was introduced with Mac OS 8.6. Keychain is deeply integrated into OS X and nowadays in iOS (via iCloud) and allows for storing various types of data including credit cards, passwords, and private keys.

More and more services like 1Password, Dashlane, and LastPass offer to generate passwords for their users. This removes the need for users to come up with a secure password and is often seen as a convenient way to speed up user account registration.

Katie Sherwin, a member of the Nielsen Norman Group, proposes simplifying password authentication flows through three approaches that improve user experience:3

  • Show the rules

  • Show the user input

  • Show strength meters

By applying these three rules, we can ensure that users feel comfortable with the passwords they use and get a clear indication about the password’s strength. Further research indicates that users who see a strength meter choose more secure passwords—even if the strength indicator is not implemented that well.4

Those who saw a meter tended to choose stronger passwords than those who didn’t, but the type of meter did not make a significant difference.

Dinei Florencio, Cormac Herley, and Paul C. van Oorschot,
“An Administrator’s Guide to Internet Password Research”

Single Sign-on

Single sign-on, also known as SSO, is a technology that leverages existing user accounts in order to authenticate against various services. The idea behind this concept is prefilling and securing a central user account instead of forcing the user to register at a variety of services over and over again.

Common choices that try to accommodate the wish to reuse user profiles to either provide profile information or to simply authenticate against other services include OpenID, OAuth 1.0, OAuth 2.0, and various hybrid models like OpenID Connect. In Chapter 4 we will focus on a selection of authentication techniques and will discuss the technical implementation details as well as the security implications.

Understanding Entropy in Password Security

Before we get too far into the weeds, we should first address how we can determine a weak password from a strong one, if that password was created by a human being. The standard industry mechanism for determining password strength is called information entropy, which is measured in the number of bits of information in a provided source, such as a password.


Typically, if you are using passphrases, a good level of entropy to have at minimum is 36.86 bits, which coincides with the average entropy level of 3 random words selected from a list of 5,000 possible unique words.

Password entropy is a measurement of how unpredictable a password is. This measurement is based on a few key characteristics:

  • The symbol set that is used

  • The expansion of the symbol set through lowercase/uppercase characters

  • Password length

Using this information, password entropy, expressed in bits, is used to predict how difficult it would be for the password to be cracked through guessing, dictionary attacks, brute-force attacks, etc.

When you are looking at determining overall password entropy, there are two main ways of generating passwords that we should explore: randomly generated passwords (computer generated) and human-selected passwords.


According to “A Large-Scale Study of Web Password Habits,” by Dinei Florencio and Cormac Herley of Microsoft Research, the entropy level of the average password is estimated to be 40.54 bits.5

Entropy in Randomly Selected Passwords

When we look at randomly selected passwords (computer generated), the process for determining the overall entropy of the passwords is fairly straightforward because there is no human, random element involved. Depending on the symbol set that we use, we can build a series of passwords with a desired level of entropy fairly easily.

First, the generally accepted formula that we use to calculate entropy is upper H equals l o g 2 left-parenthesis b Superscript l Baseline right-parenthesis


  • H = The password entropy, measured in bits

  • b = The number of possible symbols in the symbol set

  • l = The number of symbols in the password (or length)

To come up with the value of b, we can simply choose the symbol set that we are using from Table 1-2.

Table 1-2. Entropy for each symbol in a symbol set
Symbol set name Number of symbols in set Entropy per symbol (in bits)

Arabic numerals (0–9)



Hexadecimal numerals (0–9, A–F)



Case-insensitive Latin alphabet (a–z or A–Z)



Case-insensitive alphanumeric (a–z or A-Z, 0–9)



Case-sensitive Latin alphabet (a–z, A–Z)



Case-sensitive alphanumeric (a–z, A–Z, 0–9)



All ASCII printable characters



All extended ASCII printable characters



Binary (0–255 or 8 bits or 1 byte)



Diceware word list




The symbol set you might not be familiar with is the diceware word list. The method behind diceware is to use a single die (from a pair of dice), and roll it five times. The numeric values on the die each time create a five-digit number (e.g., 46231, matching the value of each individual roll). This number is then used to look up a word from a given word list. There are 7,776 possible unique words using this method. See the diceware word list for the complete reference.

Using the formula, length of the password, and numbers of symbols in a given symbol set, you can estimate the bits of entropy from a randomly generated password.

Entropy in Human-Selected Passwords

Before we get into measuring entropy levels within a password that was created by a human being, rather than being randomly generated based on security standards, we need to understand that these numbers are nontrivial. Many methods have been proposed for doing so (NIST, Shannon Entropy, Guessing Entropy, etc.), but most of these fall short in one way or another.

Shannon Entropy is seen to give an overly optimistic view of password security (while providing no real actionable improvement hints), and NIST a nonaccurate (yet conservative) one. Because we always want to err on the side of caution with password security, let’s quickly look at the NIST study on how to measure human-selected passwords, as that will give us a good starting point.

According to NIST special publication 800-63-2, if we take a human-selected password, we can measure the assumed entropy with the following guidelines:6

  • The entropy of the first character is 4 bits.

  • The entropy of the next 7 characters is 2 bits per character (they state that this is “roughly consistent with Shannon’s estimate that when statistical effects extending over not more than 8 letters are considered, the entropy is roughly 2.3 bits per character“).

  • Characters 9 through 20 have an entropy of 1.5 bits per character.

  • Characters 21 and above have an entropy of 1 bit per character.

  • A 6-bit bonus is given to password rules that require both uppercase and nonalphabetic characters. (This is also a conservative bit estimate, as the NIST publication notes that these special characters will most likely come at the beginning or end of the password, reducing the total search space.)

  • An additional 6-bit bonus is given to passwords with a length of 1 to 19 characters that follow an extensive dictionary check to ensure the password is not contained within a large dictionary. Passwords that are longer than 20 characters do not receive this bonus because they are assumed to consist of multiple dictionary words placed together into passphrases.

Let’s take that idea and see what the entropy of a few examples would be:

monkey (6 characters) = 14 bits of entropy

4 bits for the first character, 10 bits for the following 5 characters

Monkey1 (7 characters) = 22 bits of entropy

4 bits for the first character, 12 bits for the following 6 characters, 6-bit bonus for uppercase and nonalphabetic characters being used

tvMD128!Rrsa (12 characters) = 36 bits of entropy

4 bits for the first character, 14 bits for the following 7 characters, 6 bits for the following 4 characters, 6-bit bonus for uppercase and nonalphabetic characters being used, 6-bit bonus for a nondictionary string within 1–19 characters

tvMD128!aihdfo#Jh43 (19 characters) = 46.5 bits of entropy

4 bits for the first character, 14 bits for the following 7 characters, 16.5 bits for the following 11 characters, 6-bit bonus for uppercase and nonalphabetic characters being used, 6-bit bonus for a nondictionary string within 1–19 characters

tvMD128!aihdfo#Jh432 (20 characters) = 42 bits of entropy

4 bits for the first character, 14 bits for the following 7 characters, 18 bits for the following 12 characters, 6-bit bonus for uppercase and nonalphabetic characters being used

You can start to see some holes in the assumptions that the NIST study makes with the last two password examples. First, one additional character causes the loss of 6 bonus bits of entropy because of the assumption that the password is of significant length that a user would not have chosen a complex string. Second, that if a string of that length was used for a password, it is most likely several dictionary words put together, such as “treemanicuredonkeytornado,” which, based on the NIST study, would actually give us 41 bits of entropy.

As we go further, you can see why determining the security of a human-created password can be tricky, and that’s because humans are unpredictable. If we plug a system of security requirements into a computer-generated password system, and store that in a password vault application like 1Password, KeePass, or LastPass, then we can have a very predictable environment. That’s why, for the most part, we usually take one of two steps (sometimes both) in securing identity in web development:

  1. You require users, when they create their password, to strengthen their login. This can be requirements for length, nonalphabetic characters, uppercase and lowercase characters, nondictionary words, etc. For obvious reasons, the usability of this solution is quite bad, and it may alienate many users, but the security increases. The problem here is that when we make it harder to create a password, the user will more likely forget that password, and then require the use of the “forgot your password” reset flow.

  2. You attempt to harden the data, as best you can, behind the scenes. This usually involves encryption, salting, and key stretching (all concepts we will dive into in Chapter 2), to try to help prevent weak passwords that are stolen from being compromised. When you have a solution like this, you may also see a mechanism that allows only a certain number of login attempts before temporarily locking the account, to prevent potential brute-force attacks against weak passwords. This solution is higher on the usability side, because users can pick practically any password they want, but lowers the overall security of their account.

In the end, we’re back to questions of usability versus security, and the truth of the matter is that our ideal scenario, for all parties, is somewhere in between. Remember, the two aren’t mutually exclusive.

Breaking Down System Usage of a Username and Password

Another important step in understanding the concept of a username and password is to break down what they represent in an identification system. If we put this simply, they are an identification of who you are (the username, or public key) and then a verification of that fact with something that only you should know (the password, or private key).

With that understanding in place, there are two ways that we can think about handling data in an authentication system:

Harden the system

In this case, we take an existing (or new) system that is built on top of a traditional username and password, and attempt to strengthen it.

Remove the username and password

In new or innovative technology solutions, this is the case where we apply the concepts of a username and password, but do so in a different way.

As we dive further into each chapter, our main goals will be to build upon these two concepts, focusing on hardening the system, or finding a new methodology for building our identity and data security with new tools and techniques.

Securing Our Current Standards for Identity

Enhancing the security of an existing system is usually the choice of most of us, as we are building on top of existing work, or building a product that uses a username and password as the preferred login mechanism for users.

As we explored earlier in this chapter, users are usually the worst people to put in charge of protecting their own security through their passwords. The vast majority of the population will choose passwords that they can remember, which is almost always the complete opposite of what we would traditionally think of as a secure password.

You know from earlier sections how to approximate the predictability of a password, and that you should always build security toward the most unsecure element in the chain, not the average. With that said, there are certain standard mechanisms that we use for account security, and others that we should avoid.

Good and Bad Security Algorithms

Not all encryption algorithms are created equal when it comes to the security of our data and privileged user information. Some are built for speed, for quickly and accurately encrypting and decrypting large amounts of data. Others are designed to be slow. Let’s say your database of a million encrypted user records has been stolen, and the attacker is attempting to crack the encryption, such as by trying every word in the dictionary, to reveal the data underneath. Would you prefer to make this as fast as possible or as slow as possible? The correct answer is that you want this process to be as slow as possible for the attacker.

With regular cryptographic hash functions, an attacker can guess billions of passwords per second. With password security hashing algorithms, depending on the configuration, the attacker may be able to guess only a few thousand passwords per second, which is a massive difference.

The good

The following hashing algorithms are meant to be used for password security, and are built to be purposefully slow to make cracking the data harder:


PBKDF2 stands for Password-Based Key Derivation Function 2, and was created by RSA Laboratories. It applies a pseudorandom function, such as a hash, cipher, or HMAC, to the input (password) along with a salt. The process is repeated many times, which produces a derived key.


Created by Niels Provos and David Mazières, bcrypt stands for Belgian Fundamental Research in Cryptology and Information Security. It is a key derivation function based on the blowfish cipher. It incorporates a salt into the process to protect the key, and also has an interesting adaptive functionality to it. Over time, the iteration count can be increased to make it slower, so it remains resistant to brute-force attacks.


Created by Colin Percival, scrypt is another key derivation function that is designed to combat large-scale hardware attacks by requiring high amounts of memory and therefore slowing down computation.

The bad (for passwords)

The following are our standard cryptographic hashing algorithms, which are meant to be fast. In the case of password security, this is not a good scenario because slowing down the algorithm makes it much harder for an attacker to crack the data:


MD5, or message-digest algorithm, was designed by Robert Rivest in 1991, and produces a 128-bit hash value, typically expressed as a 32-digit hexadecimal number.


SHA stands for Secure Hash Algorithm. Designed by the NSA, SHA-1 produces a 160-bit (20-byte) hash value. This hash value is typically rendered as a 40-digit hexadecimal number.


Also designed by the NSA, SHA-2 is the successor of SHA-1, and consists of six hash functions with hash values that are 224, 256, 384, or 512 bits (SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, SHA-512/256).

What Data Should Be Protected?

We’ve hinted at this a few times during this chapter, but when it comes to asking yourself, “What information absolutely needs to be encrypted?” the answer is pretty simple: anything that is personally identifiable (identity data, personal information, payment details), or anything that is imperative to your system that could open up additional leaks or holes in your architecture if released.

Account Recovery Mechanisms and Social Engineering

After we’ve reviewed the details worth protecting, we should take this knowledge into account when looking at recovery mechanisms. Often social engineering or weak recovery mechanisms lead to exposure of information—even though protection mechanisms were implemented in order to prevent exactly this. If you are familiar with these matters, feel free to skip to this chapter’s wrap-up.

Popular examples include customer support providing account details they’re not supposed to share, and badly planned password-reset flows. A compromised email account can lead to easy access to a user’s account—securing our users by offering sensible security questions and allowing them to provide specific responses can help lower the risk of information leaks.

Social engineering is a non-technical method of intrusion hackers use that relies heavily on human interaction and often involves tricking people into breaking normal security procedures. It is one of the greatest threats that organizations today encounter.⁠7

TechTarget SearchSecurity

The Problem with Security Questions

While the overall knowledge and consciousness about secure passwords is steadily growing, another volatile area—security questions—is often ignored. Instead of offering users an array of personal questions or even allowing for the definition of their own security questions, many generic phrases are offered that are often as easy to find out as searching for a person’s social media profile.

Security questions often appear as repetitive and sometimes even inadvertently comedic collections that can be cumbersome to answer and hard to remember (“What was my favorite dish as a child?” “What’s your favorite book?”). Soheil Rezayazdi published a list of Nihilistic Security Questions on McSweeney’s Internet Tendency that should at least cause a slight smile on your face—here are our personal top five:8

  1. When did you stop trying?

  2. In what year did you abandon your dreams?

  3. At what age did your childhood pet run away?

  4. What was the name of your favorite unpaid internship?

  5. What is the name of your least favorite child?

In all seriousness, the impact of social engineering is often completely underestimated or even ignored. It is often easier to pass barriers instead of circumventing and breaking them down. The scope of social engineering can be anything between looking up some facts about a person online and sneaking into office buildings; while this might sound like an exaggeration (and often does not have to happen), it makes sense to prepare and train staff accordingly.

If you are looking for more information on this topic, great resources on social engineering are Kevin Mitnick’s books Ghost in the Wires (Back Bay Books), The Art of Intrusion (Wiley), and The Art of Deception (Wiley).9

Next Up

Now that you understand all of the concepts that we are going to be using and talking about throughout the rest of the chapters, let’s jump into the next chapter by drilling down into how hashing, salting, and data encryption can be added to your systems.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required