Chapter 4. OAuth Data Design

OAuth is a distributed architecture, and the authorization server has its own data stores. When you get started with OAuth, an early consideration should be enabling a future-proof data setup. Your authorization server and its data should be deployable in a straightforward manner, without adversely affecting your business data and APIs.

In this chapter, we explain how an authorization server manages data. We then explain the design choices you have for operating this data. In particular, we discuss user accounts and user attributes. We also discuss manageability, so that you get the right level of dependency between your APIs and your authorization server. We then cover some wider topics, related to multi-tenancy and multi-region, so that you take your OAuth data with you as your business data grows. Finally, we show how to use the authorization server’s user management APIs to populate user accounts. First, let’s understand the types of data your authorization server uses.

Authorization Server Data

Your authorization server stores several types of identity data, which you manage separately from your business data. The types can be broadly categorized as configuration data, operational data and user identity data, as shown in Figure 4-1.

Data managed by the authorization server
Figure 4-1. Data managed by the authorization server

Your configuration data includes client settings. You typically must register clients at the authorization server before they can be issued tokens. For each client you need to define policies such as supported authentication methods. When users log in via a client, the authorization server selects the appropriate authentication methods depending on its configured policy.

Configuration also controls the API permissions included in access tokens for each client, and crypto keys that the authorization server uses to digitally sign access tokens. As secret and key management is a common challenge many cloud providers offer services like a vault to store keys. You can also, for example, use a hardware security module to generate and store key material. In this way the authorization server does not have to store any sensitive key material.

Your authorization server writes operational data due to user activity. This includes token and session-related information, and various types of logs. You keep this operational data only as long as you need it. You use technical logs for support purposes, and audit logs for insights into security events, such as the frequency of failed login attempts.

Your user identity data requires the most design consideration. It is usual to create user accounts in the authorization server so that you can store attributes against users. The authorization server also stores other security-related data against user accounts, such as hashed passwords, or public keys used to verify more advanced types of login credentials. In many use cases, users are created as the result of activity in clients that run OAuth flows where users can sign up. In other use cases, your users may be created by administrator actions or migration processes. Your authorization server can use external identity systems to authenticate users, in which case the authorization server creates a linked account associated with the main user account.

Your authorization server can connect to various data sources. When operating in a cloud native environment, you may connect your authorization server to cloud native SQL or NoSQL databases. You may also be able to use cloud-managed storage. Many authorization servers can use the Lightweight Directory Access Protocol (LDAP) to interact with existing directory services containing user accounts and user credentials.

When getting started with OAuth, you can use the default data stores that the authorization server provides. First, focus on configuring a working integration between your clients and APIs that allows users to authenticate and clients to retrieve access tokens that they can send to APIs. So, let us explain the basic OAuth configuration that you need.

OAuth Configuration Settings

When getting started, manage configuration settings using the authorization server’s administrative user interface, where the meaning of settings is visually clearest. In the admin UI, you first configure a user authentication method. You also register a client. You then assign the user authentication method to the client. Example 4-1 shows a JSON object with the settings for an OAuth client, which include a client ID, client credentials (the client secret) and a list of scopes that the client is allowed to request.

Example 4-1. Basic client configuration
{
    "client_id": "my-web-client",
    "client_secret": "drLChAwreS6&teh7Va?1",
    "redirect_uri": "https://www.example.com/callback",
    "post_logout_redirect_uri": "https://www.example.com/loggedout",
    "scope": "openid profile purchases"
    "authentication_methods": ["username_password"]
}

Once you have configured a client in the authorization server, and the client has implemented a code flow, the client can get tokens. Clients that implement the code flow can also trigger user authentication as we have described in Chapter 2. When sending the protocol messages, the client uses a copy of the configuration settings. The client in Example 4-1 can use OpenID Connect because it supports the openid scope. The client can therefore receive an ID token with information about the authentication event. The client could decide to only use a subset of its allowed scopes, such as omitting openid to use only OAuth, without OpenID Connect.

Clients that implement the code flow can optionally be configured to require user consent. Enable this for use cases where the user must approve API permissions granted to the client. Doing so causes the authorization server to present a consent screen to the user once authentication completes. Scopes in the authorization server can be configured to be required or optional. For any optional scopes, the user can deselect values in the consent screen. When consent is used, only scopes approved by the user are issued to access tokens.

You also design identity attributes to be issued to access tokens. These access token values are called claims and are configured in the authorization server alongside scopes. Each claim is associated with one or more scopes. When a scope is issued to an access token, the claims associated with the scope are also issued to the access token. Each claim is evaluated at runtime and can contain any value that you assign. For clients that implement the code flow, user attributes can be issued as claims. For the example client in Example 4-1 you might decide to associate the purchases scope with a roles claim and a department claim. At runtime, the values of the roles and department claims issued will vary depending on the user interacting with the client. In Chapter 6 we dive deeper into using scopes and claims.

The authorization server stores many more configuration settings. These include token formats, lifetimes and signing algorithms. APIs should receive cryptographically verify access tokens. Therefore, in the authorization server, you also need to configure a key store for token signing. A default one is provided for you, which you can replace. When required, you can also configure key stores and trust stores used for TLS. This cryptographic material should be different for each stage of your deployment pipeline. Configuration data therefore consists of multiple areas, as shown in Figure 4-2.

OAuth configuration settings
Figure 4-2. OAuth configuration settings

Configuration settings are often easiest to manage in plain text storage formats, such as JSON, XML or YAML. This enables you to version control the configuration using an infrastructure-as-code (IaC) approach. You can then store the non-secret settings in source control, and the secret settings in a secure vault. You then deploy the same configuration, with minor differences, down a pipeline, with stages like test, staging and production. You manage the minor differences as environment variables and encrypted secrets. We provide Kubernetes deployment examples that demonstrate these techniques.

With the basic configuration and a working integration in place you can start to think more about user accounts and user attributes. Therefore, let’s explain the storage of user data in the authorization server.

Designing User Accounts

In most use cases you need to create user accounts in the authorization server. A user account can contain informational fields and also more advanced information. For example, if a user authenticates using an external identity provider, the authorization server can create a linked account that stores the external user ID. Similarly, if a user authenticates with newer cryptography-backed authentication methods, the authorization server can store public keys against the user account. We explain more about these forms of user authentication in Chapter 14.

Personal Data

Authorization servers provide ways to store user account records and assign attributes to users. These include built-in fields along with any custom user attributes that you define. You can get started with just the built-in fields. The default schema usually includes the fields shown in Table 4-1. Your authorization server can use these values during authentication operations. For example, a forgot password process may send an email to the user and use the given_name value in the email content:

Table 4-1. Example user attributes
Field Description

account_id

A generated database identifier for the user.

subject

An immutable user identifier supplied in access tokens.

given_name

The forename(s) of the user.

family_name

The surname(s) of the user.

emails

One or more emails associated to the user.

Authorization servers usually provide built-in fields based on OpenID Connect claims, to enable issuing those claims to tokens. Some authorization servers model their user attributes on the System for Cross-domain Identity Management (SCIM) Core Schema1. SCIM is an interoperable set of specifications for managing user identity data.

As a best practice, you should aim to make the authorization server the source of truth for all personal data. Only store personal data that you need, and only store it for as long as you need to. Ideally, applications and APIs should not store any personal data and should instead receive it from the authorization server in tokens or by calling the userinfo endpoint. This ensures that personal data is protected and audited in a single place. In Table 4-2 some extra fields have been added, for address, social security and terms accepted information.

Table 4-2. Personal data user attributes
Field Description

account_id

A generated database identifier for the user.

subject

An immutable user identifier supplied in access tokens.

given_name

The forename(s) of the user.

family_name

The surname(s) of the user.

emails

One or more emails associated to the user.

address

The user’s address fields.

social_security_number

A government issued identifier for the user.

terms_accepted

Whether the user has accepted legal terms for your organization.

terms_accepted_date

When the user last accepted legal terms for your organization.

Table 4-2 shows a cleaner data design than storing personal data in multiple places. Centralizing personal data usually makes it easier to meet user privacy regulations, such as the General Data Protection Regulation (GDPR)2 in Europe. If you need to leave some denormalized personal fields in the business data, make these fields read-only.

The authorization server uses other privacy-related behaviors. For example, you might decide to show a post-authentication screen to ask the user to accept your organization’s legal terms, then record the date when accepted. You can use the consent features of the authorization server when users need to grant your application access to their personal data. Ideally, you can configure the authorization server to show these screens only once rather than on every user login. If the conditions change, you can update the authorization server to force the user to re-consent on the next login.

Protect personal data

Store personal data in your authorization server and encrypt it at rest. Use the features of the authorization server to help manage privacy and consent before sharing personal data with clients.

Personal data is not the only type of user data. Most digital solutions also store business-related user attributes. You might classify some of these to be part of a user’s core identity and others to be specific to a particular product or area. Let’s explore this further.

Business User Attributes

When getting started with OAuth, it is common for business data to store existing user attributes. This may include both personal data and also the user’s business settings. A standalone website where users are modeled as customers might use the values shown in Table 4-3. Each customer record contains a customer_id to uniquely identify the user, along with other business-related fields. If the user makes an online purchase, the website creates a business resource such as a purchase record and associates it with the customer ID.

Table 4-3. User attributes in business data
Field Description

customer_id

The user’s business identifier.

tenant_id

The user’s organization.

name

The user’s name.

email

The user’s email address.

membership_level

Business privileges may depend on the user’s membership level.

roles

Business privileges may depend on the user’s roles.

When updating to an OAuth architecture, you usually migrate some existing user attributes to the authorization server’s user accounts. In a simple use case, such as a standalone website, the easiest way to do this is to run a migration process that moves all user attributes to the authorization server. We provide a user migration worked example later in the chapter.

If you work at a large organization you might have multiple product lines, each with its own user stores and business attributes. Migrating all user attributes across all products to the authorization server is likely to be very disruptive. You may change or remodel some business user attributes often. Fields such as roles or address may even have different meanings or data shapes in different contexts. Without care, this could lead to productivity problems where user data has to be frequently updated in the authorization server.

Therefore, in larger setups, think carefully about which attributes to store in the authorization server and which to leave in the business data. Aim to use the authorization server as a toolbox that operates on stable user attributes. In your organization, an identity team who are security specialists might operate the authorization server. The best people to manage volatile business settings are likely to be your API teams.

For your organization’s digital solutions, you may consider some business fields to be central to the user’s identity. When these are the same for all products, managing them centrally is usually a better option than duplicating them across products. Multiple APIs and applications can then receive the same values from the authorization server. The fields from Table 4-4 are often good choices to manage in the authorization server. In use cases where values vary considerably across products, leave them in the business data.

Table 4-4. Shared business settings
Field Description

customer_id

The user’s business identifier.

tenant_id

The user’s organization.

country_code

The user’s country of residence.

roles

The user’s classification to determine allowed access.

You should be free to design where to store user attributes based on your own requirements. For example, instead of storing a business user identifier in the identity data you might prefer to store the subject claim within the business data. Even if you prefer to store some user attributes in your business data, your authorization server should still be able to access them. To understand how, let’s start by discussing API user identities.

API User Identities

When designing user accounts, do some early thinking about how your APIs will identify users and authorize requests on their behalf. When the authorization server issues access tokens to a client, it includes a subject claim in the token, to identify the user account. The subject is typically a stable technical value such as a Universally Unique Identifier (UUID). It remains the same even if the user’s name changes. When the client sends the access token to the API, the API needs to identify the user in business terms. The subject claim may not be the best value to enable this.

In Figure 4-3 the customer_id field represents an existing business user identifier that the API understands. For example, customers may generate business data after events such as an online purchase. APIs then store purchase database records mapped to the customer ID. Including identifying fields in your authorization server’s user account records makes it easy to issue them as claims in access tokens. APIs that receive access tokens can then easily locate the user’s business resources.

API Request with Business User Identity
Figure 4-3. API request with business user identity

Field names for user identifier

The SCIM core schema provides an external_id field that you can use to store a business user identifier, or you can use a custom field if you prefer.

When designing API identities, also be aware of the risk of unintended user tracking. You do not have to always send the same user identity to every API. It is possible to define distinct user identifiers for each client (or group of clients). Use pairwise pseudonymous identifiers (PPID) from the OpenID Connect core specification for that purpose. One use case where you might use PPIDs is when issuing tokens to a less trusted subdivision of a large organization. Figure 4-4 shows an authorization server issuing access tokens with distinct subject claims for the same user account to different sets of clients:

Pairwise pseudonymous identifiers
Figure 4-4. Pairwise pseudonymous identifiers

Once your APIs receive useful user identifiers for existing users, you should also understand how the authorization server can interact with your APIs.

Identity Operations

During security-related operations, such as user registration, user authentication or token issuance, the authorization server should have access to any user attributes it needs. For example, the authorization server should be able to verify the correctness of a business user attribute that the user enters in a custom authentication screen. Similarly, the authorization server should be able to retrieve a business user attribute and issue it as an access token claim. To do so, the authorization server makes an external call, such as to a utility API endpoint that you provide. The authorization server sends its user attributes and can receive external attributes in the response, as shown in Figure 4-5.

External user attributes flow
Figure 4-5. External user attributes flow

If user attributes are stored in multiple data sources, your user account design should also account for future user onboarding. Aim for a solution that allows your APIs to receive the same claims in access tokens for both new and existing users. The example in Figure 4-6 shows a self-signup operation that could run during a code flow. The user fills in a registration form. The authorization server then sends some of the form fields in an external API request, to create a customer record in the business data. The external API response returns a customer_id. The authorization server creates its own user account with the remainder of the form fields and the customer ID. The business user identity for the new user account is thus populated and can be immediately issued to access tokens.

Future user registration
Figure 4-6. Future user registration

Shared user identifier

If you split user attributes between the authorization server and your business data, ensure that there is a shared user identifier that you can use to match data records together.

Once you have produced a user account design, you implement it by migrating existing user attributes to the authorization server. You do this using the authorization server’s user management APIs.

User Management APIs

The user accounts that you store in your authorization server belong to you and you should be in control of the data. We have shown how you can use a schema that meets your preferred data design. You also need API access to user accounts in a way that meets your usage requirements. The authorization server’s user management APIs enable you to operate on built-in user attributes and also any custom user attributes you have defined. Therefore, user management APIs should provide similar behaviors to your business-focused APIs.

Access to user management APIs requires a message credential. Typically, the authorization server will require an access token. You may be able to configure a particular scope such as accounts, and also associate claims to that scope, to control access. This enables you to design access tokens that restrict clients to a particular user account or a subset of user accounts. Field-level rules are also common. For example, an administrator may be able to set an initial temporary password for a user but should not be able to read passwords.

You usually first call user management APIs when populating user accounts in the authorization server for your existing users. The user management APIs can also be used for several other use cases. These could include creating, updating, deleting or deactivating users from an administrator application. In fact, user management APIs should allow you to build a self-service portal where users can update their profile from within a frontend application.

Some authorization servers follow standards for user management APIs, such as the SCIM protocol from RFC 67443. Authorization servers that follow this standard provide you with a well-designed JSON-based API that you can call from any application or API, regardless of technology stack.

When migrating users to the authorization server you typically develop a small program such as a console app. This program iterates over users in an existing data source, then creates or updates user accounts in the authorization server. Example 4-2 shows a request payload to create a new user account.

Example 4-2. Example SCIM request
{
    "schemas": [
        "urn:ietf:params:scim:schemas:core:2.0:User"
    ],
    "userName": "danademo",
    "emails": [
        {
            "value": "dana.demo@example.com",
            "primary": true
        }
    ],
    "customer_id": 345,
    "active": true
}

A successful create operation returns the new user. Example 4-3 shows an example response that includes a generated user identifier in the id field. You can invoke various other operations, such as updating a user via a PUT request, partially updating the user via a PATCH request, or removing the user with a DELETE request.

Example 4-3. Example SCIM response
{
  	"schemas": [
        "urn:ietf:params:scim:schemas:core:2.0:User"
    ],
    "userName": "danademo",
	  "emails": [
        {
    		    "value": "dana.demo@example.com",
		        "primary": true
	      }
    ],
    "customer_id": 345,
	  "active": true,
	  "id": "VVNFUjo2ZmZiZTNjMi1kYWM5LTExZWQtYjIxMi0xNjk3MTNlZGRjNDk"
}

You now understand that you can flexibly store user accounts and user attributes and also have API access to that data. In some use cases though, your APIs may have more advanced data requirements. These can include hosting APIs in multiple global locations or providing API access to users from multiple organizations. Let’s briefly summarize how the authorization server fits into these deployments.

Multi-Region

In global setups, when using APIs across multiple regions with different legal requirements, you may need to meet data sovereignty regulations4 in some regions. This can result in you needing to store each user’s business transactions and personal data in the user’s home region.

In a zero-trust architecture, you should not rely on the infrastructure to enforce the correct regional access. For example, a basic multi-region deployment might store data in the location that receives the API request, or use data replication to save all transactions to all regions. Global load balancing may occasionally route users to the wrong location, e.g. one that the user is physically closest to but that is not their home region. Figure 4-7 shows an example where the infrastructure initially routes the user to the incorrect region. The API gateway inspects the access token and reads a region or country claim from the access token. When required, the API gateway then re-routes the user to the correct region.

API gateways can use access token claims to implement dynamic user routing
Figure 4-7. Dynamic user routing

In a multi-region deployment, you should use independent deployments of the same authorization server. All regions usually share the same configuration settings, but you should avoid replicating operational data or user data. Instead, aim to store a user’s personal data only within the user’s home region. You can follow an equivalent approach for any sensitive transactions in the business data.

You may also need to provide API access to multiple companies, where each company has its own users. Let’s explain the options your authorization server provides to enable isolation between tenants.

Multi-Tenancy

In some use cases, you may need to supply APIs and clients to business partners, in a B2B2C model, where each organization has its own user base. It is then essential in clients and APIs that each user can never access data that belongs to other organizations. You can meet this requirement in multiple ways. At one end of the spectrum, you could provide each business partner a separate copy of all identity and business components, to ensure full isolation. In some deployments this may not be the most cost-effective design.

Therefore you could instead design data and components that are multi-tenant aware. You do this by assigning each business partner a tenant ID, and use it to partition resources in both the identity data and business data. You would then issue the tenant ID to access tokens, and use it during API authorization, to only allow users to access information for their tenant. Doing so may give you additional deployment options. For example, a default deployment could store data for multiple smaller tenants, while larger tenants pay extra for an isolated deployment.

You can configure your authorization server to be multi-tenant aware by including a tenant ID user attribute in the identity data. You then need to design how to populate this value when you onboard users. Alternatively, your authorization server can provide high isolation per tenant, using a mechanism such as a profile or realm. This can result in isolated copies of data sources, configuration, credentials and OAuth endpoints. You then give each tenant a distinct base URL to the authorization server as we show in Figure 4-8.

The authorization server isolates tenants with distinct profiles, databases and URLs.
Figure 4-8. Multi-tenancy using isolated profiles and data sources

We have walked through a number of data design considerations. Next, we provide some practical content, to show how to get connected to an authorization server’s user management APIs and migrate user accounts.

User Migration Code Example

The GitHub repository for this book provides a user migration code example that demonstrates a user migration. The code example is a simple console app that acts as a SCIM client and calls the authorization server’s SCIM endpoints to create user accounts. You can follow the README instructions to first deploy a cloud native authorization server, then run an example program that iterates over users. Originally, each customer record consists of both business and identity attributes, as shown in Figure 4-9. Business resources are linked to customer IDs.

Original user attributes
Figure 4-9. Original user attributes

In this example, you have decided to store personal data in the authorization server, along with a business user identity and role information. Membership details are left in the business data, along with fine-grained permissions. After the migration, user attributes exist in both the authorization server’s user accounts and the business data, with personal data removed from the business data. The two data sources are linked together by the customer ID, as shown in Figure 4-10.

Migrated user attributes
Figure 4-10. Migrated user attributes

In the repository, the original business data consists of customer records with the JSON structure shown in Example 4-4.

Example 4-4. Original customer record
[
    {
        "id": 2099,
        "userName": "dana",
        "email": "dana.demo@example.com",
        "country": "US",
        "membershipLevel": "gold",
        "membershipExpires": "2024-10",
        "roles": [
            "customer"
        ]
    }
]

The migration program uses a JSON configuration file shown in Example 4-5. The configuration settings include the URL of a token endpoint from which the program retrieves an access token with the accounts scope. In the example setup, access tokens issued with this scope have full access to administer all user accounts. The program sends a client credentials grant request to get such an access token, then sends the access token in multiple requests to the SCIM endpoint, each of which creates a user account in the authorization server.

Example 4-5. SCIM client configuration settings
{
    "clientId": "scim-client",
    "clientSecret": "?rlfRiVoPrLp8v$St0Ph",
    "scope": "accounts",
    "tokenEndpoint": "http://localhost:8443/oauth/v2/oauth-token",
    "scimEndpoint": "http://localhost:8443/scim/Users"
}

As the program migrates each user, it removes personal information from the customer record. Example 4-6 shows the user attributes that remain in the business data after the migration.

Example 4-6. Migrated customer record
[
  {
    "id": 2099,
    "membershipLevel": "gold",
    "membershipExpires": "2024-10"
  }
]

Once the program completes you can list all user accounts created in the authorization server. Example 4-7 shows an example user account containing personal data, a business user identity and roles.

Example 4-7. User account in the authorization server
[
  {
    "id": "VVNFUjplYmYwZjQyYy0wYTY2LTQxOWItYjA0Yy0yZDkzMmY0NzgxMGU",
    "userName": "dana.demo@example.com",
    "schemas": [
      "urn:ietf:params:scim:schemas:core:2.0:User"
    ],
    "active": true,
    "name": {
      "givenName": "Dana",
      "familyName": "Demo"
    },
    "emails": [
      {
        "value": "dana.demo@example.com",
        "primary": true
      }
    ],
    "addresses": [
      {
        "country": "US"
      }
    ],
    "roles": [
      {
        "value": "customer",
        "primary": true
      }
    ],
    "customerId": "2099"
  }
]

Although OAuth migrations require some design thinking, the code example shows you that implementations are straightforward. You can do so by writing a simple API client in any technology stack. The migration has improved the data architecture, and the process is reliable and repeatable. User migrations should only have minimal impact on the existing business data.

Finally, note that user migrations should be repeatable, in case you ever want to upgrade to a different authorization server in the future. Do so by exporting existing user accounts using the old authorization server’s user management APIs and importing them using the new authorization server’s user management APIs.

Summary

Authorization servers use three main types of data, for configuration, operational data and user identities. Configuration data is largely the same for all stages of your deployment pipeline, unlike the operational and user data. A key area of identity data is the storage of user accounts. You should be able to design your user attribute storage in a way that best meets your business needs. Aim to store core user identity fields in the authorization server, including personal data and any business user identities.

The best way to manage identities in an OAuth architecture requires some up-front design thinking. Once complete, you have powerful options for managing difficult security requirements. These can include management of personal data, dealing with user consent and legal terms, and implementing multi-tenant and multi-region behaviors.

Once you have designed your user accounts in the authorization server you can return user attributes to applications in OAuth tokens, or expose them from the authorization server’s userinfo endpoint. This enables you to share the user data stored in the authorization server across many applications. The fields returned to APIs in access tokens are called claims and are fully explained in Chapter 6. In particular, use of claims enables your APIs to receive a business user identity and other values needed for authorization.

In Chapter 5 we will explain how APIs use access tokens. You will learn how to receive access tokens, validate them and use their claims to implement authorization. We also demonstrate a technique for writing integration tests, so that you can test API security as early as possible in your deployment pipeline, on a development computer.

Get Cloud Native Data Security with OAuth now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.