Chapter 4. The Document Management System

The Challenge

After successfully implementing an advanced Bank Statements Analyzer for Mark Erbergzuck you decide to run some errands—including going to an appointment with your dentist. Dr. Avaj has run her practice successfully for many years. Her happy patients retain their white teeth well into old age. The downside of such a successful practice is that every year more and more patient documents get generated. Every time she needs to find a record of prior treatment, her assistants spend longer and longer searching their filing cabinets.

She realizes that it’s time to automate the process of managing these documents and keeping track of them. Luckily, she has a patient who can do that for her! You are going to help by writing software for her that manages these documents and enables her to find the information that will allow her practice to thrive and grow.

The Goal

In this chapter you’ll be learning about a variety of different software development principles. Key to the design of managing documents is an inheritance relationship, which means extending a class or implementing an interface. In order to do this the right way you’ll get to understand the Liskov Substitution Principle, named after famed computer scientist Barbara Liskov.

Your understanding of when to use inheritance will get fleshed out with a discussion of the “Composition over Inheritance” principle.

Finally, you’ll extend your knowledge of how to write automated test code by understanding what makes a good and maintainable test. Now that we’ve spoiled the plot of this chapter, let’s get back to understanding what requirements Dr. Avaj has for the Document Management System.

Note

If at any point you want to look at the source code for this chapter, you can look at the package com.iteratrlearning.shu_book.chapter_04 in the book’s code repository.

Document Management System Requirements

A friendly cup of tea with Dr. Avaj has revealed that she has the documents that she wants to manage as files on her computer. The Document Management System needs to be able to import these files and record some information about each file that can be indexed and searched. There are three types of documents that she cares about:

Reports

A body of text detailing some consultation of operation on a patient.

Letters

A text document that gets sent to an address. (You’re probably familiar with these already, come to think of it.)

Images

The dental practice often records x-rays or photos of teeth and gums. These have a size.

In addition, all documents need to record the path to the file that is being managed and what patient the document is about. Dr. Avaj needs to be able to search these documents, and query whether each of the attributes about a different type of document contains certain pieces of information; for example, to search for letters where the body contains “Joe Bloggs.”

During the conversation, you also established that Dr. Avaj might wish to add other types of documents in the future.

Fleshing Out the Design

When approaching this problem, there are lots of big design choices to make and modeling approaches that we could take. These choices are subjective, and you’re welcome to try to code up a solution to Dr. Avaj’s problem before or after reading this chapter. In “Alternative Approaches” you can see the reasons why we avoid different choices and the overarching principles behind them.

One good first step to approaching any program is to start with test-driven development (TDD), which is what we did when writing the book’s sample solution. We won’t be covering TDD until Chapter 5, so let’s begin with thinking about the behaviors that your software needs to perform and incrementally fleshing out the code that implements these behaviors.

The Document Management System should be able to import documents on request and add them into its internal store of documents. In order to fulfill this requirement, let’s create the DocumentManagementSystem class and add two methods:

void importFile(String path)

Takes a path to a file that our user wants to import to the Document Management System. As this is a public API method that might take input from users in a production system, we take our path as a String rather than relying on a more type-safe class like java.nio.Path or java.io.File.

List<Document> contents()

Returns a list of all the documents that the Document Management System currently stores.

You’ll notice that contents() returns a list of some Document class. We’ve not said what this class entails yet, but it’ll reappear in due course. For now, you can pretend that it’s an empty class.

Importers

A key characteristic of this system is that we need to be able to import documents of different types. For the purposes of this system you can rely on the files’ extensions in order to decide how to import them, since Dr. Avaj has been saving files with very specific extensions. All her letters have the .letter extension, reports have .report, and .jpg is the only image format used.

The simplest thing to do would be to just throw all the code for the importing mechanism into a single method, as shown in Example 4-1.

Example 4-1. Switch of extension example
switch(extension) {
    case "letter":
        // code for importing letters.
        break;

    case "report":
        // code for importing reports.
        break;


    case "jpg":
        // code for importing images.
        break;

    default:
        throw new UnknownFileTypeException("For file: " + path);
}

This approach would have solved the problem in question but would be hard to extend. Every time you want to add another type of file that gets processed you would need to implement another entry in the switch statement. Over time this method would become intractably long and hard to read.

If you keep your main class nice and simple and split out different implementation classes for importing different types of documents, then it’s easy to locate and understand each importer in isolation. In order to support different document types, an Importer interface is defined. Each Importer will be a class that can import a different type of file.

Now that we know we need an interface to import the files, how should we represent the file that is going to be imported? We have a couple of different options: use a plain String to represent the path of the file, or use a class that represents a file, like java.io.File.

You could make the case that we should apply the principle of strong typing here: take a type that represents the file and reduce the scope for errors versus using a String. Let’s take that approach and use a java.io.File object as the parameter in our Importer interface to represent the file being imported, as shown in Example 4-2.

Example 4-2. Importer
interface Importer {
    Document importFile(File file) throws IOException;
}

You might be asking, Why don’t you use a File for the public API of DocumentManagementSystem as well then? Well, in the case of this application, our public API would probably be wrapped up in some kind of user interface, and we aren’t sure what form that is taking files in. As a result we kept things simple and just used a String type.

The Document Class

Let’s also define the Document class at this point in time. Each document will have multiple attributes that we can search on. Different documents have different types of attributes. We have several different options that we can consider the pros and cons of when defining the Document.

The first and simplest way to represent a document would be to use a Map<String, String>, which is a map from attribute names to values associated with those attributes. So why not just pass a Map<String, String> around through the application? Well, introducing a domain class to model a single document is not just drinking the OOP Koolaid, but also provides a series of practical improvements in application maintability and readability.

For a start, the value of giving concrete names to components within an application cannot be overstated. Communication is King! Good teams of software developers use a Ubiquitous Language to describe their software. Matching the vocabulary that you use within the code of your application to the vocabulary that you use to talk to clients like Dr. Avaj makes things a lot easier to maintain. When you have a conversation with a colleague or client you will invariably need to agree upon some common language with which to describe different aspects of the software. By mapping this to the code itself, it makes it really easy to know what part of the code to change. This is called discoverability.

Note

The term Ubiquitous Language was coined by Eric Evans and originates in Domain Driven Design. It refers to the use of a common language that is clearly degined and shared between both developers and users.

Another principle that should encourage you to introduce a class to model a document is strong typing. Many people use this term to refer to the nature of a programming language, but here we’re talking about the more practical use of strong typing in implementing your software. Types allow us to restrict the way in which data is used. For example, our Document class is immutable: once it has been created you can’t change, or mutate, any of its attributes. Our Importer implementations create the documents; nothing else modifies them. If you ever see a Document with an error in one of its attributes, you can narrow the source of the bug down to the specific Importer that created the Document. You can also infer from the immutability that it’s possible to index or cache any information associated with the Document and you know that it will be correct forever, since documents are immutable.

Another design choice that developers might consider when modeling their Document would be make the Document extend HashMap<String, String>. At first that seems great because the HashMap has all the functionality you need to model a Document. However, there are several reasons why this is a bad choice.

Software design is often as much about restricting functionality that is undesirable as it is about building things that you do want. We would have instantly thrown away the aforementioned benefits from immutability by allowing anything in the application to modify the Document class if it were just a subclass of HashMap. Wrapping the collection also gives us an opportunity to give more meaningful names to the methods, instead of, for example, looking up an attribute by calling the get() method, which doesn’t really mean anything! Later on we’ll go into more detail about inheritance versus composition, because this is really a specific example of that discussion.

In short, domain classes allow us to name a concept and restrict the possible behaviors and values of this concept in order to improve discoverability and reduce the scope for bugs. As a result, we’ve chosen to model the Document as shown in Example 4-3. If you’re wondering why it isn’t public like most interfaces, this is discussed later in “Scoping and Encapsulation Choices”.

Example 4-3. Document
public class Document {
    private final Map<String, String> attributes;

    Document(final Map<String, String> attributes) {
        this.attributes = attributes;
    }

    public String getAttribute(final String attributeName) {
        return attributes.get(attributeName);
    }
}

One final thing to note about Document is that it has a package-scoped constructor. Often Java classes make their constructor public, but this can be a bad choice as it allows code anywhere in your project to create objects of that type. Only code in the Document Management System should be able to create Documents, so we keep the constructor package scoped and restrict access to only the package that the Document Management System lives in.

Attributes and Hierarchical Documents

In our Document class we used Strings for attributes. Doesn’t this go against the principle of strong typing? The answer here is yes and no. We are storing attributes as text so that they can be searched through a text-based search. Not only that, but we want to ensure that all attributes are created in a very generic form that is independent of the Importer that created them. Strings aren’t a bad choice as such in this context. It should be noted that passing Strings around throughout an application in order to represent information is often considered a bad idea. In contrast with something being strongly typed, this is termed stringly typed!

In particular, if more complicated use was being made of the attribute values, then having different attribute types parsed out would be useful. For example, if we wanted to be able to find addresses within a certain distance or images with a height and width less than a certain size, then having strongly typed attributes would be a boon. It would be a lot easier to make comparisons with a width value that is an integer. In the case of this Document Management System, however, we simply don’t need that functionality.

You could design the Document Management System with a class hierarchy for Documents that models the Importer hierarchy. For example, a ReportImporter imports instances of the Report class that extends the Document class. This passes our basic sanity check for subclassing. In other words, it allows you to say a Report is a Document and it makes sense as a sentence. We chose not to go down that direction, however, as the right way to model classes in an OOP setting is to think in terms of behavior and data.

The documents are all modeled very generically in terms of named attributes, rather than specific fields that exist within different subclasses. Additionally, as far as this system is concerned, documents have very little behavior associated with them. There was simply no point in adding a class hierarchy here when it provided no benefit. You might think that this statement in and of itself is a little arbitrary, but it informs us of another principle: KISS.

You learned about the KISS principle in Chapter 2. KISS means that designs are better if they are kept simple. It’s often very hard to avoid unnecessary complexity, but it’s worth trying hard to do so. Whenever someone says, “we might need X” or “it would be cool if we also did Y,” just say No. Bloated and complex designs are paved with good intentions around extensibility and code that is a nice-to-have rather than must-have.

Implementing and Registering Importers

You can implement the Importer interface to look up different types of files. Example 4-4 shows the way that images are imported. One of the great things about Java’s core library is that it provides a lot of built-in functionality right out of the box. Here we read an image file using the ImageIO.read method and then extract the width and height of the image from the resulting BufferedImage object.

Example 4-4. ImageImporter
import static com.iteratrlearning.shu_book.chapter_04.Attributes.*;

class ImageImporter implements Importer {
    @Override
    public Document importFile(final File file) throws IOException {
        final Map<String, String> attributes = new HashMap<>();
        attributes.put(PATH, file.getPath());

        final BufferedImage image = ImageIO.read(file);
        attributes.put(WIDTH, String.valueOf(image.getWidth()));
        attributes.put(HEIGHT, String.valueOf(image.getHeight()));
        attributes.put(TYPE, "IMAGE");

        return new Document(attributes);
    }
}

Attribute names are constants defined in the Attributes class. This avoids bugs where different importers end up using different strings for the same attribute name; for example, "Path" versus "path". Java itself doesn’t have a direct concept of a constant as such, Example 4-5 shows the commonly used idiom. This constant is public because we want to be able to use it from different importers, though you may well have a private or package scoped constant instead. The use of the final keyword ensures that it can’t be reassigned to and static ensures that there is only a single instance per class.

Example 4-5. How to define a constant in Java
public static final String PATH = "path";

There are importers for all three different types of files and you will see the other two implemented in “Extending and Reusing Code”. Don’t worry, we’re not hiding anything up our sleeves. In order to be able to use the Importer classes when we import files, we also need to register the importers to look them up. We use the extension of the file that we want to import as the key of the Map, as shown in Example 4-6.

Example 4-6. Registering the importers
    private final Map<String, Importer> extensionToImporter = new HashMap<>();

    public DocumentManagementSystem() {
        extensionToImporter.put("letter", new LetterImporter());
        extensionToImporter.put("report", new ReportImporter());
        extensionToImporter.put("jpg", new ImageImporter());
    }

Now that you know how to import documents, we can implement search. We won’t be focusing on the most efficient way to implement searching of documents here since we’re not trying to implement Google, just get the information to Dr. Avaj that she requires. A conversation with Dr. Avaj revealed that she wanted to be able to look up information about different attributes of a Document.

Her requirements could be met by just being able to find subsequences within attribute values. For example, she might want to search for documents that have a patient called Joe, and with Diet Coke in the body. We thus devised a very simple query language that consisted of a series of attribute name and substring pairs separated by commas. Our aforementioned query would be written as "patient:Joe,body:Diet Coke".

Since the search implementation keeps things simple rather than trying to be highly optimized, it just does a linear scan over all the documents recorded in the system and tests each one against the query. The query String that is passed to the search method is parsed into a Query object that can then be tested against each Document.

The Liskov Substitution Principle (LSP)

We’ve talked about a few specific design decisions related to classes—for example, modeling different Importer implementations with classes, and why we didn’t introduce a class hierarchy for the Document class and why we didn’t just make Document extend HashMap. But really there’s a broader principle at stake here, one that allows us to generalize these examples into an approach that you can use in any piece of software. This is called the Liskov Substitution Principle (LSP) and it helps us understand how to subclass and implement interfaces correctly. LSP forms the L of the SOLID principles that we’ve been referring to throughout this book.

The Liskov Substitution Principle is often stated in these very formal terms, but is actually a very simple concept. Let’s demystify some of this terminology. If you hear type in this context, just think of a class or an interface. The term subtype means establish a parent-to-child relationship between types; in other words, extend a class or implement an interface. So informally you can think of this as meaning that child classes should maintain the behavior they inherit from their parents. We know, we know—it sounds like an obvious statement, but we can be more specific and split out LSP into four distinct parts:

Preconditions cannot be strengthened in a subtype

A precondition establishes the conditions under which some code will work. You can’t just assume what you’ve written will work anyway, anyhow, anywhere. For example, all our Importer implementations have the precondition that the file being imported exists and is readable. As a result, the importFile method has validation code before any Importer is invoked, as can be seen in Example 4-7.

Example 4-7. importFile definition
    public void importFile(final String path) throws IOException {
        final File file = new File(path);
        if (!file.exists()) {
            throw new FileNotFoundException(path);
        }

        final int separatorIndex = path.lastIndexOf('.');
        if (separatorIndex != -1) {
            if (separatorIndex == path.length()) {
                throw new UnknownFileTypeException("No extension found For file: " + path);
            }
            final String extension = path.substring(separatorIndex + 1);
            final Importer importer = extensionToImporter.get(extension);
            if (importer == null) {
                throw new UnknownFileTypeException("For file: " + path);
            }

            final Document document = importer.importFile(file);
            documents.add(document);
        } else {
            throw new UnknownFileTypeException("No extension found For file: " + path);
        }
    }

LSP means that you can’t require any more restrictive preconditions than your parent required. So, for example, you can’t require your document to be smaller than 100KB in size if your parent should be able to import any size of document.

Postconditions cannot be weakened in a subtype

This might sound a bit confusing because it reads a lot like the first rule. Postconditions are things that have to be true after some code has run. For example, after importFile() has run, if the file in question is valid it must be in the list of documents returned by contents(). So if the parent has some kind of side effect or returns some value, then the child must do so as well.

Invariants of the supertype must be preserved in a subtype

An invariant is something that never changes, like the ebb and flow of the tides. In the context of inheritance, we want to make sure that any invariants that are expected to be maintained by the parent class should also be maintained by the children.

The History Rule

This is the hardest aspect of LSP to understand. In essence, the child class shouldn’t allow state changes that your parent disallowed. So, in our example program we have an immutable Document class. In other words, once it has been instantiated you can’t remove, add, or alter any of the attributes. You shouldn’t subclass this Document class and create a mutable Document class. This is because any user of the parent class would expect certain behavior in response to calling methods on the Document class. If the child were mutable, it could violate callers’ expectations about what calling those methods does.

Alternative Approaches

You could have taken a completely different approach when it comes to designing the Document Management System. We’ll take a look at some of these alternatives now as we think they are instructive. None of the choices could be considered wrong as such, but we do think the chosen approach is best.

Making Importer a Class

You could have chosen to make a class hierarchy for importers, and have a class at the top for the Importer rather than an interface. Interfaces and classes provide a different set of capabilities. You can implement multiple interfaces, while classes can contain instance fields and it’s more usual to have method bodies in classes.

In this case the reason to have a hierarchy is to enable different importers to be used. You’ve already heard about our motivation for avoiding brittle class-based inheritance relationships, so it should be pretty clear that using interfaces is a better choice here.

That’s not to say that classes wouldn’t be a better choice elsewhere. If you want to model a strong is a relationship in your problem domain that involves state or a lot of behavior, then class-based inheritance is more appropriate. It’s just not the choice we think is most appropriate here.

Scoping and Encapsulation Choices

If you have taken the time to peruse the code you might notice that the Importer interface, its implementations, and our Query class are all package scoped. Package scope is the default scope, so if you see a class file with class Query at the top you know it’s package scoped, and if it says public class Query it’s public scoped. Package scoping means that other classes within the same package can see or have access to the class, but no one else can. It’s a cloaking device.

A strange thing about the Java ecosystem is that even though package scope is the default scope, whenever we’ve been involved in software development projects there are always more public-scoped classes than package-scoped ones. Perhaps the default should have been public all along, but either way package scope is a really useful tool. It helps you encapsulate these kinds of design decisions. A lot of this section has commented on the different choices that are available to you around designing the system, and you may want to refactor to one of these alternative designs when maintaining the system. This would be harder if we leaked details about this implementation outside of the package in question. Through diligent use of package scoping you can stop classes outside of the package making so many assumptions about that internal design.

We think it’s also worth reiterating that this is simply a justification and explanation of these design choices. There’s nothing inherently wrong with making other choices listed in this section—they may work out to be more appropriate depending on how the application evolves over time.

Extending and Reusing Code

When it comes to software, the only constant is change. Over time you may want to add features to your product, customer requirements may change, and regulations could force you alter your software. As we alluded to earlier, there may be more documents that Dr. Avaj would like to add to our Document Management System. In fact, when we first came to showcase the software that we’ve written for her she immediately realized that invoicing clients was something that she also wanted to keep track of in this system. An invoice is a document with a body and an amount and has an .invoice extension. Example 4-8 shows an example invoice.

Example 4-8. Invoice example
Dear Joe Bloggs

Here is your invoice for the dental treatment that you received.

Amount: $100

regards,

  Dr Avaj
  Awesome Dentist

Fortunately for us, all of Dr. Avaj’s invoices are in the same format. As you can see, we need to extract an amount of money from this, and the amount line starts with the Amount: prefix. The person’s name is at the beginning of the letter on a line with the prefix Dear. In fact, our system implements a general method of finding the suffix of a line with a given prefix, shown in Example 4-9. In this example, the field lines has already been initialized with the lines of the file that we’re importing. We pass this method a prefix—for example, “Amount:”—and it associates the rest of the line, the suffix, with a provided attribute name.

Example 4-9. addLineSuffix definition
    void addLineSuffix(final String prefix, final String attributeName) {
        for(final String line: lines) {
            if (line.startsWith(prefix)) {
                attributes.put(attributeName, line.substring(prefix.length()));
                break;
            }
        }
    }

We in fact have a similar concept when we try to import a letter. Consider the example letter presented in Example 4-10. Here you can extract the name of the patient by looking for a line starting with Dear. Letters also have addresses and bodies of text that you want to extract from the contents of the text file.

Example 4-10. Letter example
Dear Joe Bloggs

123 Fake Street
Westminster
London
United Kingdom

We are writing to you to confirm the re-scheduling of your appointment
with Dr. Avaj from 29th December 2016 to 5th January 2017.

regards,

  Dr Avaj
  Awesome Dentist

We also have a similar problem when it comes to importing patient reports. Dr. Avaj’s reports prefix the name of the patient with Patient: and have a body of text to include, just like letters. You can see an example of a report in Example 4-11.

Example 4-11. Report example
Patient: Joe Bloggs

On 5th January 2017 I examined Joe's teeth.
We discussed his switch from drinking Coke to Diet Coke.
No new problems were noted with his teeth.

So one option here would be to have all three text-based importers implement the same method to find the suffixes of text lines with a given prefix that was listed in Example 4-9. Now if we were charging Dr. Avaj based on the number of lines of code that we had written, this would be a great strategy. We could triple the amount of money that we would make for basically the same work!

Sadly (or maybe not so sadly, given the aforementioned incentives), customers rarely pay based on the number of lines of code produced. What matters are the requirements that the customer wants. So we really want to be able to reuse this code across the three importers. In order to reuse the code we need to actually have it live in some class. You have essentially three options to consider, each with pros and cons:

  • Use a utility class

  • Use inheritance

  • Use a domain class

The simplest option to start with is to create a utility class. You could call this ImportUtil. Then every time you wanted to have a method that needs to be shared between different importers it could go in this utility class. Your utility class would end up being a bag of static methods.

While a utility class is nice and simple, it’s not exactly the pinnacle of object-oriented programming. The object-oriented style involves having concepts in your application be modeled by classes. If you want to create a thing, then you invoke new Thing() for whatever your thing is. Attributes and behavior associated with the thing should be methods on the Thing class.

If you follow this principle of modeling real-world objects as classes, it does genuinely make it easier to understand your application because it gives you a structure and maps a mental model of your domain onto your code. You want to alter the way that letters are imported? Well then edit the LetterImporter class.

Utility classes violate this expectation and often end up turning into bundles of procedural code with no single responsibility or concept. Over time, this can often lead to the appearance of a God Class in our codebase; in other words, a single large class that ends up hogging a lot of responsibility.

So what should you do if you want to associate this behavior to a concept? Well, the next most obvious approach might be to use inheritance. In this approach you would have the different importers extend a TextImporter class. You could then place all the common functionality on that class and reuse it in subclasses.

Inheritance is a perfectly solid choice of design in many circumstances. You’ve already seen the Liskov Substitution Principle and how it puts constraints on the correctness of our inheritance relationship. In practice, inheritance is often a poor choice when the inheritance fails to model some real-world relationship.

In this case, a TextImporter is an Importer and we can ensure that our classes follow the LSP rules, but it doesn’t really seem like a strong concept to work with. The issue with inheritance relationships that don’t correspond to real-world relationships is that they tend to be brittle. As your application evolves over time you want abstractions that evolve with the application rather than against it. As a rule of thumb, it’s a bad idea to introduce an inheritance relationship purely to enable code reuse.

Our final choice is to model the text file using a domain class. To use this approach we would model some underlying concept and build out our different importers by invoking methods on top of the underlying concept. So what’s the concept in question here? Well, what we’re really trying to do is manipulate the contents of a text file, so let’s call the class a TextFile. It’s not original or creative, but that’s the point. You know where the functionality for manipulating text files lies, because the class is named in a really dead simple manner.

Example 4-12 shows the definition of the class and its fields. Note that this isn’t a subclass of a Document because a document shouldn’t be coupled to just text files—we may import binary files like images as well. This is just a class that models the underlying concept of a text file and has associated methods for extracting data from text files.

Example 4-12. TextFile definition
class TextFile {
    private final Map<String, String> attributes;
    private final List<String> lines;

    // class continues ...

This is the approach that we pick in the case of importers. We think this allows us to model our problem domain in a flexible way. It doesn’t tie us into a brittle inheritance hierarchy, but still allows us to reuse the code. Example 4-13 shows how to import invoices. The suffixes for the name and amount are added, along with setting the type of the invoice to be an amount.

Example 4-13. Importing invoices
    @Override
    public Document importFile(final File file) throws IOException {
        final TextFile textFile = new TextFile(file);

        textFile.addLineSuffix(NAME_PREFIX, PATIENT);
        textFile.addLineSuffix(AMOUNT_PREFIX, AMOUNT);

        final Map<String, String> attributes = textFile.getAttributes();
        attributes.put(TYPE, "INVOICE");
        return new Document(attributes);
    }

You can also see another example of an importer that uses the TextFile class in Example 4-14. No need to worry about how TextFile.addLines is implemented; you can see an explanation of that in Example 4-15.

Example 4-14. Importing letters
    @Override
    public Document importFile(final File file) throws IOException {
        final TextFile textFile = new TextFile(file);

        textFile.addLineSuffix(NAME_PREFIX, PATIENT);

        final int lineNumber = textFile.addLines(2, String::isEmpty, ADDRESS);
        textFile.addLines(lineNumber + 1, (line) -> line.startsWith("regards,"), BODY);

        final Map<String, String> attributes = textFile.getAttributes();
        attributes.put(TYPE, "LETTER");
        return new Document(attributes);
    }

These classes weren’t first written like this, though. They evolved into their current state. When we started coding up the Document Management System, the first text-based importer, the LetterImporter, had all of its text extraction logic written inline in the class. This is a good way to start. Trying to seek out code to reuse often results in inappropriate abstractions. Walk before you run.

As we started writing the ReportImporter it become increasingly apparent that a lot of the text extraction logic could be shared between the two importers, and that really they should be written in terms of method invocations upon some common domain concept that we have introduced here—the TextFile. In fact, we even copy and pasted the code that was to be shared between the two classes to begin with.

That isn’t to say that copy and pasting code is good—far from it. But it’s often better to duplicate a little bit of code when you start writing some classes. Once you’ve implemented more of the application, the right abstraction—e.g., a TextFile class will become apparent. Only when you know a little bit more about the right way to remove duplication should you go down the route of removing the duplication.

In Example 4-15 you can see how the TextFile.addLines method was implemented. This is common code used by different Importer implementations. Its first argument is a start index, which tells you which line number to start on. Then there’s an isEnd predicate that is applied to the line and returns true if we’ve reached the end of the line. Finally, we have the name of the attribute that we’re going to associate with this value.

Example 4-15. addLines definition
    int addLines(
        final int start,
        final Predicate<String> isEnd,
        final String attributeName) {

        final StringBuilder accumulator = new StringBuilder();
        int lineNumber;
        for (lineNumber = start; lineNumber < lines.size(); lineNumber++) {
            final String line = lines.get(lineNumber);
            if (isEnd.test(line)) {
                break;
            }

            accumulator.append(line);
            accumulator.append("\n");
        }
        attributes.put(attributeName, accumulator.toString().trim());
        return lineNumber;
    }

Test Hygiene

As you learned in Chapter 2, writing automated tests has a lot of benefits in terms of software maintainability. It enables us to reduce the scope for regressions and understand which commit caused them. It also enables us to refactor our code with confidence. Tests aren’t a magic panacea, though. They require that we write and maintain a lot of code in order to get these benefits. As you know, writing and maintaining code is a difficult proposition, and many developers find that when they first start writing automated tests that they can take a lot of developer time.

In order to solve the problem of test maintainability you need to get to grips with test hygiene. Test hygiene means to keep your test code clean and ensure that it is maintained and improved along with your codebase under test. If you don’t maintain and treat your tests, over time they will become a burden on your developer productivity. In this section you’ll learn about a few key points that can help to keep tests hygienic.

Test Naming

The first thing to think about when it comes to tests is their naming. Developers can get highly opinionated about naming—it’s an easy topic to talk about a lot because everyone can relate to it and think about the problem. We think the thing to remember is that there’s rarely a clear, really good name for something, but there are many, many, bad names.

The first test we wrote for the Document Management System was testing that we import a file and create a Document. This was written before we had introduced the concept of an Importer and weren’t testing Document-specific attributes. The code is in Example 4-16.

Example 4-16. Test for importing files
    @Test
    public void shouldImportFile() throws Exception
    {
        system.importFile(LETTER);

        final Document document = onlyDocument();

        assertAttributeEquals(document, Attributes.PATH, LETTER);
    }

This test was named shouldImportFile. The key driving principles when it comes to test naming are readability, maintainability, and acting as executable documentation. When you see a report of a test class being run, the names should act as statements that document what functionality works and what does not. This allows a developer to easily map from application behavior to a test that asserts that this behavior is implemented. By reducing the impedence mismatch between behavior and code, we make it easier for other developers to understand what is happening in the future. This is a test that confirms that the document management system imports a file.

There are lots of naming anti-patterns, however. The worst anti-pattern is to name a test something completely nondescript—for example, test1. What on earth is test1 testing? The reader’s patience? Treat people who are reading your code like you would like them to treat you.

Another common test naming anti-pattern is just named after a concept or a noun—for example, file or document. Test names should describe the behavior under test, not a concept. Another test naming anti-pattern is to simply name the test after a method that is invoked during testing, rather than the behavior. In this case the test might be named importFile.

You might ask, by naming our test shouldImportFile haven’t we committed this sin here? There’s some merit to the accusation, but here we’re just describing the behavior under test. In fact, the importFile method is tested by various tests; for example, shouldImportLetterAttributes, shouldImportReportAttributes, and shouldImportImageAttributes. None of those tests are called importFile—they are all describing more specific behaviors.

OK, now you know what bad naming looks like, so what is good test naming? You should follow three rules of thumb and use them to drive test naming:

Use domain terminology

Align the vocabulary used in your test names with that used when describing the problem domain or referred by the application itself.

Use natural language

Every test name should be something that you can easily read as a sentence. It should always describe some behavior in a readable way.

Be descriptive

Code will be read many times more often than it is written. Don’t skimp on spending more time thinking of a good name that’s descriptive up front and easier to understand later down the line. If you can’t think of a good name, why not ask a colleague? In golf, you win by putting in the fewest shots. Programming isn’t like that; shortest isn’t necessarily best.

You can follow the convention used in the DocumentManagementSystemTest of prefixing test names with the word “should,” or choose not to; that’s merely a matter of personal preference.

Behavior Not Implementation

If you’re writing a test for a class, a component, or even a system, then you should only be testing the public behavior of whatever is being tested. In the case of the Document Management System, we only have tests for the behavior of our public API in the form of DocumentManagementSystemTest. In this test we test the public API of the DocumentManagementSystem class and thus the whole system. The API can be seen in Example 4-17.

Example 4-17. Public API of the DocumentManagementSystem class
public class DocumentManagementSystem
{
    public void importFile(final String path) {
        ...
    }

    public List<Document> contents() {
        ...
    }

    public List<Document> search(final String query) {
        ...
    }
}

Our tests should only invoke these public API methods and not try to inspect the internal state of the objects or the design. This is one of the key mistakes made by developers that leads to hard-to-maintain tests. Relying on specific implementation details results in brittle tests because if you change the implementation detail in question, the test can start to fail even if the behavior is still working. Take a look at the test in Example 4-18.

Example 4-18. Test for importing letters
    @Test
    public void shouldImportLetterAttributes() throws Exception
    {
        system.importFile(LETTER);

        final Document document = onlyDocument();

        assertAttributeEquals(document, PATIENT, JOE_BLOGGS);
        assertAttributeEquals(document, ADDRESS,
            "123 Fake Street\n" +
                "Westminster\n" +
                "London\n" +
                "United Kingdom");
        assertAttributeEquals(document, BODY,
            "We are writing to you to confirm the re-scheduling of your appointment\n" +
            "with Dr. Avaj from 29th December 2016 to 5th January 2017.");
        assertTypeIs("LETTER", document);
    }

One way of testing this letter-importing functionality would have been to write the test as a unit test on the LetterImporter class. This would have looked fairly similar: importing an example file and then making an assert about the result returned from the importer. In our tests, though, the mere existence of the LetterImporter is an implementation detail. In “Extending and Reusing Code”, you saw numerous other alternative choices for laying out our importer code. By laying out our tests in this manner, we give ourselves the choice to refactor our internals to a different design without breaking our tests.

So we’ve said that relying on the behavior of a class relies on using the public API, but there’s also some parts of the behavior that aren’t usually restricted just through making methods public or private. For example, we might not want to rely on the order of documents being being returned from the contents() method. That isn’t a property that’s restricted by the public API of the DocumentManagementSystem class, but simply something that you need to be careful to avoid doing.

A common anti-pattern in this regard is exposing otherwise private state through a getter or setter in order to make testing easier. You should try to avoid doing this wherever possible as it makes your tests brittle. If you have exposed this state to make testing superficially easier, then you end up making maintaining your application harder in the long run. This is because any change to your codebase that involves changing the way this internal state is represented now also requires altering your tests. This is sometimes a good indication that you need to refactor out a new class that can be more easily and effectively tested.

Don’t Repeat Yourself

“Extending and Reusing Code” extensively discusses how we can remove duplicate code from our application and where to place the resulting code. The exact same reasoning around maintenance applies equally to test code. Sadly, developers often simply don’t bother to remove duplication from tests in the same way as they would for application code. If you take a look at Example 4-19 you’ll see a test that repeatedly makes asserts about the different attributes that a resulting Document has.

Example 4-19. Test for importing images
    @Test
    public void shouldImportImageAttributes() throws Exception
    {
        system.importFile(XRAY);

        final Document document = onlyDocument();

        assertAttributeEquals(document, WIDTH, "320");
        assertAttributeEquals(document, HEIGHT, "179");
        assertTypeIs("IMAGE", document);
    }

Normally you would have to look up the attribute name for every attribute and assert that it is equal to an expected value. In the case of the tests here, this is a common enough operation that a common method, assertAttributeEquals, was extracted with this logic. Its implementation is shown in Example 4-20.

Example 4-20. Implementing a new assertion
    private void assertAttributeEquals(
        final Document document,
        final String attributeName,
        final String expectedValue)
    {
        assertEquals(
            "Document has the wrong value for " + attributeName,
            expectedValue,
            document.getAttribute(attributeName));
    }

Good Diagnostics

Tests would be no good if they didn’t fail. In fact, if you’ve never seen a test fail how do you know if it’s working at all? When writing tests the best thing to do is to optimize for failure. When we say optimize, we don’t mean make the test run faster when it fails—we mean ensure that it is written in a way that makes understanding why and how it failed as easy as possible. The trick to this is good diagnostics.

By diagnostics we mean the message and information that gets printed out when a test fails. The clearer this message is about what has failed, the easier it is to debug the test failure. You might ask why even bother with this when a lot of the time Java tests are run from within modern IDEs that have debuggers built in? Well, sometimes tests may be run within continuous integration environments, and sometimes they may be from the command line. Even if you’re running them within an IDE it is still helpful to have good diagnostic information. Hopefully, we’ve convinced you of the need for good diagnostics, but what do they look like in code?

Example 4-21 shows a method that asserts that the system only contains a single document. We will explain the hasSize() method in a little bit.

Example 4-21. Test that the system contains a single document
    private Document onlyDocument()
    {
        final List<Document> documents = system.contents();
        assertThat(documents, hasSize(1));
        return documents.get(0);
    }

The simplest type of assert that JUnit offers us is assertTrue(), which will take a boolean value that it expects to be true. Example 4-22 shows how we could have just used assertTrue to implement the test. In this case the value is being checked to equal 0 so that it will fail the shouldImportFile test and thus demonstrate the failure diagnostics. The problem with this is that we don’t get very good diagnostics—just an AssertionError with no information in the message shown in Figure 4-1. You don’t know what failed, and you don’t know what values were being compared. You know nothing, even if your name isn’t Jon Snow.

Example 4-22. assertTrue example
assertTrue(documents.size() == 0);
assertTrue example
Figure 4-1. Screenshot of assertTrue failing

The most commonly used assertion is assertEquals, which takes two values and checks they are equal and is overloaded to support primitive values. So here we can assert that the size of the documents list is 0, as shown in Example 4-23. This produces a slightly better diagnostic as shown in Figure 4-2, you know that the expected value was 0 and the actual value was 1, but it still doesn’t give you any meaningful context.

Example 4-23. assertEquals example
assertEquals(0, documents.size());
assertEquals example
Figure 4-2. Screenshot of assertEquals example failing

The best way of making an assert about the size itself is to use a matcher for asserting the collection size as this provides the most descriptive diagnostics. Example 4-24 has our example written in that style and demonstrates the output as well. As Figure 4-3 shows, this is much clearer as to what went wrong without you needing to write any more code.

Example 4-24. assertThat example
assertThat(documents, hasSize(0));
assertThat example
Figure 4-3. Screenshot of assertThat example failing

What is going on here is that JUnit’s assertThat() is being used. The method assertThat() takes a value as its first parameter and a Matcher as its second. The Matcher encapsulates the concept of whether a value matches some property and also its associated diagnostics. The hasSize matcher is statically imported from a Matchers utility class that contains a bundle of different matchers and checks that the size of a collection is equal to its parameter. These matchers come from the Hamcrest library, which is a very commonly used Java library that enables cleaner testing.

Another example of how you can build better diagnostics was shown in Example 4-20. Here an assertEquals would have given us the diagnostic for the attribute’s expected value and actual value. It wouldn’t have told us what the name of the attribute was, so this was added into the message string to help us understand failure.

Testing Error Cases

One of the absolute worst and most common mistakes to make when writing software is only to test the beautiful, golden, happy path of your application—the code path that is executed when the sun is shining on you and nothing goes wrong. In practice lots of things can go wrong! If you don’t test how your application behaves in these situations, you’re not going to end up with software that will work reliably in a production setting.

When it comes to importing documents into our Document Management System there are a couple of error cases that might happen. We might try to import a file that doesn’t exist or can’t be read, or we might try to import a file that we don’t know how to extract text from or read.

Our DocumentManagementSystemTest has a couple of tests, shown in Example 4-25, that test these two scenarios. In both cases we try to import a path file that will expose the problem. In order to make an assert about the desired behavior we use the expected = attribute of JUnit’s @Test annotation. This enables you to say Hey listen, JUnit, I’m expecting this test to throw an exception, it’s of a certain type.

Example 4-25. Testing for error cases
    @Test(expected = FileNotFoundException.class)
    public void shouldNotImportMissingFile() throws Exception
    {
        system.importFile("gobbledygook.txt");
    }

    @Test(expected = UnknownFileTypeException.class)
    public void shouldNotImportUnknownFile() throws Exception
    {
        system.importFile(RESOURCES + "unknown.txt");
    }

You may want an alternative behavior to simply throwing an exception in the case of an error, but it’s definitely helpful to know how to assert that an exception is thrown.

Constants

Constants are values that do not change. Let’s face it—they are one of the few well-named concepts when it comes to computer programming. The Java programming language doesn’t use an explicit const keyword like C++ does, but conventionally developers create static field fields in order to represent constants. Since many tests consist of examples of how a part of your computer program should be used, they often consist of many constants.

It’s a good idea when it comes to constants that have some kind of nonobvious meaning to give them a proper name that can be used within tests. We do that extensively through the DocumentManagementSystemTest, and in fact, have a block at the top dedicated to declaring constants, shown in Example 4-26.

Example 4-26. Constants
public class DocumentManagementSystemTest
{
    private static final String RESOURCES =
        "src" + File.separator + "test" + File.separator + "resources" + File.separator;
    private static final String LETTER = RESOURCES + "patient.letter";
    private static final String REPORT = RESOURCES + "patient.report";
    private static final String XRAY = RESOURCES + "xray.jpg";
    private static final String INVOICE = RESOURCES + "patient.invoice";
    private static final String JOE_BLOGGS = "Joe Bloggs";

Takeaways

  • You learned how to build a Document Management System.

  • You recognized the different trade-offs between different implementation approaches.

  • You understood several principles that drive the design of software.

  • You were introduced to the Liskov Substitution Principle as a way to think about inheritance.

  • You learned about situations where inheritance wasn’t appropriate.

Iterating on You

If you want to extend and solidify the knowledge from this section you could try one of these activities:

  • Take the existing sample code and add an implementation for importing prescription documents. A prescription should have a patient, a drug, a quantity, a date, and state the conditions for taking a drug. You should also write a test that checks that the prescription import works.

  • Try implementing the Game of Life Kata.

Completing the Challenge

Dr. Avaj is really pleased with your Document Management System and she now uses it extensively. Her needs are effectively met by the features because you drove your design from her requirements toward application behavior and into your implementation details. This is a theme that you will return to when TDD is introduced in the next chapter.

Get Real-World Software Development now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.