Chapter 1. Tetranucleotide Frequency: Counting Things

Counting the bases in DNA is perhaps the “Hello, World!” of bioinformatics. The Rosalind DNA challenge describes a program that will take a sequence of DNA and print a count of how many As, Cs, Gs, and Ts are found. There are surprisingly many ways to count things in Python, and I’ll explore what the language has to offer. I’ll also demonstrate how to write a well-structured, documented program that validates its arguments as well as how to write and run tests to ensure the program works correctly.

In this chapter, you’ll learn:

  • How to start a new program using new.py

  • How to define and validate command-line arguments using argparse

  • How to run a test suite using pytest

  • How to iterate the characters of a string

  • Ways to count elements in a collection

  • How to create a decision tree using if/elif statements

  • How to format strings

Getting Started

Before you start, be sure you have read “Getting the Code and Tests” in the Preface. Once you have a local copy of the code repository, change into the 01_dna directory:

$ cd 01_dna

Here you’ll find several solution*.py programs along with tests and input data you can use to see if the programs work correctly. To get an idea of how your program should work, start by copying the first solution to a program called dna.py:

$ cp solution1_iter.py dna.py

Now run the program with no arguments, or with the -h or --help flags. It will print usage documentation (note that usage is the first word of the output):

$ ./dna.py
usage: dna.py [-h] DNA
dna.py: error: the following arguments are required: DNA

If you get an error like “permission denied,” you may need to run chmod +x dna.py to change the mode of the program by adding the executable bit.

This is one of the first elements of reproducibility. Programs should provide documentation on how they work. While it’s common to have something like a README file or even a paper to describe a program, the program itself must provide documentation on its parameters and outputs. I’ll show you how to use the argparse module to define and validate the arguments as well as to generate the documentation, meaning that there is no possibility that the usage statement generated by the program could be incorrect. Contrast this with how README files and change logs and the like can quickly fall out of sync with a program’s development, and I hope you’ll appreciate that this sort of documentation is quite effective.

You can see from the usage line that the program expects something like DNA as an argument, so let’s give it a sequence. As described on the Rosalind page, the program prints the counts for each of the bases A, C, G, and T, in that order and separated by a single space each:

$ ./dna.py ACCGGGTTTT
1 2 3 4

When you go to solve a challenge on the Rosalind.info website, the input for your program will be provided as a downloaded file; therefore, I’ll write the program so that it will also read the contents of a file. I can use the command cat (for concatenate) to print the contents of one of the files in the tests/inputs directory:

$ cat tests/inputs/input2.txt
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

This is the same sequence shown in the example on the website. Accordingly, I know that the output of the program should be this:

$ ./dna.py tests/inputs/input2.txt
20 12 17 21

Throughout the book, I’ll use the pytest tool to run the tests that ensure programs work as expected. When I run the command pytest, it will recursively search the current directory for tests and functions that look like tests. Note that you may need to run python3 -m pytest or pytest.exe if you are on Windows. Run this now, and you should see something like the following to indicate that the program passes all four of the tests found in the tests/dna_test.py file:

$ pytest
=========================== test session starts ===========================
...
collected 4 items

tests/dna_test.py ....                                              [100%]

============================ 4 passed in 0.41s ============================

A key element to testing software is that you run your program with known inputs and verify that it produces the correct output. While that may seem like an obvious idea, I’ve had to object to “testing” schemes that simply ran programs but never verified that they behaved correctly.

Creating the Program Using new.py

If you copied one of the solutions, as shown in the preceding section, then delete that program so you can start from scratch:

$ rm dna.py

Without looking at my solutions yet, I want you to try to solve this problem. If you think you have all the information you need, feel free to jump ahead and write your own version of dna.py, using pytest to run the provided tests. Keep reading if you want to go step-by-step with me to learn how to write the program and run the tests.

Every program in this book will accept some command-line argument(s) and create some output, like text on the command line or new files. I’ll always use the new.py program described in the Preface to start, but this is not a requirement. You can write your programs however you like, starting from whatever point you want, but your programs are expected to have the same features, such as generating usage statements and properly validating arguments.

Create your dna.py program in the 01_dna directory, as this contains the test files for the program. Here is how I will start the dna.py program. The --purpose argument will be used in the program’s documentation:

$ new.py --purpose 'Tetranucleotide frequency' dna.py
Done, see new script "dna.py."

If you run the new dna.py program, you will see that it defines many different types of arguments common to command-line programs:

$ ./dna.py --help
usage: dna.py [-h] [-a str] [-i int] [-f FILE] [-o] str

Tetranucleotide frequency 1

positional arguments:
  str                   A positional argument 2

optional arguments:
  -h, --help            show this help message and exit 3
  -a str, --arg str     A named string argument (default: ) 4
  -i int, --int int     A named integer argument (default: 0) 5
  -f FILE, --file FILE  A readable file (default: None) 6
  -o, --on              A boolean flag (default: False) 7
1

The --purpose from new.py is used here to describe the program.

2

The program accepts a single positional string argument.

3

The -h and --help flags are automatically added by argparse and will trigger the usage.

4

This is a named option with short (-a) and long (--arg) names for a string value.

5

This is a named option with short (-i) and long (--int) names for an integer value.

6

This is a named option with short (-f) and long (--file) names for a file argument.

7

This is a Boolean flag that will be True when either -o or --on is present and False when they are absent.

This program only needs the str positional argument, and you can use DNA for the metavar value to give some indication to the user as to the meaning of the argument. Delete all the other parameters. Note that you never define the -h and --help flags, as argparse uses those internally to respond to usage requests. See if you can modify your program until it will produce the usage that follows (if you can’t produce the usage just yet, don’t worry, I’ll show this in the next section):

$ ./dna.py -h
usage: dna.py [-h] DNA

Tetranucleotide frequency

positional arguments:
  DNA         Input DNA sequence

optional arguments:
  -h, --help  show this help message and exit

If you can manage to get this working, I’d like to point out that this program will accept exactly one positional argument. If you try running it with any other number of arguments, the program will immediately halt and print an error message:

$ ./dna.py AACC GGTT
usage: dna.py [-h] DNA
dna.py: error: unrecognized arguments: GGTT

Likewise, the program will reject any unknown flags or options. With very few lines of code, you have built a documented program that validates the arguments to the program. That’s a very basic and important step toward reproducibility.

Using argparse

The program created by new.py uses the argparse module to define the program’s parameters, validate that the arguments are correct, and create the usage documentation for the user. The argparse module is a standard Python module, which means it’s always present. Other modules can also do these things, and you are free to use any method you like to handle this aspect of your program. Just be sure your programs can pass the tests.

I wrote a version of new.py for Tiny Python Projects that you can find in the bin directory of that book’s GitHub repo. That version is somewhat simpler than the version I want you to use. I’ll start by showing you a version of dna.py created using this earlier version of new.py:

#!/usr/bin/env python3 1
""" Tetranucleotide frequency """ 2

import argparse 3


# --------------------------------------------------
def get_args(): 4
    """ Get command-line arguments """ 5

    parser = argparse.ArgumentParser( 6
        description='Tetranucleotide frequency',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('dna', metavar='DNA', help='Input DNA sequence') 7

    return parser.parse_args() 8


# --------------------------------------------------
def main(): 9
    """ Make a jazz noise here """

    args = get_args() 10
    print(args.dna)   11


# --------------------------------------------------
if __name__ == '__main__': 12
    main()
1

The colloquial shebang (#!) tells the operating system to use the env command (environment) to find python3 to execute the rest of the program.

2

This is a docstring (documentation string) for the program or module as a whole.

3

I import the argparse module to handle command-line arguments.

4

I always define a get_args() function to handle the argparse code.

5

This is a docstring for a function.

6

The parser object is used to define the program’s parameters.

7

I define a dna argument, which will be positional because the name dna does not start with a dash. The metavar is a short description of the argument that will appear in the short usage. No other arguments are needed.

8

The function returns the results of parsing the arguments. The help flags or any problems with the arguments will cause argparse to print a usage statement/error messages and exit the program.

9

All programs in the book will always start in the main() function.

10

The first step in main() will always be to call get_args(). If this call succeeds, then the arguments must have been valid.

11

The DNA value is available in the args.dna attribute, as this is the name of the argument.

12

This is a common idiom in Python programs to detect when the program is being executed (as opposed to being imported) and to execute the main() function.

The shebang line is used by the Unix shell when the program is invoked as a program, like ./dna.py. It does not work on Windows, where you are required to run python.exe dna.py to execute the program.

While this code works completely adequately, the value returned from get_args() is an argparse.Namespace object that is dynamically generated when the program runs. That is, I am using code like parser.add_argument() to modify the structure of this object at runtime, so Python is unable to know positively at compile time what attributes will be available in the parsed arguments or what their types would be. While it may be obvious to you that there can only be a single, required string argument, there is not enough information in the code for Python to discern this.

To compile a program is to turn it into the machine code that a computer can execute. Some languages, like C, must be compiled separately before they can be run. Python programs are often compiled and run in one step, but there is still a compilation phase. Some errors can be caught at compilation, and others don’t turn up until runtime. For instance, syntax errors will prevent compilation. It is preferable to have compile-time errors over runtime errors.

To see why this could be a problem, I’ll alter the main() function to introduce a type error. That is, I’ll intentionally misuse the type of the args.dna value. Unless otherwise stated, all argument values returned from the command line by argparse are strings. If I try to divide the string args.dna by the integer value 2, Python will raise an exception and crash the program at runtime:

def main():
    args = get_args()
    print(args.dna / 2) 1
1

Dividing a string by an integer will produce an exception.

If I run the program, it crashes as expected:

$ ./dna.py ACGT
Traceback (most recent call last):
  File "./dna.py", line 30, in <module>
    main()
  File "./dna.py", line 25, in main
    print(args.dna / 2)
TypeError: unsupported operand type(s) for /: 'str' and 'int'

Our big squishy brains know that this is an inevitable error waiting to happen, but Python can’t see the problem. What I need is a static definition of the arguments that cannot be modified when the program is run. Read on to see how type annotations and other tools can detect these sorts of bugs.

Tools for Finding Errors in the Code

The goal here is to write correct, reproducible programs in Python. Are there ways to spot and avoid problems like misusing a string in a numeric operation? The python3 interpreter found no problems that prevented me from running the code. That is, the program is syntactically correct, so the code in the preceding section produces a runtime error because the error happens only when I execute the program. Years back I worked in a group where we joked, “If it compiles, ship it!” This is clearly a myopic approach when coding in Python.

I can use tools like linters and type checkers to find some kinds of problems in code. Linters are tools that check for program style and many kinds of errors beyond bad syntax. The pylint tool is a popular Python linter that I use almost every day. Can it find this problem? Apparently not, as it gives the biggest of thumbs-ups:

$ pylint dna.py

-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 9.78/10, +0.22)

The flake8 tool is another linter that I often use in combination with pylint, as it will report different kinds of errors. When I run flake8 dna.py, I get no output, which means it found no errors to report.

The mypy tool is a static type checker for Python, meaning it is designed to find misused types such as trying to divide a string by a number. Neither pylint nor flake8 is designed to catch type errors, so I cannot be legitimately surprised they missed the bug. So what does mypy have to say?

$ mypy dna.py
Success: no issues found in 1 source file

Well, that’s just a little disappointing; however, you must understand that mypy is failing to report a problem because there is no type information. That is, mypy has no information to say that dividing args.dna by 2 is wrong. I’ll fix that shortly.

Introducing Named Tuples

To avoid the problems with dynamically generated objects, all of the programs in this book will use a named tuple data structure to statically define the arguments from get_args(). Tuples are essentially immutable lists, and they are often used to represent record-type data structures in Python. There’s quite a bit to unpack with all that, so let’s step back to lists.

To start, lists are ordered sequences of items. The items can be heterogeneous; in theory, this means all the items can be of different types, but in practice, mixing types is often a bad idea. I’ll use the python3 REPL to demonstrate some aspects of lists. I recommend you use help(list) to read the documentation.

Use empty square brackets ([]) to create an empty list that will hold some sequences:

>>> seqs = []

The list() function will also create a new, empty list:

>>> seqs = list()

Verify that this is a list by using the type() function to return the variable’s type:

>>> type(seqs)
<class 'list'>

Lists have methods that will add values to the end of the list, like list.append() to add one value:

>>> seqs.append('ACT')
>>> seqs
['ACT']

and list.extend() to add multiple values:

>>> seqs.extend(['GCA', 'TTT'])
>>> seqs
['ACT', 'GCA', 'TTT']

If you type the variable by itself in the REPL, it will be evaluated and stringified into a textual representation:

>>> seqs
['ACT', 'GCA', 'TTT']

This is basically the same thing that happens when you print() a variable:

>>> print(seqs)
['ACT', 'GCA', 'TTT']

You can modify any of the values in-place using the index. Remember that all indexing in Python is 0-based, so 0 is the first element. Change the first sequence to be TCA:

>>> seqs[0] = 'TCA'

Verify that it was changed:

>>> seqs
['TCA', 'GCA', 'TTT']

Like lists, tuples are ordered sequences of possibly heterogeneous objects. Whenever you put commas between items in a series, you are creating a tuple:

>>> seqs = 'TCA', 'GCA', 'TTT'
>>> type(seqs)
<class 'tuple'>

It’s typical to place parentheses around tuple values to make this more explicit:

>>> seqs = ('TCA', 'GCA', 'TTT')
>>> type(seqs)
<class 'tuple'>

Unlike lists, tuples cannot be changed once they are created. If you read help(tuple), you will see that a tuple is a built-in immutable sequence, so I cannot add values:

>>> seqs.append('GGT')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'

or modify existing values:

>>> seqs[0] = 'TCA'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

It’s fairly common in Python to use tuples to represent records. For instance, I might represent a Sequence having a unique ID and a string of bases:

>>> seq = ('CAM_0231669729', 'GTGTTTATTCAATGCTAG')

While it’s possible to use indexing to get the values from a tuple just as with lists, that’s awkward and error-prone. Named tuples allow me to assign names to the fields, which makes them more ergonomic to use. To use named tuples, I can import the namedtuple() function from the collections module:

>>> from collections import namedtuple

As shown in Figure 1-2, I use the namedtuple() function to create the idea of a Sequence that has fields for the id and the seq:

>>> Sequence = namedtuple('Sequence', ['id', 'seq'])
mpfb 0102
Figure 1-2. The namedtuple() function generates a way to make objects of the class Sequence that have the fields id and seq

What exactly is Sequence here?

>>> type(Sequence)
<class 'type'>

I’ve just created a new type. You might call the Sequence() function a factory because it’s a function used to generate new objects of the class Sequence. It’s a common naming convention for these factory functions and class names to be TitleCased to set them apart.

Just as I can use the list() function to create a new list, I can use the Sequence() function to create a new Sequence object. I can pass the id and seq values positionally to match the order they are defined in the class:

>>> seq1 = Sequence('CAM_0231669729', 'GTGTTTATTCAATGCTAG')
>>> type(seq1)
<class '__main__.Sequence'>

Or I can use the field names and pass them as key/value pairs in any order I like:

>>> seq2 = Sequence(seq='GTGTTTATTCAATGCTAG', id='CAM_0231669729')
>>> seq2
Sequence(id='CAM_0231669729', seq='GTGTTTATTCAATGCTAG')

While it’s possible to use indexes to access the ID and sequence:

>>> 'ID = ' + seq1[0]
'ID = CAM_0231669729'
>>> 'seq = ' + seq1[1]
'seq = GTGTTTATTCAATGCTAG'

…​the whole point of named tuples is to use the field names:

>>> 'ID = ' + seq1.id
'ID = CAM_0231669729'
>>> 'seq = ' + seq1.seq
'seq = GTGTTTATTCAATGCTAG'

The record’s values remain immutable:

>>> seq1.id = 'XXX'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: can't set attribute

I often want a guarantee that a value cannot be accidentally changed in my code. Python doesn’t have a way to declare that a variable is constant or immutable. Tuples are by default immutable, and I think it makes sense to represent the arguments to a program using a data structure that cannot be altered. The inputs are sacrosanct and should (almost) never be modified.

Adding Types to Named Tuples

As nice as namedtuple() is, I can make it even better by importing the NamedTuple class from the typing module to use as the base class for the Sequence. Additionally, I can assign types to the fields using this syntax. Note the need to use an empty line in the REPL to indicate that the block is complete:

>>> from typing import NamedTuple
>>> class Sequence(NamedTuple):
...     id: str
...     seq: str
...

The ... you see are line continuations. The REPL is showing that what’s been entered so far is not a complete expression. You need to enter a blank line to let the REPL know that you’re done with the code block.

As with the namedtuple() method, Sequence is a new type:

>>> type(Sequence)
<class 'type'>

The code to instantiate a new Sequence object is the same:

>>> seq3 = Sequence('CAM_0231669729', 'GTGTTTATTCAATGCTAG')
>>> type(seq3)
<class '__main__.Sequence'>

I can still access the fields by names:

>>> seq3.id, seq3.seq
('CAM_0231669729', 'GTGTTTATTCAATGCTAG')

Since I defined that both fields have str types, you might assume this would not work:

>>> seq4 = Sequence(id='CAM_0231669729', seq=3.14)

I’m sorry to tell you that Python itself ignores the type information. You can see the seq field that I hoped would be a str is actually a float:

>>> seq4
Sequence(id='CAM_0231669729', seq=3.14)
>>> type(seq4.seq)
<class 'float'>

So how does this help us? It doesn’t help me in the REPL, but adding types to my source code will allow type-checking tools like mypy to find such errors.

Representing the Arguments with a NamedTuple

I want the data structure that represents the program’s arguments to include type information. As with the Sequence class, I can define a class that is derived from the NamedTuple type where I can statically define the data structure with types. I like to call this class Args, but you can call it whatever you like. I know this probably seems like driving a finishing nail with a sledgehammer, but trust me, this kind of detail will pay off in the future.

The latest new.py uses the NamedTuple class from the typing module. Here is how I suggest you define and represent the arguments:

#!/usr/bin/env python3
"""Tetranucleotide frequency"""

import argparse
from typing import NamedTuple 1


class Args(NamedTuple): 2
    """ Command-line arguments """
    dna: str 3


# --------------------------------------------------
def get_args() -> Args: 4
    """ Get command-line arguments """

    parser = argparse.ArgumentParser(
        description='Tetranucleotide frequency',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('dna', metavar='DNA', help='Input DNA sequence')

    args = parser.parse_args() 5

    return Args(args.dna) 6


# --------------------------------------------------
def main() -> None: 7
    """ Make a jazz noise here """

    args = get_args()
    print(args.dna / 2) 8


# --------------------------------------------------
if __name__ == '__main__':
    main()
1

Import the NamedTuple class from the typing module.

2

Define a class for the arguments which is based on the NamedTuple class. See the following note.

3

The class has a single field called dna that has the type str.

4

The type annotation on the get_args() function shows that it returns an object of the type Args.

5

Parse the arguments as before.

6

Return a new Args object that contains the single value from args.dna.

7

The main() function has no return statement, so it returns the default None value.

8

This is the type error from the earlier program.

If you run pylint on this program, you may encounter the errors “Inheriting NamedTuple, which is not a class. (inherit-non-class)” and “Too few public methods (0/2) (too-few-public-methods).” You can disable these warnings by adding “inherit-non-class” and “too-few-public-methods” to the “disable” section of your pylintrc file, or use the pylintrc file included in the root of the GitHub repository.

If you run this program, you’ll see it still creates the same uncaught exception. Both flake8 and pylint will continue to report that the program looks fine, but see what mypy tells me now:

$ mypy dna.py
dna.py:32: error: Unsupported operand types for / ("str" and "int")
Found 1 error in 1 file (checked 1 source file)

The error message shows that there is a problem on line 32 with the operands, which are the arguments to the division (/) operator. I’m mixing string and integer values. Without the type annotations, mypy would be unable to find a bug. Without this warning from mypy, I’d have to run my program to find it, being sure to exercise the branch of code that contains the error. In this case, it’s all rather obvious and trivial, but in a much larger program with hundreds or thousands of lines of code (LOC) with many functions and logical branches (like if/else), I might not stumble upon this error. I rely on types and programs like mypy (and pylint and flake8 and so on) to correct these kinds of errors rather than relying solely on tests, or worse, waiting for users to report bugs.

Reading Input from the Command Line or a File

When you attempt to prove that your program works on the Rosalind.info website, you will download a data file containing the input to your program. Usually, this data will be much larger than the sample data described in the problem. For instance, the example DNA string for this problem is 70 bases long, but the one I downloaded for one of my attempts was 910 bases.

Let’s make the program read input both from the command line and from a text file so that you don’t have to copy and paste the contents from a downloaded file. This is a common pattern I use, and I prefer to handle this option inside the get_args() function since this pertains to processing the command-line arguments.

First, correct the program so that it prints the args.dna value without the division:

def main() -> None:
    args = get_args()
    print(args.dna) 1
1

Remove the division type error.

Check that it works:

$ ./dna.py ACGT
ACGT

For this next part, you need to bring in the os module to interact with your operating system. Add import os to the other import statements at the top, then add these two lines to your get_args() function:

def get_args() -> Args:
    """ Get command-line arguments """

    parser = argparse.ArgumentParser(
        description='Tetranucleotide frequency',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('dna', metavar='DNA', help='Input DNA sequence')

    args = parser.parse_args()

    if os.path.isfile(args.dna):  1
        args.dna = open(args.dna).read().rstrip()   2

    return Args(args.dna)
1

Check if the dna value is a file.

2

Call open() to open a filehandle, then chain the fh.read() method to return a string, then chain the str.rstrip() method to remove trailing whitespace.

The fh.read() function will read an entire file into a variable. In this case, the input file is small and so this should be fine, but it’s very common in bioinformatics to process files that are gigabytes in size. Using read() on a large file could crash your program or even your entire computer. Later I will show you how to read a file line-by-line to avoid this.

Now run your program with a string value to ensure it works:

$ ./dna.py ACGT
ACGT

and then use a text file as the argument:

$ ./dna.py tests/inputs/input2.txt
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

Now you have a flexible program that reads input from two sources. Run mypy dna.py to make sure there are no problems.

Testing Your Program

You know from the Rosalind description that given the input ACGT, the program should print 1 1 1 1 since that is the number of As, Cs, Gs, and Ts, respectively. In the 01_dna/tests directory, there is a file called dna_test.py that contains tests for the dna.py program. I wrote these tests for you so you can see what it’s like to develop a program using a method to tell you with some certainty when your program is correct. The tests are really basic—given an input string, the program should print the correct counts for the four nucleotides. When the program reports the correct numbers, then it works.

Inside the 01_dna directory, I’d like you to run pytest (or python3 -m pytest or pytest.exe on Windows). The program will recursively search for all files with names that start with test_ or end with _test.py. It will then run for any functions in these files that have names starting with test_.

When you run pytest, you will see a lot of output, most of which is failing tests. To understand why these tests are failing, let’s look at the tests/dna_test.py module:

""" Tests for dna.py """ 1

import os 2
import platform 3
from subprocess import getstatusoutput 4

PRG = './dna.py' 5
RUN = f'python {PRG}' if platform.system() == 'Windows' else PRG 6
TEST1 = ('./tests/inputs/input1.txt', '1 2 3 4') 7
TEST2 = ('./tests/inputs/input2.txt', '20 12 17 21')
TEST3 = ('./tests/inputs/input3.txt', '196 231 237 246')
1

This is the docstring for the module.

2

The standard os module will interact with the operating system.

3

The platform module is used to determine if this is being run on Windows.

4

From the subprocess module I import a function to run the dna.py program and capture the output and status.

5

These following lines are global variables for the program. I tend to avoid globals except in my tests. Here I want to define some values that I’ll use in the functions. I like to use UPPERCASE_NAMES to highlight the global visibility.

6

The RUN variable determines how to run the dna.py program. On Windows, the python command must be used to run a Python program, but on Unix platforms, the dna.py program can be directly executed.

7

The TEST* variables are tuples that define a file containing a string of DNA and the expected output from the program for that string.

The pytest module will run the test functions in the order in which they are defined in the test file. I often structure my tests so that they progress from the simplest cases to more complex, so there’s usually no point in continuing after a failure. For instance, the first test is always that the program to test exists. If it doesn’t, then there’s no point in running more tests. I recommend you run pytest with the -x flag to stop on the first failing test along with the -v flag for verbose output.

Let’s look at the first test. The function is called test_exists() so that pytest will find it. In the body of the function, I use one or more assert statements to check if some condition is truthy.1 Here I assert that the program dna.py exists. This is why your program must exist in this directory—otherwise it wouldn’t be found by the test:

def test_exists(): 1
    """ Program exists """

    assert os.path.exists(PRG) 2
1

The function name must start with test_ to be found by pytest.

2

The os.path.exists() function returns True if the given argument is a file. If it returns False, then the assertion fails and this test will fail.

The next test I write is always to check that the program will produce a usage statement for the -h and --help flags. The subprocess.getstatusoutput() function will run the dna.py program with the short and long help flags. In each case, I want to see that the program prints text starting with the word usage:. It’s not a perfect test. It doesn’t check that the documentation is accurate, only that it appears to be something that might be a usage statement. I don’t feel that every test needs to be completely exhaustive. Here’s the test:

def test_usage() -> None:
    """ Prints usage """

    for arg in ['-h', '--help']: 1
        rv, out = getstatusoutput(f'{RUN} {arg}') 2
        assert rv == 0 3
        assert out.lower().startswith('usage:') 4
1

Iterate over the short and long help flags.

2

Run the program with the argument and capture the return value and output.

3

Verify that the program reports a successful exit value of 0.

4

Assert that the lowercased output of the program starts with the text usage:.

Command-line programs usually indicate an error to the operating system by returning a nonzero value. If the program runs successfully, it ought to return a 0. Sometimes that nonzero value may correlate to some internal error code, but often it just means that something went wrong. The programs I write will, likewise, always strive to report 0 for successful runs and some nonzero value when there are errors.

Next, I want to ensure that the program will die when given no arguments:

def test_dies_no_args() -> None:
    """ Dies with no arguments """

    rv, out = getstatusoutput(RUN) 1
    assert rv != 0 2
    assert out.lower().startswith('usage:') 3
1

Capture the return value and output from running the program with no arguments.

2

Verify that the return value is a nonzero failure code.

3

Check that the output looks like a usage statement.

At this point in testing, I know that I have a program with the correct name that can be run to produce documentation. This means that the program is at least syntactically correct, which is a decent place to start testing. If your program has typographical errors, then you’ll be forced to correct those to even get to this point.

Running the Program to Test the Output

Now I need to see if the program does what it’s supposed to do. There are many ways to test programs, and I like to use two basic approaches I call inside-out and outside-in. The inside-out approach starts at the level of testing individual functions inside a program. This is often called unit testing, as functions might be considered a basic unit of computing, and I’ll get to this in the solutions section. I’ll start with the outside-in approach. This means that I will run the program from the command line just as the user will run it. This is a holistic approach to check if the pieces of the code can work together to create the correct output, and so it’s sometimes called an integration test.

The first such test will pass the DNA string as a command-line argument and check if the program produces the right counts formatted in the correct string:

def test_arg():
    """ Uses command-line arg """

    for file, expected in [TEST1, TEST2, TEST3]: 1
        dna = open(file).read() 2
        retval, out = getstatusoutput(f'{RUN} {dna}') 3
        assert retval == 0 4
        assert out == expected 5
1

Unpack the tuples into the file containing a string of DNA and the expected value from the program when run with this input.

2

Open the file and read the dna from the contents.

3

Run the program with the given DNA string using the function subprocess.getstatusoutput(), which gives me both the return value from the program and the text output (also called STDOUT, which is pronounced standard out).

4

Assert that the return value is 0, which indicates success (or 0 errors).

5

Assert that the output from the program is the string of numbers expected.

The next test is almost identical, but this time I’ll pass the filename as the argument to the program to verify that it correctly reads the DNA from a file:

def test_file():
    """ Uses file arg """

    for file, expected in [TEST1, TEST2, TEST3]:
        retval, out = getstatusoutput(f'{RUN} {file}') 1
        assert retval == 0
        assert out == expected
1

The only difference from the first test is that I pass the filename instead of the contents of the file.

Now that you’ve looked at the tests, go back and run the tests again. This time, use pytest -xv, where the -v flag is for verbose output. Since both -x and -v are short flags, you can combine them like -xv or -vx. Read the output closely and notice that it’s trying to tell you that the program is printing the DNA sequence but that the test is expecting a sequence of numbers:

$ pytest -xv
============================= test session starts ==============================
...

tests/dna_test.py::test_exists PASSED                                    [ 25%]
tests/dna_test.py::test_usage PASSED                                     [ 50%]
tests/dna_test.py::test_arg FAILED                                       [ 75%]

=================================== FAILURES ===================================
___________________________________ test_arg ___________________________________

    def test_arg():
        """ Uses command-line arg """

        for file, expected in [TEST1, TEST2, TEST3]:
            dna = open(file).read()
            retval, out = getstatusoutput(f'{RUN} {dna}')
            assert retval == 0
>           assert out == expected  1
E           AssertionError: assert 'ACCGGGTTTT' == '1 2 3 4'  2
E             - 1 2 3 4
E             + ACCGGGTTTT

tests/dna_test.py:36: AssertionError
=========================== short test summary info ============================
FAILED tests/dna_test.py::test_arg - AssertionError: assert 'ACCGGGTTTT' == '...
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
========================= 1 failed, 2 passed in 0.35s ==========================
1

The > at the beginning of this line shows that this is the source of the error.

2

The output from the program was the string ACCGGGTTTT but the expected value was 1 2 3 4. Since these are not equal, an AssertionError exception is raised.

Let’s fix that. If you think you know how to finish the program, please jump right into your solution. First, perhaps try running your program to verify that it will report the correct number of As:

$ ./dna.py A
1 0 0 0

And then Cs:

$ ./dna.py C
0 1 0 0

and so forth with Gs and Ts. Then run pytest to see if it passes all the tests.

After you have a working version, consider trying to find as many different ways as you can to get the same answer. This is called refactoring a program. You need to start with something that works correctly, and then you try to improve it. The improvements can be measured in many ways. Perhaps you find a way to write the same idea using less code, or maybe you find a solution that runs faster. No matter what metric you’re using, keep running pytest to ensure the program is correct.

Solution 1: Iterating and Counting the Characters in a String

If you don’t know where to start, I’ll work through the first solution with you. The goal is to travel through all the bases in the DNA string. So, first I need to create a variable called dna by assigning it some value in the REPL:

>>> dna = 'ACGT'

Note that any value enclosed in quotes, whether single or double, is a string. Even a single character in Python is considered a string. I will often use the type() function to verify the type of a variable, and here I see that dna is of the class str (string):

>>> type(dna)
<class 'str'>

Type help(str) in the REPL to see all the wonderful things you can do with strings. This data type is especially important in genomics, where strings comprise so much of the data.

In the parlance of Python, I want to iterate through the characters of a string, which in this case are the nucleotides of DNA. A for loop will do that. Python sees a string as an ordered sequence of characters, and a for loop will visit each character from beginning to end:

>>> for base in dna: 1
...     print(base)  2
...
A
C
G
T
1

Each character in the dna string will be copied into the base variable. You could call this char, or c for character, or whatever else you like.

2

Each call to print() will end with a newline, so you’ll see each base on a separate line.

Later you will see that for loops can be used with lists and dictionaries and sets and lines in a file—basically any iterable data structure.

Counting the Nucleotides

Now that I know how to visit each base in the sequence, I need to count each base rather than printing it. That means I’ll need some variables to keep track of the numbers for each of the four nucleotides. One way to do this is to create four variables that hold integer counts, one for each base. I will initialize four variables for counting by setting their initial values to 0:

>>> count_a = 0
>>> count_c = 0
>>> count_g = 0
>>> count_t = 0

I could write this in one line by using the tuple unpacking syntax that I showed earlier:

>>> count_a, count_c, count_g, count_t = 0, 0, 0, 0

I need to look at each base and determine which variable to increment, making its value increase by 1. For instance, if the current base is a C, then I should increment the count_c variable. I could write this:

for base in dna:
    if base == 'C': 1
        count_c = count_c + 1 2
1

The == operator is used to compare two values for equality. Here I want to know if the current base is equal to the string C.

2

Set count_c equal to 1 greater than the current value.

The == operator is used to compare two values for equality. It works to compare two strings or two numbers. I showed earlier that division with / will raise an exception if you mix strings and numbers. What happens if you mix types with this operator, for example '3' == 3? Is this a safe operator to use without first comparing the types?

As shown in Figure 1-3, a shorter way to increment a variable uses the += operator to add whatever is on the righthand side (often noted as RHS) of the expression to whatever is on the lefthand side (or LHS):

mpfb 0103
Figure 1-3. The += operator will add the value on the righthand side to the variable on the lefthand side

Since I have four nucleotides to check, I need a way to combine three more if expressions. The syntax in Python for this is to use elif for else if and else for any final or default case. Here is a block of code I can enter in the program or the REPL that implements a simple decision tree:

dna = 'ACCGGGTTTT'
count_a, count_c, count_g, count_t = 0, 0, 0, 0
for base in dna:
    if base == 'A':
        count_a += 1
    elif base == 'C':
        count_c += 1
    elif base == 'G':
        count_g += 1
    elif base == 'T':
        count_t += 1

I should end up with counts of 1, 2, 3, and 4 for each of the sorted bases:

>>> count_a, count_c, count_g, count_t
(1, 2, 3, 4)

Now I need to report the outcome to the user:

>>> print(count_a, count_c, count_g, count_t)
1 2 3 4

That is the exact output the program expects. Notice that print() will accept multiple values to print, and it inserts a space between each value. If you read help(print) in the REPL, you’ll find that you can change this with the sep argument:

>>> print(count_a, count_c, count_g, count_t, sep='::')
1::2::3::4

The print() function will also put a newline at the end of the output, and this can likewise be changed using the end option:

>>> print(count_a, count_c, count_g, count_t, end='\n-30-\n')
1 2 3 4
-30-

Writing and Verifying a Solution

Using the preceding code, you should be able to create a program that passes all the tests. As you write, I would encourage you to regularly run pylint, flake8, and mypy to check your source code for potential errors. I would even go further and suggest you install the pytest extensions for these so that you can routinely incorporate such tests:

$ python3 -m pip install pytest-pylint pytest-flake8 pytest-mypy

Alternatively, I’ve placed a requirements.txt file in the root directory of the GitHub repo that lists various dependencies I’ll use throughout the book. You can install all these modules with the following command:

$ python3 -m pip install -r requirements.txt

With those extensions, you can run the following command to run not only the tests defined in the tests/dna_test.py file but also tests for linting and type checking using these tools:

$ pytest -xv --pylint --flake8 --mypy tests/dna_test.py
========================== test session starts ===========================
...
collected 7 items

tests/dna_test.py::FLAKE8 SKIPPED                                  [ 12%]
tests/dna_test.py::mypy PASSED                                     [ 25%]
tests/dna_test.py::test_exists PASSED                              [ 37%]
tests/dna_test.py::test_usage PASSED                               [ 50%]
tests/dna_test.py::test_dies_no_args PASSED                        [ 62%]
tests/dna_test.py::test_arg PASSED                                 [ 75%]
tests/dna_test.py::test_file PASSED                                [ 87%]
::mypy PASSED                                                      [100%]
================================== mypy ==================================

Success: no issues found in 1 source file
====================== 7 passed, 1 skipped in 0.58s ======================

Some tests are skipped when a cached version indicates nothing has changed since the last test. Run pytest with the ---cache-clear option to force the tests to run. Also, you may find you fail linting tests if your code is not properly formatted or indented. You can automatically format your code using yapf or black. Most IDEs and editors will provide an auto-format option.

That’s a lot to type, so I’ve created a shortcut for you in the form of a Makefile in the directory:

$ cat Makefile
.PHONY: test

test:
	python3 -m pytest -xv --flake8 --pylint --pylint-rcfile=../pylintrc \
    --mypy dna.py tests/dna_test.py

all:
	../bin/all_test.py dna.py

You can learn more about these files by reading Appendix A. For now, it’s enough to understand that if you have make installed on your system, you can use the command make test to run the command in the test target of the Makefile. If you don’t have make installed or you don’t want to use it, that’s fine too, but I suggest you explore how a Makefile can be used to document and automate processes.

There are many ways to write a passing version of dna.py, and I’d like to encourage you to keep exploring before you read the solutions. More than anything, I want to get you used to the idea of changing your program and then running the tests to see if it works. This is the cycle of test-driven development, where I first create some metric to decide when the program works correctly. In this instance, that is the dna_test.py program that is run by pytest.

The tests ensure I don’t stray from the goal, and they also let me know when I’ve met the requirements of the program. They are the specifications (also called specs) made incarnate as a program that I can execute. How else would I ever know when a program worked or was finished? Or, as Louis Srygley puts it, “Without requirements or design, programming is the art of adding bugs to an empty text file.”

Testing is essential to creating reproducible programs. Unless you can absolutely and automatically prove the correctness and predictability of your program when run with both good and bad data, then you’re not writing good software.

Additional Solutions

The program I wrote earlier in this chapter is the solution1_iter.py version in the GitHub repo, so I won’t bother reviewing that version. I would like to show you several alternate solutions that progress from simpler to more complex ideas. Please do not take this to mean they progress from worse to better. All versions pass the tests, so they are all equally valid. The point is to explore what Python has to offer for solving common problems. Note I will omit code they all have in common, such as the get_args() function.

Solution 2: Creating a count() Function and Adding a Unit Test

The first variation I’d like to show will move all the code in the main() function that does the counting into a count() function. You can define this function anywhere in your program, but I generally like get_args() first, main() second, and then other functions after that but before the final couplet that calls main().

For the following function, you will also need to import the typing.Tuple value:

def count(dna: str) -> Tuple[int, int, int, int]: 1
    """ Count bases in DNA """

    count_a, count_c, count_g, count_t = 0, 0, 0, 0 2
    for base in dna:
        if base == 'A':
            count_a += 1
        elif base == 'C':
            count_c += 1
        elif base == 'G':
            count_g += 1
        elif base == 'T':
            count_t += 1

    return (count_a, count_c, count_g, count_t) 3
1

The types show that the function takes a string and returns a tuple containing four integer values.

2

This is the code from main() that did the counting.

3

Return a tuple of the four counts.

There are many reasons to move this code into a function. To start, this is a unit of computation—given a string of DNA, return the tetranucleotide frequency—so it makes sense to encapsulate it. This will make main() shorter and more readable, and it allows me to write a unit test for the function. Since the function is called count(), I like to call the unit test test_count(). I have placed this function inside the dna.py program just after the count() function rather than in the dna_test.py program just as a matter of convenience. For short programs, I tend to put my functions and unit tests together in the source code, but as projects grow larger, I will segregate unit tests into a separate module. Here’s the test function:

def test_count() -> None: 1
    """ Test count """

    assert count('') == (0, 0, 0, 0) 2
    assert count('123XYZ') == (0, 0, 0, 0)
    assert count('A') == (1, 0, 0, 0) 3
    assert count('C') == (0, 1, 0, 0)
    assert count('G') == (0, 0, 1, 0)
    assert count('T') == (0, 0, 0, 1)
    assert count('ACCGGGTTTT') == (1, 2, 3, 4)
1

The function name must start with test_ to be found by pytest. The types here show that the test accepts no arguments and, because it has no return statement, returns the default None value.

2

I like to test functions with both expected and unexpected values to ensure they return something reasonable. The empty string should return all zeros.

3

The rest of the tests ensure that each base is reported in the correct position.

To verify that my function works, I can use pytest on the dna.py program:

$ pytest -xv dna.py
=========================== test session starts ===========================
...

dna.py::test_count PASSED                                           [100%]

============================ 1 passed in 0.01s ============================

The first test passes the empty string and expects to get all zeros for the counts. This is a judgment call, honestly. You might decide your program ought to complain to the user that there’s no input. That is, it’s possible to run the program using the empty string as the input, and this version will report the following:

$ ./dna.py ""
0 0 0 0

Likewise, if I passed an empty file, I’d get the same answer. Use the touch command to create an empty file:

$ touch empty
$ ./dna.py empty
0 0 0 0

On Unix systems, /dev/null is a special filehandle that returns nothing:

$ ./dna.py /dev/null
0 0 0 0

You may feel that no input is an error and report it as such. The important thing about the test is that it forces me to think about it. For instance, should the count() function return zeros or raise an exception if it’s given an empty string? Should the program crash on empty input and exit with a nonzero status? These are decisions you will have to make for your programs.

Now that I have a unit test in the dna.py code, I can run pytest on that file to see if it passes:

$ pytest -v dna.py
============================ test session starts =============================
...
collected 1 item

dna.py::test_count PASSED                                              [100%]

============================= 1 passed in 0.01s ==============================

When I’m writing code, I like to write functions that do just one limited thing with as few parameters as possible. Then I like to write a test with a name like test_ plus the function name, usually right after the function in the source code. If I find I have many of these kinds of unit tests, I might decide to move them to a separate file and have pytest execute that file.

To use this new function, modify main() like so:

def main() -> None:
    args = get_args()
    count_a, count_c, count_g, count_t = count(args.dna) 1
    print('{} {} {} {}'.format(count_a, count_c, count_g, count_t)) 2
1

Unpack the four values returned from count() into separate variables.

2

Use str.format() to create the output string.

Let’s focus for a moment on Python’s str.format(). As shown in Figure 1-4, the string '{} {} {} {}' is a template for the output I want to generate, and I’m calling the str.format() function directly on a string literal. This is a common idiom in Python that you’ll also see with the str.join() function. It’s important to remember that, in Python, even a literal string (one that literally exists inside your source code in quotes) is an object upon which you can call methods.

mpfb 0104
Figure 1-4. The str.format() function uses a template containing curly brackets to define placeholders that are filled in with the values of the arguments

Every {} in the string template is a placeholder for some value that is provided as an argument to the function. When using this function, you need to ensure that you have the same number of placeholders as arguments. The arguments are inserted in the order in which they are provided. I’ll have much more to say about the str.format() function later.

I’m not required to unpack the tuple returned by the count() function. I can pass the entire tuple as the argument to the str.format() function if I splat it by adding an asterisk (*) to the front. This tells Python to expand the tuple into its values:

def main() -> None:
    args = get_args()
    counts = count(args.dna) 1
    print('{} {} {} {}'.format(*counts)) 2
1

The counts variable is a 4-tuple of the integer base counts.

2

The *counts syntax will expand the tuple into the four values needed by the format string; otherwise, the tuple would be interpreted as a single value.

Since I use the counts variable only once, I could skip the assignment and shrink this to one line:

def main() -> None:
    args = get_args()
    print('{} {} {} {}'.format(*count(args.dna))) 1
1

Pass the return value from count() directly to the str.format() method.

The first solution is arguably easier to read and understand, and tools like flake8 would be able to spot when the number of {} placeholders does not match the number of variables. Simple, verbose, obvious code is often better than compact, clever code. Still, it’s good to know about tuple unpacking and splatting variables as I’ll use these in ideas in later programs.

Solution 3: Using str.count()

The previous count() function turns out to be quite verbose. I can write the function using a single line of code using the str.count() method. This function will count the number of times one string is found inside another string. Let me show you in the REPL:

>>> seq = 'ACCGGGTTTT'
>>> seq.count('A')
1
>>> seq.count('C')
2

If the string is not found, it will report 0, making this safe to count all four nucleotides even when the input sequence is missing one or more bases:

>>> 'AAA'.count('T')
0

Here is a new version of the count() function using this idea:

def count(dna: str) -> Tuple[int, int, int, int]: 1
    """ Count bases in DNA """

    return (dna.count('A'), dna.count('C'), dna.count('G'), dna.count('T')) 2
1

The signature is the same as before.

2

Call the dna.count() method for each of the four bases.

This code is much more succinct, and I can use the same unit test to verify that it’s correct. This is a key point: functions should act like black boxes. That is, I do not know or care what happens inside the box. Something goes in, an answer comes out, and I only really care that the answer is correct. I am free to change what happens inside the box so long as the contract to the outside—the parameters and return value—stays the same.

Here’s another way to create the output string in the main() function using Python’s f-string syntax:

def main() -> None:
    args = get_args()
    count_a, count_c, count_g, count_t = count(args.dna) 1
    print(f'{count_a} {count_c} {count_g} {count_t}') 2
1

Unpack the tuple into each of the four counts.

2

Use an f-string to perform variable interpolation.

It’s called an f-string because the f precedes the quotes. I use the mnemonic format to remind me this is to format a string. Python also has a raw string that is preceded with an r, which I’ll discuss later. All strings in Python—bare, f-, or r-strings—can be enclosed in single or double quotes. It makes no difference.

With f-strings, the {} placeholders can perform variable interpolation, which is a 50-cent word that means turning a variable into its contents. Those curlies can even execute code. For instance, the len() function will return the length of a string and can be run inside the braces:

>>> seq = 'ACGT'
>>> f'The sequence "{seq}" has {len(seq)} bases.'
'The sequence "ACGT" has 4 bases.'

I usually find f-strings easier to read than the equivalent code using str.format(). Which you choose is mostly a stylistic decision. I would recommend whichever makes your code more readable.

Solution 4: Using a Dictionary to Count All the Characters

So far I’ve discussed Python’s strings, lists, and tuples. This next solution introduces dictionaries, which are key/value stores. I’d like to show a version of the count() function that internally uses dictionaries so that I can hit on some important points to understand:

def count(dna: str) -> Tuple[int, int, int, int]: 1
    """ Count bases in DNA """

    counts = {} 2
    for base in dna: 3
        if base not in counts: 4
            counts[base] = 0 5
        counts[base] += 1 6

    return (counts.get('A', 0), 7
            counts.get('C', 0),
            counts.get('G', 0),
            counts.get('T', 0))
1

Internally I’ll use a dictionary, but nothing changes about the function signature.

2

Initialize an empty dictionary to hold the counts.

3

Use a for loop to iterate through the sequence.

4

Check if the base does not yet exist in the dictionary.

5

Initialize the value for this base to 0.

6

Increment the count for this base by 1.

7

Use the dict.get() method to get each base’s count or the default of 0.

Again, the contract for this function—the type signature—hasn’t changed. It’s still a string in and 4-tuple of integers out. Inside the function, I’m going to use a dictionary that I’ll initialize using the empty curly brackets:

>>> counts = {}

I could also use the dict() function. Neither is preferable:

>>> counts = dict()

I can use the type() function to check that this is a dictionary:

>>> type(counts)
<class 'dict'>

The isinstance() function is another way to check the type of a variable:

>>> isinstance(counts, dict)
True

My goal is to create a dictionary that has each base as a key and the number of times it occurs as a value. For example, given the sequence ACCGGGTTT, I want counts to look like this:

>>> counts
{'A': 1, 'C': 2, 'G': 3, 'T': 4}

I can access any of the values using square brackets and a key name like so:

>>> counts['G']
3

Python will raise a KeyError exception if I attempt to access a dictionary key that doesn’t exist:

>>> counts['N']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'N'

I can use the in keyword to see if a key exists in a dictionary:

>>> 'N' in counts
False
>>> 'T' in counts
True

As I am iterating through each of the bases in the sequence, I need to see if a base exists in the counts dictionary. If it does not, I need to initialize it to 0. Then I can safely use the += assignment to increment the count for a base by 1:

>>> seq = 'ACCGGGTTTT'
>>> counts = {}
>>> for base in seq:
...     if not base in counts:
...         counts[base] = 0
...     counts[base] += 1
...
>>> counts
{'A': 1, 'C': 2, 'G': 3, 'T': 4}

Finally, I want to return a 4-tuple of the counts for each of the bases. You might think this would work:

>>> counts['A'], counts['C'], counts['G'], counts['T']
(1, 2, 3, 4)

But ask yourself what would happen if one of the bases was missing from the sequence. Would this pass the unit test I wrote? Definitely not. It would fail on the very first test using an empty string because it would generate a KeyError exception. The safe way to ask a dictionary for a value is to use the dict.get() method. If the key does not exist, then None will be returned:

>>> counts.get('T')
4
>>> counts.get('N')

The dict.get() method accepts an optional second argument that is the default value to return when the key does not exist, so this is the safest way to return a 4-tuple of the base counts:

>>> counts.get('A', 0), counts.get('C', 0), counts.get('G', 0),
    counts.get('T', 0)
(1, 2, 3, 4)

No matter what you write inside your count() function, ensure that it will pass the test_count() unit test.

Solution 5: Counting Only the Desired Bases

The previous solution will count every character in the input sequence, but what if I only want to count the four nucleotides? In this solution, I will initialize a dictionary with values of 0 for the wanted bases. I’ll need to also bring in typing.Dict to run this code:

def count(dna: str) -> Dict[str, int]: 1
    """ Count bases in DNA """

    counts = {'A': 0, 'C': 0, 'G': 0, 'T': 0} 2
    for base in dna: 3
        if base in counts: 4
            counts[base] += 1 5

    return counts 6
1

The signature now indicates I’ll be returning a dictionary that has strings for the keys and integers for the values.

2

Initialize the counts dictionary with the four bases as keys and values of 0.

3

Iterate through the bases.

4

Check if the base is found as a key in the counts dictionary.

5

If so, increment the counts for this base by 1.

6

Return the counts dictionary.

Since the count() function is now returning a dictionary rather than a tuple, the test_count() function needs to change:

def test_count() -> None:
    """ Test count """

    assert count('') == {'A': 0, 'C': 0, 'G': 0, 'T': 0} 1
    assert count('123XYZ') == {'A': 0, 'C': 0, 'G': 0, 'T': 0} 2
    assert count('A') == {'A': 1, 'C': 0, 'G': 0, 'T': 0}
    assert count('C') == {'A': 0, 'C': 1, 'G': 0, 'T': 0}
    assert count('G') == {'A': 0, 'C': 0, 'G': 1, 'T': 0}
    assert count('T') == {'A': 0, 'C': 0, 'G': 0, 'T': 1}
    assert count('ACCGGGTTTT') == {'A': 1, 'C': 2, 'G': 3, 'T': 4}
1

The returned dictionary will always have the keys A, C, G, and T. Even for the empty string, these keys will be present and set to 0.

2

All the other tests have the same inputs, but now I check that the answer comes back as a dictionary.

When writing these tests, note that the order of the keys in the dictionaries is not important. The two dictionaries in the following code have the same content even though they were defined differently:

>>> counts1 = {'A': 1, 'C': 2, 'G': 3, 'T': 4}
>>> counts2 = {'T': 4, 'G': 3, 'C': 2, 'A': 1}
>>> counts1 == counts2
True

I would point out that the test_count() function tests the function to ensure it’s correct and also serves as documentation. Reading these tests helps me see the structure of the possible inputs and expected outputs from the function.

Here’s how I need to change the main() function to use the returned dictionary:

def main() -> None:
    args = get_args()
    counts = count(args.dna) 1
    print('{} {} {} {}'.format(counts['A'], counts['C'], counts['G'], 2
                               counts['T']))
1

counts is now a dictionary.

2

Use the str.format() method to create the output using the values from the dictionary.

Solution 6: Using collections.defaultdict()

I can rid my code of all the previous efforts to initialize dictionaries and check for keys and such by using the defaultdict() function from the collections module:

>>> from collections import defaultdict

When I use the defaultdict() function to create a new dictionary, I tell it the default type for the values. I no longer have to check for a key before using it because the defaultdict type will automatically create any key I reference using a representative value of the default type. For the case of counting the nucleotides, I want to use the int type:

>>> counts = defaultdict(int)

The default int value will be 0. Any reference to a nonexistent key will cause it to be created with a value of 0:

>>> counts['A']
0

This means I can instantiate and increment any base in one step:

>>> counts['C'] += 1
>>> counts
defaultdict(<class 'int'>, {'A': 0, 'C': 1})

Here is how I could rewrite the count() function using this idea:

def count(dna: str) -> Dict[str, int]:
    """ Count bases in DNA """

    counts: Dict[str, int] = defaultdict(int) 1

    for base in dna:
        counts[base] += 1 2

    return counts
1

The counts will be a defaultdict with integer values. The type annotation here is required by mypy so that it can be sure that the returned value is correct.

2

I can safely increment the counts for this base.

The test_count() function looks quite different. I can see at a glance that the answers are very different from the previous versions:

def test_count() -> None:
    """ Test count """

    assert count('') == {} 1
    assert count('123XYZ') == {'1': 1, '2': 1, '3': 1, 'X': 1, 'Y': 1, 'Z': 1} 2
    assert count('A') == {'A': 1} 3
    assert count('C') == {'C': 1}
    assert count('G') == {'G': 1}
    assert count('T') == {'T': 1}
    assert count('ACCGGGTTTT') == {'A': 1, 'C': 2, 'G': 3, 'T': 4}
1

Given an empty string, an empty dictionary will be returned.

2

Notice that every character in the string is a key in the dictionary.

3

Only A is present, with a count of 1.

Given the fact that the returned dictionary may not contain all the bases, the code in main() needs to use the count.get() method to retrieve each base’s frequency:

def main() -> None:
    args = get_args()
    counts = count(args.dna) 1
    print(counts.get('A', 0), counts.get('C', 0), counts.get('G', 0), 2
          counts.get('T', 0))
1

The counts will be a dictionary that may not contain all of the nucleotides.

2

It’s safest to use the dict.get() method with a default value of 0.

Solution 7: Using collections.Counter()

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

Antoine de Saint-Exupéry

I don’t actually like the last three solutions all that much, but I needed to step through how to use a dictionary both manually and with defaultdict() so that you can appreciate the simplicity of using collections.Counter():

>>> from collections import Counter
>>> Counter('ACCGGGTTT')
Counter({'G': 3, 'T': 3, 'C': 2, 'A': 1})

The best code is code you never write, and Counter() is a prepackaged function that will return a dictionary with the frequency of the items contained in the iterable you pass it. You might also hear this called a bag or a multiset. Here the iterable is a string composed of characters, and so I get back the same dictionary as in the last two solutions, but having written no code.

It’s so simple that you could pretty much eschew the count() and test_count() functions and integrate it directly into your main():

def main() -> None:
    args = get_args()
    counts = Counter(args.dna) 1
    print(counts.get('A', 0), counts.get('C', 0), counts.get('G', 0), 2
          counts.get('T', 0))
1

The counts will be a dictionary containing the frequencies of the characters in args.dna.

2

It is still safest to use dict.get() as I cannot be certain that all the bases are present.

I could argue that this code belongs in a count() function and keep the tests, but the Counter() function is already tested and has a well-defined interface. I think it makes more sense to use this function inline.

Going Further

The solutions here only handle DNA sequences provided as UPPERCASE TEXT. It’s not unusual to see these sequences provided as lowercase letters. For instance, in plant genomics, it’s common to use lowercase bases to denote regions of repetitive DNA. Modify your program to handle both uppercase and lowercase input by doing the following:

  1. Add a new input file that mixes case.

  2. Add a test to tests/dna_test.py that uses this new file and specifies the expected counts insensitive to case.

  3. Run the new test and ensure your program fails.

  4. Alter the program until it will pass the new test and all of the previous tests.

The solutions that used dictionaries to count all available characters would appear to be more flexible. That is, some of the tests only account for the bases A, C, G, and T, but if the input sequence were encoded using IUPAC codes to represent possible ambiguity in sequencing, then the program would have to be entirely rewritten. A program hard-coded to look only at the four nucleotides would also be useless for protein sequences that use a different alphabet. Consider writing a version of the program that will print two columns of output with each character that is found in the first column and the character’s frequency in the second. Allow the user to sort ascending or descending by either column.

Review

This was kind of a monster chapter. The following chapters will be a bit shorter, as I’ll build upon many of the foundational ideas I’ve covered here:

  • You can use the new.py program to create the basic structure of a Python program that accepts and validates command-line arguments using argparse.

  • The pytest module will run all functions with names starting with test_ and will report the results of how many tests pass.

  • Unit tests are for functions, and integration tests check if a program works as a whole.

  • Programs like pylint, flake8, and mypy can find various kinds of errors in your code. You can also have pytest automatically run tests to see if your code passes these checks.

  • Complicated commands can be stored as a target in a Makefile and executed using the make command.

  • You can create a decision tree using a series of if/else statements.

  • There are many ways to count all the characters in a string. Using the collections.Counter() function is perhaps the simplest method to create a dictionary of letter frequencies.

  • You can annotate variables and functions with types, and use mypy to ensure the types are used correctly.

  • The Python REPL is an interactive tool for executing code examples and reading documentation.

  • The Python community generally follows style guidelines such as PEP8. Tools like yapf and black can automatically format code according to these suggestions, and tools like pylint and flake8 will report deviations from the guidelines.

  • Python strings, lists, tuples, and dictionaries are very powerful data structures, each with useful methods and copious documentation.

  • You can create a custom, immutable, typed class derived from named tuples.

You may be wondering which is the best of the seven solutions. As with many things in life, it depends. Some programs are shorter to write and easier to understand but may fare poorly when confronting large datasets. In Chapter 2, I’ll show you how to benchmark programs, pitting them against each other in multiple runs using large inputs to determine which performs the best.

1 Boolean types are True or False, but many other data types are truthy or conversely falsey. The empty str ("") is falsey, so any nonempty string is truthy. The number 0 is falsey, so any nonzero value is truthy. An empty list, set, or dict is falsey, so any nonempty one of those is truthy.

Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.