Chapter 4. Head Aches

Stand on your own head for a change / Give me some skin to call my own

They Might Be Giants, “Stand on Your Own Head” (1988)

The challenge in this chapter is to implement the head program, which will print the first few lines or bytes of one or more files. This is a good way to peek at the contents of a regular text file and is often a much better choice than cat. When faced with a directory of something like output files from some process, using head can help you quickly scan for potential problems. It’s particularly useful when dealing with extremely large files, as it will only read the first few bytes or lines of a file (as opposed to cat, which will always read the entire file).

In this chapter, you will learn how to do the following:

  • Create optional command-line arguments that accept values

  • Parse a string into a number

  • Write and run a unit test

  • Use a match arm with a guard

  • Convert between types using From, Into, and as

  • Use take on an iterator or a filehandle

  • Preserve line endings while reading a filehandle

  • Read bytes versus characters from a filehandle

  • Use the turbofish operator

How head Works

I’ll start with an overview of head so you know what’s expected of your program. There are many implementations of the original AT&T Unix operating system, such as Berkeley Standard Distribution (BSD), SunOS/Solaris, HP-UX, and Linux. Most of these operating systems have some version of a head program that will default to showing the first 10 lines of 1 or more files. Most will probably have options -n to control the number of lines shown and -c to instead show some number of bytes. The BSD version has only these two options, which I can see via man head:

HEAD(1)                   BSD General Commands Manual                  HEAD(1)

NAME
     head -- display first lines of a file

SYNOPSIS
     head [-n count | -c bytes] [file ...]

DESCRIPTION
     This filter displays the first count lines or bytes of each of the speci-
     fied files, or of the standard input if no files are specified.  If count
     is omitted it defaults to 10.

     If more than a single file is specified, each file is preceded by a
     header consisting of the string ''==> XXX <=='' where ''XXX'' is the name
     of the file.

EXIT STATUS
     The head utility exits 0 on success, and >0 if an error occurs.

SEE ALSO
     tail(1)

HISTORY
     The head command appeared in PWB UNIX.

BSD                              June 6, 1993                              BSD

With the GNU version, I can run head --help to read the usage:

Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -c, --bytes=[-]K         print the first K bytes of each file;
                             with the leading '-', print all but the last
                             K bytes of each file
  -n, --lines=[-]K         print the first K lines instead of the first 10;
                             with the leading '-', print all but the last
                             K lines of each file
  -q, --quiet, --silent    never print headers giving file names
  -v, --verbose            always print headers giving file names
      --help     display this help and exit
      --version  output version information and exit

K may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

Note the ability with the GNU version to specify -n and -c with negative numbers and using suffixes like K, M, etc., which the challenge program will not implement. In both the BSD and GNU versions, the files are optional positional arguments that will read STDIN by default or when a filename is a dash.

To demonstrate how head works, I’ll use the files found in 04_headr/tests/inputs:

  • empty.txt: an empty file

  • one.txt: a file with one line of text

  • two.txt: a file with two lines of text

  • three.txt: a file with three lines of text and Windows line endings

  • ten.txt: a file with 10 lines of text

Given an empty file, there is no output, which you can verify with head tests/inputs/empty.txt. As mentioned, head will print the first 10 lines of a file by default:

$ head tests/inputs/ten.txt
one
two
three
four
five
six
seven
eight
nine
ten

The -n option allows you to control the number of lines that are shown. For instance, I can choose to show only the first two lines with the following command:

$ head -n 2 tests/inputs/ten.txt
one
two

The -c option shows only the given number of bytes from a file. For instance, I can show just the first two bytes:

$ head -c 2 tests/inputs/ten.txt
on

Oddly, the GNU version will allow you to provide both -n and -c and defaults to showing bytes. The BSD version will reject both arguments:

$ head -n 1 -c 2 tests/inputs/one.txt
head: can't combine line and byte counts

Any value for -n or -c that is not a positive integer will generate an error that will halt the program, and the error will echo back the illegal value:

$ head -n 0 tests/inputs/one.txt
head: illegal line count -- 0
$ head -c foo tests/inputs/one.txt
head: illegal byte count -- foo

When there are multiple arguments, head adds a header and inserts a blank line between each file. Notice in the following output that the first character in tests​/⁠inputs/one.txt is an Ö, a silly multibyte character I inserted to force the program to discern between bytes and characters:

$ head -n 1 tests/inputs/*.txt
==> tests/inputs/empty.txt <==

==> tests/inputs/one.txt <==
Öne line, four words.

==> tests/inputs/ten.txt <==
one

==> tests/inputs/three.txt <==
Three

==> tests/inputs/two.txt <==
Two lines.

With no file arguments, head will read from STDIN:

$ cat tests/inputs/ten.txt | head -n 2
one
two

As with cat in Chapter 3, any nonexistent or unreadable file is skipped and a warning is printed to STDERR. In the following command, I will use blargh as a nonexistent file and will create an unreadable file called cant-touch-this:

$ touch cant-touch-this && chmod 000 cant-touch-this
$ head blargh cant-touch-this tests/inputs/one.txt
head: blargh: No such file or directory
head: cant-touch-this: Permission denied
==> tests/inputs/one.txt <==
Öne line, four words.

This is as much as this chapter’s challenge program will need to implement.

Getting Started

You might have anticipated that the program I want you to write will be called headr (pronounced head-er). Start by running cargo new headr, then add the following dependencies to your Cargo.toml:

[dependencies]
clap = "2.33"

[dev-dependencies]
assert_cmd = "2"
predicates = "2"
rand = "0.8"

Copy my 04_headr/tests directory into your project directory, and then run cargo test. All the tests should fail. Your mission, should you choose to accept it, is to write a program that will pass these tests. I propose you again split your source code so that src/main.rs looks like this:

fn main() {
    if let Err(e) = headr::get_args().and_then(headr::run) {
        eprintln!("{}", e);
        std::process::exit(1);
    }
}

Begin your src/lib.rs by bringing in clap and the Error trait and declaring MyResult, which you can copy from the source code in Chapter 3:

use clap::{App, Arg};
use std::error::Error;

type MyResult<T> = Result<T, Box<dyn Error>>;

The program will have three parameters that can be represented with a Config struct:

#[derive(Debug)]
pub struct Config {
    files: Vec<String>, 1
    lines: usize, 2
    bytes: Option<usize>, 3
}
1

files will be a vector of strings.

2

The number of lines to print will be of the type usize.

3

bytes will be an optional usize.

The primitive usize is the pointer-sized unsigned integer type, and its size varies from 4 bytes on a 32-bit operating system to 8 bytes on a 64-bit system. Rust also has an isize type, which is a pointer-sized signed integer, which you would need to represent negative numbers as the GNU version does. Since you only want to store positive numbers à la the BSD version, you can stick with an unsigned type. Note that Rust also has the types u32/i32 (unsigned/signed 32-bit integer) and u64/i64 (unsigned/signed 64-bit integer) if you want finer control over how large these values can be.

The lines and bytes parameters will be used in a couple of functions, one of which expects a usize and the other a u64. This will provide an opportunity later to discuss how to convert between types. Your program should use 10 as the default value for lines, but bytes will be an Option, which I first introduced in Chapter 2. This means that bytes will either be Some<usize> if the user provides a valid value or None if they do not.

Next, create your get_args function in src/lib.rs with the following outline. You need to add the code to parse the arguments and return a Config struct:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("headr")
        .version("0.1.0")
        .author("Ken Youens-Clark <kyclark@gmail.com>")
        .about("Rust head")
        // What goes here?
        .get_matches();

    Ok(Config {
        files: ...
        lines: ...
        bytes: ...
    })
}
Tip

All the command-line arguments for this program are optional because files will default to a dash (-), lines will default to 10, and bytes can be left out. The optional arguments in Chapter 3 were flags, but here lines and bytes will need Arg::takes_value set to true.

The values that clap returns will be strings, so you will need to convert lines and bytes to integers in order to place them into the Config struct. In the next section, I’ll show you how to do this. In the meantime, create a run function that prints the configuration:

pub fn run(config: Config) -> MyResult<()> {
    println!("{:#?}", config); 1
    Ok(()) 2
}
1

Pretty-print the config. You could also use dbg!(config).

2

Return a successful result.

Writing a Unit Test to Parse a String into a Number

All command-line arguments are strings, and so it falls on our code to check that the lines and bytes values are valid integer values. In the parlance of computer science, we must parse these values to see if they look like positive whole numbers. The str::parse function will parse a string slice into some other type, such as a usize. This function will return a Result that will be an Err variant when the value cannot be parsed into a number, or an Ok containing the converted number. I’ve written a function called parse_positive_int that attempts to parse a string value into a positive usize value. Add the following function to your src/lib.rs:

fn parse_positive_int(val: &str) -> MyResult<usize> { 1
    unimplemented!(); 2
}
1

This function accepts a &str and will either return a positive usize or an error.

2

The unimplemented! macro will cause the program to panic or prematurely terminate with the message not implemented.

Tip

You can manually call the panic! macro to kill the program with a given error.

In all the previous chapters, we’ve used only integration tests that run and test the program as a whole from the command line just as the user will do. Next, I will show you how to write a unit test to check the parse_positive_int function in isolation. I recommend adding this just after parse_positive_int function:

#[test]
fn test_parse_positive_int() {
    // 3 is an OK integer
    let res = parse_positive_int("3");
    assert!(res.is_ok());
    assert_eq!(res.unwrap(), 3);

    // Any string is an error
    let res = parse_positive_int("foo");
    assert!(res.is_err());
    assert_eq!(res.unwrap_err().to_string(), "foo".to_string());

    // A zero is an error
    let res = parse_positive_int("0");
    assert!(res.is_err());
    assert_eq!(res.unwrap_err().to_string(), "0".to_string());
}
Note

To run just this one test, execute cargo test parse_posi⁠tive​_int. Stop reading now and write a version of the function that passes the test. I’ll wait here until you finish.

TIME PASSES.
AUTHOR GETS A CUP OF TEA AND CONSIDERS HIS LIFE CHOICES.
AUTHOR RETURNS TO THE NARRATIVE.

How did that go? Swell, I bet! Here is the function I wrote that passes the preceding tests:

fn parse_positive_int(val: &str) -> MyResult<usize> {
    match val.parse() { 1
        Ok(n) if n > 0 => Ok(n), 2
        _ => Err(From::from(val)), 3
    }
}
1

Attempt to parse the given value. Rust infers the usize type from the return type.

2

If the parse succeeds and the parsed value n is greater than 0, return it as an Ok variant.

3

For any other outcome, return an Err with the given value.

I’ve used match several times so far, but this is the first time I’m showing that match arms can include a guard, which is an additional check after the pattern match. I don’t know about you, but I think that’s pretty sweet. Without the guard, I would have to write something much longer and more redundant, like this:

fn parse_positive_int(val: &str) -> MyResult<usize> {
    match val.parse() {
        Ok(n) => {
            if n > 0 {
                Ok(n) 1
            } else {
                Err(From::from(val)) 2
            }
        }
        _ => Err(From::from(val)),
    }
}
1

After the value is parsed as a usize, check if it is greater than 0. If so, return an Ok.

2

Otherwise, turn the given value into an error.

Converting Strings into Errors

When I’m unable to parse a given string value into a positive integer, I want to return the original string so it can be included in an error message. To do this in the parse_positive_int function, I am using the redundantly named From::from to turn the input &str value into an Error. Consider the following version, where I put the unparsable string directly into the Err:

fn parse_positive_int(val: &str) -> MyResult<usize> {
    match val.parse() {
        Ok(n) if n > 0 => Ok(n),
        _ => Err(val), // This will not compile
    }
}

If I try to compile this, I get the following error:

error[E0308]: mismatched types
  --> src/lib.rs:71:18
   |
71 |         _ => Err(val),
   |                  ^^^ expected struct `Box`, found `&str`
   |
   = note: expected struct `Box<dyn std::error::Error>`
           found reference `&str`
   = note: for more on the distinction between the stack and the heap,
     read https://doc.rust-lang.org/book/ch15-01-box.html,
     https://doc.rust-lang.org/rust-by-example/std/box.html, and
     https://doc.rust-lang.org/std/boxed/index.html
     help: store this in the heap by calling `Box::new`
   |
71 |         _ => Err(Box::new(val)),
   |                  +++++++++   +

The problem is that the function is expected to return a MyResult, which is defined as either an Ok<T> for any type T or something that implements the Error trait and which is stored in a Box:

type MyResult<T> = Result<T, Box<dyn Error>>;

In the preceding code, &str neither implements Error nor lives in a Box. I can try to fix this according to the compiler error suggestions by placing the value into a Box. Unfortunately, this still won’t compile as I still haven’t satisfied the Error trait:

error[E0277]: the trait bound `str: std::error::Error` is not satisfied
  --> src/lib.rs:71:18
   |
71 |         _ => Err(Box::new(val)),
   |                  ^^^^^^^^^^^^^ the trait `std::error::Error`
   |                                is not implemented for `str`
   |
   = note: required because of the requirements on the impl of
     `std::error::Error` for `&str`
   = note: required for the cast to the object type `dyn std::error::Error`

Enter the std::convert::From trait, which helps convert from one type to another. As the documentation states:

The From is also very useful when performing error handling. When constructing a function that is capable of failing, the return type will generally be of the form Result<T, E>. The From trait simplifies error handling by allowing a function to return a single error type that encapsulates multiple error types.

Figure 4-1 shows that I can convert &str into an Error using either std::​con⁠vert::From or std::convert::Into. They each accomplish the same task, but val.into() is the shortest thing to type.

clru 0401
Figure 4-1. There are many ways to convert a &str to an Error using From and Into traits.

Now that you have a way to convert a string to a number, integrate it into your get_args. See if you can get your program to print a usage like the following. Note that I use the short and long names from the GNU version:

$ cargo run -- -h
headr 0.1.0
Ken Youens-Clark <kyclark@gmail.com>
Rust head

USAGE:
    headr [OPTIONS] [FILE]...

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -c, --bytes <BYTES>    Number of bytes
    -n, --lines <LINES>    Number of lines [default: 10]

ARGS:
    <FILE>...    Input file(s) [default: -]

Run the program with no inputs and verify the defaults are correctly set:

$ cargo run
Config {
    files: [ 1
        "-",
    ],
    lines: 10, 2
    bytes: None, 3
}
1

files should default to a dash (-) as the filename.

2

The number of lines should default to 10.

3

bytes should be None.

Now run the program with arguments and ensure they are correctly parsed:

$ cargo run -- -n 3 tests/inputs/one.txt
Config {
    files: [
        "tests/inputs/one.txt", 1
    ],
    lines: 3, 2
    bytes: None, 3
}
1

The positional argument tests/inputs/one.txt is parsed as one of the files.

2

The -n option for lines sets this to 3.

3

The -b option for bytes defaults to None.

If I provide more than one positional argument, they will all go into files, and the -c argument will go into bytes. In the following command, I’m again relying on the bash shell to expand the file glob *.txt into all the files ending in .txt. PowerShell users should refer to the equivalent use of Get-ChildItem shown in the section “Iterating Through the File Arguments”:

$ cargo run -- -c 4 tests/inputs/*.txt
Config {
    files: [
        "tests/inputs/empty.txt", 1
        "tests/inputs/one.txt",
        "tests/inputs/ten.txt",
        "tests/inputs/three.txt",
        "tests/inputs/two.txt",
    ],
    lines: 10, 2
    bytes: Some( 3
        4,
    ),
}
1

There are four files ending in .txt.

2

lines is still set to the default value of 10.

3

The -c 4 results in the bytes now being Some(4).

Any value for -n or -c that cannot be parsed into a positive integer should cause the program to halt with an error:

$ cargo run -- -n blargh tests/inputs/one.txt
illegal line count -- blargh
$ cargo run -- -c 0 tests/inputs/one.txt
illegal byte count -- 0

The program should disallow -n and -c being present together. Be sure to consult the clap documentation as you figure this out:

$ cargo run -- -n 1 -c 1 tests/inputs/one.txt
error: The argument '--lines <LINES>' cannot be used with '--bytes <BYTES>'
Note

Just parsing and validating the arguments is a challenge, but I know you can do it. Stop reading here and get your program to pass all the tests included with cargo test dies:

running 3 tests
test dies_bad_lines ... ok
test dies_bad_bytes ... ok
test dies_bytes_and_lines ... ok

Defining the Arguments

Welcome back. Now that your program can pass all of the tests included with cargo test dies, compare your solution to mine. Note that the two options for lines and bytes will take values. This is different from the flags implemented in Chapter 3 that are used as Boolean values:

    let matches = App::new("headr")
        .version("0.1.0")
        .author("Ken Youens-Clark <kyclark@gmail.com>")
        .about("Rust head")
        .arg(
            Arg::with_name("lines") 1
                .short("n")
                .long("lines")
                .value_name("LINES")
                .help("Number of lines")
                .default_value("10"),
        )
        .arg(
            Arg::with_name("bytes") 2
                .short("c")
                .long("bytes")
                .value_name("BYTES")
                .takes_value(true)
                .conflicts_with("lines")
                .help("Number of bytes"),
        )
        .arg(
            Arg::with_name("files") 3
                .value_name("FILE")
                .help("Input file(s)")
                .multiple(true)
                .default_value("-"),
        )
        .get_matches();
1

The lines option takes a value and defaults to 10.

2

The bytes option takes a value, and it conflicts with the lines parameter so that they are mutually exclusive.

3

The files parameter is positional, required, takes one or more values, and defaults to a dash (-).

Tip

The Arg::value_name will be printed in the usage documentation, so be sure to choose a descriptive name. Don’t confuse this with the Arg::with_name that uniquely defines the name of the argument for accessing within your code.

Following is how I can use parse_positive_int inside get_args to validate lines and bytes. When the function returns an Err variant, I use ? to propagate the error to main and end the program; otherwise, I return the Config:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("headr")... // Same as before

    let lines = matches
        .value_of("lines") 1
        .map(parse_positive_int) 2
        .transpose() 3
        .map_err(|e| format!("illegal line count -- {}", e))?; 4

    let bytes = matches 5
        .value_of("bytes")
        .map(parse_positive_int)
        .transpose()
        .map_err(|e| format!("illegal byte count -- {}", e))?;

    Ok(Config {
        files: matches.values_of_lossy("files").unwrap(), 6
        lines: lines.unwrap(), 7
        bytes 8
    })
}
1

ArgMatches::value_of returns an Option<&str>.

2

Use Option::map to unpack a &str from Some and send it to parse_posi⁠tive​_int.

3

The result of Option::map will be an Option<Result>, and Option::transpose will turn this into a Result<Option>.

4

In the event of an Err, create an informative error message. Use ? to propagate an Err or unpack the Ok value.

5

Do the same for bytes.

6

The files option should have at least one value, so it should be safe to call Option::unwrap.

7

The lines argument has a default value and is safe to unwrap.

8

The bytes argument should be left as an Option. Use the struct field init shorthand since the name of the field is the same as the variable.

In the preceding code, I could have written the Config with every key/value pair like so:

Ok(Config {
    files: matches.values_of_lossy("files").unwrap(),
    lines: lines.unwrap(),
    bytes: bytes,
})

While that is valid code, it’s not idiomatic Rust. The Rust code linter, Clippy, will suggest using field init shorthand:

$ cargo clippy
warning: redundant field names in struct initialization
  --> src/lib.rs:61:9
   |
61 |         bytes: bytes,
   |         ^^^^^^^^^^^^ help: replace it with: `bytes`
   |
   = note: `#[warn(clippy::redundant_field_names)]` on by default
   = help: for further information visit https://rust-lang.github.io/
     rust-clippy/master/index.html#redundant_field_names

It’s quite a bit of work to validate all the user input, but now I have some assurance that I can proceed with good data.

Processing the Input Files

This challenge program should handle the input files just like the one in Chapter 3, so I suggest you add the open function to src/lib.rs:

fn open(filename: &str) -> MyResult<Box<dyn BufRead>> {
    match filename {
        "-" => Ok(Box::new(BufReader::new(io::stdin()))),
        _ => Ok(Box::new(BufReader::new(File::open(filename)?))),
    }
}

Be sure to add all the required dependencies:

use clap::{App, Arg};
use std::error::Error;
use std::fs::File;
use std::io::{self, BufRead, BufReader};

Expand your run function to try opening the files, printing errors as you encounter them:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files { 1
        match open(&filename) { 2
            Err(err) => eprintln!("{}: {}", filename, err), 3
            Ok(_) => println!("Opened {}", filename), 4
        }
    }
    Ok(())
}
1

Iterate through each of the filenames.

2

Attempt to open the given file.

3

Print errors to STDERR.

4

Print a message that the file was successfully opened.

Run your program with a good file and a bad file to ensure it seems to work. In the following command, blargh represents a nonexistent file:

$ cargo run -- blargh tests/inputs/one.txt
blargh: No such file or directory (os error 2)
Opened tests/inputs/one.txt

Next, try to read the lines and then the bytes of a given file, then try to add the headers separating multiple file arguments. Look closely at the error output from head when handling invalid files. Notice that readable files have a header first and then the file output, but invalid files only print an error. Additionally, there is an extra blank line separating the output for the valid files:

$ head -n 1 tests/inputs/one.txt blargh tests/inputs/two.txt
==> tests/inputs/one.txt <==
Öne line, four words.
head: blargh: No such file or directory

==> tests/inputs/two.txt <==
Two lines.

I’ve specifically designed some challenging inputs for you to consider. To see what you face, use the file command to report file type information:

$ file tests/inputs/*.txt
tests/inputs/empty.txt: empty 1
tests/inputs/one.txt:   UTF-8 Unicode text 2
tests/inputs/ten.txt:   ASCII text 3
tests/inputs/three.txt: ASCII text, with CRLF, LF line terminators 4
tests/inputs/two.txt:   ASCII text 5
1

This is an empty file just to ensure your program doesn’t fall over.

2

This file contains Unicode, as I put an umlaut over the O in Őne to force you to consider the differences between bytes and characters.

3

This file has 10 lines to ensure the default of 10 lines is shown.

4

This file has Windows-style line endings.

5

This file has Unix-style line endings.

Tip

On Windows, the newline is the combination of the carriage return and the line feed, often shown as CRLF or \r\n. On Unix platforms, only the newline is used, so LF or \n. These line endings must be preserved in the output from your program, so you will have to find a way to read the lines in a file without removing the line endings.

Reading Bytes Versus Characters

Before continuing, you should understand the difference between reading bytes and characters from a file. In the early 1960s, the American Standard Code for Information Interchange (ASCII, pronounced as-key) table of 128 characters represented all possible text elements in computing. It takes only seven bits (27 = 128) to represent this many characters. Usually a byte consists of eight bits, so the notion of byte and character were interchangeable.

Since the creation of Unicode (Universal Coded Character Set) to represent all the writing systems of the world (and even emojis), some characters may require up to four bytes. The Unicode standard defines several ways to encode characters, including UTF-8 (Unicode Transformation Format using eight bits). As noted, the file tests​/⁠inputs/one.txt begins with the character Ő, which is two bytes long in UTF-8. If you want head to show you this one character, you must request two bytes:

$ head -c 2 tests/inputs/one.txt
Ö

If you ask head to select just the first byte from this file, you get the byte value 195, which is not a valid UTF-8 string. The output is a special character that indicates a problem converting a character into Unicode:

$ head -c 1 tests/inputs/one.txt
�

The challenge program is expected to re-create this behavior. This is a not an easy program to write, but you should be able to use std::io, std::fs::File, and std::io::BufReader to figure out how to read bytes and lines from each of the files. Note that in Rust, a String must be a valid UTF-8-encoded string, and this struct has, for instance, the method String::from_utf8_lossy that might prove useful. I’ve included a full set of tests in tests/cli.rs that you should have copied into your source tree.

Note

Stop reading here and finish the program. Use cargo test frequently to check your progress. Do your best to pass all the tests before looking at my solution.

Solution

This challenge proved more interesting than I anticipated. I thought it would be little more than a variation on cat, but it turned out to be quite a bit more difficult. I’ll walk you through how I arrived at my solution.

Reading a File Line by Line

After opening the valid files, I started by reading lines from the filehandle. I decided to modify some code from Chapter 3:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files {
        match open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(file) => {
                for line in file.lines().take(config.lines) { 1
                    println!("{}", line?); 2
                }
            }
        }
    }
    Ok(())
}
1

Use Iterator::take to select the desired number of lines from the filehandle.

2

Print the line to the console.

I think this is a fun solution because it uses the Iterator::take method to select the desired number of lines. I can run the program to select one line from a file, and it appears to work well:

$ cargo run -- -n 1 tests/inputs/ten.txt
one

If I run cargo test, the program passes almost half the tests, which seems pretty good for having implemented only a small portion of the specifications; however, it’s failing all the tests that use the Windows-encoded input file. To fix this problem, I have a confession to make.

Preserving Line Endings While Reading a File

I hate to break it to you, dear reader, but the catr program in Chapter 3 does not completely replicate the original cat program because it uses BufRead::lines to read the input files. The documentation for that functions says, “Each string returned will not have a newline byte (the 0xA byte) or CRLF (0xD, 0xA bytes) at the end.” I hope you’ll forgive me because I wanted to show you how easy it can be to read the lines of a file, but you should be aware that the catr program replaces Windows CRLF line endings with Unix-style newlines.

To fix this, I must instead use BufRead::read_line, which, according to the documentation, “will read bytes from the underlying stream until the newline delimiter (the 0xA byte) or EOF is found. Once found, all bytes up to, and including, the delimiter (if found) will be appended to buf.”1 Following is a version that will preserve the original line endings. With these changes, the program will pass more tests than it fails:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files {
        match open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(mut file) => { 1
                let mut line = String::new(); 2
                for _ in 0..config.lines { 3
                    let bytes = file.read_line(&mut line)?; 4
                    if bytes == 0 { 5
                        break;
                    }
                    print!("{}", line); 6
                    line.clear(); 7
                }
            }
        };
    }
    Ok(())
}
1

Accept the filehandle as a mut (mutable) value.

2

Use String::new to create a new, empty mutable string buffer to hold each line.

3

Use for to iterate through a std::ops::Range to count up from zero to the requested number of lines. The variable name _ indicates I do not intend to use it.

4

Use BufRead::read_line to read the next line.

5

The filehandle will return zero bytes when it reaches the end, so break out of the loop.

6

Print the line, including the original line ending.

7

Use String::clear to empty the line buffer.

If I run cargo test at this point, the program will pass almost all the tests for reading lines and will fail all those for reading bytes and handling multiple files.

Reading Bytes from a File

Next, I’ll handle reading bytes from a file. After I attempt to open the file, I check to see if config.bytes is Some number of bytes; otherwise, I’ll use the preceding code that reads lines. For the following code, be sure to add use std::io::Read to your imports:

for filename in config.files {
    match open(&filename) {
        Err(err) => eprintln!("{}: {}", filename, err),
        Ok(mut file) => {
            if let Some(num_bytes) = config.bytes { 1
                let mut handle = file.take(num_bytes as u64); 2
                let mut buffer = vec![0; num_bytes]; 3
                let bytes_read = handle.read(&mut buffer)?; 4
                print!(
                    "{}",
                    String::from_utf8_lossy(&buffer[..bytes_read]) 5
                );
            } else {
                ... // Same as before
            }
        }
    };
}
1

Use pattern matching to check if config.bytes is Some number of bytes to read.

2

Use take to read the requested number of bytes.

3

Create a mutable buffer of a fixed length num_bytes filled with zeros to hold the bytes read from the file.

4

Read the desired number of bytes from the filehandle into the buffer. The value bytes_read will contain the number of bytes that were actually read, which may be fewer than the number requested.

5

Convert the selected bytes into a string, which may not be valid UTF-8. Note the range operation to select only the bytes actually read.

Tip

The take method from the std::io::Read trait expects its argument to be the type u64, but I have a usize. I cast or convert the value using the as keyword.

As you saw in the case of selecting only part of a multibyte character, converting bytes to characters could fail because strings in Rust must be valid UTF-8. The String::from_utf8 function will return an Ok only if the string is valid, but String::from_utf8_lossy will convert invalid UTF-8 sequences to the unknown or replacement character:

$ cargo run -- -c 1 tests/inputs/one.txt
�

Let me show you another, much worse, way to read the bytes from a file. You can read the entire file into a string, convert that into a vector of bytes, and then select the first num_bytes:

let mut contents = String::new(); 1
file.read_to_string(&mut contents)?; // Danger here 2
let bytes = contents.as_bytes(); 3
print!("{}", String::from_utf8_lossy(&bytes[..num_bytes])); // More danger 4
1

Create a new string buffer to hold the contents of the file.

2

Read the entire file contents into the string buffer.

3

Use str::as_bytes to convert the contents into bytes (u8 or unsigned 8-bit integers).

4

Use String::from_utf8_lossy to turn a slice of bytes into a string.

As I’ve noted before, this approach can crash your program or computer if the file’s size exceeds the amount of memory on your machine. Another serious problem with the preceding code is that it assumes the slice operation bytes[..num_bytes] will succeed. If you use this code with an empty file, for instance, you’ll be asking for bytes that don’t exist. This will cause your program to panic and exit immediately with an error message:

$ cargo run -- -c 1 tests/inputs/empty.txt
thread 'main' panicked at 'range end index 1 out of range for slice of
length 0', src/lib.rs:80:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Following is a safe—and perhaps the shortest—way to read the desired number of bytes from a file:

let bytes: Result<Vec<_>, _> = file.bytes().take(num_bytes).collect();
print!("{}", String::from_utf8_lossy(&bytes?));

In the preceding code, the type annotation Result<Vec<_>, _> is necessary as the compiler infers the type of bytes as a slice, which has an unknown size. I must indicate I want a Vec, which is a smart pointer to heap-allocated memory. The underscores (_) indicate partial type annotation, which basically instructs the compiler to infer the types. Without any type annotation for bytes, the compiler complains thusly:

error[E0277]: the size for values of type `[u8]` cannot be known at
compilation time
  --> src/lib.rs:95:58
   |
95 |                     print!("{}", String::from_utf8_lossy(&bytes?));
   |                                                          ^^^^^^^ doesn't
   |                                        have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `[u8]`
   = note: all local variables must have a statically known size
   = help: unsized locals are gated as an unstable feature
Note

You’ve now seen that the underscore (_) serves various different functions. As the prefix or name of a variable, it shows the compiler you don’t want to use the value. In a match arm, it is the wildcard for handling any case. When used in a type annotation, it tells the compiler to infer the type.

You can also indicate the type information on the righthand side of the expression using the turbofish operator (::<>). Often it’s a matter of style whether you indicate the type on the lefthand or righthand side, but later you will see examples where the turbofish is required for some expressions. Here’s what the previous example would look like with the type indicated with the turbofish instead:

let bytes = file.bytes().take(num_bytes).collect::<Result<Vec<_>, _>>();

The unknown character produced by String::from_utf8_lossy (b'\xef\xbf\xbd') is not exactly the same output produced by the BSD head (b'\xc3'), making this somewhat difficult to test. If you look at the run helper function in tests/cli.rs, you’ll see that I read the expected value (the output from head) and use the same function to convert what could be invalid UTF-8 so that I can compare the two outputs. The run_stdin function works similarly:

fn run(args: &[&str], expected_file: &str) -> TestResult {
    // Extra work here due to lossy UTF
    let mut file = File::open(expected_file)?;
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer)?;
    let expected = String::from_utf8_lossy(&buffer); 1

    Command::cargo_bin(PRG)?
        .args(args)
        .assert()
        .success()
        .stdout(predicate::eq(&expected.as_bytes() as &[u8])); 2

    Ok(())
}
1

Handle any invalid UTF-8 in expected_file.

2

Compare the output and expected values as a slice of bytes ([u8]).

Printing the File Separators

The last piece to handle is the separators between multiple files. As noted before, valid files have a header that puts the filename inside ==> and <== markers. Files after the first have an additional newline at the beginning to visually separate the output. This means I will need to know the number of the file that I’m handling, which I can get by using the Iterator::enumerate method. Following is the final version of my run function that will pass all the tests:

pub fn run(config: Config) -> MyResult<()> {
    let num_files = config.files.len(); 1

    for (file_num, filename) in config.files.iter().enumerate() { 2
        match open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(mut file) => {
                if num_files > 1 { 3
                    println!(
                        "{}==> {} <==",
                        if file_num > 0 { "\n" } else { "" }, 4
                        filename
                    );
                }

                if let Some(num_bytes) = config.bytes {
                    let mut handle = file.take(num_bytes as u64);
                    let mut buffer = vec![0; num_bytes];
                    let bytes_read = handle.read(&mut buffer)?;
                    print!(
                        "{}",
                        String::from_utf8_lossy(&buffer[..bytes_read])
                    );
                } else {
                    let mut line = String::new();
                    for _ in 0..config.lines {
                        let bytes = file.read_line(&mut line)?;
                        if bytes == 0 {
                            break;
                        }
                        print!("{}", line);
                        line.clear();
                    }
                }
            }
        };
    }

    Ok(())
}
1

Use the Vec::len method to get the number of files.

2

Use the Iterator::enumerate method to track the file number and filenames.

3

Only print headers when there are multiple files.

4

Print a newline when file_num is greater than 0, which indicates the first file.

Going Further

There’s no reason to stop this party now. Consider implementing how the GNU head handles numeric values with suffixes and negative values. For instance, -c=1K means print the first 1,024 bytes of the file, and -n=-3 means print all but the last three lines of the file. You’ll need to change lines and bytes to signed integer values to store both positive and negative numbers. Be sure to run the GNU head with these arguments, capture the output to test files, and write tests to cover the new features you add.

You could also add an option for selecting characters in addition to bytes. You can use the String::chars function to split a string into characters. Finally, copy the test input file with the Windows line endings (tests/inputs/three.txt) to the tests for Chapter 3. Edit the mk-outs.sh for that program to incorporate this file, and then expand the tests and program to ensure that line endings are preserved.

Summary

This chapter dove into some fairly sticky subjects, such as converting types like a &str to a usize, a String to an Error, and a usize to a u64. When I was learning Rust, I felt like it took me quite a while to understand the differences between &str and String and why I need to use From::from to create the Err part of MyResult. If you still feel confused, just know that you won’t always. If you keep reading the docs and writing more code, it will eventually make sense.

Here are some things you accomplished in this chapter:

  • You learned to create optional parameters that can take values. Previously, the options were flags.

  • You saw that all command-line arguments are strings. You used the str::parse method to attempt the conversion of a string like "3" into the number 3.

  • You learned how to write and run a unit test for an individual function.

  • You learned to convert types using the as keyword or with traits like From and Into.

  • You found that using _ as the name or prefix of a variable is a way to indicate to the compiler that you don’t intend to use the value. When used in a type annotation, it tells the compiler to infer the type.

  • You learned that a match arm can incorporate an additional Boolean condition called a guard.

  • You learned how to use BufRead::read_line to preserve line endings while reading a filehandle.

  • You found that the take method works on both iterators and filehandles to limit the number of elements you select.

  • You learned to indicate type information on the lefthand side of an assignment or on the righthand side using the turbofish operator.

In the next chapter, you’ll learn more about Rust iterators and how to break input into lines, bytes, and characters.

1 EOF is an acronym for end of file.

Get Command-Line Rust now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.