Chapter 4. Head Aches

Stand on your own head for a change / Give me some skin to call my own

They Might Be Giants, “Stand on Your Own Head” (1988)

The challenge in this chapter is to implement the head program, which will print the first few lines or bytes of one or more files. This is a good way to peek at the contents of a regular text file and is often a much better choice than cat. When faced with a directory of something like output files from some process, using head can help you quickly scan for potential problems. It’s particularly useful when dealing with extremely large files, as it will only read the first few bytes or lines of a file (as opposed to cat, which will always read the entire file).

In this chapter, you will learn how to do the following:

Create optional command-line arguments that accept numeric values
Convert between types using as
Use take on an iterator or a filehandle
Preserve line endings while reading a filehandle
Read bytes versus characters from a filehandle
Use the turbofish operator

How head Works

I’ll start with an overview of head so you know what’s expected of your program. There are many implementations of the original AT&T Unix operating system, such as Berkeley Standard Distribution (BSD), SunOS/Solaris, HP-UX, and Linux. Most of these operating systems have some version of a head program that will default to showing the first 10 lines of 1 or more files. Most will probably have options -n to control the number of lines shown and -c to instead show some number of bytes. The BSD version has only these two options, which I can see via man head:

HEAD(1)                   BSD General Commands Manual                  HEAD(1)

NAME
     head -- display first lines of a file

SYNOPSIS
     head [-n count | -c bytes] [file ...]

DESCRIPTION
     This filter displays the first count lines or bytes of each of the speci-
     fied files, or of the standard input if no files are specified.  If count
     is omitted it defaults to 10.

     If more than a single file is specified, each file is preceded by a
     header consisting of the string ''==> XXX <=='' where ''XXX'' is the name
     of the file.

EXIT STATUS
     The head utility exits 0 on success, and >0 if an error occurs.

SEE ALSO
     tail(1)

HISTORY
     The head command appeared in PWB UNIX.

BSD                              June 6, 1993                              BSD

With the GNU version, I can run head --help to read the usage:

Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -c, --bytes=[-]K         print the first K bytes of each file;
                             with the leading '-', print all but the last
                             K bytes of each file
  -n, --lines=[-]K         print the first K lines instead of the first 10;
                             with the leading '-', print all but the last
                             K lines of each file
  -q, --quiet, --silent    never print headers giving file names
  -v, --verbose            always print headers giving file names
      --help     display this help and exit
      --version  output version information and exit

K may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

Note that the GNU version can specify negative numbers for -n and -c and with suffixes like K, M, etc., which the challenge program will not implement. In both the BSD and GNU versions, the files are optional positional arguments that will read STDIN by default or when a filename is a dash.

To demonstrate how head works, I’ll use the files found in 04_headr/tests/inputs:

empty.txt: an empty file
one.txt: a file with one line of text
two.txt: a file with two lines of text
three.txt: a file with three lines of text and Windows line endings
twelve.txt: a file with 12 lines of text

Given an empty file, there is no output, which you can verify with head tests/inputs/empty.txt. As mentioned, head will print the first 10 lines of a file by default:

$ head tests/inputs/twelve.txt
one
two
three
four
five
six
seven
eight
nine
ten

The -n option allows you to control the number of lines that are shown. For instance, I can choose to show only the first two lines with the following command:

$ head -n 2 tests/inputs/twelve.txt
one
two

The -c option shows only the given number of bytes from a file. For instance, I can show just the first two bytes:

$ head -c 2 tests/inputs/twelve.txt
on

Oddly, the GNU version will allow you to provide both -n and -c and defaults to showing bytes. The BSD version will reject both arguments:

$ head -n 1 -c 2 tests/inputs/one.txt
head: can't combine line and byte counts

Any value for -n or -c that is not a positive integer will generate an error that will halt the program, and the error message will include the illegal value:

$ head -n 0 tests/inputs/one.txt
head: illegal line count -- 0
$ head -c foo tests/inputs/one.txt
head: illegal byte count -- foo

When there are multiple arguments, head adds a header and inserts a blank line between each file. Notice in the following output that the first character in tests/⁠inputs/one.txt is an Ö, a silly multibyte character I inserted to force the program to discern between bytes and characters:

$ head -n 1 tests/inputs/*.txt
==> tests/inputs/empty.txt <==

==> tests/inputs/one.txt <==
Öne line, four words.

==> tests/inputs/three.txt <==
Three

==> tests/inputs/twelve.txt <==
one

==> tests/inputs/two.txt <==
Two lines.

With no file arguments, head will read from STDIN:

$ cat tests/inputs/twelve.txt | head -n 2
one
two

As with cat in Chapter 3, any nonexistent or unreadable file is skipped and a warning is printed to STDERR. In the following command, I will use blargh as a nonexistent file and will create an unreadable file called cant-touch-this:

$ touch cant-touch-this && chmod 000 cant-touch-this
$ head blargh cant-touch-this tests/inputs/one.txt
head: blargh: No such file or directory
head: cant-touch-this: Permission denied
==> tests/inputs/one.txt <==
Öne line, four words.

This is as much as this chapter’s challenge program will need to implement.

Getting Started

You might have anticipated that the program I want you to write will be called headr (pronounced head-er). Start by running cargo new headr, then add the following dependencies to your Cargo.toml:

[dependencies]
anyhow = "1.0.79"
clap = { version = "4.5.0", features = ["derive"] }

[dev-dependencies]
assert_cmd = "2.0.13"
predicates = "3.0.4"
pretty_assertions = "1.4.0"
rand = "0.8.5"

Copy my 04_headr/tests directory into your project directory, and then run cargo test. All the tests should fail. Your mission, should you choose to accept it, is to write a program that will pass these tests. I propose you begin src/main.rs with the following code to represent the program’s three parameters with an Args struct:

#[derive(Debug)]
struct Args {
    files: Vec<String>, 
    lines: u64, 
    bytes: Option<u64>, 
}

: files will be a vector of strings.
: The number of lines to print will be of the type u64.
: bytes will be an optional u64.

Tip

All the command-line arguments for this program are optional because files should default to a dash (-), lines will default to 10, and bytes can be left out.

The primitive u64 is an unsigned integer that uses 8 bytes of memory and is similar to a usize, which is a pointer-sized unsigned integer type with a size that varies from 4 bytes on a 32-bit operating system to 8 bytes on a 64-bit system. Rust also has an isize type, which is a pointer-sized signed integer that you would need to represent negative numbers as the GNU version does. Since you only want to store positive numbers à la the BSD version, you can stick with an unsigned type. Note the other Rust types of u32/i32 (unsigned/signed 32-bit integer) and u64/i64 (unsigned/signed 64-bit integer) if you want finer control over how large these values can be.

The lines and bytes parameters will be used in functions that expect the types usize and u64, so later we’ll discuss how to convert between these types. Your program should use 10 as the default value for lines, but bytes will be an Option, which I first introduced in Chapter 2. This means that bytes will either be Some<u64> if the user provides a valid value or None if they do not.

I challenge you to parse the command-line arguments into this struct however you like. To use the derive pattern, annotate the preceding Args accordingly. If you prefer to follow the builder pattern, consider writing a get_args function with the following outline:

fn get_args() -> Args {
    let matches = Command::new("headr")
        .version("0.1.0")
        .author("Ken Youens-Clark <kyclark@gmail.com>")
        .about("Rust version of `head`")
        // What goes here?
        .get_matches();

    Args {
        files: ...
        lines: ...
        bytes: ...
    }
}

Update main to parse and pretty-print the arguments:

fn main() {
    let args = Args::parse();
    println!("{:#?}", args);
}

See if you can get your program to print a usage like the following. Note that I use the short and long names from the GNU version:

$ cargo run -- -h
Rust version of `head`

Usage: headr [OPTIONS] [FILE]...

Arguments:
  [FILE]...  Input file(s) [default: -]

Options:
  -n, --lines <LINES>  Number of lines [default: 10]
  -c, --bytes <BYTES>  Number of bytes
  -h, --help           Print help
  -V, --version        Print version

Run the program with no inputs and verify the defaults are correctly set:

$ cargo run
Args {
    files: [
        "-", 
    ],
    lines: 10, 
    bytes: None, 
}

: files should default to a dash (-) as the filename.
: The number of lines should default to 10.
: bytes should be None.

Now run the program with arguments and ensure they are correctly parsed:

$ cargo run -- -n 3 tests/inputs/one.txt
Args {
    files: [
        "tests/inputs/one.txt", 
    ],
    lines: 3, 
    bytes: None, 
}

: The positional argument tests/inputs/one.txt is parsed as one of the files.
: The -n option for lines sets this to 3.
: The -b option for bytes defaults to None.

If I provide more than one positional argument, they will all go into files, and the -c argument will go into bytes. In the following command, I’m again relying on the bash shell to expand the file glob *.txt into all the files ending in .txt. PowerShell users should refer to the equivalent use of Get-ChildItem shown in the section “Iterating Through the File Arguments”:

$ cargo run -- -c 4 tests/inputs/*.txt
Args {
    files: [
        "tests/inputs/empty.txt", 
        "tests/inputs/one.txt",
        "tests/inputs/three.txt",
        "tests/inputs/twelve.txt",
        "tests/inputs/two.txt",
    ],
    lines: 10, 
    bytes: Some( 
        4,
    ),
}

: There are four files ending in .txt.
: lines is still set to the default value of 10.
: The -c 4 results in the bytes now being Some(4).

Any value for -n or -c that cannot be parsed into a positive integer should cause the program to halt with an error. Use clap::value_parser to ensure that the integer arguments are valid and convert them to numbers:

$ cargo run -- -n blargh tests/inputs/one.txt
error: invalid value 'blargh' for '--lines <LINES>':
invalid digit found in string
$ cargo run -- -c 0 tests/inputs/one.txt
error: invalid value '0' for '--bytes <BYTES>':
0 is not in 1..18446744073709551615

The program should disallow the use of both -n and -c:

$ cargo run -- -n 1 -c 1 tests/inputs/one.txt
error: the argument '--lines <LINES>' cannot be used with '--bytes <BYTES>'

Usage: headr --lines <LINES> <FILE>...

Note

Just parsing and validating the arguments is a challenge, but I know you can do it. Stop reading here and get your program to pass all the tests included with cargo test dies:

running 3 tests
test dies_bad_lines ... ok
test dies_bad_bytes ... ok
test dies_bytes_and_lines ... ok

Defining the Arguments

Welcome back. I will first show the builder pattern with a get_args function as in the previous chapter. Note that the two optional arguments, lines and bytes, accept numeric values. This is different from the optional arguments implemented in Chapter 3 that are used as Boolean flags. Note that the following code requires use clap::{Arg, Command}:

fn get_args() -> Args {
    let matches = Command::new("headr")
        .version("0.1.0")
        .author("Ken Youens-Clark <kyclark@gmail.com>")
        .about("Rust version of `head`")
        .arg(
            Arg::new("lines") 
                .short('n')
                .long("lines")
                .value_name("LINES")
                .help("Number of lines")
                .value_parser(clap::value_parser!(u64).range(1..))
                .default_value("10"),
        )
        .arg(
            Arg::new("bytes") 
                .short('c')
                .long("bytes")
                .value_name("BYTES")
                .conflicts_with("lines")
                .value_parser(clap::value_parser!(u64).range(1..))
                .help("Number of bytes"),
        )
        .arg(
            Arg::new("files") 
                .value_name("FILE")
                .help("Input file(s)")
                .num_args(0..)
                .default_value("-"),
        )
        .get_matches();

    Args {
        files: matches.get_many("files").unwrap().cloned().collect(),
        lines: matches.get_one("lines").cloned().unwrap(),
        bytes: matches.get_one("bytes").cloned(),
    }
}

: The lines option takes a value and defaults to 10.
: The bytes option takes a value, and it conflicts with the lines parameter so that they are mutually exclusive.
: The files parameter is positional, takes zero or more values, and defaults to a dash (-).

Alternatively, the clap derive pattern requires annotating the Args struct:

#[derive(Parser, Debug)]
#[command(author, version, about)]
/// Rust version of `head`
struct Args {
    /// Input file(s)
    #[arg(default_value = "-", value_name = "FILE")]
    files: Vec<String>,

    /// Number of lines
    #[arg(
        short('n'),
        long,
        default_value = "10",
        value_name = "LINES",
        value_parser = clap::value_parser!(u64).range(1..)
    )]
    lines: u64,

    /// Number of bytes
    #[arg(
        short('c'),
        long,
        value_name = "BYTES",
        conflicts_with("lines"),
        value_parser = clap::value_parser!(u64).range(1..)
    )]
    bytes: Option<u64>,
}

Tip

In the derive pattern, the default Arg::long value will be the name of the struct field, for example, lines and bytes. The default value for Arg::short will be the first letter of the struct field, so l or b. I specify the short names n and c, respectively, to match the original tool.

It’s quite a bit of work to validate all the user input, but now I have some assurance that I can proceed with good data.

Processing the Input Files

I recommend that you have your main call a run function. Be sure to add use anyhow::Result for the following:

fn main() {
    if let Err(e) = run(Args::parse()) {
        eprintln!("{e}");
        std::process::exit(1);
    }
}

fn run(_args: Args) -> Result<()> {
    Ok(())
}

This challenge program should handle the input files as in Chapter 3, so I suggest you add the same open function:

fn open(filename: &str) -> Result<Box<dyn BufRead>> {
    match filename {
        "-" => Ok(Box::new(BufReader::new(io::stdin()))),
        _ => Ok(Box::new(BufReader::new(File::open(filename)?))),
    }
}

Be sure to add all these additional dependencies:

use std::fs::File;
use std::io::{self, BufRead, BufReader};

Expand your run function to try opening the files, printing errors as you encounter them:

fn run(args: Args) -> Result<()> {
    for filename in args.files { 
        match open(&filename) { 
            Err(err) => eprintln!("{filename}: {err}"), 
            Ok(_) => println!("Opened {filename}"), 
        }
    }
    Ok(())
}

: Iterate through each of the filenames.
: Attempt to open the given file.
: Print errors to STDERR.
: Print a message that the file was successfully opened.

Run your program with a good file and a bad file to ensure it seems to work. In the following command, blargh represents a nonexistent file:

$ cargo run -- blargh tests/inputs/one.txt
blargh: No such file or directory (os error 2)
Opened tests/inputs/one.txt

Without looking ahead to my solution, figure out how to read the lines and then the bytes of a given file. Next, add the headers separating multiple file arguments. Look closely at the error output from the original head program when handling invalid files, noticing that readable files have a header first and then the file output, but invalid files only print an error. Additionally, there is an extra blank line separating the output for the valid files:

$ head -n 1 tests/inputs/one.txt blargh tests/inputs/two.txt
==> tests/inputs/one.txt <==
Öne line, four words.
head: blargh: No such file or directory

==> tests/inputs/two.txt <==
Two lines.

I’ve specifically designed some challenging inputs for you to consider. To see what you face, use the file command to report file type information:

$ file tests/inputs/*.txt
tests/inputs/empty.txt:  empty 
tests/inputs/one.txt:    UTF-8 Unicode text 
tests/inputs/three.txt:  ASCII text, with CRLF, LF line terminators 
tests/inputs/twelve.txt: ASCII text 
tests/inputs/two.txt:    ASCII text

: This is an empty file just to ensure your program doesn’t fall over.
: This file contains Unicode, as I put an umlaut over the O in Őne to force you to consider the differences between bytes and characters.
: This file has Windows-style line endings.
: This file has 12 lines to ensure the default of 10 lines is shown.
: This file has Unix-style line endings.

Tip

On Windows, the newline is the combination of the carriage return and the line feed, often shown as CRLF or \r\n. On Unix platforms, only the newline is used, so LF or \n. These line endings must be preserved in the output from your program, so you will have to find a way to read the lines in a file without removing the line endings.

Reading Bytes Versus Characters

Before continuing, you should understand the difference between reading bytes and characters from a file. In the early 1960s, the American Standard Code for Information Interchange (ASCII, pronounced as-key) table of 128 characters represented all possible text elements in computing. It takes only seven bits (2⁷ = 128) to represent this many characters. Usually a byte consists of eight bits, so the notion of byte and character were interchangeable.

Since the creation of Unicode (Universal Coded Character Set) to represent all the writing systems of the world (and even emojis), some characters may require up to four bytes. The Unicode standard defines several ways to encode characters, including UTF-8 (Unicode Transformation Format using eight bits). As noted, the file tests/⁠inputs/one.txt begins with the character Ő, which is two bytes long in UTF-8. If you want head to show you this one character, you must request two bytes:

$ head -c 2 tests/inputs/one.txt
Ö

If you ask head to select just the first byte from this file, you get the byte value 195, which is not a valid UTF-8 string. The output is a special character that indicates a problem converting a character into Unicode:

$ head -c 1 tests/inputs/one.txt
�

The challenge program is expected to re-create this behavior. This is not an easy program to write, but you should be able to use std::io, std::fs::File, and std::io::BufReader to figure out how to read bytes and lines from each of the files. Note that in Rust, a String must be a valid UTF-8-encoded string, and so the method String::from_utf8_lossy might prove useful. I’ve included a full set of tests in tests/cli.rs that you should have copied into your source tree.

Note

Stop reading here and finish the program. Use cargo test frequently to check your progress. Do your best to pass all the tests before looking at my solution.

Solution

This challenge proved more interesting than I anticipated. I thought it would be little more than a variation on cat, but it turned out to be quite a bit more difficult. I’ll walk you through how I arrived at my solution.

Reading a File Line by Line

After opening the valid files, I started by reading lines from the filehandle. I decided to modify some code from Chapter 3:

fn run(args: Args) -> Result<()> {
    for filename in args.files {
        match open(&filename) {
            Err(err) => eprintln!("{filename}: {err}"),
            Ok(file) => {
                for line in file.lines().take(args.lines as usize) { 
                    println!("{}", line?); 
                }
            }
        }
    }
    Ok(())
}

: Use Iterator::take to select the desired number of lines from the filehandle.
: Print the line to the console.

Tip

The Iterator::take method expects its argument to be the type usize, but I have a u64. I cast or convert the value using the as keyword.

I think this is a fun solution because it uses the Iterator::take method to select the desired number of lines. I can run the program to select one line from a file, and it appears to work well:

$ cargo run -- -n 1 tests/inputs/twelve.txt
one

If I run cargo test, the program passes almost half the tests, which seems pretty good for having implemented only a small portion of the specifications; however, it’s failing all the tests that use the Windows-encoded input file. To fix this problem, I have a confession to make.

Preserving Line Endings While Reading a File

I hate to break it to you, dear reader, but the catr program in Chapter 3 does not completely replicate the original cat program because it uses BufRead::lines to read the input files. The documentation for that function says, “Each string returned will not have a newline byte (the 0xA byte) or CRLF (0xD, 0xA bytes) at the end.” I hope you’ll forgive me because I wanted to show you how easy it can be to read the lines of a file, but you should be aware that the catr program replaces Windows CRLF line endings with Unix-style newlines.

To fix this, I must instead use BufRead::read_line, which, according to the documentation, “will read bytes from the underlying stream until the newline delimiter (the 0xA byte) or EOF is found. Once found, all bytes up to, and including, the delimiter (if found) will be appended to buf.”¹ Following is a version that will preserve the original line endings. With these changes, the program will pass more tests than it fails:

fn run(args: Args) -> Result<()> {
    for filename in args.files {
        match open(&filename) {
            Err(err) => eprintln!("{filename}: {err}"),
            Ok(mut file) => { 
                let mut line = String::new(); 
                for _ in 0..args.lines { 
                    let bytes = file.read_line(&mut line)?; 
                    if bytes == 0 { 
                        break;
                    }
                    print!("{line}"); 
                    line.clear(); 
                }
            }
        };
    }
    Ok(())
}

: Accept the filehandle as a mutable value.
: Use String::new to create a new, empty mutable string buffer to hold each line.
: Use for to iterate through a std::ops::Range to count up from zero to the requested number of lines. The variable name _ indicates I do not intend to use it.
: Use BufRead::read_line to read the next line into the string buffer.
: The filehandle will return zero bytes when it reaches the end of the file, so break out of the loop.
: Print the line, including the original line ending.
: Use String::clear to empty the line buffer.

If I run cargo test at this point, the program will pass almost all the tests for reading lines and will fail all those for reading bytes and handling multiple files.

Reading Bytes from a File

Next, I’ll handle reading bytes from a file. After I attempt to open the file, I check to see if args.bytes is Some number of bytes; otherwise, I’ll use the preceding code that reads lines. For the following code, be sure to add use std::io::Read to your imports:

for filename in args.files {
    match open(&filename) {
        Err(err) => eprintln!("{filename}: {err}"),
        Ok(mut file) => {
            if let Some(num_bytes) = args.bytes { 
                let mut buffer = vec![0; num_bytes as usize]; 
                let bytes_read = file.read(&mut buffer)?; 
                print!(
                    "{}",
                    String::from_utf8_lossy(&buffer[..bytes_read]) 
                );
            } else {
                ... // Same as before
            }
        }
    };
}

: Use pattern matching to check if args.bytes is Some number of bytes to read.
: Create a mutable buffer of a fixed length num_bytes filled with zeros to hold the bytes read from the file.
: Read bytes from the filehandle into the buffer. The value bytes_read will contain the number of bytes that were read, which may be fewer than the number requested.
: Convert the selected bytes into a string, which may not be valid UTF-8. Note the range operation to select only the bytes actually read.

As you saw in the case of selecting only part of a multibyte character, converting bytes to characters could fail because strings in Rust must be valid UTF-8. The String::from_utf8 function will return an Ok only if the string is valid, but String::from_utf8_lossy will convert invalid UTF-8 sequences to the unknown or replacement character:

$ cargo run -- -c 1 tests/inputs/one.txt
�

Let me show you another, much worse, way to read the bytes from a file. You can read the entire file into a string, convert that into a vector of bytes, and then select the first num_bytes:

let mut contents = String::new(); 
file.read_to_string(&mut contents)?; // Danger here 
let bytes = contents.as_bytes(); 
print!(
    "{}",
    String::from_utf8_lossy(&bytes[..num_bytes as usize]) // More danger 
);

: Create a new string buffer to hold the contents of the file.
: Read the entire file contents into the string buffer.
: Use str::as_bytes to convert the contents into bytes (u8 or unsigned 8-bit integers).
: Use String::from_utf8_lossy to turn a slice of bytes into a string.

As I’ve noted before, this approach can crash your program or computer if the file’s size exceeds the amount of memory on your machine. Another serious problem with the preceding code is that it assumes the slice operation bytes[..num_bytes] will succeed. If you use this code with an empty file, for instance, you’ll be asking for bytes that don’t exist. This will cause your program to panic and exit immediately with an error message:

$ cargo run -- -c 1 tests/inputs/empty.txt
thread 'main' panicked at src/main.rs:53:55:
range end index 1 out of range for slice of length 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Following is a safe—and perhaps the shortest—way to read the desired number of bytes from a file. Be sure to add the trait use std::io::Read to your imports:

let bytes: Result<Vec<_>, _> = file.bytes().take(num_bytes as usize).collect();
print!("{}", String::from_utf8_lossy(&bytes?));

In the preceding code, the type annotation Result<Vec<_>, _> is necessary as the compiler infers the type of bytes as a slice, which has an unknown size. I must indicate I want a Vec, which is a smart pointer to heap-allocated memory. The underscores (_) indicate partial type annotation, causing the compiler to infer the types. Without any type annotation for bytes, the compiler complains thusly:

error[E0277]: the size for values of type `[u8]` cannot be known at
compilation time
  --> src/main.rs:50:59
   |
95 |                     print!("{}", String::from_utf8_lossy(&bytes?));
   |                                                          ^^^^^^^ doesn't
   |                                        have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `[u8]`
   = note: all local variables must have a statically known size
   = help: unsized locals are gated as an unstable feature

Note

You’ve now seen that the underscore (_) serves various functions. As the prefix or name of a variable, it shows the compiler you don’t want to use the value. In a match arm, it is the wildcard for handling any case. When used in a type annotation, it tells the compiler to infer the type.

You can also indicate the type information on the righthand side of the expression using the turbofish operator (::<>). Often it’s a matter of style whether you indicate the type on the lefthand or righthand side, but later you will see examples where the turbofish is required for some expressions. Here’s what the previous example would look like with the type indicated with the turbofish instead:

let bytes = file
    .bytes()
    .take(num_bytes as usize)
    .collect::<Result<Vec<_>, _>>();

The unknown character produced by String::from_utf8_lossy (b'\xef\xbf\xbd') is not exactly the same output produced by the BSD head (b'\xc3'), making this somewhat difficult to test. If you look at the run helper function in tests/cli.rs, you’ll see that I read the expected value (the output from head) and used the same function to convert what could be invalid UTF-8 so that I can compare the two outputs. The run_stdin function works similarly:

fn run(args: &[&str], expected_file: &str) -> Result {
    // Extra work here due to lossy UTF
    let mut file = File::open(expected_file)?;
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer)?;
    let expected = String::from_utf8_lossy(&buffer); 

    let output = Command::cargo_bin(PRG)?.args(args).output().expect("fail");
    assert!(output.status.success());
    assert_eq!(String::from_utf8_lossy(&output.stdout), expected); 

    Ok(())
}

: Handle any invalid UTF-8 in expected_file.
: Compare the output and expected values as lossy strings.

Printing the File Separators

The last piece to handle is the separators between multiple files. As noted before, valid files have a header that puts the filename inside ==> and <== markers. Files after the first have an additional newline at the beginning to visually separate the output. This means I will need to know the file number that I’m handling, which I can get by using the Iterator::enumerate method. Following is the final version of my run function that will pass all the tests:

fn run(args: Args) -> Result<()> {
    let num_files = args.files.len(); 

    for (file_num, filename) in args.files.iter().enumerate() { 
        match open(filename) {
            Err(err) => eprintln!("{filename}: {err}"),
            Ok(mut file) => {
                if num_files > 1 { 
                    println!(
                        "{}==> {filename} <==",
                        if file_num > 0 { "\n" } else { "" }, 
                    );
                }

                if let Some(num_bytes) = args.bytes {
                    let mut buffer = vec![0; num_bytes as usize];
                    let bytes_read = file.read(&mut buffer)?;
                    print!(
                        "{}",
                        String::from_utf8_lossy(&buffer[..bytes_read])
                    );
                } else {
                    let mut line = String::new();
                    for _ in 0..args.lines {
                        let bytes = file.read_line(&mut line)?;
                        if bytes == 0 {
                            break;
                        }
                        print!("{line}");
                        line.clear();
                    }
                }
            }
        }
    }

    Ok(())
}

: Use the Vec::len method to get the number of files.
: Use the Iterator::enumerate method to track the file number and filenames.
: Only print headers when there are multiple files.
: Print a newline when file_num is greater than 0, which indicates the first file.

Going Further

There’s no reason to stop this party now. Consider implementing how the GNU head handles numeric values with suffixes and negative values. For instance, -c=1K means print the first 1,024 bytes of the file, and -n=-3 means print all but the last three lines of the file. You’ll need to change lines and bytes to signed integer values to store both positive and negative numbers. Be sure to run the GNU head with these arguments, capture the output to test files, and write tests to cover the new features you add.

You could also add an option for selecting characters in addition to bytes. You can use the String::chars function to split a string into characters. Finally, copy the test input file with the Windows line endings (tests/inputs/three.txt) to the tests for Chapter 3. Edit the mk-outs.sh for that program to incorporate this file, and then expand the tests and program to ensure that line endings are preserved.

Summary

This chapter dove into some fairly sticky subjects, such as converting types like string inputs to a u64 and then casting these to usize. If you still feel confused, just know that you won’t always. If you keep reading the docs and writing more code, it will eventually make sense.

Here are some things you accomplished in this chapter:

You learned to create optional parameters that can take values. Previously, the options were flags.
You saw that all command-line arguments are strings and used clap to attempt the conversion of a string like "3" into the number 3.
You learned to convert types using the as keyword.
You found that using _ as the name or prefix of a variable is a way to indicate to the compiler that you don’t intend to use the value. When used in a type annotation, it tells the compiler to infer the type.
You learned how to use BufRead::read_line to preserve line endings while reading a filehandle.
You found that the take method works on both iterators and filehandles to limit the number of elements you select.
You learned to indicate type information on the lefthand side of an assignment or on the righthand side using the turbofish operator.

In the next chapter, you’ll learn more about Rust iterators and how to break input into lines, bytes, and characters.

¹ EOF is an acronym for end of file.

Get Command-Line Rust now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Command-Line Rust by Ken Youens-Clark

Chapter 4. Head Aches

How head Works

Getting Started

Tip

Note

Defining the Arguments

Tip

Processing the Input Files

Tip

Reading Bytes Versus Characters

Note

Solution

Reading a File Line by Line

Tip

Preserving Line Endings While Reading a File

Reading Bytes from a File

Note

Printing the File Separators

Going Further

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly