Chapter 3. Creating Small Tools: Do one thing and do it well

image with no caption

Every operating system includes small tools.

Small tools written in C perform specialized small tasks, such as reading and writing files, or filtering data. If you want to perform more complex tasks, you can even link several tools together. But how are these small tools built? In this chapter, you’ll look at the building blocks of creating small tools. You’ll learn how to control command-line options, how to manage streams of information, and redirection, getting tooled up in no time.

Small tools can solve big problems

A small tool is a C program that does one task and does it well. It might display the contents of a file on the screen or list the processes running on the computer. Or it might display the first 10 lines of a file or send it to the printer. Most operating systems come with a whole set of small tools that you can run from the command prompt or the terminal. Sometimes, when you have a big problem to solve, you can break it down into a series of small problems, and then write small tools for each of them.

Note

Operating systems like Linux are mostly made up of hundreds and hundreds of small tools.

A small tool does one task and does it well.

image with no caption

If one small part of your program needs to convert data from one format to another, that’s the perfect kind of task for a small tool.

Here’s how the program should work

  1. Take the GPS from the bike and download the data.

    It creates a file called gpsdata.csv with one line of data for every location.

    image with no caption
  2. The geo2json tool needs to read the contents of the gpsdata.csv line by line...

  3. ...and then write that data in JSON format into a file called output.json.

  4. The web page that contains the map application reads the output.json file.

    It displays all of the locations on the map.

But you’re not using files...

The problem is, instead of reading and writing files, your program is currently reading data from the keyboard and writing it to the display.

image with no caption

But that isn’t good enough. The user won’t want to type in all of the data if it’s already stored in a file somewhere. And if the data in JSON format is just displayed on the screen, there’s no way the map within the web page will be able to read it.

You need to make the program work with files. But how do you do that? If you want to use files instead of the keyboard and the display, what code will you have to change? Will you have to change any code at all?

Brain Power

Is there a way of making our program use files without changing code? Without even recompiling it?

Geek Bits

Tools that read data line by line, process it, and write it out again are called filters. If you have a Unix machine, or you’ve installed Cygwin on Windows, you already have a few filter tools installed.

head: This tool displays the first few lines of a file.

tail: This filter displays the lines at the end of a file.

sed: The stream editor lets you do things like search and replace text.

You’ll see later how to combine filters together to form filter chains.

You can use redirection

You’re using scanf() and printf() to read from the keyboard and write to the display. But the truth is, they don’t talk directly to the keyboard and display. Instead, they use the Standard Input and Standard Output. The Standard Input and Standard Output are created by the operating system when the program runs.

image with no caption

The operating system controls how data gets into and out of the Standard Input and Output. If you run a program from the command prompt or terminal, the operating system will send all of the keystrokes from the keyboard into the Standard Input. If the operating system reads any data from the Standard Output, by default it will send that data to the display.

The scanf() and printf() functions don’t know, or care, where the data comes from or goes to. They just read and write Standard Input and the Standard Output.

Now this might sound like it’s kind of complicated. After all, why not just have your program talk directly to the keyboard and screen? Wouldn’t that be simpler?

Well, there’s a very good reason why operating systems communicate with programs using the Standard Input and the Standard Output:

You can redirect the Standard Input and Standard Output so that they read and write data somewhere else, such as to and from files.

You can redirect the Standard Input with <...

Instead of entering data at the keyboard, you can use the < operator to read the data from a file.

image with no caption

The < operator tells the operating system that the Standard Input of the program should be connected to the gpsdata.csv file instead of the keyboard. So you can send the program data from a file. Now you just need to redirect its output.

image with no caption

...and redirect the Standard Output with >

To redirect the Standard Output to a file, you need to use the > operator:

image with no caption

Because you’ve redirected the Standard Output, you don’t see any data appearing on the screen at all. But the program has now created a file called output.json.

The output.json file is the one you needed to create for the mapping application. Let’s see if it works.

image with no caption

But there’s a problem with some of the data...

Your program seems to be able to read GPS data and format it correctly for the mapping application. But after a few days, a problem creeps in.

image with no caption

So what happened here? The problem is that there was some bad data in the GPS data file:

image with no caption

But the geo2json program doesn’t do any checking of the data it reads; it just reformats the numbers and sends them to the output.

That should be easy to fix. You need to validate the data.

Brain Power

Study the code. What do you think happened? Is the code doing what you asked it to? Why weren’t there any error messages? Why did the mapping program think that the entire output.json file was corrupt?

Geek Bits

If your program finds a problem in the data, it exits with a status of 2. But how can you check that error status after the program has finished? Well, it depends on what operating system you’re using. If you’re running on a Mac, Linux, some other kind of Unix machine, or if you’re using Cygwin on a Windows machine, you can display the error status like this:

image with no caption

If you’re using the Command Prompt in Windows, then it’s a little different:

image with no caption

Both commands do the same thing: they display the number returned by the program when it finished.

image with no caption

Introducing the Standard Error

The Standard Output is the default way of outputting data from a program. But what if something exceptional happens, like an error? You’ll probably want to deal with things like error messages a little differently from the usual output.

That’s why the Standard Error was invented. The Standard Error is a second output that was created for sending error messages.

Human beings generally have two ears and one mouth, but processes are wired a little differently. Every process has one ear (the Standard Input) and two mouths (the Standard Output and the Standard Error).

Human

image with no caption

Process

image with no caption

Let’s see how the operating system sets these up.

By default, the Standard Error is sent to the display

Remember how when a new process is created, the operating system points the Standard Input at the keyboard and the Standard Output at the screen? Well, the operating system creates the Standard Error at the same time and, like the Standard Output, the Standard Error is sent to the display by default.

image with no caption

That means that if someone redirects the Standard Input and Standard Output so they use files, the Standard Error will continue to send data to the display.

image with no caption

And that’s really cool, because it means that even if the Standard Output is redirected somewhere else, by default, any messages sent down the Standard Error will still be visible on the screen.

So you can fix the problem of our hidden error messages by simply displaying them on the Standard Error.

But how do you do that?

fprintf() prints to a data stream

You’ve already seen that the printf() function sends data to the Standard Output. What you didn’t know is that the printf() function is just a version of a more general function called fprintf():

image with no caption

The fprintf() function allows you to choose where you want to send text to. You can tell fprintf() to send text to stdout (the Standard Output) or stderr (the Standard Error).

Let’s update the code to use fprintf()

With just a couple of small changes, you can get our error messages printing on the Standard Error.

image with no caption

That means that the code should now work in exactly the same way, except the error messages should appear on the Standard Error instead of the Standard Output.

Let’s run the code and see.

Small tools are flexible

One of the great things about small tools is their flexibility. If you write a program that does one thing really well, chances are you will be able to use it in lots of contexts. If you create a program that can search for text inside a file, say, then chances are you’re going to find that program useful in more than one place.

For example, think about your geo2json tool. You created it to help display cycling data, right? But there’s no reason you can’t use it for some other purpose...like investigating...the...

image with no caption

To see how flexible our tool is, let’s use it for a completely different problem. Instead of just displaying data on a map, let’s try to use it for something a little more complex. Say you want to read in a whole set of GPS data like before, but instead of just displaying everything, let’s just display the information that falls inside the Bermuda Rectangle.

That means you will display only data that matches these conditions:

((latitude > 26) && (latitude < 34))

((longitude > -76) && (longitude < -64))

So where do you need to begin?

Don’t change the geo2json tool

Our geo2json tool displays all of the data it’s given. So what should we do? Should we modify geo2json so that it exports data and also checks the data?

Well, we could, but remember, a small tool:

does one job and does it well

You don’t really want to modify the geo2json tool, because you want it to do just one task. If you make the program do something more complex, you’ll cause problems for your users who expect the tool to keep working in exactly the same way.

image with no caption

So if you don’t want to change the geo2json tool, what should you do?

A different task needs a different tool

If you want to skip over the data that falls outside the Bermuda Rectangle, you should build a separate tool that does just that.

So, you’ll have two tools: a new bermuda tool that filters out data that is outside the Bermuda Rectangle, and then your original geo2json tool that will convert the remaining data for the map.

This is how you’ll connect the programs together:

image with no caption

By splitting the problem down into two tasks, you will be able to leave your geo2json untouched. That will mean that its current users will still be able to use it. The question is:

How will you connect your two tools together?

Connect your input and output with a pipe

You’ve already seen how to use redirection to connect the Standard Input and the Standard Output of a program file. But now you’ll connect the Standard Output of the bermuda tool to the Standard Input of the geo2json, like this:

The | symbol is a pipe that connects the Standard Output of one process to the Standard Input of another process.

image with no caption

That way, whenever the bermuda tool sees a piece of data inside the Bermuda Rectangle, it will send the data to its Standard Output. The pipe will send that data from the Standard Output of the bermuda tool to Standard Input of the geo2json tool.

The operating system will handle the details of exactly how the pipe will do this. All you have to do to get things running is issue a command like this:

image with no caption

So now it’s time to build the bermuda tool.

The bermuda tool

The bermuda tool will work in a very similar way to the geo2json tool: it will read through a set of GPS data, line by line, and then send data to the Standard Output.

But there will be two big differences. First, it won’t send every piece of data to the Standard Output, just the lines that are inside the Bermuda Rectangle. The second difference is that the bermuda tool will always output data in the same CSV format used to store GPS data.

This is what the pseudocode for the tool looks like:

image with no caption

Let’s turn the pseudocode into C.

Do this!

You can download the spooky.csv file at http://oreillyhfc.appspot.com/spooky.csv.

But what if you want to output to more than one file?

We’ve looked at how to read data from one file and write to another file using redirection, but what if the program needs to do something a little more complex, like send data to more than one file?

Imagine you need to create another tool that will read a set of data from a file, and then split it into other files.

image with no caption

So what’s the problem? You can’t write to files, right? Trouble is, with redirection you can write to only two files at most, one from the Standard Output and one from the Standard Error. So what do you do?

Roll your own data streams

When a program runs, the operating system gives it three file data streams: the Standard Input, the Standard Output, and the Standard Error. But sometimes you need to create other data streams on the fly.

image with no caption

The good news is that the operating system doesn’t limit you to the ones you are dealt when the program starts. You can roll your own as the program runs.

Each data stream is represented by a pointer to a file, and you can create a new data stream using the fopen() function:

image with no caption

The fopen() function takes two parameters: a filename and a mode. The mode can be w to write to a file, r to read from a file, or a to append data to the end of a file.

Once you’ve created a data stream, you can print to it using fprintf(), just like before. But what if you need to read from a file? Well, there’s also an fscanf() function to help you do that too:

The mode is:

“w” = write,

“r” = read, or

“a” = append.

fprintf(out_file, "Don't wear %s with %s", "red", "green");


fscanf(in_file, "%79[^\n]\n", sentence);

Finally, when you’re finished with a data stream, you need to close it. The truth is that all data streams are automatically closed when the program ends, but it’s still a good idea to always close the data stream yourself:

fclose(in_file);

fclose(out_file);

Let’s try this out now.

The program runs, but...

If you compile and run the program with:

image with no caption

the program will read the spooky.csv file and split up the data, line by line, into three other files— ufos.csv, disappearances.csv, and other.csv.

That’s great, but what if a user wanted to split up the data differently? What if he wanted to search for different words or write to different files? Could he do that without needing to recompile the program each time?

There’s more to main()

The thing is, any program you write will need to give the user the ability to change the way it works. If it’s a GUI program, you will probably need to give it preferences. And if it’s a command-line program, like our categorize tool, it will need to give the user the ability to pass it command-line arguments:

image with no caption

Like any array in C, you need some way of knowing how long the array is. That’s why the main() function has two parameters. The argc value is a count of the number of elements in the array.

Command-line arguments really give your program a lot more flexibility, and it’s worth thinking about which things you want your users to tweak at runtime. It will make your program a lot more valuable to them.

OK, let’s see how you can add a little flexibility to the categorize program.

Watch it!

The first argument contains the name of the program as it was run by the user.

That means that the first proper command-line argument is argv[1].

Safety Check

Although at Head First Labs we never make mistakes (cough), it’s important in real-world programs to check for problems when you open a file for reading or writing. Fortunately, if there’s a problem opening a data stream, the fopen() function will return the value 0. That means if you want to check for errors, you should change code like:

FILE *in = fopen("i_dont_exist.txt", "r");

to this:

FILE *in;
if (!(in = fopen("dont_exist.txt", "r"))) {
  fprintf(stderr, "Can't open the file.\n");
  return 1;
}

Overheard at the Head First Pizzeria

image with no caption

Chances are, any program you write is going to need options. If you create a chat program, it’s going to need preferences. If you write a game, the user will want to change the shape of the blood spots. And if you’re writing a command-line tool, you are probably going to need to add command-line options.

Command-line options are the little switches you often see with command-line tools:

image with no caption

Let the library do the work for you

Many programs use command-line options, so there’s a special library function you can use to make dealing with them a little easier. It’s called getopt(), and each time you call it, it returns the next option it finds on the command line.

Let’s see how it works. Imagine you have a program that can take a set of different options:

image with no caption

This program needs one option that will take a value ( -e = engines) and another that is simply on or off ( -a = awesomeness). You can handle these options by calling getopt() in a loop like this:

image with no caption

Inside the loop, you have a switch statement to handle each of the valid options. The string ae: tells the getopt() function that a and e are valid options. The e is followed by a colon to tell getopt() that the -e needs to be followed by an extra argument. getopt() will point to that argument with the optarg variable.

When the loop finishes, you tweak the argv and argc variables to skip past all of the options and get to the main command-line arguments. That will make your argv array look like this:

image with no caption

The Polite Guide to Standards

The unistd.h header is not actually part of the standard C library. Instead, it gives your programs access to some of the POSIX libraries. POSIX was an attempt to create a common set of functions for use across all popular operating systems.

Watch it!

After processing the arguments, the 0th argument will no longer be the program name.

argv[0] will instead point to the first command-line argument that follows the options.

Your C Toolbox

You’ve got Chapter 3 under your belt, and now you’ve added small tools to your toolbox. For a complete list of tooltips in the book, see Appendix B.

image with no caption

Get Head First C now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.