Program: Data Mining

Suppose that I, as a published author, want to track how my book is selling in comparison to others. This information can be obtained for free just by clicking on the page for my book on any of the major bookseller sites, reading the sales rank number off the screen, and typing the number into a file, but that’s tedious. As I somewhat haughtily wrote in the book that this example looks for, “computers get paid to extract relevant information from files; people should not have to do such mundane tasks.” This program uses the regular expressions API and, in particular, newline matching to extract a value from an HTML page. It also reads from a URL (discussed later in Section 17.7.) The pattern to look for is something like this (bear in mind that the HTML may change at any time, so I want to keep the pattern fairly general):

<b>QuickBookShop.web Sales Rank: </b>

As the pattern may extend over more than one line, I read the entire web page from the URL into a single long string using my FileIO.readerAsString( ) method (see Section 9.6) instead of the more traditional line-at-a-time paradigm. I then plot a graph using an external program (see Section 26.2); this could (and should) be changed to use a Java graphics program. The complete program is shown in Example 4-2.

Example 4-2.

import com.darwinsys.util.FileIO;
import java.text.*;
import java.util.*;
import org.apache.regexp.*;

/** Graph of a book's sales rank on a given bookshop site.

public class BookRank {
    public final static String ISBN = "0937175307";
    public final static String DATA_FILE = "lint.sales";
    public final static String GRAPH_FILE = "lint.png";
    public final static String TITLE = "Checking C Prog w/ Lint";
    public final static String QUERY = "

    /** Grab the sales rank off the web page and log it. */
    public static void main(String[] args) throws Exception {

        // Looking for something like this in the input:
        //     <b>QuickBookShop.web Sales Rank: </b>
        //     26,252
        //     </font><br>

        // From Patrick Killelea <>: match number with
        // comma included, just print as is. Loses if you fall below 100,000.
        RE r = new RE("\..web Sales Rank: </b>\\s*(\\d*),*(\\d+)\\s");

        // Read the given search URL looking for the rank information.
        // Read as a single long string, so can match multi-line entries.
        // If found, append to sales data file.
        BufferedReader is = new BufferedReader(new InputStreamReader(
            new URL(QUERY + ISBN).openStream(  )));
        String input = FileIO.readerToString(is);
        if (r.match(input)) {
            PrintWriter FH = new PrintWriter(
                new FileWriter(DATA_FILE, true));
            String date = // `date +'%m %d %H %M %S %Y'`;
                new SimpleDateFormat("MM dd hh mm ss yyyy ").
                format(new Date(  ));
            FH.println(date + r.getParen(1) + r.getParen(2));
            FH.close(  );

        // Draw the graph, using gnuplot.

        String gnuplot_cmd = 
            "set term png\n" + 
            "set output \"" + GRAPH_FILE + "\"\n" +
            "set xdata time\n" +
            "set ylabel \"Amazon sales rank\"\n" +
            "set bmargin 3\n" +
            "set logscale y\n" +
            "set yrange [1:60000] reverse\n" +
            "set timefmt \"%m %d %H %M %S %Y\"\n" +
            "plot \"" + DATA_FILE + 
                "\" using 1:7 title \"" + TITLE + "\" with lines\n" 

        Process p = Runtime.getRuntime(  ).exec("/usr/local/bin/gnuplot");
        PrintWriter gp = new PrintWriter(p.getOutputStream(  ));
        gp.close(  );

Get Java Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.