Over the next two chapters, we will be dealing with code that is messy, undertested, and does something cool. Note that both of these chapters work from the same project codebase.
The first part that is cool is that if you have interest, but lack experience, in machine learning, we’re using a particular algorithm is fairly simple and still very powerful. It’s called a Naive Bayes Classifier (NBC). You can use it to classify things based on previous knowledge. A spam filter is a frequently cited example. An NBC has two basic steps. First, you give it data that you already know how a human would classify (e.g., “These 35 subject lines are from spam emails”). That is called “training” the algorithm. Then, you give it a new piece of data and ask it what category that data likely fits into (e.g., “Here is the subject line of an email we just received. Is it spam or not?”).
The second cool thing (if you’re into playing music at all) is that our specific application of the algorithm will use chords in songs along with their difficulty as training data. Following that, we can feed it the chords of other songs, and it will automatically characterize its difficulty for us. At the end of these two chapters, we’ll make some tweaks to have the algorithm guess at whether a segment of text is understandable or not (with the assumption that we understand English, but not Japanese).
This might seem like an intimidating or complex problem, but two things ...