July 2018
Beginner to intermediate
406 pages
9h 55m
English
As mentioned earlier, we will use the Text and Score features to train our classifier. The problem with Text is that the classifier does not work well with strings. We will have to convert it into one or more numbers. So, what statistics could be useful to extract from a post? Let's start with the number of HTML links, assuming that good posts have a higher chance of having links in them.
We can do this with regular expressions. The following captures all HTML link tags that start with http:// (ignoring the other protocols for now):
import relink_match = re.compile('<a href="http://.*?".*?>(.*?)</a>', re.MULTILINE | re.DOTALL)
However, we do not want to count links that are part of a code block. If, for example, ...
Read now
Unlock full access