Designing more features

In addition to using the number of hyperlinks as a proxy for a post's quality, the number of code lines is possibly another good one, too. At least, it is a good indicator that the post's author is interested in answering the question. We can find the code embedded in the <pre>...</pre> tag. And once we have it extracted, we should count the number of normal words in the post:

# we will use regular expression to remove HTML tagstag_match = re.compile('<[^>]*>', re.MULTILINE | re.DOTALL)whitespace_match = re.compile(r'\s+', re.MULTILINE | re.DOTALL)def extract_features_from_body(s):    num_code_lines = 0    link_count_in_code = 0    # remove source code and count how many lines the post has    code_free_s = s for match_str in code_match.findall(s): ...

Get Building Machine Learning Systems with Python - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.