In addition to using the number of hyperlinks as a proxy for a post's quality, the number of code lines is possibly another good one, too. At least, it is a good indicator that the post's author is interested in answering the question. We can find the code embedded in the <pre>...</pre> tag. And once we have it extracted, we should count the number of normal words in the post:
# we will use regular expression to remove HTML tagstag_match = re.compile('<[^>]*>', re.MULTILINE | re.DOTALL)whitespace_match = re.compile(r'\s+', re.MULTILINE | re.DOTALL)def extract_features_from_body(s): num_code_lines = 0 link_count_in_code = 0 # remove source code and count how many lines the post has code_free_s = s for match_str in code_match.findall(s): ...