July 2018
Beginner to intermediate
406 pages
9h 55m
English
Let's play with the toy dataset, consisting of the following posts:
|
Post filename
|
Post content
|
|
01.txt |
This is a toy post about machine learning. Actually, it contains not much interesting stuff |
|
02.txt |
Imaging databases can get huge |
|
03.txt |
Most imaging databases save images permanently |
|
04.txt |
Imaging databases store images |
|
05.txt |
Imaging databases store images |
In this post dataset, we want to find the most similar post for the short post imaging databases.
Assuming that the posts are located in the "data/toy" directory (please check the Jupyter notebook), we can feed CountVectorizer with it:
>>> from pathlib import Path # for easy path management
>>> TOY_DIR = Path('data/toy')
>>> posts ...Read now
Unlock full access