Chapter 11. Data Munging with fastai’s Mid-Level API
We have seen what Tokenizer
and Numericalize
do to a collection of
texts, and how they’re used inside the data block API, which
handles those transforms for us directly using the TextBlock
. But what
if we want to apply only one of those transforms, either to see
intermediate results or because we have already tokenized texts? More
generally, what can we do when the data block API is not flexible enough
to accommodate our particular use case? For this, we need to use
fastai’s mid-level API for processing data. The data block
API is built on top of that layer, so it will allow you to do everything
the data block API does, and much much more.
Going Deeper into fastai’s Layered API
The fastai library is built on a layered API. In the very top layer are applications that allow us to train a model in five lines of
code, as we saw in Chapter 1. In the case of creating
DataLoaders
for a text classifier, for instance, we used this line:
from
fastai.text.all
import
*
dls
=
TextDataLoaders
.
from_folder
(
untar_data
(
URLs
.
IMDB
),
valid
=
'test'
)
The factory method TextDataLoaders.from_folder
is very convenient when
your data is arranged the exact same way as the IMDb dataset, but in
practice, that often won’t be the case. The data block API
offers more flexibility. As we saw in the preceding chapter, we can get the
same result with the following:
path
=
untar_data
(
URLs
.
IMDB
)
dls
=
DataBlock
(
blocks
=
(
TextBlock
.
from_folder
(
path
),
CategoryBlock
Get Deep Learning for Coders with fastai and PyTorch now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.