Now that we understand the data that we are using and the DeepSpeech model architecture, let's set up the environment to train the model. There are some preliminary steps to create a virtual environment for the project that are optional, but always recommended to use. Also, it's recommended to use GPUs to train these models.
Along with Python Version 3.5 and TensorFlow version 1.7+, the following are some of the prerequisites:
- python-Levenshtein: To compute character error rate (CER), basically the distance
- python_speech_features: To extract MFCC features from raw data
- pysoundfile: To read FLAC files
- scipy: Helper functions for windowing
- tqdm: For displaying a progress bar
Let's create the virtual environment and install ...