A novel solution for a data augmentation and bias problem in NLP using TensorFlow

by KC Tung

Released February 2020

Publisher(s): O'Reilly Media, Inc.

ISBN: 0636920373759

Start your free trial

Video description

The TensorFlow ecosystem contains many valuable assets. One of which is the highly acclaimed TensorFlow high-level API. It’s critical for a fast and lightweight approach to reducing lead time in deep learning model development and hypothesis testing. It’s now possible to quickly and easily develop a novel deep learning solution to meet an important need in practice: data bias and augmentation in NLP. Solving this problem would have a far-reaching impact in model bias, offensive-language detection, language personalization, and classification.

KC Tung (Microsoft) details his work to satisfy a need of an enterprise customer (one of the largest airlines in the world) for a model that can accurately review, classify, and store texts from aircraft maintenance logs to comply with FAA regulations on aviation safety. The customer’s data is imbalanced and biased toward certain categories.

Training machine learning models with imbalanced data inevitably leads to model bias, and text generation is a novel and important approach for data augmentation. In NLP, many current approaches to augmenting minority data are unsupervised and are limited to synonym swap, insertion, deletion, or oversampling. These generalized approaches often lead to a trade-off between precision and recall. They also don’t work well in practice, as enterprise data is almost always domain specific. There needs to be a better framework to generate new corpus by learning from any domain-specific underrepresented text.

KC presents a novel deep learning framework built with TensorFlow to quickly achieve this goal. A benchmark model is trained on the balanced dataset. From this dataset a class is undersampled as the underrepresented, minority class text. Then a gated recurrent unit (GRU) model learns to generate more underrepresented text, which helps training a long short-term memory (LSTM) model that classifies text. The result on holdout data shows that the model trained with generated text is surprisingly effective. Classification accuracy, precision, and recall at each class are all on par with the benchmark model without compromising precision or recall. In short, this demonstrates the success of TensorFlow adoption for the enterprise customer in quickly leveraging and applying the TensorFlow high-level API in building a novel production-grade solution for deployment, demonstrating the effectiveness of a novel data-augmentation framework, identifying a “killer app” or a new core value for text generation, and best practices and guidance in navigating machine learning model bias and business impact.

KC also details how to containerize the TensorFlow application and serve it in a Kubernetes cluster in the cloud, all with open source Python libraries. The TensorFlow high-level API proves to be indispensable for a fast and high-quality deep learning model development experience. Most importantly, this TensorFlow model may be deployed as a container in the cloud, on-premises, or at the edge, providing great flexibility to meet various solution architecture or business needs.

Prerequisite knowledge

Experience with TensorFlow, Keras, or other machine learning frameworks (useful but not required)
Familiarity with NLP, deep learning, text classification, and text generation (useful but not required)

What you'll learn

Discover TensorFlow high-level API for production grade model, quick starts for deep learning model development with hidden gems in tf.data examples, a new "killer app" for machine text generation using TensorFlow, and reference architecture for TensorFlow model deployment in the cloud or at the edge

A novel solution for a data augmentation and bias problem in NLP using TensorFlow - KC Tung (Microsoft)

Product information

Title: A novel solution for a data augmentation and bias problem in NLP using TensorFlow
Author(s): KC Tung
Release date: February 2020
Publisher(s): O'Reilly Media, Inc.
ISBN: 0636920373759

article

Why So Many Data Science Projects Fail to Deliver

by Mayur P. Joshi, Ning Su, Robert D. Austin, Anand K. Sundaram

Many companies are unable to consistently gain business value from their investments in big data, artificial …

book

Advanced Natural Language Processing with TensorFlow 2

by Ashish Bansal

One-stop solution for NLP practitioners, ML developers, and data scientists to build effective NLP systems that …

article

Run Llama-2 Models Locally with llama.cpp

by Federico Castanedo

Llama is Meta’s answer to the growing demand for LLMs. Unlike its well-known technological relative, ChatGPT, …

video

NLP, BERT, and the Anatomy of the Tensor

by Chris Mattmann & Scott Penberthy

AI experts discuss how modelling language as tensors helps machines read the way people do.

A novel solution for a data augmentation and bias problem in NLP using TensorFlow

Video description

Table of contents

Product information

You might also like

Why So Many Data Science Projects Fail to Deliver

Advanced Natural Language Processing with TensorFlow 2

Run Llama-2 Models Locally with llama.cpp

NLP, BERT, and the Anatomy of the Tensor

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly