Skip to Content
Data Science with Java
book

Data Science with Java

by Michael R. Brzustowicz
June 2017
Beginner to intermediate
233 pages
5h 57m
English
O'Reilly Media, Inc.
Content preview from Data Science with Java

Chapter 4. Data Operations

Now that we know how to input data into a useful data structure, we can operate on that data by using what we know about statistics and linear algebra. There are many operations we perform on data before we subject it to a learning algorithm. Often called preprocessing, this step comprises data cleaning, regularizing or scaling the data, reducing the data to a smaller size, encoding text values to numerical values, and splitting the data into parts for model training and testing. Often our data is already in one form or another (e.g., List or double[][]), and the learning routines we will use may take either or both of those formats. Additionally, a learning algorithm may need to know whether the labels are binary or multiclass or even encoded in some other way such as text. We need to account for this and prepare the data before it goes in the learning algorithm. The steps in this chapter can be part of an automated pipeline that takes raw data from the source and prepares it for either learning or prediction algorithms.

Transforming Text Data

Many learning and prediction algorithms require numerical input. One of the simplest ways to achieve this is by creating a vector space model in which we define a vector of known length and then assign a collection of text snippets (or even words) to a corresponding collection of vectors. The general process of converting text to vectors has many options and variations. Here we will assume that there exists a large ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Mastering Java for Data Science

Mastering Java for Data Science

Alexey Grigorev
Java: Data Science Made Easy

Java: Data Science Made Easy

Richard M. Reese, Jennifer L. Reese, Alexey Grigorev

Publisher Resources

ISBN: 9781491934104Errata PageSupplemental Content