7 Numerical Representations of Chemical Data for Structure-Based Machine Learning

Gyoung S. Na

Korea Research Institute of Chemical Technology (KRICT), Daejeon, 141 Gajeong-ro, Republic of Korea

7.1 Machine Readable Data Formats

Chemical data is usually represented as chemical formulas, molecular structures, and other composite formats of feature vectors and chemical compounds. For machine learning, unstructured chemical data should be represented as numerical formats because the machine learning algorithms are basically mathematical functions mapping the numerical input values to the numerical target values. For example, fully-connected neural networks (FCNNs) [1] require the feature vectors as their inputs, and convolutional neural networks (CNNs) [2] require the input data of feature matrices or tensors. Thus, our first step of machine learning for chemical applications is to represent the chemical data as the numerical formats, such as feature vectors, feature matrices, and mathematical graphs.

7.1.1 Feature Vectors

Feature vectors is an most essential data format in machine learning. Formally, the feature vectors of the data are defined on a numerical space chi element-of double-struck upper R Superscript d, where d is the number of features in the feature vectors. Figure 7.1 presents three examples of the feature vectors for data representations. Numerically, we can describe an animal with its species, height, weight, ...

Get AI-Guided Design and Property Prediction for Zeolites and Nanoporous Materials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.