Chapter 3. Vectors: Representing Semantic Information

Language comes so naturally to humans that its complexity is hard to understand. We go from concept and meaning to spoken or written word (and back), mostly unconsciously. If computers were humans, they could easily communicate in natural language. AI researchers have studied symbolic natural language processing (NLP) in computers for decades with mixed results. The advent of modern machine learning and the age of big data has revolutionized NLP and brought a paradigm shift in our approach, enabling us to code language as high-dimensional vectors.1 In this chapter, you will learn how ML systems train, employ, and create vectors to work with natural language.

Vector Basics

Computers and ML models only understand numbers. To work with the information contained in natural language, they need that information in number form. Vectors are that number form.

Vectors for semantic search (called embeddings) represent natural language as a set of values across many dimensions. When people train ML models for use in search engines, the goal is to produce a model that generates vectors that are close together for text that has similar meaning and far apart for text that has different meanings.

A vector (when centered at the origin) is a value for each of the axes in an n-dimensional space. Figure 3-1 shows the vector (4,6) in two dimensions—X and Y. You visualize this vector by drawing the line from the origin to the (X,Y) point. We ...

Get Natural Language and Search now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.