Skip to Content
图解大模型 : 生成式AI 原理与实战
book

图解大模型 : 生成式AI 原理与实战

by Jay Alammar, Maarten Grootendorst
May 2025
Intermediate to advanced
382 pages
10h 33m
Chinese
Posts & Telecom Press
Content preview from 图解大模型 : 生成式AI 原理与实战
60
2
)[:,
0
]
return songs_df.iloc
[
similar_songs
]
# 提取推荐结果
print_recommendations(2172)
2.6
 小结
在本章中,我们介绍了
LLM
词元、分词器以及使用词元嵌入的实用方法。这为我们下一章
深入研究语言模型做好了准备,同时也为我们学习如何在语言模型之外使用嵌入打开了大门。
我们探讨了分词器如何作为处理
LLM
输入的第一步,将原始文本输入转换为词元
ID
。常
见的分词方案包括将文本分解为词、子词、字符或字节,具体取决于特定应用的要求。
通过对现实世界预训练分词器(从
BERT
GPT-2
GPT-4
和其他模型)的探索,我们了解
了某些分词器在某些方面表现更好(例如,保留大小写、换行符或其他语言的词元等信息);
而在其他方面,分词器之间仅存在差异(例如,它们如何分解某些词),并无优劣之分。
分词器设计中有三个主要决策点:分词器算法(如
BPE
WordPiece
SentencePiece
)、
词参数(包括词表大小、特殊词元、大小写处理策略和不同语言的处理)以及用于训练分
词器的数据集。
语言模型能够生成高质量与上下文相关的词元嵌入,这种嵌入改进了原始的静态嵌入。这
些与上下文相关的词元嵌入可以用于命名实体识别、抽取式文本摘要和文本分类等任务。
除了生成词元嵌入,语言模型还可以生成涵盖整个句子甚至文档的文本嵌入。这为本书第
二部分将要展示的众多语言模型应用提供了强大支持。
LLM
之前,
word2vec
GloVe
fastText
等词嵌入方法非常流行。在语言处理中 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

大模型应用开发极简入门 : 基于GPT-4 和ChatGPT(第2版)

大模型应用开发极简入门 : 基于GPT-4 和ChatGPT(第2版)

Olivier Caelen, Marie-Alice Blete
生成式人工智能可视化

生成式人工智能可视化

Priyanka Vergadia, Valliappa Lakshmanan

Publisher Resources

ISBN: 9787115670830