Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第3章 创建语料库

本章将介绍以下内容。

  • 建立自定义语料库。
  • 创建词汇表语料库。
  • 创建已标记词性单词的语料库。
  • 创建已组块短语的语料库。
  • 创建已分类文本的语料库。
  • 创建已分类组块语料库的读取器。
  • 懒惰语料库加载。
  • 创建自定义语料库视图。
  • 创建基于MongoDB的语料库读取器。
  • 使用文件加锁的语料库编辑。

本章将介绍如何使用语料库读取器,以及如何创建自定义语料库。如果你希望训练自己的模型,如词性标记器或文本分类器,那么你需要创建自定义语料库来进行训练。后续章节将介绍模型训练。

现在,你将学习如何使用NLTK自带的现有语料库数据。在后面章节中,当获取训练数据时,如果需要访问语料库,这至关重要。你已经访问过该模块第1章中的WordNet语料库。本章将介绍更多的语料库。

本章还将讨论如何创建自定义的语料库读取器,当NLTK不能识别语料库的文件格式时,或如果语料库不以文件的形式存储,而是在存储在诸如MongoDB之类的数据库中,可以使用语料库读取器。熟悉该模块第1章所介绍的标记化是至关重要的。

语料库(corpus)是文本文档的集合,corpora是corpus的复数形式。这是拉丁词,意思是body(身体),在此情况下,指的是文本主体(body of text)。因此,自定义语料库实际上就是目录中的一堆文本文件,并且这个目录还常常伴随着许多其他文本文件的目录。

你应该遵循NLTK网站上的说明,安装了NLTK数据包。假设数据安装到了Windows系统上的C:\nltk_data中,或者Linux系统、UNIX系统和Mac OS X上的/usr/share/ nltk_data中。

NLTK在nltk.data.path中定义了数据目录或路径的列表。自定义语料库必须存在于其中一个路径,这样它才可以被NLTK找到。为了避免与官方数据包冲突,这里将在主目录中创建自定义的nltk_data目录。以下Python代码创建此目录并验证这个目录在nltk.data.path指定的已知路径列表中。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451