Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第5章 提取组块

本章将介绍以下内容。

  • 使用正则表达式组块和隔断。
  • 使用正则表达式合并和拆分组块。
  • 使用正则表达式扩展和删除组块。
  • 使用正则表达式进行部分解析。
  • 训练基于标注器的分块器。
  • 基于分类的分块。
  • 提取命名实体。
  • 提取专有名词组块。
  • 提取部位组块。
  • 训练命名实体组块器。
  • NLTK训练器训练组块器。

组块提取(或部分解析)是从词性标签语句中提取短语的过程。这不同于完全解析,因为我们感兴趣的是独立组块或短语,而不是完整的解析树(要了解关于解析树的更多知识,请参阅维基百科)。这背后的思想是,可以通过查找词性标签的特定模式,从句子中提取有意义的短语。

正如在第4章中,我们将使用宾州树库语料库进行基本训练,并测试组块提取。由于CoNLL2000语料库格式相对简单并且灵活,支持多种组块类型(请参阅该模块3.2节和3.5节,获得关于conll2000语料库和IOB标签的更多详细信息),因此我们也使用CoNLL2000语料库。

使用修改的正则表达式,可以定义组块的模式。这些是词性标签的模式,这些模式定义了何种单词组成了组块。也可以对哪类单词不应该在组块中定义模式。这些未分组的单词,称为隔断(chink)。

ChunkRule类指定了在组块中包含什么,而ChinkRule类指定了从组块中排除什么。换句话说,分块创建组块,而隔断打破这些组块。

首先,我们需要知道如何定义组块模式。这些修改的正则表达式用于匹配词性标签序列。使用尖括号指定单个标签,如使用<NN>匹配名词标签。可以组合多个标签,如<DT> <NN>匹配后面跟着名词的限定词。在尖括号内,使用正则表达式语法来匹配个别标签模式,因此,可以使用<NN *>匹配所有名词,包括NN和NNS。也可以在尖括号外使用正则表达式语法,匹配标签模式。<DT>?<NN.*> ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451