Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第8章 分布式进程和大型数据集的处理

本章将介绍以下内容。

  • 使用execnet进行分布式标注。
  • 使用execnet进行分布式分块。
  • 使用execnet并行处理列表。
  • 在Redis中存储频率分布。
  • 在Redis中存储条件频率分布。
  • 在Redis中存储有序字典。
  • 使用Redis和execnet进行分布式单词评分。

对于内存中的单处理器自然语言处理而言,NLTK是很好的工具。但是,有些时候,你有大量的数据需要处理,并且希望利用多个CPU、多核CPU,甚至多台计算机的优势。或者,你可能希望将频率和概率存储在持久共享的数据库中,这样多个进程可以同时访问这些数据。对于第一种情况,我们将使用execnet与NLTK进行并行和分布式处理。对于第二种情况,读者将会学习到如何使用Redis数据结构服务器/数据库来存储频率分布等。

execnet是Python的分布式执行库。它允许你创建网关和通道,从而执行远程代码。网关(gateway)是从主调进程到远程环境的连接。远程环境可以是本地子进程,或通过SSH连接到得远程节点。通道(channel)在网关处创建,处理通道创建器和远程代码之间的通信。使用这种方式,execnet是一种消息传递接口(Message Passing Interface,MPI),在这个接口中,网关创建了连接,而通道则用来来回发送消息。

由于许多NLTK进程在计算过程中占用了100%的CPU,因此execnet则是一种分布计算以获得最大资源利用率的理想方式。可以为每个CPU内核创建一个网关,内核在本地计算机上还是在跨远程设备上分布,都无关紧要。在许多情况下,只需要在单台机器上拥有训练的对象和数据,然后,根据需要,将对象和数据发送到远程节点。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451