Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第10章 大规模的文本挖掘

本章将用到前面章节中所学到的一些库,但是在本章中,我们希望学习在大数据的环境下这些库如何进行纵向扩展。假设你拥有关于大数据、Hadoop和Hive的一些知识。我们将探讨一些Python库(如NLTK、Scikit-learn和Pandas)如何使用在具有大量非结构化数据的Hadoop集群上。

在NLP和文本挖掘的上下文中,本章将谈论一些最常见的用例,也会提供有助于读者完成工作的代码片段。本章将着眼于可以代表绝大多数文本挖掘问题的三大示例。本章将提示如何大规模地运行NLTK,以执行我们在前几章中完成的一些NLP任务。本章将给出一些在大数据上进行文本分类的示例。

大规模的机器学习和自然语言处理的另一方面是要了解问题是否可并行。本章将简要讨论前一章中所提到的一些问题,并探讨这些问题是否属于大数据的问题。在一些情况下,甚至可能使用大数据解决这些问题。

由于我们目前学习的大多数库是用Python编写的,因此下面会处理其中一个主要问题,即如何在大数据(Hadoop)上使用Python。

在本章结束之前,我们希望读者能够:

  • 很好地理解大数据相关的技术(如Hadoop、Hive)以及如何使用Python实现这个技术;
  • 由浅入深地在大数据上使用NLTK、Scikit&PySpark。

在Hadoop上运行Python进程,有很多种方法。本节将会谈论一些在Hadoop运行Python的主流方式,如流MapReduce作业,在Hive中的Python UDF,以及Python Hadoop包装器。

通常,以map函数和reduce函数的形式,编写Hadoop作业。对于给定的任务,用户必须写出map和reduce函数的实现。通常,这些mapper和reducer是使用Java实现的。同时,Hadoop提供了流,在任何其他的语言中,用户可以使用类似于Java语言的方式,编写Python的mapper和reducer。假设读者已经使用Python运行了单词计数的示例。在本章稍后部分,我们也将使用NLTK,运行相同的示例。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451