Skip to Content
NLTK应用开发指南
book

NLTK应用开发指南

by Posts & Telecom Press, Nitin K Hardeniya
May 2024
Intermediate to advanced
172 pages
2h 39m
Chinese
Packt Publishing
Content preview from NLTK应用开发指南

第10章 大规模文本挖掘

本章打算再回顾之前章节中提到的一些程序库,但这回要谈的是如何在大数据环境中大规模地使用这些库。因此,本章会假设读者对于Hadoop+Hive这样的大数据框架已经有了一定的了解。在此基础之上,我们会对一些Python库进行一些相应的探讨,例如NLTK、scikit-learn和pandas这几个库都可以被应用于带有大规模非结构化数据的Hadoop集群。

还将会讨论一些NLP和文本挖掘领域中常见的用例,在这过程中,也会给出一些代码片段,以便帮助你完成相关的工作。具体来看三个会涉及绝大多数文本挖掘问题的主要示例。这些示例会告诉你如何通过大规模地执行NLTK来完成本书最初几章中所介绍的那些NLP任务。此外,还将通过几个例子来介绍如何在大数据条件下执行文本分类任务。

当然,机器学习和NLP还有另一高度规模化应用的问题就是它们是否可并行化。这里将会简单地讨论一下上一章中的一些问题,看看这些问题是否属于大数据问题,或者是否在某些条件下可以用大数据的方式来解决这些问题。

由于到目前为止所学习的大多数库都是用Python编写的,所以如何用Python(Hadoop)来处理大数据也是本章的主要问题之一。

在阅读完本章之后,我们希望读者掌握以下内容。

  • 能很好地了解Hadoop、Hive这些与大数据相关的技术,并在其条件下使用Python。
  • 根据教程一步一步地掌握如何在大数据条件下使用NLTK、Scikit和PySpark。

在Hadoop上运行一个Python进程的方式有很多种。在这里,将会讨论其中一些当前最为流行的方式,并通过这些方式在Hadoop上用Python来实现流式的MapReduce作业[1]、Hive中的Python ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Python编程入门与实战

Python编程入门与实战

Posts & Telecom Press, Fabrizio Romano
高性能Spark

高性能Spark

Holden Karau, Rachel Warren
Java数据科学指南

Java数据科学指南

Posts & Telecom Press, Rushdi Shams
Python机器学习案例精解

Python机器学习案例精解

Posts & Telecom Press, Yuxi (Hayden) Liu

Publisher Resources

ISBN: 9781836205913