Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
138
4
选择是使用库
multiprocessing
https://oreil.ly/hoqxv
)。特别是数据框的操作并行
化,有一些值得一试的可扩展方案,例如
Dask
https://dask.org
)、
Modin
https://
oreil.ly/BPMLh
)和
Vaex
https://oreil.ly/hb66b
)。库
pandarallel
https://oreil.ly/-
PCJa
)可以直接给
Pandas
加入并行操作。
无论在何种情况下,你都应该注意观察进度,并大致估算运行的时间。如第
1
章所
述,
tqdm
库可以帮助你完成这项工作,因为它提供了迭代器以及数据框操作的进度
条(
https://oreil.ly/Rbh_-
)。本书
GitHub
上的
notebook
尽可能使用了
tqdm
4.8
补充说明
本章,我们从数据清理开始,介绍了语言处理的整个流程。不过,还有一些方面我
们未能详细介绍,但有些知识可能对你的项目有所帮助,甚至很有必要。
4.8.1
语言检测
许多语料库都包含多种语言的文本。每当需要处理多语言语料库时,你必须从以下
选项中选择一个:
如果其他语言代表的少数文本可忽略不计,则请忽略这些语言,并将每个文本
都视为语料库的主要语言,例如英语。
将所有文本都翻译成主要语言,例如通过谷歌翻译等。
确定语言,并在后续步骤中执行与语言相关的预处理。
有关语言检测的库有很多。我们的推荐是
Facebook
fastText
库(
https://oreil.
ly/6QhAj
)。
fastT ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446