Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
特征工程与句法相似性
171
5.6.4
有关语法相似性分析等运行时间较长的程序的提示
下列是我们针对运行时间较长的程序,给出的一些效率方面的提示:
通过基准测试避免长时间的等待
在针对整个数据集执行多次计算之前,首先应该运行单次计算,并据此估算整
个算法运行的总耗时,以及所需的内存。你应该了解运行的时间和内存随着复
杂度的增加而增长的速度(呈线性、多项式、指数增长)。否则,就有可能等
待了几个小时(甚至几天)之后,计算仅仅完成了
10%
时内存就耗尽了。
设法将问题分割成小块
将问题分割成小块有很多好处。在查找新闻语料库中最相似的文档时,我们看
到整个过程仅花费了大约
20
分钟就运行完成了,并且没有占用大量内存。如果
采用直接计算的方法,那么很有可能在运行了很长时间后,我们才发现内存不
够用。此外,将问题分成小块,可以方便你使用多核体系结构,甚至可以将问
题分发到多台计算机上。
5.7
小结
在本章中,我们介绍了有关向量化与语法相似性的案例。几乎所有涉及文本的机器
学习项目(例如分类,主题建模和情感检测)本质上都需要文本向量。
事实证明,特征工程是一个非常强大的工具,可以帮助这些复杂的机器学习算法实
现出色的性能。因此,你应该尝试各种不同的向量化器,实验各个参数,然后观察
生成的特征空间。向量化的方法和参数非常繁杂,而且各有各的用途,尽管这方面
的优化需要花费一定的时间,但通常都能获得丰厚的回报,因为分析流水线中后续
步骤的结果都将大大受益。
本章中的相似度只是文档相似性的一个示例。如果你的需求更复杂,则可以通过后
续章节学习更复杂的相似性算法。 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446