Skip to Content
Web机器学习
book

Web机器学习

by Posts & Telecom Press, Andrea Isoni
May 2024
Intermediate to advanced
234 pages
3h 58m
Chinese
Packt Publishing
Content preview from Web机器学习

第4章 Web挖掘技术

Web数据挖掘技术适用于探索因特网上的数据,从中抽取相关信息。搜索网上内容,其过程很复杂,要用到多种算法,本章重点讲解这些算法。搜索引擎,拿到查询词(search query)之后,分析每个网页的数据,找到与查询词相关的网页。网页中的数据通常分为网页内容和链接到其他网页的超链接。一般而言,搜索引擎由以下部件组成:

  • 采集网页的Web爬虫或蜘蛛;
  • 抽取内容和预处理网页的解析器;
  • 将网页组织为数据结构的索引器;
  • 信息检索系统:根据文档与查询词的相关程度,找出最重要的文档;
  • 以某种有意义的方式,调整各网页顺序的排序算法。

这些部件的核心技术为Web结构挖掘和Web内容挖掘。

搜索引擎的Web爬虫、索引器和排序机制,处理的是Web的结构(超链接文本形成的网络)。搜索引擎的其余部分(解析器和检索系统)为Web内容分析方法,因为要解析网页,检索其中的文本信息。

更一步来讲,对于收集到的网页,我们可以利用自然语言处理技术深入分析其中的内容,比如使用潜在狄利克雷分布分析(Latent Dirichlet Allocation,LDA)、意见挖掘或情感分析工具。这些重要技术适用于从Web内容抽取其发表人的主观看法。因此在很多市场营销、咨询领域的商业应用中,都能看到它们的身影。本章最后将讨论这些情感分析技术。现在,我们首先来讨论Web结构挖掘。

这一类Web挖掘技术,有两个主要任务,一是如何发现网页之间的关系,二是如何利用链接结构找出相关网页。任务一,我们通常用爬虫爬取链接,并将爬取到的链接和网页存储到索引器。任务二,则要计算网页的重要性,并按其排序。

爬虫从一组URL(种子网页)开始爬取,从这些网页抽取链接后,接着去爬取它们。然后,再从新爬取到的网页抽取新链接。重复这一过程,直到满足预先设定的标准为止。未爬取的URL存储在 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

人工智能技术与大数据

人工智能技术与大数据

Posts & Telecom Press, Anand Deshpande, Manish Kumar
神经网络算法与Java编程

神经网络算法与Java编程

Posts & Telecom Press, Fabio M. Soares, Alan M. F. Souza
Python图像处理实战

Python图像处理实战

Posts & Telecom Press, Sandipan Dey
面向MapReduce的Hadoop优化

面向MapReduce的Hadoop优化

Posts & Telecom Press, Khaled Tannir

Publisher Resources

ISBN: 9781836203612