Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
130
9
微软学术图谱数据集
它包含
166 192 182
篇论文,可经由
Open Academic Graph
获取,只能用于研究目的。
完整数据集的大小是
104GB
每条观测有
18
个变量用以标识论文,包括论文题目、论文摘要、作者姓名、关键
字和研究领域。
这个数据集被设计成易于使用数据库存储和读取。对于机器学习模型来说,它不够整洁,
需要做一些基本的数据整理。有些教师喜欢省略这个步骤,让学生直接建模并得到结果,
以此来提高他们的兴趣。我们可不这么做,一切都从头开始。
第一步是将一些变量整理为正确的形式,建立一个基于项目的协同过滤器,看看能否快速
有效地找到那些非常相似的论文。
基于项目的协同过滤之起源
这种方法最初是由
Amazon
公司开发的,作为基于用户的商品推荐算法的一
种改进。
Sarawar
等人详细介绍了将推荐算法从基于用户切换到基于项目的
过程中的困难和收获(
Sarawar
等,
2001
)。
基于项目的协同过滤方法根据项目之间的相似程度来提供推荐。这项工作分为两个阶段:
首先找出项目之间的相似度评分,然后对所有评分进行排序,找到前
N
个相似项目作为
推荐。
建立基于项目的推荐器
基于项目的推荐器完成以下三项任务。
(1)
生成关于“事物”或项目的信息。
(2)
对所有项目进行评分,找出与某个项目“相似”的其他项目。
(3)
返回评分排序
+
项目。
9.2
 第一关
数据导入
清理和特征解析
与所有优秀的科学实验一样,我们从一个假设开始。在这个例子中,我们假定那些大约在
同一时间而且在同一研究领域发表的论文对用户是最有用的。我们 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680