Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
简单而又奇妙的数值
9
2.2.2
 区间量化
分箱
在这个练习中,我们使用来自于
Yelp
点评网站数据竞赛(
http://www.yelp.com/dataset_
challenge
)第
6
轮的数据,并用这些数据创建一个规模小得多的分类数据集。
Yelp
数据集
是用户对商家的点评数据,这些商家来自于北美和欧洲的
10
个城市,每个商家都用
0
或多个分类进行标记。
Yelp
点评数据集
6
的统计信息
782
个商业分类。
整个数据集包含
1 569 264
≈ 160
万)条点评和
61 184
个商家。
就点评数量而言,“
Restaurants
”(
990 627
条点评)和“
Nightlife
”(
210 028
条点评)
是最普遍的分类。
没有商家既属于餐馆又属于夜生活场所,所以这两个点评分组中没有重叠。
每个商家都有一个点评数量。假设我们的任务是使用协同过滤方法预测某用户给某商家的打
分。点评数量会是一个非常有用的输入特征,因为人气和高评分之间通常有很强的相关性。
现在的问题就是,我们应该使用原始点评数量,还是应该对其做进一步的处理?图
2-4
是根
据例
2-2
生成的,它给出了所有商家点评数量的直方图。从中可以看出,它和前一个例子中
的收听次数具有相同的模式:大多数商家的点评数量很少,但有些商家具有几千条点评。
2-2
 
Yelp
数据集中的商家点评数量可视化
>>> import pandas as pd
>>> import json
# 加载商家数据
>>> biz_file = open('yelp_academic_dataset_business.json') ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680