Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
65
5
分类变量:自动化时代的数据计数
顾名思义,
分类变量
是用来表示类别或标记的。例如,分类变量可以表示世界上的主要城
市、一年中的四季、企业所属行业(石油、旅游、科技),等等。在实际的数据集中,类
别的数量总是有限的。类别可以用数字表示,但与数值型变量不同,分类变量的值是不能
被排序的。(作为行业类型,石油和旅游之间是分不出大小的。)它们又称为
无序变量
可以用一个简单的问题作为能否使用分类变量的试金石:“我们是需要知道两个值有
多大
不同,还是只需要知道它们
是否
不同?”
500
美元的股票价格是
100
美元的股票价格的
5
倍,所以,股票价格应该用连续型数值变量表示。另一方面,公司所属行业(石油、旅
游、科技,等等)就应该用分类变量表示。
大型分类变量在交易记录中是极其常见的。例如,很多
Web
服务使用
ID
来跟踪用户,
ID
就是一个分类变量,它的值根据服务的用户数量的不同,可能有几百个到几亿个。大型分
类变量的另一个例子是互联网交易中的
IP
地址。尽管用户
ID
IP
地址是用数值表示的,
但它们是分类变量,因为它们的大小与当前任务通常是没有关系的。举例来说,当进行个
人交易中的欺诈检测时,
IP
地址是个相关变量——有些
IP
地址或子网会生成更多的欺诈
交易。但地址为
164.203.x.x
的子网不会天生比
164.202.x.x
的子网更具欺诈性,子网地址
的数值在这里并不重要。
文档语料库的词汇表可以表示为一个大型分类变量,类别就是唯一的单词。表示如此多的
不同类别需要很高的计算成本。如果一个类别(如一个单词)在一个数据点 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680