Skip to Content
Python机器学习基础教程
book

Python机器学习基础教程

by Andreas C. Müller, Sarah Guido
January 2018
Intermediate to advanced
301 pages
8h 54m
Chinese
Posts & Telecom Press
Content preview from Python机器学习基础教程
202
5
5.2.2
 参数过拟合的风险与验证集
看到这个结果,我们可能忍不住要报告,我们找到了一个在数据集上精度达到
97%
的模
型。然而,这种说法可能过于乐观了(或者就是错的),其原因如下:我们尝试了许多不
同的参数,并选择了在测试集上精度最高的那个,但这个精度不一定能推广到新数据上。
由于我们使用测试数据进行调参,所以不能再用它来评估模型的好坏。我们最开始需要将
数据划分为训练集和测试集也是因为这个原因。我们需要一个独立的数据集来进行评估,
一个在创建模型时没有用到的数据集。
为了解决这个问题,一种方法是再次划分数据,这样我们得到
3
个数据集:用于构建模型
的训练集,用于选择模型参数的验证集(开发集),用于评估所选参数性能的测试集。图
5-5
给出了这
3
个集合的图示:
In[19]:
mglearn.plots.plot_threefold_split()
5-5:对数据进行 3 折划分,分为训练集、验证集和测试集
利用验证集选定最佳参数之后,我们可以利用找到的参数设置重新构建一个模型,但是要
同时在训练数据和验证数据上进行训练。这样我们可以利用尽可能多的数据来构建模型。
其实现如下所示:
In[20]:
from sklearn.svm import SVC
# 将数据划分为训练+验证集与测试集
X_trainval, X_test, y_trainval, y_test = train_test_split(
iris.data, iris.target, random_state=0)
# ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据驱动力:企业数据分析实战

数据驱动力:企业数据分析实战

Carl Anderson
Python应用开发指南

Python应用开发指南

Posts & Telecom Press, Ninad Sathaye
管理Kubernetes

管理Kubernetes

Brendan Burns, Craig Tracey

Publisher Resources

ISBN: 9787115475619