Skip to Content
R在数据科学中的应用,第2版
book

R在数据科学中的应用,第2版

by Hadley Wickham, Mine Cetinkaya-Rundel, Garrett Grolemund
May 2025
Intermediate to advanced
578 pages
8h 9m
Chinese
O'Reilly Media, Inc.
Content preview from R在数据科学中的应用,第2版

第 18 章 缺失值

本作品已使用人工智能进行翻译。欢迎您提供反馈和意见:translation-feedback@oreilly.com

导言

在本书的前半部分你已经了解了缺失值的基本知识。在第 1 章中,我们第一次看到了缺失值,在绘制曲线图时,缺失值会发出警告;在" summarize() "中,缺失值会干扰汇总统计量的计算;在"缺失值 "中,我们了解了缺失值的传染性以及如何检查缺失值的存在现在,我们将更深入地讨论它们,让你了解更多细节。

首先,我们将讨论一些用于处理记录为NAs 的缺失值的通用工具。然后,我们将探讨隐性缺失值的概念,即数据中根本不存在的值,并展示一些可以用来使其显性化的工具。最后,我们将对数据中未出现的因子水平所导致的空组进行相关讨论。

先决条件

处理缺失数据的函数主要来自 dplyr 和 tidyr,它们是 tidyverse 的核心成员。

library(tidyverse)

明确的缺失值

首先,让我们来探索几种方便的工具,用于创建或消除缺失的显式值,即看到NA 的单元格。

转入的最后一个观测点

缺失值的一个常见用途是方便数据录入。手工输入数据时,缺失值有时表示前一行的值被重复(或结转):

treatment <- tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         NA,
  "Katherine Burke",  1,         4
)

您可以用 tidyr::fill().其工作原理如下 select(),取一组列:

treatment |>
  fill(everything())
#> # A tibble: 4 × 3
#>   person           treatment response
#>   <chr>                <dbl>    <dbl>
#> 1 Derrick Whitmore         1        7
#> 2 Derrick Whitmore         2       10
#> 3 Derrick Whitmore         3       10
#> 4 Katherine Burke          1        4

这种处理方法有时被称为 "最后观测值结转",简称locf。您可以使用.direction 参数来填补以更特殊方式生成的缺失值。

固定值

有时,缺失值代表一些固定的已知值,最常见的是 0。 dplyr::coalesce()来替换它们:

x <- c(1, 4, 5, 7, NA)
coalesce(x, 0)
#> [1] 1 4 5 7 0

有时您会遇到相反的问题,即某些具体值实际上代表了缺失值。这种情况通常出现在旧版软件生成的数据中,这些软件没有正确的方法来表示缺失值,因此必须使用一些特殊值,如 99 或 -999。

如果可能,可在读入数据时处理这个问题,例如使用na 的参数来处理。 readr::read_csv()的参数,如read_csv(path, na = "99") 。如果您后来发现了这个问题,或者您的数据源没有提供在读取时处理这个问题的方法,您可以使用 dplyr::na_if():

x <- c(1, 4, 5, 7, -99)
na_if(x, -99)
#> [1]  1  4  5  7 NA

在我们继续之前,有一种特殊的缺失值你会经常遇到:NaN (读作 "nan"),或者说不是数字。知道这一点并不重要,因为它的行为通常与NA 相同:

x <- c(NA, NaN)
x * 10
#> [1]  NA NaN
x == 1
#> [1] ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

R深度学习权威指南

R深度学习权威指南

Posts & Telecom Press, Joshua F. Wiley
AI工程

AI工程

Chip Huyen
Raku学习手册

Raku学习手册

brian d foy

Publisher Resources

ISBN: 9798341657304