Skip to Content
Spark高级数据分析(第2版)
book

Spark高级数据分析(第2版)

by Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
June 2018
Beginner to intermediate
246 pages
6h 57m
Chinese
Posts & Telecom Press
Content preview from Spark高级数据分析(第2版)
GraphX
分析伴生网络
141
- -
( N YN*NY T/ )* 2YY N
χ T
YA*NA*YB*NB
2
2
=
||
请注意,这个卡方统计量的公式包括一个术语“
T/2
”。这是耶茨的连续性校正(
Yates
s
continuity correction
https://en.wikipedia.org/wiki/Yates
s_correction_for_continuity
),一些
公式中并没有包含。
如果样本实际上是独立的,我们期望该统计量服从适当自由度的
卡方分布
。假定
r
c
待比较的两个随机变量的基数,则自由度为
(
r
1)(
c
1) = 1
。卡方统计量大则表明随机变
量相互独立的可能性小,因此两个概念同时出现是有意义的。更具体地讲,自由度为
1
卡方分布的
CDF
(累积分布函数)给出一个
p
值,它是我们拒绝变量是独立的这个备择假
设的置信水平。
本节将使用
GraphX
来计算伴生图中每个概念对的卡方统计量。
7.7.1
 处理
EdgeTriplet
求卡方统计量时最简单的部分就是计算
T
,也就是需要考虑的文档的总个数。只要简单数
一下
medline
RDD
中的条目个数就可以轻松地得到这个
T
,代码如下:
val T = medline.count()
计算每个概念在多少篇文档中出现也相对简单,本章前面建立
DataFrame
实例
topicDist
时已经讨论过,但我们现在需要将其表示为主题的散列值及其计数组成的
RDD
val topicDistRdd = topicDist.map{
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

全脑设计:基于脑科学原理的产品设计

全脑设计:基于脑科学原理的产品设计

John Whalen
大数据项目管理:从规划到实现

大数据项目管理:从规划到实现

Ted Malaska, Jonathan Seidman
Kubernetes设计模式

Kubernetes设计模式

Bilgin Ibryam, Roland Huß
Istio 学习指南

Istio 学习指南

Lee Calcote, Zack Butcher

Publisher Resources

ISBN: 9787115482525