Skip to Content
精通Spark数据科学
book

精通Spark数据科学

by Posts & Telecom Press, Andrew Morgan, Antoine Amend, David George, Matthew Hallett
May 2024
Intermediate to advanced
457 pages
6h 33m
Chinese
Packt Publishing
Content preview from 精通Spark数据科学

第9章 新闻词典和实时标记系统

虽然分层数据仓库将数据存储在文件夹的文件中,但典型的基于Hadoop的系统依赖扁平架构来存储数据。如果没有适当的数据治理或对数据全部内容的清晰理解,那数据湖就将不可避免地变成沼泽,在沼泽中,像GDELT这样的有趣数据集只不过是一个包含大量非结构化文本文件的文件夹。因此,数据分类可能是大型组织中使用最广泛的机器学习技术之一,因为它允许用户正确分类和标记他们的数据,将这些类别作为其元数据解决方案的一部分发布,从而以最有效的方式访问特定信息。如果没有预先执行适当的标记机制,理论上在摄取时,查找有关特定主题的所有新闻文章将需要解析整个数据集以查找特定关键字。在本章中,我们将描述一种创新的方式,它使用Spark Streaming和1%Twitter firehost以非监督的方式近实时地标记传入的GDELT数据。

在这一章中,我们将探讨以下主题。

  • 使用Stack Exchange数据引导朴素贝叶斯分类器。
  • 用于实时流应用的Lambda与Kappa架构。
  • Spark Streaming应用中的Kafka和Twitter4J。
  • 部署模型时的线程安全性。
  • 使用Elasticsearch作为缓存层。

数据分类是一种监督学习技术,这意味着你只能预测从训练数据集中学习的标签和类别。因为训练数据集必须被恰当地标记,这将成为我们本章中讨论的主要挑战。

在新闻文章的背景下,数据都没有得到适当的标记。严格来说,我们无法从中学到任何东西。数据科学家的常识是开始手工标记一些输入记录,这些记录将用作训练数据集。但是,类的数量可能相对较大,至少在我们的案例中可能有数百个标签,标记的数据量(数千篇文章)可能很大,需要付出巨大的努力。第一个解决方案是将这项繁重的任务外包给一个“土耳其机器人”(Mechanical ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

R大数据分析实用指南

R大数据分析实用指南

Posts & Telecom Press, Simon Walkowiak
Python迁移学习

Python迁移学习

Posts & Telecom Press, Dipanjan Sarkar, Raghav Bali, Tamoghna Ghosh
R深度学习权威指南

R深度学习权威指南

Posts & Telecom Press, Joshua F. Wiley
Python预测分析实战

Python预测分析实战

Posts & Telecom Press, Alvaro Fuentes

Publisher Resources

ISBN: 9781836203858