An overview of the challenges MLflow tackles and a primer on how to get started.
Designing application architectures for real-time decisions.
There are advantages to having a search engine built into a database.
A look at Apache Kylin’s architecture and features in version 2.0.
Tools from maps to drones respond to crises with increasing speed and accuracy.
Transform your basemaps using CARTO and PostGIS.
Performing business analytics on the data lake using next-gen open source tools.
Python and R are widely accepted as logical languages for data science—but what about Go?
Apache Arrow makes it possible to use multiple languages and heterogeneous data infrastructure.
An analytics database can offer performance and scalability advantages.
Leading data-driven organizations point out five common pitfalls.
Assessing cost, performance, and run time of a typical Spark workload.
You’ve got three options: Scaling up, scaling out, or using R as an abstraction layer.
Near-real-time processing yields increased efficiency and an opportunity for unified architecture.
The FusionNet deep learning architecture tackles three-dimensional objects with underlying architectures that “think” in three dimensions.
Addressing the challenge of delivering big data analytics to the masses.
In this excerpt, Karthik Ramasamy and Sijie Guo of Twitter discuss the operational experience of DistributedLog and Heron, two powerful real-time analytics tools that were open sourced by the company in early 2016.
How Spark will fit into—and change—the current ecosystem of distributed computing tools.
Using Python, and other tools, for natural language processing, sentiment analysis, and data wrangling.
Learn the basics of machine learning and deep learning using TensorFlow.
Using Apache Beam to become data-driven, even before you have big data.
A single, multitenant platform built with open source technologies, based on an understanding of basic common needs.
Kappa architecture and Bayesian models yield quick, accurate analytics in cloud monitoring systems.
Radu Gheorghe demonstrates how to create, retrieve, update, and delete documents in Elasticsearch. He also covers special Elasticsearch fields, like _type, _source, and _version, and the relationship between Elasticsearch shards and Lucene indices.
Visualizations that show comparisons, connections, and conclusions offer analytical clarity.
The O’Reilly Podcast: Nikolaus Bates-Haus on tools and techniques for addressing data variety and augmentation at scale.
How QoS enables business-critical and low-priority applications to coexist in a single cluster.
How teaching others can help society and advance your career.
Apache Hadoop co-founders Doug Cutting and Mike Cafarella explore the future of Hadoop.
Companies are differentiating themselves by acting on data in real time. But what does “real time” really mean? Jack Norris discusses the challenges of coordinating data flows, analysis, and integration at scale to shape business as it happens.
Eric Frenkiel explains how a trinity of real-time technologies—Kafka, Spark, MemSQL—is enabling Uber and others to power their companies with predictive apps and analytics.
Common mistakes that thwart a simple data visualization technique.
Moving data transformation into the hands of administrators, analysts, and other non-developers.
Exploring the intersections and compatibility of data science and procurement.
How to build, maintain, and derive value from your Hadoop data lake.
Query languages, like BQL, offer a bridge between domain experts and software experts.
Over-allocated but underutilized clusters require more than best practice solutions.
Jeff Carpenter and Eben Hewitt design a data model for a sample application and represent it in CQL.
Metadata, governance, and other considerations for building ground-to-cloud.
Comparing the effects of storage format, modeling/filtering, caching, and other effects on analytical query speed and storage cost.
Knowing the architectures is key to thinking strategically and delivering value.
How to group users’ events using machine learning and distributed computing
Choosing the right tools for the job.
Ranking algorithms bolster intrusion detection systems.
Globalize your data with Apache Cassandra.
Data management is an important step in deriving business value from your Hadoop data lake.
Streaming analytics are only worthwhile if the data leads to action.
Identification of data sources is the first step in warehouse development. In this video training segment, Michael Blaha provides a framework by reviewing data modeling constructs and terminology, including dependent and independent entity types. Using IE (information engineering) notation and the ERwin tool, Blaha walks through a sample operational data model.
Everyone loves data, so it's no surprise that we've been innovating by orders of magnitude in data storage. But has analytics innovation kept up?
What it looks like to analyze, visualize, and even forecast human society using global news coverage.
Consolidating data across silos improves business insight.
How Baidu combined Tachyon with Spark SQL to increase speed 30-fold
Learn how to deploy machine learning solutions using Azure ML.
Collaboration and data security tools win the day.
In the new O’Reilly video training "Introduction to Hadoop YARN," David Yahalom explains everything you need to know about using this new data processing platform to extend Hadoop’s potential. In this segment, Yahalom explains YARN’s architecture and daemons.
Todd Lipcon investigates the trade-offs between real-time transactional access and fast analytic performance. He also describes Kudu, the new addition to the open source Hadoop ecosystem that fills the storage gap.
A data-driven market report.
Tools and learning resources for building intelligent, real-time products.
Tensor methods for machine learning are fast, accurate, and scalable, but we'll need well-developed libraries.
The Lambda Architecture has its merits, but alternatives are worth exploring.