Ben Lorica

Ben Lorica

Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc.. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.

Twitter and the Micro-Messaging Revolution: Communication, Connections, and Immediacy--140 Characters at a Time Twitter and the Micro-Messaging Revolution: Communication, Connections, and Immediacy--140 Characters at a Time
by Abdur Chowdhury, Gregor Hochmuth, Ben Lorica, Roger Magoulas, Sarah Milstein, Tim O'Reilly
June 2009
Ebook: $99.00

Where 2.0: The State of the Geospatial Web Where 2.0: The State of the Geospatial Web
by Brady Forrest, Ben Lorica, Roger Magoulas, Andrew Turner
June 2009
Ebook: $399.00

Virtual Worlds: A Business Guide Virtual Worlds: A Business Guide
by Ben Lorica, Roger Magoulas
June 2009
Ebook: $249.00 Ebook: $249.00

Recent Posts | All O'Reilly Posts

Ben blogs at:



A growing number of applications are being built with Spark

May 31 2014

One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companies are focusing on solving data problems in specific industries rather than … read more

Welcome to Intelligence Matters

May 15 2014

Editor’s note: this post originally appeared on the O’Reilly Radar blog and was co-authored by Ben Lorica and Roger Magoulas Today we’re kicking off Intelligence Matters (IM), a new series exploring current issues in artificial intelligence, including the connection between … read more

Welcome to Intelligence Matters

May 14 2014

Editor’s note: this post was co-authored by Ben Lorica and Roger Magoulas Today we’re kicking off Intelligence Matters (IM), a new series exploring current issues in artificial intelligence, including the connection between artificial intelligence, human intelligence and the brain. IM … read more

Network Science Dashboards

April 26 2014

With Network Science well on its way to being an established academic discipline, we’re beginning to see tools that leverage it. Applications that draw heavily from this discipline make heavy use of visual representations and come with interfaces aimed at … read more

Verticalized Big Data solutions

April 19 2014

As much as I love talking about general-purpose big data platforms and data science frameworks, I’m the first to admit that many of the interesting startups I talk to are focused on specific verticals. At their core big data applications … read more

5 Fun Facts about HBase that you didn’t know

April 06 2014

With HBaseCon right around the corner, I wanted to take stock of one of the more popular1 components in the Hadoop ecosystem. Over the last few years, many more companies have come to rely on HBase to run key products … read more

Crowdsourcing Feature discovery

March 15 2014

Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasons he started CrowdFlower was that as a data scientist he got frustrated with having to … read more

Instrumenting collaboration tools used in data projects

March 08 2014

As I noted in a previous post, model building is just one component of the analytic lifecycle. Many analytic projects result in models that get deployed in production environments. Moreover, companies are beginning to treat analytics as mission-critical software and … read more

Interface Languages and Feature Discovery

March 02 2014

Here are a few more observations based on conversations I had during the just concluded Strata Santa Clara conference. Interface languages: Python, R, SQL (and Scala) This is a great time to be a data scientist or data engineer who … read more

Extending GraphLab to tables

February 23 2014

GraphLab’s SFrame, an interesting and somewhat under-the-radar tool was unveiled1 at Strata Santa Clara. It is a disk-based, flat table representation that extends GraphLab to tabular data. With the addition of SFrame, users can leverage GraphLab’s many algorithms on data … read more

Bridging the gap between research and implementation

February 15 2014

One of the most popular offerings at Strata Santa Clara was Hardcore Data Science day. Over the next few weeks we hope to profile some of the speakers who presented, and make the video of the talks available as a … read more

Big Data solutions through the combination of tools

February 09 2014

As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A … read more

Business analysts want access to advanced analytics

January 29 2014

I talk with many new companies who build tools for business analysts and other non-technical users. These new tools streamline and simplify important data tasks including interactive analysis (e.g., pivot tables and cohort analysis), interactive visual analysis (as popularized by … read more

What I use for data visualization

January 26 2014

Depending on the nature of the problem, data size, and deliverable, I still draw upon an array of tools for data visualization. As I survey the Design track at next month’s Strata conference, I see creators and power users of … read more

IPython: A unified environment for interactive data analysis

January 19 2014

As I noted in a recent post on reproducing data projects, notebooks have become popular tools for maintaining, sharing, and replicating long data science workflows. Much of that is due to the popularity of IPython1. In development since 2001, IPython … read more

Big Data systems are making a difference in the fight against cancer

January 10 2014

As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where … read more

A compelling family of DSLs for Data Science

January 06 2014

An important reason why pydata tools and Spark appeal to data scientists is that they both cover many data science tasks and workloads (Spark users can move seamlessly between batch and streaming). Being able to use the same programming style … read more

Six reasons why I recommend scikit-learn

December 28 2013

I use a variety of tools for advanced analytics, most recently I’ve been using Spark (and MLlib), R, scikit-learn, and GraphLab. When I need to get something done quickly, I’ve been turning to scikit-learn for my first pass analysis. For … read more

Financial analytics as a service

December 22 2013

In relatively short order Amazon’s internal computing services has become the world’s most successful cloud computing platform. Conceived in 2003 and launched in 2006, AWS grew quickly and is now the largest web hosting company in the world. With the … read more

Expanding options for mining streaming data

December 15 2013

Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler … read more

Reproducing Data Projects

December 07 2013

As I talk to people and companies building the next generation of tools for data scientists, collaboration and reproducability keep popping up. Collaboration is baked into many of the newer tools I’ve seen (including ones that have yet to be … read more

Data Scientists and Data Engineers like Python and Scala

December 01 2013

In exchange for getting personalized recommendations many Meetup members declare1 topics that they’re interested in. I recently looked at the topics listed by members of a few local, data Meetups that I’ve frequented. These Meetups vary in size from 600 … read more

Data Wrangling gets a fresh look

November 25 2013

Data analysts have long lamented the amount of time they spend on data wrangling. Rightfully so, as some estimates suggest they spend a majority of their time on it. The problem is compounded by the fact that these days, data … read more

Day-Long Immersions and Deep Dives at Strata Santa Clara 2014

November 16 2013

As the Program Development Director for Strata Santa Clara 2014, I am pleased to announce that the tutorial session descriptions are now live. We’re pleased to offer several day-long immersions including the popular Data Driven Business Day and Hardcore Data … read more

How companies are using Spark

November 10 2013

When an interesting piece of big data technology gets introduced, early1 adopters tend to focus on technical features and capabilities. Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That … read more

Simplifying interactive, realtime, and advanced analytics

November 03 2013

Here are a few observations based on conversations I had during the just concluded Strata NYC conference. Interactive query analysis on Hadoop remains a hot area A recent O’Reilly survey confirmed SQL is an important skill for data scientists. A … read more

The emergence of Crowdsourcing specialists

October 25 2013

A little over four years ago, I attended the first Crowdsourcing meetup at the offices of Crowdflower (then called Dolores Labs). The crowdsourcing community has grown explosively since that initial gathering, and there are now conference tracks and conferences devoted … read more

Deep Learning oral traditions

October 20 2013

This past week I had the good fortune of attending two great talks1 on Deep Learning, given by Googlers Ilya Sutskever and Jeff Dean. Much of the excitement surrounding Deep Learning stems from impressive results in a variety of perception … read more

Stream Mining essentials

October 13 2013

A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their … read more

Semi-automatic method for grading a million homework assignments

October 06 2013

One of the hardest things about teaching a large class is grading exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the … read more

Gaining access to the best machine-learning methods

September 29 2013

For companies in the early stages of grappling with big data, the analytic lifecycle (model building, deployment, maintenance) can be daunting. In earlier posts I highlighted some new tools that simplify aspects of the analytic lifecycle, including the early phases … read more

Databricks aims to build next-generation analytic tools for Big Data

September 26 2013

Key technologists behind the Berkeley Data Analytics Stack (BDAS) have launched a company that will build software – centered around Apache Spark and Shark – for analyzing big data. Details of their product and strategy are sparse, as the company … read more

Stream Processing and Mining just got more interesting

September 22 2013

Largely unknown outside data engineering circles, Apache Kafka is one of the more popular open source, distributed computing projects. Many data engineers I speak with either already use it or are planning to do so. It is a distributed message … read more

How Twitter monitors millions of time-series

September 15 2013

One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics … read more

Data Analysis: Just one component of the Data Science workflow

September 08 2013

Judging from articles in the popular press the term data scientist has increasingly come to refer to someone who specializes in data analysis (statistics, machine-learning, etc.). This is unfortunate since the term originally described someone who could cut across disciplines. … read more

Running batch and long-running, highly available service jobs on the same cluster

September 01 2013

As organizations increasingly rely on large computing clusters, tools for leveraging and efficiently managing compute resources become critical. Specifically, tools that allow multiple services and frameworks run on the same cluster can significantly increase utilization and efficiency. Schedulers1 take into … read more

Data analysis tools target non-experts

August 25 2013

A new set of tools make it easier to do a variety of data analysis tasks. Some require no programming, while other tools make it easier to combine code, visuals, and text in the same workflow. They enable users who … read more

Interactive Big Data analysis using approximate answers

August 17 2013

Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, … read more

Surfacing anomalies and patterns in Machine Data

August 11 2013

I’ve been noticing that many interesting big data systems are coming out of IT operations. These are systems that go beyond the standard “capture/measure, display charts, and send alerts”. IT operations has long been a source of many interesting big … read more

Big Data and Advertising: In the trenches

August 05 2013

The $35B merger of Omnicom and Publicis put the convergence of Big Data and Advertising1 in the front pages of business publications. Adtech2 companies have long been at the forefront of many data technologies, strategies, and techniques. By now it’s … read more

Near realtime, streaming, and perpetual analytics

July 28 2013

Simple example of a near realtime app built with Hadoop and HBase Over the past year Hadoop emerged from its batch processing roots and began to take on interactive and near realtime applications. There are numerous examples that fall under … read more

Tightly integrated engines streamline Big Data analysis

July 20 2013

The choice of tools for data science includes1 factors like scalability, performance, and convenience. A while back I noted that data scientists tended to fall into two camps: those who used an integrated stack, and others who tended to stitch … read more

Data scientists tackle the analytic lifecycle

July 15 2013

What happens after data scientists build analytic models? Model deployment, monitoring, and maintenance are topics that haven’t received as much attention in the past, but I’ve been hearing more about these subjects from data scientists and software developers. I remember … read more

Pattern-detection and Twitter’s Streaming API

July 06 2013

Researchers and companies who need social media data frequently turn to Twitter’s API to access a random sample of tweets. Those who can afford to pay (or have been granted access) use the more comprehensive feed (the firehose) available through … read more

Moving from Batch to Continuous Computing at Yahoo!

June 29 2013

My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the … read more

Analytic engines that factor in security labels

June 23 2013

Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It … read more

HBase looks more appealing to data scientists

June 16 2013

When Hadoop users need to develop apps that are “latency sensitive”, many of them turn to HBase1. Its tight integration with Hadoop makes it a popular data store for real-time applications. When I attended the first HBase conference last year, … read more

It’s getting easier to build Big Data applications

June 09 2013

Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera … read more

Tracking the progress of large-scale Query Engines

June 04 2013

As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing … read more

How signals, geometry, and topology are influencing data science

May 24 2013

I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so … read more

Recent Posts | All O'Reilly Posts