book

Using OpenRefine

Name: Using OpenRefine
ISBN: 9781783289080

by Ruben Verborgh, Max De Wilde

September 2013

Intermediate to advanced

114 pages

2h 51m

English

Packt Publishing

Read now

Unlock full access

Using OpenRefine
Table of Contents
Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and moreWhy Subscribe?Free Access for Packt account holders
Preface
What this book covers
What you need for this book

Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example filesErrataPiracyQuestions
1. Diving Into OpenRefine
Introducing OpenRefine
Recipe 1 – installing OpenRefine
WindowsMacLinux
Recipe 2 – creating a new project
File formats supported by OpenRefine
Recipe 3 – exploring your data
Recipe 4 – manipulating columns
Collapsing and expanding columnsMoving columns aroundRenaming and removing columns
Recipe 5 – using the project history
Recipe 6 – exporting a project
Recipe 7 – going for more memory
WindowsMacLinux
Summary
2. Analyzing and Fixing Data
Recipe 1 – sorting dataReordering rows
Recipe 2 – faceting data
Text facetsNumeric facetsCustomized facetsFaceting by star or flag
Recipe 3 – detecting duplicates
Recipe 4 – applying a text filter
Recipe 5 – using simple cell transformations
Recipe 6 – removing matching rows
Summary
3. Advanced Data Operations
Recipe 1 – handling multi-valued cells
Recipe 2 – alternating between rows and records mode
Recipe 3 – clustering similar cells
Recipe 4 – transforming cell values
Recipe 5 – adding derived columns
Recipe 6 – splitting data across columns
Recipe 7 – transposing rows and columns
Summary
4. Linking Datasets
Recipe 1 – reconciling values with Freebase
Recipe 2 – installing extensions
Recipe 3 – adding a reconciliation service
Recipe 4 – reconciling with Linked Data
Recipe 5 – extracting named entities
Summary
A. Regular Expressions and GREL
Regular expressions for text patternsCharacter classesQuantifiersAnchorsChoicesGroups
Overview
General Refine Expression Language (GREL)
Transforming dataCreating custom facetsSolving problems with GREL
Index

Content preview from Using OpenRefine

Recipe 3 – detecting duplicates

In this recipe, you will learn what duplicates are, how to spot them, and why it matters.

The only type of customized facet that we left out in the previous recipe is the duplicates facet. Duplicates are annoying records that happen to appear twice (or more) in a dataset. Keeping identical records is a waste of space and can generate ambiguity, so we will want to remove these duplicates. This facet is an easy way to detect them, but it has a downside; it only works on text strings, at least straightforwardly (to learn how to tweak it to work on integers as well, have a look at Appendix, Regular Expressions and GREL).

Too bad then; we cannot use a duplicate facet on the Record ID column. The next best thing is to run ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

What Successful Brick-and-Mortar Retailers Get Right

Publisher Resources

ISBN: 9781783289080Other

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Using OpenRefine

by Ruben Verborgh, Max De Wilde

Recipe 3 – detecting duplicates

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.