Chapter 1. An Introduction to Regular Expressions

Many data science, analyst, and technology professionals have encountered regular expressions at some point. This esoteric, miniature language is used for matching complex text patterns, and looks mysterious and intimidating at first. However, regular expressions (also called “regex”) are a powerful tool that only require a small time investment to learn. They are almost ubiquitously supported wherever there is data. Several analytical and technology platforms support them, including SQL, Python, R, Alteryx, Tableau, LibreOffice, Java, Scala, .NET, and Go. Major text editors and IDE’s like Atom Editor, Notepad++, Emacs, Vim, Intellij IDEA, and PyCharm also support searching files with regular expressions.

The ubiquity of regular expressions must mean they offer universal utility, and, surprisingly, they do not have a steep learning curve. If you frequently find yourself manually scanning documents or parsing substrings just to identify text patterns, you might want to give them a look. Especially in data science and data engineering, they can assist in a wide spectrum of tasks, from wrangling data to qualifying and categorizing it.

In this report, I will cover enough regular expression features to make them useful for a great majority of tasks you may encounter.

Setting Up

You can test these examples I am about to walk through in a number of places. I recommend using Regular Expressions 101, a free web-based application to test ...

Get An Introduction to Regular Expressions now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.