5.9. Remove Duplicate Lines
Problem
You have a log file, database query output, or some other type of file or string with duplicate lines. You need to remove all but one of each duplicate line using a text editor or other similar tool.
Solution
There is a variety of software (including the Unix command-line
utility uniq
and Windows PowerShell cmdlet Get-Unique
) that can help you remove duplicate
lines in a file or string. The following sections contain three
regex-based approaches that can be especially helpful when trying to
accomplish this task in a nonscriptable text editor with regular
expression search-and-replace support.
When you’re programming, options two and three should be avoided since they are inefficient compared to other available approaches, such as using a hash object to keep track of unique lines. However, the first option (which requires that you sort the lines in advance, unless you only want to remove adjacent duplicates) may be an acceptable approach since it’s quick and easy.
Option 1: Sort lines and remove adjacent duplicates
If you’re able to sort lines in the file or string you’re working with so that any duplicate lines appear next to each other, you should do so, unless the order of the lines must be preserved. This option will allow using a simpler and more efficient search-and-replace operation to remove the duplicates than would otherwise be possible.
After sorting the lines, use the following regex and replacement string to get rid of the duplicates:
^(.*)(?:(?:\r?\n|\r)\1)+$ ...
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.