5.9. Remove Duplicate Lines
You have a log file, database query output, or some other type of file or string with duplicate lines. You need to remove all but one of each duplicate line using a text editor or other similar tool.
There is a variety of software (including the Unix command-line
uniq and Windows PowerShell cmdlet
Get-Unique) that can help you remove duplicate
lines in a file or string. The following sections contain three
regex-based approaches that can be especially helpful when trying to
accomplish this task in a nonscriptable text editor with regular
expression search-and-replace support.
When you’re programming, options two and three should be avoided since they are inefficient compared to other available approaches, such as using a hash object to keep track of unique lines. However, the first option (which requires that you sort the lines in advance, unless you only want to remove adjacent duplicates) may be an acceptable approach since it’s quick and easy.
Option 1: Sort lines and remove adjacent duplicates
If you’re able to sort lines in the file or string you’re working with so that any duplicate lines appear next to each other, you should do so, unless the order of the lines must be preserved. This option will allow using a simpler and more efficient search-and-replace operation to remove the duplicates than would otherwise be possible.
After sorting the lines, use the following regex and replacement string to get rid of the duplicates: