5.13. Replace Repeated Whitespace with a Single Space
Problem
As part of a cleanup routine for user input or other data, you want to replace repeated whitespace characters with a single space. Any tabs, line breaks, or other whitespace should also be replaced with a space.
Solution
To implement either of the following regular expressions, simply replace all matches with a single space character. Recipe 3.14 shows the code to do this.
Clean any whitespace characters
\s+
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Clean horizontal whitespace characters
[●\t\xA0]+| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby 1.8 |
[●\t\u00A0]+| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, Python, Ruby 1.9 |
\h+
| Regex options: None |
| Regex flavors: PCRE 7.2, Perl 5.10 |
Discussion
A common text cleanup routine is to replace repeated whitespace characters with a single space. In HTML, for example, repeated whitespace is simply ignored when rendering a page (with a few exceptions). Removing repeated whitespace can therefore help to reduce the file size of some pages (or at least page sections) without any negative effects.
Clean any whitespace characters
In this solution, any sequence of whitespace characters
(line breaks, tabs, spaces, etc.) is replaced with a single space.
Since the ‹+›
quantifier repeats the ‹\s› whitespace class one or more times, even a single tab character, for example, will be replaced with a space. If you replaced ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access