8.2. Finding URLs Within Full Text
Problem
You want to find URLs in a larger body of text. URLs may or may not be enclosed in punctuation, such as parentheses, that are not part of the URL.
Solution
URL without spaces:
\b(https?|ftp|file)://\S+
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
URL without spaces or final punctuation:
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵ [A-Z0-9+&@#/%=~_|$]
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
URL without spaces or final punctuation. URLs that start with the
www or ftp subdomain can omit the scheme:
\b((https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵ [A-Z0-9+&@#/%=~_|$]
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Discussion
Given the text:
Visit http://www.somesite.com/page, where you will find more information.
what is the URL?
Before you say http://www.somesite.com/page, think about
this: punctuation and spaces are valid characters in URLs. Though RFC
3986 (see Recipe 8.7) does not allow literal
spaces in URLs, all major browsers accept URLs with literal spaces just
fine. Some WYSIWYG web authoring tools even make it easy for the user to
put spaces in file and folder names, and include those spaces literally
in links to those files.
That means that if we use a regular expression that allows all valid URLs, it will find this URL in the preceding text:
http://www.somesite.com/page, where ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access