8.2. Finding URLs Within Full Text

Problem

You want to find URLs in a larger body of text. URLs may or may not be enclosed in punctuation, such as parentheses, that are not part of the URL.

Solution

URL without spaces:

\b(https?|ftp|file)://\S+
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

URL without spaces or final punctuation:

\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵
[A-Z0-9+&@#/%=~_|$]
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

URL without spaces or final punctuation. URLs that start with the www or ftp subdomain can omit the scheme:

\b((https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵
[A-Z0-9+&@#/%=~_|$]
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Given the text:

Visit http://www.somesite.com/page, where you will find more information.

what is the URL?

Before you say http://www.somesite.com/page, think about this: punctuation and spaces are valid characters in URLs. Though RFC 3986 (see Recipe 8.7) does not allow literal spaces in URLs, all major browsers accept URLs with literal spaces just fine. Some WYSIWYG web authoring tools even make it easy for the user to put spaces in file and folder names, and include those spaces literally in links to those files.

That means that if we use a regular expression that allows all valid URLs, it will find this URL in the preceding text:

http://www.somesite.com/page, where ...

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.