7.12. Extracting the Path from a URL

Problem

You want to extract the path from a string that holds a URL. For example, you want to extract /index.html from http://www.regexcookbook.com/index.html or from /index.html#fragment.

Solution

Extract the path from a string known to hold a valid URL. The following finds a match for all URLs, even for URLs that have no path:

\A
# Skip over scheme and authority, if any
([a-z][a-z0-9+\-.]*:(//[^/?#]+)?)?
# Path
([a-z0-9\-._~%!$&'()*+,;=:@/]*)
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([a-z][a-z0-9+\-.]*:(//[^/?#]+)?)?([a-z0-9\-._~%!$&'()*+,;=:@/]*)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the path from a string known to hold a valid URL. Only match URLs that actually have a path:

\A
# Skip over scheme and authority, if any
([a-z][a-z0-9+\-.]*:(//[^/?#]+)?)?
# Path
(/?[a-z0-9\-._~%!$&'()*+,;=@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?|/)
# Query, fragment, or end of URL
([#?]|\Z)
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([a-z][a-z0-9+\-.]*:(//[^/?#]+)?)?(/?[a-z0-9\-._~%!$&'()*+,;=@]+↵
(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?|/)([#?]|$)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the path from a string known to hold a valid URL. Use atomic grouping to match only those URLs that actually have a path:

\A # Skip over scheme and authority, if ...

Get Regular Expressions Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.