8.10. Extracting the Host from a URL

Problem

You want to extract the host from a string that holds a URL. For example, you want to extract www.regexcookbook.com from http://www.regexcookbook.com/.

Solution

Extract the host from a URL known to be valid

\A
[a-z][a-z0-9+\-.]*://               # Scheme
([a-z0-9\-._~%!$&'()*+,;=]+@)?      # User
([a-z0-9\-._~%]+                    # Named or IPv4 host
|\[[a-z0-9\-._~%!$&'()*+,;=:]+\])   # IPv6+ host
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
^[a-z][a-z0-9+\-.]*://([a-z0-9\-._~%!$&'()*+,;=]+@)?([a-z0-9\-._~%]+|↵
\[[a-z0-9\-._~%!$&'()*+,;=:]+\])
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the host while validating the URL

\A
[a-z][a-z0-9+\-.]*://                       # Scheme
([a-z0-9\-._~%!$&'()*+,;=]+@)?              # User
([a-z0-9\-._~%]+                            # Named host
|\[[a-f0-9:.]+\]                            # IPv6 host
|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\])  # IPvFuture host
(:[0-9]+)?                                  # Port
(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?          # Path
(\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?         # Query
(\#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?         # Fragment
\Z
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^[a-z][a-z0-9+\-.]*://([a-z0-9\-._~%!$&'()*+,;=]+@)?([a-z0-9\-._~%]+|↵
\[[a-f0-9:.]+\]|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\])(:[0-9]+)?↵
(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?(\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?↵
(#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, ...

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.