October 2017
Beginner to intermediate
236 pages
7h 38m
English
Unlike readLines(), the read_html() function does not read the source code line by line, rather it reads the entire HTML source code into a single object while maintaining the original HTML structure. If you want to see the output of the HTML source code, you have to retrieve the plain text component under various HTML tags.
The rvest library has functions to interact with various HTML tags and retrieve the plain text elements from it. For example, suppose you are interested in retrieving the title of the web page. The title of the page has been enclosed by the <title>…</title> HTML tag pair. The following code will give you the plain text title of the page:
html_text(html_nodes(htmlTextData,xpath="//title"))
Notice that there ...