Although printing to the terminal is a lot of fun, it’s not incredibly useful when it comes to data aggregation and analysis. To make the majority of web scrapers remotely useful, you need to be able to save the information that they scrape.
This chapter covers three main methods of data management that are sufficient for almost any imaginable application. Do you need to power the backend of a website or create your own API? You’ll probably want your scrapers to write to a database. Need a fast and easy way to collect documents off the internet and put them on your hard drive? You’ll probably want to create a file stream for that. Need occasional alerts, or aggregated data once a day? Send yourself an email!
Above and beyond web scraping, the ability to store and interact with large amounts of data is incredibly important for just about any modern programming application. In fact, the information in this chapter is necessary for implementing many of the examples in later sections of the book. I highly recommend that you at least skim this chapter if you’re unfamiliar with automated data storage.
You can store media files in two main ways: by reference and by downloading the file itself. You can store a file by reference by storing the URL where the file is located. This has several advantages:
Scrapers run much faster and require much less bandwidth when they don’t have to download files.
You save space on your own machine by storing only ...