Luckily for us, the team behind StackOverflow provides most of the data behind the StackExchange universe to which StackOverflow belongs under a cc-wiki license. At the time of writing this book, the latest data dump can be found at https://archive.org/details/stackexchange. It contains data dumps of all Q&A sites of the StackExchange family. For StackOverflow, you will find multiple files, of which we only need the
stackoverflow.com-Posts.7z file, which is 5.2 GB.
After downloading and extracting it, we have around 26 GB of data in the format of XML, containing all questions and answers as individual
row tags within the
root tag posts:
<?xml version="1.0" encoding="utf-8"?> <posts> ... <row Id="4572748" PostTypeId="2" ParentId="4568987" ...