November 2019
Intermediate to advanced
346 pages
9h 36m
English
We begin by reading the HIPAA dataset into a dataframe and dropping any rows that contain NAs (step 1). Next, in step 2, we can see that most breaches are relatively small scale, but a small number of breaches are massive. This is consistent with Pareto's principle. In step 3, we plot breaches by sector to ensure that the largest breaches occur in Business Associates. Then, we examine which states have the most HIPAA breaches in step 4. In step 5, we learn that the cause of the largest breaches is usually unknown! In steps 6 and 7, we perform a basic NLP on the descriptions of the breaches. This will allow us to extract additional information of interest. In step 8, we can see that TF-IDF was able to find some very informative ...