7Detection of Malicious Emails and URLs Using Text Mining
Heetakshi Fating, Aditya Narawade, Sandeep Kumar Satapathy and Shruti Mishra*
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Abstract
This work aims to create a combined model of two models, to first process whether an email is malicious or not, after which, a non-malicious email is further analyzed to check whether it contains a malicious URL. Features are created for one of the models after which the information gain feature selection technique is used, while the method of tokenization is used for the email model. For the combined model, a new dataset containing only non-malicious emails which contained a mix of good and bad URLs was created and features were created in a similar manner to the URL dataset’s model to determine whether the flagged non-malicious emails were entirely non-malicious or whether they did contain a malicious URL of any sort. For the malicious URL detection alone, the best accuracy of 80.7% was achieved by the Random Forest algorithm while an accuracy of 98.9% was achieved for the email dataset using the Random Forest algorithm as well. For the final combined model, the Support Vector Machine and Logistic Regression algorithms gave the better accuracies among others of 81.88% and 81.49% respectively.
Keywords: Malicious emails, malicious URLs, machine learning, feature creation, feature selection
7.1 Introduction
A major and growing ...
Get Evolution and Applications of Quantum Computing now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.