6Comparative Analysis of Various Ensemble Approaches for Web Page Classification

J. Dutta*, Yong Woon Kim and Dalia Dominic

Centre for Digital Innovation, CHRIST (Deemed to be University), Mysore Road, Kumbalgodu, Bangalore, India

Abstract

The amount of data available on web pages is enormous, and extracting the relevant information and classifying them is an important task. Web page classification finds applications in web content filtering, maintaining and expanding web directories, building efficient crawlers, etc. Machine Learning methods known for their well-established classification approaches have proved to be effective in web page classification. The present work uses ensemble methods like Bagging Meta Estimator, Random Forest, Adaptive boosting, Gradient Tree boosting, Extreme Gradient boosting and stacking to improve single classifier’s results. One dataset is manually created to classify web pages into IoT projects and non-IoT projects. Another publicly available dataset is used to classify publications- and conference-related web pages. The advantage of the Ensemble methods over single classifiers has been validated, and various parameters to tune the Ensemble classifiers have been presented and analysed, with accuracy being the metric for performance. Features like learning rate, number of estimators, and maximum number of features have been tuned besides other parameters, and a comparison has been presented.

Keywords: Machine learning, web scraping, web page classification, ...

Get Data Engineering and Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.