Basic Concepts in
Data Mining
This chapter describes basic concepts in data mining, typical tasks for data
mining, and basic data structures as targets of data mining.
5.1 What is Data Mining?
First, the fundamental concepts of data mining [Han et al. 2006, Tan et al.
2006], which can be used as principal techniques for constructing hypotheses
in the analysis of social big data, will be briefl y described. Data mining is, in a
nutshell, to discover frequent patterns and meaningful structures appearing
in a large amount of data used by applications. Principal techniques, such
as multivariate analysis, for validating hypotheses in social big data will
be explained in a separate chapter.
One of the basic techniques for data mining is association rule mining,
also known as association analysis. It is to discover frequent co-occurrences
between structured data used in business applications, which are usually
managed by database management systems (DBMS) such as relational
database systems. An algorithm called Apriori is used in many cases for
that purpose. For example, association rule mining discovers combinations
of items co-occurring frequently in a group of items (i.e., contents of the
shopping carts) that customers have purchased at the same time in retail
stores such as supermarkets. Association rules are made from frequent
combinations of items discovered by the algorithm. Based on association
rules, a lot of application systems recommend a set of items by revising
arrangements of them. Association rule mining is extended and applied
to the history of product purchases and the history of click streams on the
Web pages in order to discover the frequent patterns of series data. Mining
historical data is called historical data mining in particular.
Basic Concepts in Data Mining 87
On the other hand, a classifi er is learned based on data whose classes
(i.e., categories) are known in advance. Then, if there is new data, classes
to which they should belong are determined by using the learned classifi er.
This task called classifi cation is one of the basic data mining techniques.
Naïve Bayes and decision trees are used as typical classifi ers. Classifi cation
is used by such a variety of applications as determination of promising
customers, detection of spam e-mails and determination of categories of new
specimens in science or medicine. Determination of continuous values such
as temperatures and stock prices is also called prediction of future values.
Prediction requires methods such as regression analysis as a basic approach
or multivariate analysis as a more advanced approach. Indeed, these
analytical approaches have been developed more or less independently
from data mining. However, they are considered a kind of extensions of
data mining and will be described as one of the key technologies for social
big data mining separately in this book. Based on a combination of two or
more existing classifi ers, ensemble learning creates a more accurate classifi er
than each of the original ones.
It may be possible to defi ne the degrees of similarity between data even
if the categories of the data are not known in advance. The opposite concept
of similarity is dissimilarity or distance. Based on the defi ned similarity,
grouping data into the same group which are similar to each other in a
collection of data is called cluster analysis or clustering, which is also one
of the basic technologies of data mining. Unlike classifi cation, clustering
doesn’t demand that the names and characteristics of clusters are known
in advance. Techniques such as a hierarchical agglomerative method and
a nonhierarchical k-means method are often used for clustering. Promising
applications of clustering include discovery of groups of similar customers
for marketing.
A data mining task which can detect exceptional values or values
different from standard values is called outlier detection. There are methods
for outlier detection based on statistical models, data distances, and data
densities. There are alternative ways to fi nd outliers using clustering and
classifi cation. Outlier detection has been used for applications, such as
detection of credit card frauds or network intrusions.
5.2 Technical Issues and Related Technologies
Here the relationships between data mining and its peripheral technologies
will be summarized in order to better understand the features of data
mining. Since there are various technologies related to data mining such as
databases, information retrieval, and Web search (i.e., search engine), the
relationships between data mining and such technologies will be described
as follows.

Get Social Big Data Mining now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.