Adaptable Similarity Search using Non-Relevant
Information
Ashwin T.V.
vashwin @in. ibm. con
Rahul Gupta
rahulgup #in. ibm. con
Sugata Ghosal
gsugata@in, ibm. con
IBM India Research Lab, Hauz Khas, New Delhi 110016, India
Abstract 1 Introduction
Many modern database applications require
content-based similarity search capability in
numeric attribute space. Further, users' no-
tion of similarity varies between search ses-
sions. Therefore online techniques for adap-
tively refining the similarity metric based on
relevance feedback from the user are neces-
sary. Existing methods use retrieved items
marked relevant by the user to refine the sim-
ilarity metric, without taking into account
the information about non-relevant (or un-
satisfactory) items. Consequently items in
database close to non-relevant ones continue
to be retrieved in further iterations. In this
paper a robust technique is proposed to incor-
porate non-relevant information to efficiently
discover the feasible search region. A de-
cision surface is determined to split the at-
tribute space into relevant and non-relevant
regions. The decision surface is composed
of hyperplanes, each of which is normal to
the minimum distance vector from a non-
relevant point to the convex hull of the rel-
evant points. A similarity metric, estimated
using the relevant objects is used to rank and
retrieve database objects in the relevant re-
gion. Experiments on simulated and bench-
mark datasets demonstrate robustness and st-
perior performance of the proposed technique
over existing adaptive similarity search tech-
niques.
Keywords :
Information retrieval, Database brows-
ing, Database navigation, Ellipsoid query processing,
Relevance ]eedback, Non-relevant judgement
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the VLDB copyright notice
and
the title of the publication and its date appear, and notice is
given that copying is by permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
Proceedings
of the 28th VLDB Conference,
Hong Kong, China, 2002
In many modern database applications, it is neces-
sary to be able to pose queries in terms of similarity
of data objects rather than relational operations like
equality or inequality. Examples include finding sets
of stocks that behave in approximately the same (or
for that matter opposite) way in a temporal database
[16]; searching for structurally similar proteins from a
spatial database which can cause a particular medi-
cal condition [13]; retrieval of 3D objects from a CAD
database [2], finding similar objects based on content
from multimedia databases containing audio, image or
video [11]. Also, several approximation schemes have
been developed to efficiently process similarity queries
using multidimensional indexing structures [19, 1, 3].
In this paper, we focus on accuracy improvement of
similarity based retrieval of database objects using rel-
evance feedback.
To support similarity based modern database appli-
cations, multidimensional attribute or feature vectors
are extracted from the original object, and stored in
the database. Given a query object (associated with it
an attribute vector), database objects whose attribute
vectors are most similar to the query vector are re-
trieved for the user. Usually k top matches are re-
trieved. For text applications,
vector space model
with
cosine similarity metric is being widely used [17],[4].
x.q
The cosine similarity is defined as, S(2, q-) - II~[JT~l'
where ff.g stands for inner-product of ~7 and g, and
I lgll denotes the magnitude of ~7. On the other hand,
in
metric space model
of information retrieval, second-
order (L2) distance metrics are typically used. The
second-order L2 norm is a quadratic distance met-
ric and is defined as T~(~, q, Q)= (~-q-,)TQ(:~_
q-,).
Imposing constraints on the structure of the matrix
Q, metrics like Euclidean (Q is the identity matrix),
weighted Euclidean (diagonal Q) and generalized Eu-
clidean (symmetric Q with off-diagonal entries) are ob-
tained. Examples of application of quadratic distance
metrics include Euclidean metric in discrete wavelet
transform (DWT) based attribute space for time-series
stock data [16], color histogram space for color im-
age databases [11], Fourier descriptors (FD) for shape
databases [2]; weighted Euclidean metric for nmltime-
47

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.