Streaming Queries over Streaming Data
Sirish Chandrasekaran
Michael J. Franklin
University of California at Berkeley
{ sirish,franklin } @ cs.berkeley.edu
Abstract
Recent work on querying data streams has fo-
cused on systems where newly arriving data is
processed and continuously streamed to the user
in real-time. In many emerging applications, how-
ever, ad hoc queries and/or intermittent connectiv-
ity also require the processing of data that arrives
prior to query submission or during a period of
disconnection. For such applications, we have de-
veloped PSoup, a system that combines the pro-
cessing of ad-hoc and continuous queries by treat-
ing data and queries symmetrically, allowing new
queries to be applied to old data and new data to
be applied to old queries. PSoup also supports in-
termittent connectivity by separating the compu-
tation of query results from the delivery of those
results. PSoup builds on adaptive query process-
ing techniques developed in the Telegraph project
at
UC Berkeley. In this paper, we describe PSoup
and present experiments that demonstrate the ef-
fectiveness of our approach.
1 Introduction
The proliferation of the Internet, the Web, and sensor net-
works have fueled the development of applications that
treat
data as a continuous stream, rather than as a fixed
set. Telephone call records, stock and sports tickers, and
data feeds from sensors are examples of streaming data.
Recently, a number of systems have been proposed to ad-
dress the mismatch between traditional database technol-
ogy and the needs of query processing over streaming data
(e.g., [HFCD+00, AF00, CDTW00, BW01, CCCC+02]).
This work has been supported in part by the National Science Foundation
under the ITR grants IIS0086057 and SI0122599, and by IBM, Microsoft,
Siemens, and the UC MICRO program.
Permission to copy without fee all or part of this material is granted pro-
vided that the copies are not made or distributed for direct commercial
advantage, the VLDB copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Very Large Data Base Endowment. To copy otherwise, or to republish,
requires a fee and/or special permission from the Endowment.
Proceedings of the 28th VLDB Conference,
Hong Kong, China, 2002
In contrast to traditional DBMSs, which answer streams
of queries over a non-streaming database, these
continu-
ous query (CQ)
systems treat queries as fixed entities and
stream the data over them.
Previous systems allow only the queries or the data to
be streamed, but not both. As a result, they cannot support
queries that require access to both data that arrived previ-
ously and data that will arrive in the future. Furthermore,
existing CQ systems continuously deliver results as they
are computed. In many situations, however, such contin-
uous delivery may be infeasible or inefficient. Two such
scenarios are:
Data Recharging:
Data Recharging [CFZ01] is a process
through which personal devices such as PDAs periodically
connect to the network to refresh their data contents. For
example, consider a business traveler who wishes to stay
apprised of information ranging from the movements of fi-
nancial markets to the latest football scores, all within a
certain historical window. These interests are encoded into
queries to be executed at a remote server, the results of
which must be downloaded to the user's PDA when it is
connected to the network infrastructure.
Monitoring:
Consider a user who wants to track inter-
esting pieces of information such as the number of music
downloads from within his subnet in the last hour, or re-
cent postings on Slashdot (http://www.slashdot.org/) with
a score
greater than a certain threshold. Even when online,
the user might only periodically wish to see summaries
of recent activity, rather than being interrupted by every
update. Aggregated over many users, the bandwidth and
server load wasted on transmitting data that is never ac-
cessed will be significant. A more efficient approach is to
return the current results of a standing query
on demand.
To support such applications, we propose PSoup, a
query processor based on the Telegraph [HFCD+00] query
processing framework. The core insight in PSoup that al-
lows us to support such applications is that both data and
queries are streaming, and more importantly, they are duals
of each other:
multiquery processing is viewed as a join of
query and data streams.
In addition, PSoup also partially
materializes results to support disconnected operation, and
to improve data throughput and query response times.
1.1 Overview of the System
A user interacts with PSoup by initially
registering
a query
specification with the system. The system returns a handle
203

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.