Abstract:
This paper presents the "situation manager", a tool that includes both a language and an efficient runtime execution mechanism aimed at reducing the complexity of active applications. This tool follows the observation that in many cases there is a gap between current tools that enable one to react to a single event (following the ECA: event-condition-action paradigm) and the reality in which a single event may not require any reaction; however, the reaction should be given to patterns over the event history. The concept of situation presented in this paper extends the concept of composite event in its expressive power, flexibility, and usability. This paper motivates the work, surveys other efforts in this area, and discusses both the language and the execution model.
B. Bhattacharjee, S. Padmanabhan, T. Malkemus, T. Lai, L. Cranston, M. Huras, “Efficient Query Processing for Multi-Dimensionally Clustered Tables in DB2,” Proc. Intl. Conf. Very Large Data Bases (VLDB), 2003.
Abstract:
We have introduced a Multi-Dimensional Clustering (MDC) physical layout scheme in DB2 version 8.0 for relational tables. Multi-Dimensional Clustering is based on the definition of one or more orthogonal clustering attributes (or expressions) of a table. The table is organized physically by associating records with similar values for the dimension attributes in a cluster. Each clustering key is allocated one or more blocks of physical storage with the aim of storing the multiple records belonging to the cluster in almost contiguous fashion. Block oriented indexes are created to access these blocks. In this paper, we describe novel techniques for query processing operations that provide significant performance improvements for MDC tables. Current database systems employ a repertoire of access methods including table scans, index scans, index ANDing, index ORing. We have extended these access methods for efficiently processing the block based MDC tables. One important concept at the core of processing MDC tables is the block oriented access technique. In addition, since MDC tables can include regular record oriented indexes, we employ novel techniques to combine block and record indexes. Block oriented processing is extended to nested loop joins and star joins as well. We show results from experiments using a star-schema database to validate our claims of performance with minimal overhead.
C. C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. Intl. Conf. Very Large Data Bases (VLDB), 2003.
Abstract:
The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too ineffcient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream.
The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. For example, a simple one-pass clustering algorithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously perform dynamic clustering over all possible time horizons for a data stream of even moderately large volume.
This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. The idea is divide the clustering process into an online component which periodically stores detailed summary statistics and an online component which uses only this summary statistics. The online component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The problems of effcient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyramidal time frame in conjunction with a micro-clustering approach. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.
L. Lim, M. Wang, J. S. Vitter, “SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads,” Proc. Intl. Conf. Very Large Data Bases (VLDB), 2003.
Abstract:
Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries. These selectivities are typically used for cost-based query optimization. While the problem of building an accurate histogram for a given attribute or attribute set has been well-studied, little attention has been given to the problem of building and tuning a set of histograms collectively for multidimensional queries in a self-managed manner based only on query feedback. In this paper, we present SASH, a Self-Adaptive Set of Histograms that addresses the problem of building and maintaining a set of histograms. SASH uses a novel two-phase method to automatically build and maintain itself using query feedback information only. In the online tuning phase, the current set of histograms is tuned in response to the estimation error of each query in an online manner. In the restructuring phase, a new and more accurate set of histograms replaces the current set of histograms. The new set of histograms (attribute sets and memory distribution) is found using information from a batch of query feedback. We present experimental results that show the effectiveness and accuracy of our approach.
M. Stillger, G. M. Lohman, V. Markl, M. Kandil, “LEO - DB2's LEarning Optimizer,” Proc. Intl. Conf. Very Large Data Bases (VLDB), 2001, pp. 19-28.
Abstract:
Most modern DBMS optimizers rely upon a cost model to choose the best query execution plan (QEP) for any given query. Cost estimates are heavily dependent upon the optimizer’s estimates for the number of rows that will result at each step of the QEP for complex queries involving many predicates and/or operations. These estimates rely upon statistics on the database and modeling assumptions that may or may not be true for a given database. In this paper we introduce LEO, DB2's LEarning Optimizer, as a comprehensive way to repair incorrect statistics and cardinality estimates of a query execution plan. By monitoring previously executed queries, LEO compares the optimizer’s estimates with actuals at each step in a QEP, and computes adjustments to cost estimates and statistics that may be used during future query optimizations. This analysis can be done either on-line or off-line on a separate system, and either incrementally or in batches. In this way, LEO introduces a feedback loop to query optimization that enhances the available information on the database where the most queries have occurred, allowing the optimizer to actually learn from its past mistakes. Our technique is general and can be applied to any operation in a QEP, including joins, derived results after several predicates have been applied, and even to DISTINCT and GROUP-BY operators. As shown by performance measurements on a 10 GB TPCH data set, the runtime overhead of LEO’s monitoring is insignificant, whereas the potential benefit to response time from more accurate cardinality and cost estimates can be orders of magnitude.
R. Agrawal, A. Evfimievski, R. Srikant, “Information Sharing Across Private Databases,” Proc. ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.
Abstract:
Literature on information integration across databases tacitly assumes that the data in each database can be revealed to the other databases. However, there is an increasing need for sharing information across autonomous entities in such a way that no information apart from the answer to the query is revealed. We formalize the notion of minimal information sharing across private databases, and develop protocols for intersection, equijoin, intersection size, and equijoin size. We also show how new applications can be built using the proposed protocols.
R. Fagin, P. G. Kolaitis, L. Popa. , “Data Exchange: Getting to the Core,” Proc. ACM Symp. on Principles of Database Systems (PODS), San Diego, California, June 2003.
Abstract:
Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible. Given a source instance, there may be many target instances that satisfy the constraints of the data exchange problem. In an earlier paper, we identified a special class of solutions that we call ``universal''. A universal solution has homomorphisms into every possible solution, and hence is a ``most general possible'' solution. However, there may be many universal solutions for a given source instance. This naturally brings up the question of whether there is a ``best'' universal solution, and hence a best solution for data exchange. In this paper, we answer this question by considering the well-known notion of the “core” of a structure, a notion that was first studied in graph theory, but has also played a role in conjunctive-query processing. The core of a structure is the smallest substructure that is also a homomorphic image of the structure. All universal solutions have the same core (up to isomorphism); we show that this core is also a universal solution, and hence the smallest universal solution. The uniqueness of the core of a universal solution together with its minimality make the core an ideal solution for data exchange. Furthermore, we show that the core is the best among all universal solutions for answering conjunctive queries with inequalities. Finally, we consider the complexity of producing the core. We show that the complexity of deciding if a graph H is the core of a graph G is DP-complete. Consequently, unless NP = coNP, there is no polynomial-time algorithm for producing the core. In the case of data exchange, however, we give natural conditions where there are polynomial-time algorithms for computing the core.
