LEO - Learning Emperical Results in Query Optimization

Innovation Matters


The learning optimizer (LEO) project uses autonomic feedback loops to create a self-tuning database query optimizer. LEO self-validates and adjusts its model to improve query optimization and execution without requiring any user interaction. The goal of LEO is to reduce the total cost of owning database management systems by simplifying database administration.


Structured query language (SQL) has emerged as the industry standard for querying relational database management systems, largely because a user need only specify what data is wanted, not the details of how to access that data. A query optimizer uses a mathematical model of query execution to automatically determine the best way to access and process any given SQL query. This model is heavily dependent upon the optimizer’s estimates for the number of rows that will result at each step of the query execution plan (QEP), especially for complex queries involving many predicates and/or operations. These estimates rely upon statistics on the database and modeling assumptions that may or may not be true for a given database.

The goal of the LEO (Learning Optimizer) project is to develop an autonomic query optimizer that automatically self-validates its model without requiring any user interaction to repair incorrect statistics or cardinality estimates. By monitoring queries as they execute, the autonomic optimizer compares the optimizer’s estimates with actuals at each step in a QEP, and then computes adjustments to its estimates that may be used during future optimizations of similar queries. Moreover, estimation errors can also trigger re-optimization of a query in mid-execution. The autonomic refinement of the optimizer’s model can result in a reduction of query execution time by orders of magnitude at negligible additional run-time cost. Database users will benefit from LEO by reduced administration time, fewer problem queries, and overall improved query performance with increased robustness and predictability of query response times.

LEO
The LEO Feedback Loop

The LEO feedback loop is comprised of four steps: monitoring, analysis, feedback and feedback exploitation. The monitoring component saves at query compilation time the cardinality estimates derived by the optimizer for the best (i.e., least-cost) plan, and during query execution saves the actual cardinalities observed for that plan. The analysis component uses the monitored information to identify modeling errors and compute corrective adjustments. This analysis is a standalone process that may be run separately from the database server, possibly on another system. The feedback component modifies the catalog statistics of the database according to the learned information. The exploitation component closes the feedback loop by using the learned knowledge in the system catalog to provide adjustments to the query optimizer’s cardinality estimates.

The four components can operate independently, but form a consecutive sequence that constitutes a continuous learning mechanism by incrementally capturing plans, monitoring their execution, analyzing the monitor output, and adjusting the optimizer's model with the learned knowledge. In addition to using query feedback to reactively adjust the model for future queries, the LEO project considers progressive optimization to re-optimize a query on the fly whenever significant modeling errors are detected, as well as proactive correlation detection through sampling to bootstrap the Learning Optimizer.

The LEO project is multi-disciplinary, as it touches on database research, statistics, information theory, data mining and artificial intelligence. Major research challenges addressed in the LEO project are:
(1) ensuring stability and convergence of the autonomic system,
(2) guaranteeing consistency of the overall optimizer's model upon refinements,
(3) carrying out efficient monitoring in shared-nothing configurations.


Related Publications  

Ashraf I. Aboulnaga, Peter J. Haas, Sam Lightstone, Guy M. Lohman, Volker G. Markl, Ivan Popivanov and Vijayshankar Raman. Automated Statistics Collection in DB2 Stinger. VLDB 2004. June 2004.

Ihab F. Ilyas, Volker Markl, Peter J. Haas, Paul Brown and Ashraf Aboulnaga. Automatic relationship discovery in self-managing database systems. International Conference Autonomic Computing (ICAC '04). 2004.

Ihab Ilyas, Peter J. Haas, Volker G. Markl, Paul G. Brown and Ashraf I. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. SIGMOD 2004 . ACM, February 2004.

Ihab Ilyas, Volker G. Markl, Peter J. Haas, Paul G. Brown and Ashraf I. Aboulnaga. CORDS: Automatic Generation of Correlation Statistics in DB2. VLDB 2004. June 2004.

Volker Markl, G. M. Lohman and V. Raman. LEO: An Autonomic Query Optimizer for DB2. IBM Systems Journal Special Issue on Autonomic Computing, January 2001.

Vijayshankar Raman, Volker G. Markl, David E. Simmen, Guy M. Lohman and Mir Hamid Pirahesh. Progressive Optimization in Action. VLDB 2004. June 2004.

M. Stillger, G. M. Lohman, V. Markl and M. Kandil. LEO - DB2's LEarning Optimizer. Proc. Intl. Conf. Very Large Data Bases (VLDB). 2001.


News and Information:
Automated Statistics Profiling as the first delivery of LEO technology into DB2 will be part of DB2 Stinger.

The LEO project will be presented at VLDB 2004 in Toronto in August by one talk and two demos.

"Automated Statistics Profiling" will be featured in a talk at IDUG Europe in October in Prague, as well as at the DB2 North America Tech Confernece 2004 in Las Vegas.

Rate this article

Innovator's corner  

Volker MarklVolker Markl Researcher

What is the most exciting potential future use for the work you're doing?
A learning optimizer (LEO) is a major step towards self-tuning database management systems (DBMS). This autonomic technology significantly reduces the administration needs and thus the total cost of ownership for a DBMS. The major potential of LEO is to make databases transparent to the end-user, and even could enable DBA-less databases for small to medium businesses. Automatic problem determination and solution are a major requirement for an on-demand computing society, and feedback loops as used by LEO will be a major enabling factor in this world, extending well beyond databases.


What is the most interesting part of your research?
I very much enjoy the multidisciplinary aspect of LEO. Autonomic computing combines many sciences in the areas of math, computer science, and engineering. To create a self-tuning optimizer requires considering systems architecture, software engineering practices, database systems, control theory, statistics, information theory and artificial intelligence. Combining all of these areas is a challenge, but also a very rewarding undertaking.


What inspired you to go into this field?
The magnitude of the problem, as well the potential impact that a solution will have. The consequences will be the creation of new applications and markets for database management systems. Being a generalist, another major motivation for me is the fact that the technology developed in LEO is not limited to DBMS, but is applicable to many other domains and systems, like operating systems, workflow management, and even business applications.


What is your favorite invention of all time?
I am very impressed by simple inventions that have a drastic impact on society and are used by almost everybody. I find it hard to pinpoint a single invention. To me a true genius invents something simple, and even though nobody has thought about that invention, when knowing it everybody would say "Well, that is a simple idea, I could have come up with that.” From a historic perspective, this includes items that we nowadays take for granted, like the telephone, radio waves or electricity. I am also very impressed by people who challenge traditional thinking. In this way, Goedel's proof to me is the greatest proof of all times. In computer science, next to the great abstraction of Turing machines and the idea of combining instructions and data, I am very impressed with invention of B-trees. The idea to have a tree grow upside down, by splitting the root instead of adding leaves, is simple, but allows for efficient algorithms together with worst case guarantees for basic operations of insertion, update and delete, making B-Trees one of the most important enabling technologies of database systems. The most significant recent invention to me has been search engines, more specifically Google, as it has drastically simplified the way I research literature or procure information in general.

Research team  

Related Research  

Disciplines: Computer Science
Research Areas: Data Management
Research Labs: Almaden Research Center