summaryrefslogtreecommitdiff
path: root/src/backend/optimizer/README
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/optimizer/README')
-rw-r--r--src/backend/optimizer/README148
1 files changed, 79 insertions, 69 deletions
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 20a4e47742..bbc1204395 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -1,6 +1,20 @@
Summary
-------
+These directories take the Query structure returned by the parser, and
+generate a plan used by the executor. The /plan directory generates the
+actual output plan, the /path code generates all possible ways to join the
+tables, and /prep handles special cases like inheritance. /util is utility
+stuff. /geqo is the separate "genetic optimization" planner --- it does
+a semi-random search through the join tree space, rather than exhaustively
+considering all possible join trees. (But each join considered by geqo
+is given to /path to create paths for, so we consider all possible
+implementation paths for each specific join even in GEQO mode.)
+
+
+Join Tree Construction
+----------------------
+
The optimizer generates optimal query plans by doing a more-or-less
exhaustive search through the ways of executing the query. During
the planning/optimizing process, we build "Path" trees representing
@@ -26,8 +40,17 @@ the WHERE clause "tab1.col1 = tab2.col1" generates a JoinInfo for tab1
listing tab2 as an unjoined relation, and also one for tab2 showing tab1
as an unjoined relation.
+If we have only a single base relation in the query, we are done here.
+Otherwise we have to figure out how to join the base relations into a
+single join relation.
+
2) Consider joining each RelOptInfo to each other RelOptInfo specified in
its RelOptInfo.joininfo, and generate a Path for each possible join method.
+(If we have a RelOptInfo with no join clauses, we have no choice but to
+generate a clauseless Cartesian-product join; so we consider joining that
+rel to each other available rel. But in the presence of join clauses we
+will only consider joins that use available join clauses.)
+
At this stage each input RelOptInfo is a single relation, so we are joining
every relation to the other relations as joined in the WHERE clause. We
generate a new "join" RelOptInfo for each possible combination of two
@@ -53,9 +76,17 @@ that represent the scan methods used for the two input relations.
3) If we only had two base relations, we are done: we just pick the
cheapest path for the join RelOptInfo. If we had more than two, we now
need to consider ways of joining join RelOptInfos to each other to make
-join RelOptInfos that represent more than two base relations. This process
-is repeated until we have finally built a RelOptInfo that represents all
-the base relations in the query. Then we pick its cheapest Path.
+join RelOptInfos that represent more than two base relations.
+
+The join tree is constructed using a "dynamic programming" algorithm:
+in the first pass (already described) we consider ways to create join rels
+representing exactly two base relations. The second pass considers ways
+to make join rels that represent exactly three base relations; the next pass,
+four relations, etc. The last pass considers how to make the final join
+relation that includes all base rels --- obviously there can be only one
+join rel at this top level, whereas there can be more than one join rel
+at lower levels. At each level we use joins that follow available join
+clauses, if possible, just as described for the first level.
For example:
@@ -69,6 +100,7 @@ For example:
{1 2},{2 3},{3 4}
{1 2 3},{2 3 4}
{1 2 3 4}
+ (other possibilities will be excluded for lack of join clauses)
SELECT *
FROM tab1, tab2, tab3, tab4
@@ -78,56 +110,43 @@ For example:
Tables 1, 2, 3, and 4 are joined as:
{1 2},{1 3},{1 4}
- {1 2 3},{1 3 4},{1,2,4}
+ {1 2 3},{1 3 4},{1 2 4}
{1 2 3 4}
-In the default left-handed joins, each RelOptInfo adds one
-single-relation RelOptInfo in each join pass, and the added RelOptInfo
-is always the inner relation in the join. In right-handed joins, the
-added RelOptInfo is the outer relation in the join. In bushy plans,
-multi-relation RelOptInfo's can be joined to other multi-relation
-RelOptInfo's.
+We consider left-handed plans (the outer rel of an upper join is a joinrel,
+but the inner is always a base rel); right-handed plans (outer rel is always
+a base rel); and bushy plans (both inner and outer can be joins themselves).
+For example, when building {1 2 3 4} we consider joining {1 2 3} to {4}
+(left-handed), {4} to {1 2 3} (right-handed), and {1 2} to {3 4} (bushy),
+among other choices. Although the jointree scanning code produces these
+potential join combinations one at a time, all the ways to produce the
+same set of joined base rels will share the same RelOptInfo, so the paths
+produced from different join combinations that produce equivalent joinrels
+will compete in add_pathlist.
+
+Once we have built the final join rel, we use either the cheapest path
+for it or the cheapest path with the desired ordering (if that's cheaper
+than applying a sort to the cheapest other path).
+
Optimizer Functions
-------------------
-These directories take the Query structure returned by the parser, and
-generate a plan used by the executor. The /plan directory generates the
-actual output plan, the /path code generates all possible ways to join the
-tables, and /prep handles special cases like inheritance. /util is utility
-stuff. /geqo is the separate "genetic optimization" planner --- it does
-a semi-random search rather than exhaustively considering all possible
-join trees.
-
planner()
handle inheritance by processing separately
-init_query_planner()
preprocess target list
preprocess qualifications(WHERE)
--query_planner()
- cnfify()
- Summary:
-
- Simple cases with all AND's are handled by removing the AND's:
-
- convert: a = 1 AND b = 2 AND c = 3
- to: a = 1, b = 2, c = 3
-
- Qualifications with OR's are handled differently. OR's inside AND
- clauses are not modified drastically:
-
- convert: a = 1 AND b = 2 AND (c = 3 OR d = 4)
- to: a = 1, b = 2, c = 3 OR d = 4
-
- OR's in the upper level are more complex to handle:
-
- convert: (a = 1 AND b = 2) OR c = 3
- to: (a = 1 OR c = 3) AND (b = 2 OR c = 3)
- finally: (a = 1 OR c = 3), (b = 2 OR c = 3)
-
- These clauses all have to be true for a result to be returned,
- so the optimizer can choose the most restrictive clauses.
-
+ simplify constant subexpressions
+ canonicalize_qual()
+ Attempt to reduce WHERE clause to either CNF or DNF canonical form.
+ CNF (top-level-AND) is preferred, since the optimizer can then use
+ any of the AND subclauses to filter tuples; but quals that are in
+ or close to DNF form will suffer exponential expansion if we try to
+ force them to CNF. In pathological cases either transform may expand
+ the qual unreasonably; so we may have to leave it un-normalized,
+ thereby reducing the accuracy of selectivity estimates.
pull out constants from target list
get a target list that only contains column names, no expressions
if none, then return
@@ -142,20 +161,14 @@ planner()
find selectivity of columns used in joins
-----make_one_rel_by_joins()
jump to geqo if needed
- again:
- make_rels_by_joins():
- for each joinrel:
- make_rels_by_clause_joins()
- for each rel's joininfo list:
- if a join from the join clause adds only one relation, do the join
- or make_rels_by_clauseless_joins()
- update_rels_pathlist_for_joins()
- generate nested,merge,hash join paths for new rel's created above
- merge_rels_with_same_relids()
- merge RelOptInfo paths that have the same relids because of joins
- rels_set_cheapest()
- set cheapest path
- if all relations in one RelOptInfo, return
+ else call make_rels_by_joins() for each level of join tree needed
+ make_rels_by_joins():
+ For each joinrel of the prior level, do make_rels_by_clause_joins()
+ if it has join clauses, or make_rels_by_clauseless_joins() if not.
+ Also generate "bushy plan" joins between joinrels of lower levels.
+ Back at make_one_rel_by_joins(), apply set_cheapest() to extract the
+ cheapest path for each newly constructed joinrel.
+ Loop back if this wasn't the top join level.
do group(GROUP)
do aggregate
put back constants
@@ -164,7 +177,6 @@ planner()
make sort(ORDER BY)
-
Optimizer Data Structures
-------------------------
@@ -174,13 +186,13 @@ RelOptInfo - a relation or joined relations
JoinInfo - join clauses, including the relids needed for the join
Path - every way to generate a RelOptInfo(sequential,index,joins)
- SeqScan - a plain Path node with nodeTag = T_SeqScan
+ SeqScan - a plain Path node with nodeTag = T_SeqScan
IndexPath - index scans
NestPath - nested-loop joins
MergePath - merge joins
HashPath - hash joins
- PathKeys - a data structure representing the ordering of a path
+ PathKeys - a data structure representing the ordering of a path
The optimizer spends a good deal of its time worrying about the ordering
of the tuples returned by a path. The reason this is useful is that by
@@ -192,21 +204,19 @@ generated during the optimization process are marked with their sort order
(to the extent that it is known) for possible use by a higher-level merge.
It is also possible to avoid an explicit sort step to implement a user's
-ORDER BY clause if the final path has the right ordering already.
-Currently, this is not very well implemented: we avoid generating a
-redundant sort if the chosen path has the desired order, but we do not do
-anything to encourage the selection of such a path --- so we only avoid the
-sort if the path that would be chosen anyway (because it is cheapest
-without regard to its ordering) is properly sorted. The path winnowing
-process needs to be aware of the desired output order and account for the
-cost of doing an explicit sort while it is choosing the best path.
+ORDER BY clause if the final path has the right ordering already, so the
+sort ordering is of interest even at the top level. subplanner() will
+look for the cheapest path with a sort order matching the desired order,
+and will compare its cost to the cost of using the cheapest-overall path
+and doing an explicit sort.
When we are generating paths for a particular RelOptInfo, we discard a path
if it is more expensive than another known path that has the same or better
sort order. We will never discard a path that is the only known way to
-achieve a given sort order. In this way, the next level up will have the
-maximum freedom to build mergejoins without sorting, since it can pick from
-any of the paths retained for its inputs.
+achieve a given sort order (without an explicit sort, that is). In this
+way, the next level up will have the maximum freedom to build mergejoins
+without sorting, since it can pick from any of the paths retained for its
+inputs.
See path/pathkeys.c for an explanation of the PathKeys data structure that
represents what is known about the sort order of a particular Path.