diff options
Diffstat (limited to 'src/backend/optimizer/README')
| -rw-r--r-- | src/backend/optimizer/README | 148 |
1 files changed, 79 insertions, 69 deletions
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README index 20a4e47742..bbc1204395 100644 --- a/src/backend/optimizer/README +++ b/src/backend/optimizer/README @@ -1,6 +1,20 @@ Summary ------- +These directories take the Query structure returned by the parser, and +generate a plan used by the executor. The /plan directory generates the +actual output plan, the /path code generates all possible ways to join the +tables, and /prep handles special cases like inheritance. /util is utility +stuff. /geqo is the separate "genetic optimization" planner --- it does +a semi-random search through the join tree space, rather than exhaustively +considering all possible join trees. (But each join considered by geqo +is given to /path to create paths for, so we consider all possible +implementation paths for each specific join even in GEQO mode.) + + +Join Tree Construction +---------------------- + The optimizer generates optimal query plans by doing a more-or-less exhaustive search through the ways of executing the query. During the planning/optimizing process, we build "Path" trees representing @@ -26,8 +40,17 @@ the WHERE clause "tab1.col1 = tab2.col1" generates a JoinInfo for tab1 listing tab2 as an unjoined relation, and also one for tab2 showing tab1 as an unjoined relation. +If we have only a single base relation in the query, we are done here. +Otherwise we have to figure out how to join the base relations into a +single join relation. + 2) Consider joining each RelOptInfo to each other RelOptInfo specified in its RelOptInfo.joininfo, and generate a Path for each possible join method. +(If we have a RelOptInfo with no join clauses, we have no choice but to +generate a clauseless Cartesian-product join; so we consider joining that +rel to each other available rel. But in the presence of join clauses we +will only consider joins that use available join clauses.) + At this stage each input RelOptInfo is a single relation, so we are joining every relation to the other relations as joined in the WHERE clause. We generate a new "join" RelOptInfo for each possible combination of two @@ -53,9 +76,17 @@ that represent the scan methods used for the two input relations. 3) If we only had two base relations, we are done: we just pick the cheapest path for the join RelOptInfo. If we had more than two, we now need to consider ways of joining join RelOptInfos to each other to make -join RelOptInfos that represent more than two base relations. This process -is repeated until we have finally built a RelOptInfo that represents all -the base relations in the query. Then we pick its cheapest Path. +join RelOptInfos that represent more than two base relations. + +The join tree is constructed using a "dynamic programming" algorithm: +in the first pass (already described) we consider ways to create join rels +representing exactly two base relations. The second pass considers ways +to make join rels that represent exactly three base relations; the next pass, +four relations, etc. The last pass considers how to make the final join +relation that includes all base rels --- obviously there can be only one +join rel at this top level, whereas there can be more than one join rel +at lower levels. At each level we use joins that follow available join +clauses, if possible, just as described for the first level. For example: @@ -69,6 +100,7 @@ For example: {1 2},{2 3},{3 4} {1 2 3},{2 3 4} {1 2 3 4} + (other possibilities will be excluded for lack of join clauses) SELECT * FROM tab1, tab2, tab3, tab4 @@ -78,56 +110,43 @@ For example: Tables 1, 2, 3, and 4 are joined as: {1 2},{1 3},{1 4} - {1 2 3},{1 3 4},{1,2,4} + {1 2 3},{1 3 4},{1 2 4} {1 2 3 4} -In the default left-handed joins, each RelOptInfo adds one -single-relation RelOptInfo in each join pass, and the added RelOptInfo -is always the inner relation in the join. In right-handed joins, the -added RelOptInfo is the outer relation in the join. In bushy plans, -multi-relation RelOptInfo's can be joined to other multi-relation -RelOptInfo's. +We consider left-handed plans (the outer rel of an upper join is a joinrel, +but the inner is always a base rel); right-handed plans (outer rel is always +a base rel); and bushy plans (both inner and outer can be joins themselves). +For example, when building {1 2 3 4} we consider joining {1 2 3} to {4} +(left-handed), {4} to {1 2 3} (right-handed), and {1 2} to {3 4} (bushy), +among other choices. Although the jointree scanning code produces these +potential join combinations one at a time, all the ways to produce the +same set of joined base rels will share the same RelOptInfo, so the paths +produced from different join combinations that produce equivalent joinrels +will compete in add_pathlist. + +Once we have built the final join rel, we use either the cheapest path +for it or the cheapest path with the desired ordering (if that's cheaper +than applying a sort to the cheapest other path). + Optimizer Functions ------------------- -These directories take the Query structure returned by the parser, and -generate a plan used by the executor. The /plan directory generates the -actual output plan, the /path code generates all possible ways to join the -tables, and /prep handles special cases like inheritance. /util is utility -stuff. /geqo is the separate "genetic optimization" planner --- it does -a semi-random search rather than exhaustively considering all possible -join trees. - planner() handle inheritance by processing separately -init_query_planner() preprocess target list preprocess qualifications(WHERE) --query_planner() - cnfify() - Summary: - - Simple cases with all AND's are handled by removing the AND's: - - convert: a = 1 AND b = 2 AND c = 3 - to: a = 1, b = 2, c = 3 - - Qualifications with OR's are handled differently. OR's inside AND - clauses are not modified drastically: - - convert: a = 1 AND b = 2 AND (c = 3 OR d = 4) - to: a = 1, b = 2, c = 3 OR d = 4 - - OR's in the upper level are more complex to handle: - - convert: (a = 1 AND b = 2) OR c = 3 - to: (a = 1 OR c = 3) AND (b = 2 OR c = 3) - finally: (a = 1 OR c = 3), (b = 2 OR c = 3) - - These clauses all have to be true for a result to be returned, - so the optimizer can choose the most restrictive clauses. - + simplify constant subexpressions + canonicalize_qual() + Attempt to reduce WHERE clause to either CNF or DNF canonical form. + CNF (top-level-AND) is preferred, since the optimizer can then use + any of the AND subclauses to filter tuples; but quals that are in + or close to DNF form will suffer exponential expansion if we try to + force them to CNF. In pathological cases either transform may expand + the qual unreasonably; so we may have to leave it un-normalized, + thereby reducing the accuracy of selectivity estimates. pull out constants from target list get a target list that only contains column names, no expressions if none, then return @@ -142,20 +161,14 @@ planner() find selectivity of columns used in joins -----make_one_rel_by_joins() jump to geqo if needed - again: - make_rels_by_joins(): - for each joinrel: - make_rels_by_clause_joins() - for each rel's joininfo list: - if a join from the join clause adds only one relation, do the join - or make_rels_by_clauseless_joins() - update_rels_pathlist_for_joins() - generate nested,merge,hash join paths for new rel's created above - merge_rels_with_same_relids() - merge RelOptInfo paths that have the same relids because of joins - rels_set_cheapest() - set cheapest path - if all relations in one RelOptInfo, return + else call make_rels_by_joins() for each level of join tree needed + make_rels_by_joins(): + For each joinrel of the prior level, do make_rels_by_clause_joins() + if it has join clauses, or make_rels_by_clauseless_joins() if not. + Also generate "bushy plan" joins between joinrels of lower levels. + Back at make_one_rel_by_joins(), apply set_cheapest() to extract the + cheapest path for each newly constructed joinrel. + Loop back if this wasn't the top join level. do group(GROUP) do aggregate put back constants @@ -164,7 +177,6 @@ planner() make sort(ORDER BY) - Optimizer Data Structures ------------------------- @@ -174,13 +186,13 @@ RelOptInfo - a relation or joined relations JoinInfo - join clauses, including the relids needed for the join Path - every way to generate a RelOptInfo(sequential,index,joins) - SeqScan - a plain Path node with nodeTag = T_SeqScan + SeqScan - a plain Path node with nodeTag = T_SeqScan IndexPath - index scans NestPath - nested-loop joins MergePath - merge joins HashPath - hash joins - PathKeys - a data structure representing the ordering of a path + PathKeys - a data structure representing the ordering of a path The optimizer spends a good deal of its time worrying about the ordering of the tuples returned by a path. The reason this is useful is that by @@ -192,21 +204,19 @@ generated during the optimization process are marked with their sort order (to the extent that it is known) for possible use by a higher-level merge. It is also possible to avoid an explicit sort step to implement a user's -ORDER BY clause if the final path has the right ordering already. -Currently, this is not very well implemented: we avoid generating a -redundant sort if the chosen path has the desired order, but we do not do -anything to encourage the selection of such a path --- so we only avoid the -sort if the path that would be chosen anyway (because it is cheapest -without regard to its ordering) is properly sorted. The path winnowing -process needs to be aware of the desired output order and account for the -cost of doing an explicit sort while it is choosing the best path. +ORDER BY clause if the final path has the right ordering already, so the +sort ordering is of interest even at the top level. subplanner() will +look for the cheapest path with a sort order matching the desired order, +and will compare its cost to the cost of using the cheapest-overall path +and doing an explicit sort. When we are generating paths for a particular RelOptInfo, we discard a path if it is more expensive than another known path that has the same or better sort order. We will never discard a path that is the only known way to -achieve a given sort order. In this way, the next level up will have the -maximum freedom to build mergejoins without sorting, since it can pick from -any of the paths retained for its inputs. +achieve a given sort order (without an explicit sort, that is). In this +way, the next level up will have the maximum freedom to build mergejoins +without sorting, since it can pick from any of the paths retained for its +inputs. See path/pathkeys.c for an explanation of the PathKeys data structure that represents what is known about the sort order of a particular Path. |
