summaryrefslogtreecommitdiff
path: root/doc/src/sgml/perform.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/perform.sgml')
-rw-r--r--doc/src/sgml/perform.sgml214
1 files changed, 125 insertions, 89 deletions
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 32e17ee5f8..b4b8f8dcb8 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1071,26 +1071,42 @@ WHERE tablename = 'road';
are independent of each other,
an assumption that does not hold when column values are correlated.
Regular statistics, because of their per-individual-column nature,
- do not capture the knowledge of cross-column correlation;
- <firstterm>multivariate statistics</firstterm> can be used to instruct
- the server to obtain statistics across such a set of columns,
- which are later used by the query optimizer
- to determine cardinality and selectivity
- of clauses involving those columns.
- Multivariate statistics are currently the only use of
- <firstterm>extended statistics</firstterm>.
+ cannot capture any knowledge about cross-column correlation.
+ However, <productname>PostgreSQL</> has the ability to compute
+ <firstterm>multivariate statistics</firstterm>, which can capture
+ such information.
</para>
<para>
- Extended statistics are created using
+ Because the number of possible column combinations is very large,
+ it's impractical to compute multivariate statistics automatically.
+ Instead, <firstterm>extended statistics objects</firstterm>, more often
+ called just <firstterm>statistics objects</>, can be created to instruct
+ the server to obtain statistics across interesting sets of columns.
+ </para>
+
+ <para>
+ Statistics objects are created using
<xref linkend="sql-createstatistics">, which see for more details.
- Data collection is deferred until the next <command>ANALYZE</command>
- on the table, after which the stored values can be examined in the
+ Creation of such an object merely creates a catalog entry expressing
+ interest in the statistics. Actual data collection is performed
+ by <command>ANALYZE</command> (either a manual command, or background
+ auto-analyze). The collected values can be examined in the
<link linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
catalog.
</para>
<para>
+ <command>ANALYZE</command> computes extended statistics based on the same
+ sample of table rows that it takes for computing regular single-column
+ statistics. Since the sample size is increased by increasing the
+ statistics target for the table or any of its columns (as described in
+ the previous section), a larger statistics target will normally result in
+ more accurate extended statistics, as well as more time spent calculating
+ them.
+ </para>
+
+ <para>
The following subsections describe the types of extended statistics
that are currently supported.
</para>
@@ -1099,142 +1115,162 @@ WHERE tablename = 'road';
<title>Functional Dependencies</title>
<para>
- The simplest type of extended statistics are functional dependencies,
- a concept used in definitions of database normal forms.
- Put simply, it is said that column <literal>b</> is functionally
- dependent on column <literal>a</> if knowledge of the value of
- <literal>a</> is sufficient to determine the value of <literal>b</>.
- In normalized databases, functional dependencies are allowed only on
- primary keys and superkeys. However, many data sets are in practice not
- fully normalized for various reasons; intentional denormalization for
- performance reasons is a common example.
+ The simplest type of extended statistics tracks <firstterm>functional
+ dependencies</>, a concept used in definitions of database normal forms.
+ We say that column <structfield>b</> is functionally dependent on
+ column <structfield>a</> if knowledge of the value of
+ <structfield>a</> is sufficient to determine the value
+ of <structfield>b</>, that is there are no two rows having the same value
+ of <structfield>a</> but different values of <structfield>b</>.
+ In a fully normalized database, functional dependencies should exist
+ only on primary keys and superkeys. However, in practice many data sets
+ are not fully normalized for various reasons; intentional
+ denormalization for performance reasons is a common example.
+ Even in a fully normalized database, there may be partial correlation
+ between some columns, which can be expressed as partial functional
+ dependency.
</para>
<para>
- The existance of functional dependencies directly affects the accuracy
- of estimates in certain queries.
- The reason is that conditions on the dependent columns do not
- restrict the result set, but the query planner (lacking functional
- dependency knowledge) considers them independent, resulting in
- underestimates.
- To inform the planner about the functional dependencies, we collect
- measurements of dependency during <command>ANALYZE</>. Assessing
- the degree of dependency between all sets of columns would be
- prohibitively expensive, so the search is limited to potential
- dependencies defined using the <literal>dependencies</> option of
- extended statistics. It is advisable to create
- <literal>dependencies</> statistics if and only if functional
- dependencies actually exist, to avoid unnecessary overhead on both
- <command>ANALYZE</> and query planning.
+ The existence of functional dependencies directly affects the accuracy
+ of estimates in certain queries. If a query contains conditions on
+ both the independent and the dependent column(s), the
+ conditions on the dependent columns do not further reduce the result
+ size; but without knowledge of the functional dependency, the query
+ planner will assume that the conditions are independent, resulting
+ in underestimating the result size.
</para>
<para>
- To inspect functional dependencies on a statistics
- <literal>stts</literal>, you may do this:
+ To inform the planner about functional dependencies, <command>ANALYZE</>
+ can collect measurements of cross-column dependency. Assessing the
+ degree of dependency between all sets of columns would be prohibitively
+ expensive, so data collection is limited to those groups of columns
+ appearing together in a statistics object defined with
+ the <literal>dependencies</> option. It is advisable to create
+ <literal>dependencies</> statistics only for column groups that are
+ strongly correlated, to avoid unnecessary overhead in both
+ <command>ANALYZE</> and later query planning.
+ </para>
+
+ <para>
+ Here is an example of collecting functional-dependency statistics:
<programlisting>
-CREATE STATISTICS stts (dependencies)
- ON zip, city FROM zipcodes;
+CREATE STATISTICS stts (dependencies) ON zip, city FROM zipcodes;
+
ANALYZE zipcodes;
+
SELECT stxname, stxkeys, stxdependencies
FROM pg_statistic_ext
- WHERE stxname = 'stts';
+ WHERE stxname = 'stts';
stxname | stxkeys | stxdependencies
---------+---------+------------------------------------------
stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130}
(1 row)
</programlisting>
- where it can be seen that column 1 (a zip code) fully determines column
+ Here it can be seen that column 1 (zip code) fully determines column
5 (city) so the coefficient is 1.0, while city only determines zip code
about 42% of the time, meaning that there are many cities (58%) that are
represented by more than a single ZIP code.
</para>
<para>
- When computing the selectivity, the planner inspects all conditions and
- attempts to identify which conditions are already implied by other
- conditions. The selectivity estimates from any redundant conditions are
- ignored from a selectivity point of view. In the example query above,
- the selectivity estimates for either of the conditions may be eliminated,
- thus improving the overall estimate.
+ When computing the selectivity for a query involving functionally
+ dependent columns, the planner adjusts the per-condition selectivity
+ estimates using the dependency coefficients so as not to produce
+ an underestimate.
</para>
<sect4>
<title>Limitations of Functional Dependencies</title>
<para>
- Functional dependencies are a very simple type of statistics, and
- as such have several limitations. The first limitation is that they
- only work with simple equality conditions, comparing columns and constant
- values. It's not possible to use them to eliminate equality conditions
- comparing two columns or a column to an expression, range clauses,
- <literal>LIKE</> or any other type of conditions.
+ Functional dependencies are currently only applied when considering
+ simple equality conditions that compare columns to constant values.
+ They are not used to improve estimates for equality conditions
+ comparing two columns or comparing a column to an expression, nor for
+ range clauses, <literal>LIKE</> or any other type of condition.
</para>
<para>
- When eliminating the implied conditions, the planner assumes that the
- conditions are compatible. Consider the following example, where
- this assumption does not hold:
-
+ When estimating with functional dependencies, the planner assumes that
+ conditions on the involved columns are compatible and hence redundant.
+ If they are incompatible, the correct estimate would be zero rows, but
+ that possibility is not considered. For example, given a query like
<programlisting>
-EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
- QUERY PLAN
------------------------------------------------------------------------------
- Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1)
- Filter: ((a = 1) AND (b = 10))
- Rows Removed by Filter: 10000
+SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
</programlisting>
-
- While there are no rows with such combination of values, the planner
- is unable to verify whether the values match &mdash; it only knows that
- the columns are functionally dependent.
+ the planner will disregard the <structfield>city</> clause as not
+ changing the selectivity, which is correct. However, it will make
+ the same assumption about
+<programlisting>
+SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
+</programlisting>
+ even though there will really be zero rows satisfying this query.
+ Functional dependency statistics do not provide enough information
+ to conclude that, however.
</para>
<para>
- This assumption is related to queries executed on the database; in many
- cases, it's actually satisfied (e.g. when the GUI only allows selecting
- compatible values). But if that's not the case, functional dependencies
- may not be a viable option.
+ In many practical situations, this assumption is usually satisfied;
+ for example, there might be a GUI in the application that only allows
+ selecting compatible city and zipcode values to use in a query.
+ But if that's not the case, functional dependencies may not be a viable
+ option.
</para>
</sect4>
</sect3>
<sect3>
- <title>Multivariate N-Distinct Coefficients</title>
+ <title>Multivariate N-Distinct Counts</title>
<para>
Single-column statistics store the number of distinct values in each
- column. Estimates of the number of distinct values on more than one
- column (for example, for <literal>GROUP BY a, b</literal>) are
+ column. Estimates of the number of distinct values when combining more
+ than one column (for example, for <literal>GROUP BY a, b</literal>) are
frequently wrong when the planner only has single-column statistical
- data, however, causing it to select bad plans.
- In order to improve n-distinct estimation when multiple columns are
- grouped together, the <literal>ndistinct</> option of extended statistics
- can be used, which instructs <command>ANALYZE</> to collect n-distinct
- estimates for all possible combinations of two or more columns of the set
- of columns in the statistics object (the per-column estimates are already
- available in <structname>pg_statistic</>).
+ data, causing it to select bad plans.
+ </para>
+
+ <para>
+ To improve such estimates, <command>ANALYZE</> can collect n-distinct
+ statistics for groups of columns. As before, it's impractical to do
+ this for every possible column grouping, so data is collected only for
+ those groups of columns appearing together in a statistics object
+ defined with the <literal>ndistinct</> option. Data will be collected
+ for each possible combination of two or more columns from the set of
+ listed columns.
</para>
<para>
- Continuing the above example, the n-distinct coefficients in a ZIP
- code table may look like the following:
+ Continuing the previous example, the n-distinct counts in a
+ table of ZIP codes might look like the following:
<programlisting>
-CREATE STATISTICS stts2 (ndistinct)
- ON zip, state, city FROM zipcodes;
+CREATE STATISTICS stts2 (ndistinct) ON zip, state, city FROM zipcodes;
+
ANALYZE zipcodes;
+
SELECT stxkeys AS k, stxndistinct AS nd
FROM pg_statistic_ext
- WHERE stxname = 'stts2';
+ WHERE stxname = 'stts2';
-[ RECORD 1 ]--------------------------------------------------------
k | 1 2 5
nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
(1 row)
</programlisting>
- which indicates that there are three combinations of columns that
+ This indicates that there are three combinations of columns that
have 33178 distinct values: ZIP code and state; ZIP code and city;
and ZIP code, city and state (the fact that they are all equal is
- expected given the nature of ZIP-code data). On the other hand,
- the combination of city and state only has 27435 distinct values.
+ expected given that ZIP code alone is unique in this table). On the
+ other hand, the combination of city and state has only 27435 distinct
+ values.
+ </para>
+
+ <para>
+ It's advisable to create <literal>ndistinct</> statistics objects only
+ on combinations of columns that are actually used for grouping, and
+ for which misestimation of the number of groups is resulting in bad
+ plans. Otherwise, the <command>ANALYZE</> cycles are just wasted.
</para>
</sect3>
</sect2>