diff options
Diffstat (limited to 'doc/src/sgml/trgm.sgml')
| -rw-r--r-- | doc/src/sgml/trgm.sgml | 214 |
1 files changed, 214 insertions, 0 deletions
diff --git a/doc/src/sgml/trgm.sgml b/doc/src/sgml/trgm.sgml new file mode 100644 index 0000000000..62a5e30382 --- /dev/null +++ b/doc/src/sgml/trgm.sgml @@ -0,0 +1,214 @@ +<sect1 id="pgtrgm"> + <title>pg_trgm</title> + + <indexterm zone="pgtrgm"> + <primary>pgtrgm</primary> + </indexterm> + + <para> + The <literal>pg_trgm</literal> module provides functions and index classes + for determining the similarity of text based on trigram matching. + </para> + + <sect2> + <title>Trigram (or Trigraph)</title> + <para> + A trigram is a set of three consecutive characters taken + from a string. A string is considered to have two spaces + prefixed and one space suffixed when determining the set + of trigrams that comprise the string. + </para> + <para> + eg. The set of trigrams in the word "cat" is " c", " ca", + "at " and "cat". + </para> + </sect2> + + <sect2> + <title>Public Functions</title> + <table> + <title><literal>pg_trgm</literal> functions</title> + <tgroup cols="2"> + <thead> + <row> + <entry>Function</entry> + <entry>Description</entry> + </row> + </thead> + <tbody> + <row> + <entry><literal>real similarity(text, text)</literal></entry> + <entry> + <para> + Returns a number that indicates how closely matches the two + arguments are. A zero result indicates that the two words + are completely dissimilar, and a result of one indicates that + the two words are identical. + </para> + </entry> + </row> + <row> + <entry><literal>real show_limit()</literal></entry> + <entry> + <para> + Returns the current similarity threshold used by the '%' + operator. This in effect sets the minimum similarity between + two words in order that they be considered similar enough to + be misspellings of each other, for example. + </para> + </entry> + </row> + <row> + <entry><literal>real set_limit(real)</literal></entry> + <entry> + <para> + Sets the current similarity threshold that is used by the '%' + operator, and is returned by the show_limit() function. + </para> + </entry> + </row> + <row> + <entry><literal>text[] show_trgm(text)</literal></entry> + <entry> + <para> + Returns an array of all the trigrams of the supplied text + parameter. + </para> + </entry> + </row> + <row> + <entry>Operator: <literal>text % text (returns boolean)</literal></entry> + <entry> + <para> + The '%' operator returns TRUE if its two arguments have a similarity + that is greater than the similarity threshold set by set_limit(). It + will return FALSE if the similarity is less than the current + threshold. + </para> + </entry> + </row> + </tbody> + </tgroup> + </table> + </sect2> + + <sect2> + <title>Public Index Operator Class</title> + <para> + The <literal>pg_trgm</literal> module comes with the + <literal>gist_trgm_ops</literal> index operator class that allows a + developer to create an index over a text column for the purpose + of very fast similarity searches. + </para> + <para> + To use this index, the '%' operator must be used and an appropriate + similarity threshold for the application must be set. Example: + </para> + <programlisting> +CREATE TABLE test_trgm (t text); +CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops); + </programlisting> + <para> + At this point, you will have an index on the t text column that you + can use for similarity searching. Example: + </para> + <programlisting> +SELECT + t, + similarity(t, 'word') AS sml +FROM + test_trgm +WHERE + t % 'word' +ORDER BY + sml DESC, t; + </programlisting> + <para> + This will return all values in the text column that are sufficiently + similar to 'word', sorted from best match to worst. The index will + be used to make this a fast operation over very large data sets. + </para> + </sect2> + + <sect2> + <title>Tsearch2 Integration</title> + <para> + Trigram matching is a very useful tool when used in conjunction + with a text index created by the Tsearch2 contrib module. (See + contrib/tsearch2) + </para> + <para> + The first step is to generate an auxiliary table containing all + the unique words in the Tsearch2 index: + </para> + <programlisting> +CREATE TABLE words AS SELECT word FROM + stat('SELECT to_tsvector(''simple'', bodytext) FROM documents'); + </programlisting> + <para> + Where 'documents' is a table that has a text field 'bodytext' + that TSearch2 is used to search. The use of the 'simple' dictionary + with the to_tsvector function, instead of just using the already + existing vector is to avoid creating a list of already stemmed + words. This way, only the original, unstemmed words are added + to the word list. + </para> + <para> + Next, create a trigram index on the word column: + </para> + <programlisting> +CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops); + </programlisting> + <para> + or + </para> + <programlisting> +CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops); + </programlisting> + <para> + Now, a <literal>SELECT</literal> query similar to the example above can be + used to suggest spellings for misspelled words in user search terms. A + useful extra clause is to ensure that the similar words are also + of similar length to the misspelled word. + </para> + <para> + <note> + <para> + Since the 'words' table has been generated as a separate, + static table, it will need to be periodically regenerated so that + it remains up to date with the word list in the Tsearch2 index. + </para> + </note> + </para> + </sect2> + + <sect2> + <title>References</title> + <para> + Tsearch2 Development Site + <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink> + </para> + <para> + GiST Development Site + <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink> + </para> + </sect2> + + <sect2> + <title>Authors</title> + <para> + Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia + </para> + <para> + Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia + </para> + <para> + Documentation: Christopher Kings-Lynne + </para> + <para> + This module is sponsored by Delta-Soft Ltd., Moscow, Russia. + </para> + </sect2> + +</sect1> + |
