summaryrefslogtreecommitdiff
path: root/doc/src/sgml/trgm.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/trgm.sgml')
-rw-r--r--doc/src/sgml/trgm.sgml214
1 files changed, 214 insertions, 0 deletions
diff --git a/doc/src/sgml/trgm.sgml b/doc/src/sgml/trgm.sgml
new file mode 100644
index 0000000000..62a5e30382
--- /dev/null
+++ b/doc/src/sgml/trgm.sgml
@@ -0,0 +1,214 @@
+<sect1 id="pgtrgm">
+ <title>pg_trgm</title>
+
+ <indexterm zone="pgtrgm">
+ <primary>pgtrgm</primary>
+ </indexterm>
+
+ <para>
+ The <literal>pg_trgm</literal> module provides functions and index classes
+ for determining the similarity of text based on trigram matching.
+ </para>
+
+ <sect2>
+ <title>Trigram (or Trigraph)</title>
+ <para>
+ A trigram is a set of three consecutive characters taken
+ from a string. A string is considered to have two spaces
+ prefixed and one space suffixed when determining the set
+ of trigrams that comprise the string.
+ </para>
+ <para>
+ eg. The set of trigrams in the word "cat" is " c", " ca",
+ "at " and "cat".
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Public Functions</title>
+ <table>
+ <title><literal>pg_trgm</literal> functions</title>
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Function</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>real similarity(text, text)</literal></entry>
+ <entry>
+ <para>
+ Returns a number that indicates how closely matches the two
+ arguments are. A zero result indicates that the two words
+ are completely dissimilar, and a result of one indicates that
+ the two words are identical.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>real show_limit()</literal></entry>
+ <entry>
+ <para>
+ Returns the current similarity threshold used by the '%'
+ operator. This in effect sets the minimum similarity between
+ two words in order that they be considered similar enough to
+ be misspellings of each other, for example.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>real set_limit(real)</literal></entry>
+ <entry>
+ <para>
+ Sets the current similarity threshold that is used by the '%'
+ operator, and is returned by the show_limit() function.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>text[] show_trgm(text)</literal></entry>
+ <entry>
+ <para>
+ Returns an array of all the trigrams of the supplied text
+ parameter.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>Operator: <literal>text % text (returns boolean)</literal></entry>
+ <entry>
+ <para>
+ The '%' operator returns TRUE if its two arguments have a similarity
+ that is greater than the similarity threshold set by set_limit(). It
+ will return FALSE if the similarity is less than the current
+ threshold.
+ </para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
+ <sect2>
+ <title>Public Index Operator Class</title>
+ <para>
+ The <literal>pg_trgm</literal> module comes with the
+ <literal>gist_trgm_ops</literal> index operator class that allows a
+ developer to create an index over a text column for the purpose
+ of very fast similarity searches.
+ </para>
+ <para>
+ To use this index, the '%' operator must be used and an appropriate
+ similarity threshold for the application must be set. Example:
+ </para>
+ <programlisting>
+CREATE TABLE test_trgm (t text);
+CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
+ </programlisting>
+ <para>
+ At this point, you will have an index on the t text column that you
+ can use for similarity searching. Example:
+ </para>
+ <programlisting>
+SELECT
+ t,
+ similarity(t, 'word') AS sml
+FROM
+ test_trgm
+WHERE
+ t % 'word'
+ORDER BY
+ sml DESC, t;
+ </programlisting>
+ <para>
+ This will return all values in the text column that are sufficiently
+ similar to 'word', sorted from best match to worst. The index will
+ be used to make this a fast operation over very large data sets.
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Tsearch2 Integration</title>
+ <para>
+ Trigram matching is a very useful tool when used in conjunction
+ with a text index created by the Tsearch2 contrib module. (See
+ contrib/tsearch2)
+ </para>
+ <para>
+ The first step is to generate an auxiliary table containing all
+ the unique words in the Tsearch2 index:
+ </para>
+ <programlisting>
+CREATE TABLE words AS SELECT word FROM
+ stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
+ </programlisting>
+ <para>
+ Where 'documents' is a table that has a text field 'bodytext'
+ that TSearch2 is used to search. The use of the 'simple' dictionary
+ with the to_tsvector function, instead of just using the already
+ existing vector is to avoid creating a list of already stemmed
+ words. This way, only the original, unstemmed words are added
+ to the word list.
+ </para>
+ <para>
+ Next, create a trigram index on the word column:
+ </para>
+ <programlisting>
+CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
+ </programlisting>
+ <para>
+ or
+ </para>
+ <programlisting>
+CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);
+ </programlisting>
+ <para>
+ Now, a <literal>SELECT</literal> query similar to the example above can be
+ used to suggest spellings for misspelled words in user search terms. A
+ useful extra clause is to ensure that the similar words are also
+ of similar length to the misspelled word.
+ </para>
+ <para>
+ <note>
+ <para>
+ Since the 'words' table has been generated as a separate,
+ static table, it will need to be periodically regenerated so that
+ it remains up to date with the word list in the Tsearch2 index.
+ </para>
+ </note>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>References</title>
+ <para>
+ Tsearch2 Development Site
+ <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
+ </para>
+ <para>
+ GiST Development Site
+ <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Authors</title>
+ <para>
+ Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
+ </para>
+ <para>
+ Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
+ </para>
+ <para>
+ Documentation: Christopher Kings-Lynne
+ </para>
+ <para>
+ This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
+ </para>
+ </sect2>
+
+</sect1>
+