diff options
author | Tom Lane <tgl@sss.pgh.pa.us> | 2015-05-15 15:01:59 -0400 |
---|---|---|
committer | Tom Lane <tgl@sss.pgh.pa.us> | 2015-05-15 15:02:13 -0400 |
commit | 8d3e0906df5496b853cc763f87b9ffd2ae27adbe (patch) | |
tree | 41de26c9c6f67d2ef6467ea231511a5a4b00cd2a /src/backend/utils/adt/network_gist.c | |
parent | 92edba2665ae7bf43ed03538311e63652f9e2373 (diff) | |
download | postgresql-8d3e0906df5496b853cc763f87b9ffd2ae27adbe.tar.gz |
Extend GB18030 encoding conversion to cover full Unicode range.
Our previous code for GB18030 <-> UTF8 conversion only covered Unicode code
points up to U+FFFF, but the actual spec defines conversions for all code
points up to U+10FFFF. That would be rather impractical as a lookup table,
but fortunately there is a simple algorithmic conversion between the
additional code points and the equivalent GB18030 byte patterns. Make use
of the just-added callback facility in LocalToUtf/UtfToLocal to perform the
additional conversions.
Having created the infrastructure to do that, we can use the same code to
map certain linearly-related subranges of the Unicode space below U+FFFF,
allowing removal of the corresponding lookup table entries. This more
than halves the lookup table size, which is a substantial savings;
utf8_and_gb18030.so drops from nearly a megabyte to about half that.
In support of doing that, replace ISO10646-GB18030.TXT with the data file
gb-18030-2000.xml (retrieved from
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/ )
in which these subranges have been deleted from the simple lookup entries.
Per bug #12845 from Arjen Nienhuis. The conversion code added here is
based on his proposed patch, though I whacked it around rather heavily.
Diffstat (limited to 'src/backend/utils/adt/network_gist.c')
0 files changed, 0 insertions, 0 deletions