summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAlex Dowad <alexinbeijing@gmail.com>2020-11-09 21:40:08 +0200
committerAlex Dowad <alexinbeijing@gmail.com>2020-11-11 11:18:58 +0200
commitfbdcab953d086dffd5228e1ff6374cd2b1e8023c (patch)
treec8fad891d4bc6c925486e12849c4de96c0f85138
parentb27a34c5a9ce1fa509fe30c141142c801ce3b6dd (diff)
downloadphp-git-fbdcab953d086dffd5228e1ff6374cd2b1e8023c.tar.gz
Unicode -> SJIS-mac conversion doesn't reject valid codepoints after a bad transcoding hint
To give the background on this issue, here is an excerpt from JAPANESE.txt, from the Unicode Consortium: Apple has defined a block of 32 corporate characters as "transcoding hints." These are used in combination with standard Unicode characters to force them to be treated in a special way for mapping to other encodings; they have no other effect. Sixteen of these transcoding hints are "grouping hints" - they indicate that the next 2-4 Unicode characters should be treated as a single entity for transcoding. The other sixteen transcoding hints are "variant tags" - they are like combining characters, and can follow a standard Unicode (or a sequence consisting of a base character and other combining characters) to cause it to be treated in a special way for transcoding. These always terminate a combining-character sequence. The transcoding coding hints used in this mapping table are: 0xF860 group next 2 characters as a single entity for transcoding 0xF861 group next 3 characters as a single entity for transcoding 0xF862 group next 4 characters as a single entity for transcoding 0xF87A variant tag for "negative" (i.e. black & white reversed) 0xF87E variant tag for vertical form 0xF87F variant tag for other alternate form For example, the Apple addition character 0x85AB is Roman numeral thirteen. There is no single Unicode for this (although there are standard Unicodes for Roman numerals 1-12). Using the grouping hint 0xF862 in combination with standard Unicodes, we can map this as 0xF862+0x0058+0x0049+0x0049+0x0049 (i.e. X + I + I + I). Our SJIS-mac conversion code actually recognizes some special sequences which start with an Apple 'transcoding hint'. However, if a transcoding hint is misplaced and is not followed by one of the expected sequences, we can just emit one error marker for the bad transcoding hint and then process the following codepoint as normal.
-rw-r--r--ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c4
1 files changed, 3 insertions, 1 deletions
diff --git a/ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c b/ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c
index 78bf8e3671..45b87a8f98 100644
--- a/ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c
+++ b/ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c
@@ -408,6 +408,7 @@ mbfl_filt_conv_wchar_sjis_mac(int c, mbfl_convert_filter *filter)
}
if (c == 0xf860 || c == 0xf861 || c == 0xf862) {
+ /* Apple 'transcoding hint' codepoints (from private use area) */
filter->status = 2;
filter->cache = c;
return c;
@@ -527,8 +528,9 @@ mbfl_filt_conv_wchar_sjis_mac(int c, mbfl_convert_filter *filter)
}
if (filter->status == 0) {
+ /* Didn't find any of expected codepoints after Apple transcoding hint */
CK(mbfl_filt_conv_illegal_output(c1, filter));
- CK(mbfl_filt_conv_illegal_output(c, filter));
+ return mbfl_filt_conv_wchar_sjis_mac(c, filter);
}
break;