delta/php-git.git/ext/mbstring/libmbfl, branch master

Update 'East Asian Width' table to comply with Unicode 13.0

2021-01-19T18:38:44+00:00

Instead of manually maintaining the data in eaw_table.h, it is now automatically
generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from
the Unicode Consortium.

Something must be said about the deleted test case. Back in 2004, someone
noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was
added to expose the problem. Well, time keeps moving on, and with the changing
years, new Unicodes are born and old Unicodes die. Some characters which were
counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0,
which renders the test case obsolete.

At the same time, make a couple of spelling/grammar fixes in ucgendat.php.

Remove useless constant MBFL_ENCTYPE_MBCS

2021-01-15T19:55:41+00:00

This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.

Remove unused macros from mbfilter_cp51932.c, mbfilter_iso2022jp_mobile.c

2021-01-15T19:55:41+00:00

Remove useless mbstring encoding 'JIS-ms'

2021-01-15T19:55:41+00:00

MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called
CP50220, CP50221, and CP50222. All three are supported by mbstring.

Since these encodings are very similar, some code can be shared. Actually,
conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when
converting from Unicode to CP50220/1/2 that some small differences arise in how
certain katakana are handled.

The most important common code was a function called `mbfl_filt_wchar_jis_ms`.
The `jis_ms` part doubtless refers to the fact that these encodings are modified
versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported
'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested
'JIS-ms' conversion, they got something like CP50220/1/2, minus their special
ways of handling half-width katakana when converting from Unicode.

But... that 'encoding' is not something which actually exists in the world outside
of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'.

For a text encoding conversion library, inventing new variant encodings and
implementing them is not very productive. Our interest is in handling text
encodings which real people actually use for... you know, storing actual text
and things like that.

Remove useless mbstring encoding 'CP50220-raw'

2021-01-15T19:55:41+00:00

CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some
Unicode characters which are not representable in ISO-2022-JP by converting
them to similar characters which are representable.

What, then, is CP50220-raw? An Internet search turns up absolutely nothing.
Reference works which I consulted don't say anything about it. Other text
conversion libraries don't support it.

From looking at the code: It's just the same as CP50220, but it accepts
unmapped JIS X 0208 characters passed through from other Japanese encodings
and silently encodes them using the usual ISO-2022-JP escape sequence and
representation for JIS X 0208 characters.

It's hard to see how this could be useful. OK, let me come out and say it:
it's _not_ useful. We can confidently jettison this (mis)feature.

CP5022{0,1,2}: treat truncated multibyte characters as error

2021-01-15T19:55:41+00:00

CP5022{0,1,2}: treat unrecognized escapes as error

2021-01-15T06:30:36+00:00

CP5022{0,1,2}: use JISX0201 for U+203E (overline)

2021-01-15T06:30:30+00:00

Same issue as d497c0e96f addressed for JIS7/JIS8, but for CP5022{0,1,2} this time.

CP5022{0,1,2}: convert Unicode codepoints in 'user' area (0xE000-E757) correctly

2021-01-15T06:26:46+00:00

Unicode has a range of 'private' codepoints which individual applications can
use for their own purposes. When they were inventing CP932, MicroSoft mapped
these 'private' or 'user' codepoints to ten new rows added to the JIS X 0208
character table. (JIS X 0208 is based on a 94x94 table; MS used rows 95-114
for private characters.)

`mbfl_filt_conv_wchar_jis_ms` converted these private codepoints to rows 85-94
rather than 95-114. The code included a link to a document on the OpenGroup
web site, dating back to 1996 [1], which proposed mapping private codepoints to
these rows. However, that is not consistent with what mbstring does when
converting CP5022x to Unicode.

There seems to be a dearth of information on CP5022x on the web. However, I
did find one (Japanese-language) page on CP50221, which states that it maps
kuten codes 0x7F21-0x927E to the 'private' Unicode codepoints [2].

As a side note, using rows higher than 95 does seem to defeat one purpose of
using an ISO-2022-JP variant: ISO-2022-JP was specifically designed to be
"7-bit clean", but once you go beyond row 95, the ku codes are 0x80 and up,
so 8 bits are needed.

[1] https://web.archive.org/web/20000229180004/http://www.opengroup.or.jp/jvc/cde/ucs-conv.html
[2] https://www.wdic.org/w/WDIC/Microsoft%20Windows%20Codepage%20%3A%2050221

CP5022{0,1,2}: convert characters in ku 0x2D (13th row) correctly

2021-01-15T06:26:38+00:00

Essentially, CP5022{0,1,2} are to CP932 as ISO-2022-JP is to Shift-JIS.
As Shift-JIS and ISO-2022-JP both encode characters from the JIS X 0208 charset,
CP932 and CP5022x both encode characters from JIS X 0208 _plus_ extra characters
added as MicroSoft vendor extensions.

Among the added characters are a number of symbols which MS put in the 13th row
of the 94x94 character table. (In JIS X 0208, that row is empty.)

mbfilter_cp50220x.c had an `if` clause which was intended to handle the
conversion of characters in that 13th row, but it was dead code, as the previous
clause was always true in those cases. The solution is to reverse the order of
those two clauses (just as they already appeared in mbfilter_cp932.c).