This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
|
|
This page has archives. Sections older than 730 days may be automatically archived by Lowercase sigmabot III when more than 3 sections are present. |
Math error, request confirmation/correction by Unicode standards expert.
editWhen I tally the total nubmer of Code Points available from the three Private Use ranges I get four (4) more than is indicated by the summary at the top of this article.
- par.3: .... 137,468 are reserved for private use, leaving 974,530 for public assignment. - par.4: .... 65,536 code points (Supplementary Private Use Area-A and -B, which constitute the entirety of planes 15 and 16).
Basic Multilingual Plane: - par.4: As of Unicode 12.1, the BMP comprises the following 163 blocks: o .... o Private Use Area (E000–F8FF)
F8FFhex 63743 end of BMP Private Use Block -DFFFhex 57343 end of preceeding Surrogate Block ============= 1900hex 6400 code points in BMP Private Use Block
6,400 Private Use Block in Unicode Plane 0 (BMP) + 65,536 Private Use Block in Unicode Plane 15 (PUA-A) + 65,536 Private Use Block in Unicode Plane 16 (PUA-B) ======== 137,472 tally of the three (3) Private Use Blocks
137,472 tally of the three (3) Private Use Blocks -137,468 code points referenced in introduction to this article ======== 4 less code points in Intro than calculated from tallies of the 3 Blocks
Tree4rest (talk) 23:46, 24 September 2019 (UTC)
- Is this possibly caused by the xxFFFE and xxFFFF code points in the PUA planes?Spitzak (talk) 23:54, 24 September 2019 (UTC)
- Yes. Although each plane has 65,536 = 2^16 code points, the last two in each plane are permanently declared non-characters. So only 65,534 are available for (any) use in planes 15 and 16. -- Elphion (talk) 00:35, 25 September 2019 (UTC)
- Actually, even if the last two characters in each planes are declared "non-characters", they are valid codepoints and can be encoded, say with UTF-8, even if the encoded text is non-conforming. The same is true for the few non-characters assigned inside the Arabic forms near the end of the BMP. Being "non-characters" means that they are not useful for encoding text for interchange, but they can still be used *locally* as special-purpose marks inside applications, or libraries, or renderers, to facilitate their implementation (and they are used for that: on input texts are filtered and either non-characters may be filtered out, or the whole document would be rejected as invalid, or they could be replaced by a placefolder; but internally, they can then be freely used for the implementation that should then still not emit transformed texts containing them because these texts would become fully rejected by the recipient).
- Those non-characters have then NO meaning (like PUA) but more restricted than PUA because their interfhange in conforming text documents is invalid (for example the non-characters must NOT be present in documents conforming to standards like HTML or XML or JSON. And varoous applications or libraries will reject them if they ever detect them: for example a filesystem API may detect an encoding error and filesystem inconsistancy, or desynchrinization problems, or data corruption in the media, and the filesystem could refuse to mount such filesystem and won't grant any write access without specific permission: a special maintenance will be needed, that cannot be automated as it could cause security issues or corruption of important data which is not supposed to be text, and could be an encrypted binary file - "repairing" the filesystem by replacing/dropping those characters could damage the data or invalidate its binary signature)
- Non-characters are very useful as they can be used to detect corruptions, or access violations, or failure in communication or storage protocols: they can then be used as guards (notably the last two codepoints at end of each plane), for example to create a binary container formats multiplexing text parts and binary parts, all with variable nelgth (e.g. inside encoded video streams like audio/video/image formats, including JPEG, MPEG, PNG, Webm, Ogg and others where text framents may be present for tagging metadata, or subtitles, or titles, or licensing and copyright statements, or to embed URIs or HTML, XML and JSON documents). There are not many non-characters, but still they are valid codepoints (meaning that they can be transformed bijectively between all conforming UTFs; It is not the case for surrogates that don't have this bijective capability, so there's no roundtrip conversion (the roundtrip does not work with two successive surrogates, it only works with isolated surrogates, which are still forbidden in conforming texts: surrogates do not have any value even if they have a codepoint assigned to them, only to implement UTF-16; if UTF-16 was not part of the standard, there would be NO surrogate at all in the BMP, but there would still remain non-characters). verdy_p (talk) 02:59, 17 October 2020 (UTC)
- The artical count is correct. The two non-characters at the end of each plane are specifically excluded from PUA-A and PUA-B per The Unicode Standard (https://www.unicode.org/versions/Unicode13.0.0/ch23.pdf#G19378). DRMcCreedy (talk) 03:43, 17 October 2020 (UTC)
Aramaic Scripts
editWouldn't the group heading "Aramaic Scripts" be more accurately named "Semitic Scripts" (or more pedantically, "Scripts used with Semitic languages")? Is the heading "Aramaic Scripts" an official designation made by a committee? 2601:602:8580:5E00:A8FA:C0F1:EB3C:C8C6 (talk) 07:56, 11 January 2021 (UTC)
- It's not a Unicode designation. If you look at The Unicode Standard (http://www.unicode.org/versions/Unicode13.0.0/) you'll see that they're grouped geographically. Hebrew, Arabic, Syriac, Samaritan, and Mandaic are in Chapter 9: Middle East-I, Modern and Liturgical Scripts. Thaana is in Chapter 13: South and Central Asia-II. And N'Ko is in Chapter 19: Africa. I'm not weighing in on the Aramaic vs Semitic question, just that it's not a Unicode designation. DRMcCreedy (talk) 18:25, 11 January 2021 (UTC)
History
editThe article is missing a History section (or similar), showing the motivation for the introduction of planes and details such as:
- When was it decided that the original 65536 code point space was insufficient for future expansion of Unicode, and that additional planes had to be added?
- In which year and Unicode version was the first additional plane added (and corresponding characters defined)?
- Which Unicode version introduced each new plane / started defining characters in a previously empty plane?
--Cousteau (talk) 15:50, 31 October 2024 (UTC)
- https://www.unicode.org/notes/tn23/Muller-Slides+Narr.pdf seems to suggest that the first use of non-BMP planes appeared in Unicode 3.1 (2001). The version history table found at the Wikipedia article for Unicode suggests that planes were defined in Unicode 2.0 (but not started to be used until 3.1 when the original 16-bit space ran out). Cousteau (talk) 16:02, 31 October 2024 (UTC)