Module talk:Unicode chart

(Redirected from Template talk:Unicode chart/sandbox/testcases)
Latest comment: 6 months ago by Eievie in topic Trying again from scratch

Notes about notes

edit
  • I'm not convinced the "Notes" section at the bottom is worth the space it takes up, and I only added it as a proof-of-concept gesture to mimic existing layout convention. A collapsible (show/hide, just like the section above) section at the bottom with an additional list/table of character info (one per line) would certainly be feasible and only require a few more lines of code. Its hugeness of screen space would be the primary concern, because its expansion would displace other page content possibly including wrapped text or floating images (unlike navboxes, which occupy 100% width at the very bottom).
    We should just give first and last rows for blocks with character names derived from code points (CJK, Tangut, Nushu, ...), so the largest block is Hangul Syllables with 11,184 code points, which I agree is too long for this approach. But the next biggest blocks are Yi Syllables (1,168), Egyptian Hieroglyphs (1,072), Mathematical Alphanumeric Symbols (1,024), and Cuneiform (1,024), which I think should be acceptable if the names list is initially hidden. I don't see that displacement of other text and images would be an issue, especially as the code charts are mostly only used in the corresponding Unicode Block name articles. BabelStone (talk) 11:33, 10 September 2019 (UTC)Reply
  • One intuitive solution would be to mimic typical charmap program behavior by using a Javascript click handler on each character cell that populates the footer area (of about the same size as the "Notes" section, maybe slightly smaller) with the cursor-selectable name of the last clicked-upon codepoint, plus its &escapecode; and any additional info we care to pull from Module:Unicode data (replacing any previous content). I could whip up a demo for that in the next few days. I just worry that it might be too interactive to be widely accepted.
    Nice idea but I am also concerned that turning Wikipedia into an app is a step too far. I'd like to see a prototype of it though. BabelStone (talk) 11:33, 10 September 2019 (UTC)Reply
  • A third approach might be to render the entire list (of names and whatnot) in a vertically scrollable footer panel containing "section" links, such that clicking on the character cell would cause the footer to scroll to and highlight (similar behavior to reflist anchors) the appropriate line. This might be even less popular.
    I think this is the best solution, regardless of WP:SCROLL. Only 50 blocks with non-algorithmic character names have more than 128 code points, so if we make the scroll window 128 rows only the 50 largest blocks will be affected. BabelStone (talk) 11:33, 10 September 2019 (UTC)Reply
  • On the other hand, some philosophies may have changed over the years. I mean, we do have interactive scrolling maps that pop up in a fullscreen div now (see example).
  • I haven't formed any opinion yet on how to handle combining character positioning, other than "oh god, I hope it's something other than  " lol.
    Personally I prefer NBSP as the base for combining characters as dotted circle (which we currently use) often interferes with the character. BabelStone (talk) 11:33, 10 September 2019 (UTC)Reply

cobaltcigs 17:55, 9 September 2019 (UTC)Reply

Existing charts

edit

Interesting approach to create the Unicode code charts dynamically but I have many questions. Most only apply if this module is intended to replace the existing chart templates...

  1. What problem is this new approach solving? Is it just duplicating/replacing the existing templates? If not, what will this module be used for?
  2. Do the charts get created every time they're displayed? If so, do we care about the extra processing incurred?
  3. How to handle fonts? I saw the post at Template talk:Script#Module:Unicode chart and the notes above so I know this is a known issue.
  4.   Done How to handle a varying number of reserved characters? The current charts leave off the "Gray areas" notice if there are no non-assigned code points because having the "gray areas" notice for those blocks would be confusing. And the wording changes if there is only one non-assigned code point.
  5.   Done How to handle charts with additional footnotes? For example, Template:Unicode chart Arabic. And for the existing charts, the notes are indeed valuable.
  6.   Done How to handle non-characters? For example, U+FDD0-FDEF in Template:Unicode chart Arabic Presentation Forms-A.
  7. How to handle combining marks (which are referenced above)? Some charts have special additions for some combining characters. For example, U+A980 in Template:Unicode chart Javanese uses a dotted circle. Other combining marks, like U+1D242 in Template:Unicode chart Ancient Greek Musical Notation use a non-breaking space. Some combining marks use no additional character at all.
  8. How to handle characters with dashed boxes? For example, U+0600-0605, 061C, and 06DD in the Template:Unicode chart Arabic chart.
  9. How to handle control(ish) characters where we don't want the actual character in the chart? For example, U+061C in the Template:Unicode chart Arabic chart, and more obviously, control characters in Template:Unicode chart C0 Controls and Basic Latin and Template:Unicode chart C1 Controls and Latin-1 Supplement.
  10. How to create character name aliases? See U+061C in Template:Unicode chart Arabic and the control characters in Template:Unicode chart C0 Controls and Basic Latin and Template:Unicode chart C1 Controls and Latin-1 Supplement.
  11. How to handle block-specific formatting? For example Template:Unicode chart Javanese has a specific height and some of the characters in Template:Unicode chart Control Pictures use a different font size.
  12. How to handle character links? Like @BabelStone:, I'm not a fan of linking specific characters (but others are). It looks like your code, optionally, will link every character if an article exists, but this could increase the number of linked characters. And many characters aren't linked to the character itself, like U+2245 in Template:Unicode chart Mathematical Operators. Some link to wikt, like U+0x2105 in Template:Unicode chart Letterlike Symbols and all the characters in Template:Unicode chart CJK Unified Ideographs Extension A.
  13.   Done Some blocks have special parameters that need to be taken into account: Template:Unicode chart Alphabetic Presentation Forms, Template:Unicode chart Enclosed Alphanumeric Supplement, Template:Unicode chart Enclosed CJK Letters and Months, Template:Unicode chart Halfwidth and Fullwidth Forms, Template:Unicode chart Miscellaneous Symbols, and Template:Unicode chart Supplemental Symbols and Pictographs. As with most of these questions, this only only applies if you're replacing existing chart templates.
  14. How to determine the chart name? Most charts use the block name for the title but some don't. For example, "C0 Controls and Basic Latin" is the chart name for the "Basic Latin" block.
  15. How to determine what to link the chart name to. For example, the Template:Unicode chart Kangxi Radicals chart links to "Kangxi radical#Unicode". Most either link to the block name itself or the block name with "(Unicode block)" appended.
  16. Will the new approach be used for the list charts that make up List of CJK Unified Ideographs, part 1 of 4 and List of CJK Unified Ideographs Extension B (Part 1 of 7)?

DRMcCreedy (talk) 04:51, 10 September 2019 (UTC)Reply

  • 1. Consistency of format, avoidance of stupidity like this.
  •   Done 2 (and 3). Here are four profiler outputs for the testcases page. Note that this is the total churning of five {{unicode chart}}s transcluded on the same page (indirectly through the {{test case}} template/module in fact). Even with those factors the processing stats are at a small fraction of allowable limits in every case except for ifexist (which should probably be the first feature taken out). Actual overhead in the wild would be lower. Based on the percentages at the bottom, it looks like the single worst bottleneck is the grand #switch statement at Template:Script. We could probably save at least 40% on parser juice by skipping that and moving its fairly trivial functionality (that of choosing a css class and a definition for same, having already obtained an ISO 15924 code from here) into some module.
  • 4.   Done Keeping a count of reserved codepoints and rendering the "note" as plural/singular/blank will be a trivial step. I just didn't think of it. I do question whether the footnote system is the appropriate way to present this.
  • 5.   Done My first version of the module actually did have a parameter accepting whole refs. I just took it out when I got the impression every existing template had the same two notes. I can put it back.
  • 6.   Done Preview of {{tl|unicode chart|name=Arabic Presentation Forms-A|version=12.0}} has them showing up as normally reserved codepoints (the default assumption based on lines missing from here), rather than choking. If we want to give the "permanently reserved" codepoints a different background and auto-generate a footnote explaining what this means, we'd have to maintain a list of them somewhere. Does anything like this occur in other blocks?
  • Also 6. I'd be more immediately concerned about this cell-stretching monstrosity at U+FDFD, which seems to be a consequence of using {{script}} in places where the original chart template does not.
  • 7. Not sure yet. I did see some interesting suggestions here.
  • 8. Depends on what the rationale is for drawing these boxes, and whether it can be detected in any way from Unicode data. Or whether it needs to be listed elsewhere as a special case. Or whether the boxes are needed at all. I don't see a footnote explaining what the boxes even indicate. No hints on my own system either.
  • 9 and 10. Each of the display-aliased characters in the templates you mentioned returns false for the Module:Unicode data function .is_printable(n), except for U+0020 SPACE and U+00A0 NO-BREAK SPACE, which return true for .is_whitespace(n). So both of these traits can easily be tested. Choosing the replacement alias we want would require maintaining a list of same. I'm not sure a printable space character should be aliased in this manner. Maybe it the cell background should be a different color with a footnote explaining yes, a whitespace character is there, and yes, you can copy it and paste it elsewhere. Also not sure "XXX" is appropriate for U+0080–0081. Maybe we want to display "PAD" and "HOP" instead?
  • 11. The existing chart for Javanese shows up with a cell height of 80px which seems excessive for the apparent line height of 33px on my screen. Preview of module output for Javanese looks fine. Better in my opinion. Maybe I just don't have the right fonts installed. But yes, cell height/width params can be added if there's a demonstrated need for this. Otherwise the browser should be trusted to stretch cells for large characters as needed. See "Also 6" above.
  • 12. I think if they are going to be linked, they shouldn't be piped to something else unless the character itself an illegal title char and even then it shouldn't be linked to anything other than a title that paraphrases said character (e.g. [[Number sign|#]]). Making a disambiguation page (then piping the link to a more specific topic because linking to disambiguation pages is bad) was a mistake in my opinion. And nothing on Letterlike Symbols should link to wikt. Probably only the CJK Ideographs and such (which represent whole words and where wikt has, or should have, a page of that exact title which Wikipedia will never have) should link to wikt. This could be added as a separate link=wikt mode.
  • 12, continued. If the character title is a redirect to some other page (such as a list of emojis, or an article about the subject represented by some symbol), that's fine. Someday the character itself might become a separate article, which is also fine. The template need not know or care about that. I'm thinking a list of link aliases for bad-title chars (mapping '#' to Number sign and so on) would be a good solution. But only if we're going to be linking the characters at all, which is unclear.
  • 13.  Done I did keep the optional start/end parameters, because I figured subdivision would be wanted in some blocks for reasons including hugeness. Note that these need not be multiples of 16. The module will pad leftover cells accordingly with <td class="excluded"> which is currently styled the same as class="reserved" but this can be changed.
    • start/end parameters have been scrapped in favor of a single range parameter which can contain multiple ranges (connected by hyphen or en dash, and separated from each other by comma, whitespace, the word "and", or in fact anything that's not a hex digit).
  • 14 and 15. If the unicode block display names can't be made to exactly match the "official" names in all cases, we'll need a (hopefully short) list of aliases. Adding a blocknamelink parameter which continues to default to Blockname (Unicode chart) if empty would be easy and sufficient. Let's try to avoid having three sets of names wherever possible.
  •   Done 16. I don't see why not. See 13.

cobaltcigs 18:20, 10 September 2019 (UTC)Reply

I have some follow-up:

DRMcCreedy (talk) 23:09, 10 September 2019 (UTC)Reply

  • 9 I think the current solution to control characters and invisible format characters is best, i.e. use the acronym or abbreviation in a dotted square, following the example of the official Unicode code charts. The new Basic Latin and Latin-1 Supplement charts show the control codes as reserved which is incorrect (they are assigned, with the general category Cc, but do not have formal character names, although they do have formal character name aliases). I also notice that U+003D (=) and U+007C (|) do not display properly. BabelStone (talk) 11:02, 11 September 2019 (UTC)Reply

Update:

  • I've restored the refs parameter. Any refs inputted here will be numbered before the auto-generated refs. Perhaps I should also have it sanitize anything that's not actually a <ref> by wrapping it in a <ref> tag so it doesn't appear in the title bar.
  • I prefer having the auto-generated refs first, that way the version, which covers the whole chart, is the very first one with additional notes, usually covering just a few codepoints, are at the end. This is just a preference. DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)Reply
  • I've added a range parameter that allows multiple ranges to be specified. Potentially in the wrong order, even. Perhaps they should be force-sorted ascendingly. And sanitized to avoid duplication due to overlap.
  • Black blocks were actually easy to detect. Previous code assumed anything containing "<" was <reserved-NNNN> when it can actually be <noncharacter-NNNN> or <control-NNNN>. Whoops. It's all right there in Module:Unicode data. Will work on control chars next.
  • I've discovered Module:Unicode data/aliases includes (among other things) abbreviations for control characters. It does in fact use PAD and HOP.
  • The three characters that Unicode displays as "XXX" do indeed have abbreviations in NameAliases.txt but they all have a type of "figment" as in "figment of one's imagination". I feel strongly that we shouldn't assign abbreviations to the charts that contradict the ones used in the actual, cited Unicode charts. DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)Reply
  • I gave the control characters a light blue background and an explanatory footnote similar to those for RESERVED and NONCHARACTER. Also dashed boxes around the abbreviations, which are loaded from here. Some have multiple abbreviations. The current behavior is to choose the last one, because at brief glance that seemed most correct in most cases. I'd rather we move the "official" or preferred abbreviation to the top and consistently select the first one instead. I've yet to research what, if anything, might be broken by changing abbreviation order.
  • Module:Unicode data/aliases is generated from Unicode's NameAliases.txt file. It looks like it is in the same order, so any tweeking we do to order would be problematic when the file is updated. If we changed the script that creates aliases we would just be moving the logic from the chart script to the generation script. Other users of alias may not have the same requirement so I think the right place to make the determination for what to use in the charts belongs in the chart script. I have another abbreviation issue but I'll do that in a new section for clarity. DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)Reply

cobaltcigs 09:17, 12 September 2019 (UTC)Reply

The font problem, explained

edit

The only way to load custom css definitions is through the <templatestyles src="Template:Something/something.css" /> extension tag. This can be produced in the module by preprocessing the previous wikitext/pseudo-html, or by using frame:extensionTag{ name = 'templatestyles', args = { src = '...'} } to the same effect. Either way, the src page must be of "content model: Sanitized CSS" meaning it must be in the template namespace and have a title ending with ".css" which puts you in a mode that checks for syntax errors and disallows the use of templates, modules, parserFunctions, or anything other than hard-coded css (with a few features excluded for security reasons).

In practice that means there's no way for a template/module parameter such as | font = font-family: 'DejaVu Sans', 'FreeSans', 'Lucida Sans Unicode'; font-size: 1.25em; (or for any string of text obtained or composed at module runtime) to create a reusable css class. So any user-supplied font specs would need to be hard-coded as a style attribute to be used at all. Workarounds include, in descending order of sloppiness:

  1. Duplicating that much code in the style attribute of every single td cell (which would be stupid as hell).
  2. Assigning the bulky style="..." crap one time only to the root table element, then having the th { ... } css (conveniently everything that's not a character cell td is a th) loaded from here attempt to negate any foreseeable user input back to the default so that the table's style attribute appears to only affect the td (codepoint grid) cells. This would be very difficult to do well, considering the defaults we'd seek to revert to could differ according to user skin and other environmental factors.
  3. Continue using {{script}} within each cell and suffer its inefficiency and incompleteness.
  4. Placing this much css (more to be added later) on a single acceptable css source page, then it import via templatestyles.
  5. Make a better version of Template:Script by dividing the css into 154 one-liner subpages of CSS, each named to reflect the ISO 15924 code, and imported only when the need for it is detected (using this). Needing more than one in the same table will most likely be rare, so the question of how many small loads are processor-equal to one big load is probably not even worth testing.
  6. Avoid forking and turn the original Template:Script into what we want (use consistent names, include everything, and use a module instead of the switch statement and sub-template spaghetti logic).

I'm prepared to go with #4 for now, then upgrade to #5–6 only after all the other issues are addressed. ―cobaltcigs 09:17, 12 September 2019 (UTC)Reply

I've never been very keen on specifying fonts on the Wikipedia side, because 1) most fonts for most Unicode scripts are not available on most users devices without downloading them; 2) in the past editors have tended to specify fonts that they have on their own system so that it looks nice for them, without considering other users; and 3) the Wikipedia specified fonts may override users' font preferences set in their browser (or in Wikipedia settings). Personally I would rather not specify any fonts, and leave it to the user's browser to apply an appropriate font, but I know that this is a minority view, so I'm OK with your suggested solution. BabelStone (talk) 13:06, 12 September 2019 (UTC)Reply
My understanding was that certain browsers would show the little squares even if a suitable font was installed, unless specifically told to use that font. I have no idea whether this is (still?) accurate. I suppose could add a parameter like fonts=off. Then we could ask several Windows users whether all the charts look okay with no fonts specified. ―cobaltcigs 19:04, 16 September 2019 (UTC)Reply
  Done fonts=off parameter now exists as an option. ―cobaltcigs 21:38, 17 September 2019 (UTC)Reply

Formatting abbreviations

edit

Besides worrying about which abbreviations are used in the charts, there's an issue of formatting. Today, long ones are often split into two or more lines to control the width of the chart. An extreme example is NULL NOTE HEAD in Template:Unicode chart Musical Symbols but this practice happens in other places like Template:Unicode_chart_Mongolian and Template:Unicode chart Variation Selectors Supplement. I haven't checked to see if the abbreviations are always in a dashed box but maybe we could have a parm like ...|abbr|1D15|{{resize|75%|NULL<br />NOTE<br />&nbsp;HEAD&nbsp;}} to preserve the ability to format these in the current fashion. In any case, formatting is something to consider. DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)Reply

Eww. See User:BabelStone/sandbox#Musical Symbols for an attempt to replicate that (without any <br />&nbsp; crap, which is great!). Note that 1D173–1D17A are identified as "format" characters in this file, but "NULL NOTE HEAD" is not. Hence the difference in css/color. The pink can of course be changed later. ―cobaltcigs 20:45, 13 September 2019 (UTC)Reply
Wow, I've never realised that U+1D159 is not a format character. Are there any other characters displayed as a dashed box around text that are not format or control characters? I don't think so (variation selectors are gc=Mn). The worrying thing is there seems to be no way of extracting the information from the UCD, so it relies on visually checking the Unicode code charts, but what if it changes suddenly to a graphic character in a new version of Unicode? My gut feeling is that gc=So is wrong if the character has no visible glyph and is not whitespace. BabelStone (talk) 22:52, 13 September 2019 (UTC)Reply
I couldn't immediately work out where you are specifying a smaller font size for "NULL NOTE HEAD" compared with "Begin Beam" etc. I think that all the dashed boxes need a smaller font size because (on my system at least) the dashed letters are much larger size than Basic Latin letters, and make the cells overwide. Can we simply add "font-size:75%" for td.box in Template:Unicode chart/styles.css, or is there more to it? BabelStone (talk) 23:30, 13 September 2019 (UTC)Reply
This text uses span.small-1 { font-size:80%; } span.small-2 { font-size:59%; } wherein the suffix digit is determined by the number of spaces converted to linebreaks in whatever text is shown (which may be read from the aliases file or from a display_NNNN override parameter). Then the property white-space:pre; forces \n to show up as literal linebreaks so we don't have to resort to <br />. Thus one-word abbreviations such as ACK use the same size as regular chars. All of this can be easily changed. For now, I've tightened the dashed box and cell margins/padding a little bit. ―cobaltcigs 10:08, 14 September 2019 (UTC)Reply

Version

edit

There have been many past discussions about how to determine which Unicode version to show in the footnote of the chart. Because they were manually updated, it wasn't practical to have a master switch for the version. If the charts are created using Module:Unicode data it might be possible to do away with the mindless updating I do once a year for all the charts. A new Module:Unicode data/version item could be added that is manually updated after all of the other Module:Unicode data files are updated. Basically, it's just a string field to say "We've updated all the other data to version x". If the version footnote was pulled from that string, it would alleviate a lot of manual effort. It would mean adding Module:Unicode data/version to the list of "regenerate the charts if tables x, y, and z change". DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)Reply

FYI: After a few updates, all of the Module:Unicode data subpages are now up-to-date (Unicode version 12.1). DRMcCreedy (talk) 04:45, 14 September 2019 (UTC)Reply
I do like the idea of centralizing the version string. Even as a single-purpose one-liner module return "12.1" would be fine. ―cobaltcigs 10:17, 14 September 2019 (UTC)Reply
  Donecobaltcigs 09:25, 17 September 2019 (UTC)Reply
P.S. Can I get a complete list of subpages that have actually changed so I can update my localhost wiki (on which I test most of this stuff before posting) accordingly? ―cobaltcigs 10:42, 14 September 2019 (UTC)Reply
I updated Module:Unicode data/category, Module:Unicode data/control, Module:Unicode data/scripts, Module:Unicode data/names/002, and Module:Unicode data/names/003. Some changes were unrelated to the release of v12.1. For example, U+2BC9, 2BFF, and 2E4F were missing for some reason. DRMcCreedy (talk) 16:43, 14 September 2019 (UTC)Reply
On Wiktionary I updated the the U+2xxx names module and several others for 12.0 back in March, but I didn't bother with the Wikipedia modules because they weren't being used. But I'm glad to see that now they are at last. — Eru·tuon 09:01, 16 September 2019 (UTC)Reply

Perhaps it would be helpful, now, to put an edit notice on all Unicode data subpages that says "Please remember to update Module:Unicode data/version if applicable" etc. ―cobaltcigs 09:25, 17 September 2019 (UTC)Reply

Pink cells

edit

A footnote says “Pink cells indicate non-printable format characters.” That is untrue: they currently indicate all format characters, some of which are printable. It would be more useful, I think, to highlight default ignorable characters. Gorobay (talk) 03:27, 14 September 2019 (UTC)Reply

Okay. I shall try to figure out how to distinguish between these using the available data modules. ―cobaltcigs 11:19, 14 September 2019 (UTC)Reply
There wasn't a data module for the Default Ignorable property, so I added is_default_ignorable to Module:Unicode data/sandbox and created Module:Unicode data/derived core properties. — Eru·tuon 10:06, 16 September 2019 (UTC)Reply
Alternatively I could just take out the word "non-printable" (my own misconception at the time I wrote it) and make the existing footnote statement correct.
But assuming "default-ignorable" is the more important concept, is my understanding correct the characters identified by is_default_ignorable:
(a) are a proper superset of format characters, and
(b) are a disjoint set of control characters?
This is important to the extent that we can't highlight the same cell in two colors. The td.format css would be retired if we go this route, but the format designation would still be conveyed in the info panel below—which is (relatively speaking) a much a newer feature than the pink highlight.
Counting the default-ignorables and determining whether to show the footnote as plural/singular/none (just like the others) will be a two-minute coding task given the available functions. What kind of verbiage would we want in the footnote for the default-ignorables? Seems like we should try to briefly explain what that means. ―cobaltcigs 18:43, 17 September 2019 (UTC)Reply

General Punctuation, row U+206x

edit

I've noticed on User:BabelStone/sandbox#General Punctuation that 2061–2064 and 206A–206F appear to have neither a visible glyph nor an official abbreviation. Assuming this isn't simply a font deficiency on my end, would inserting things like this based on the pdf be wildly inappropriate (see parameters)? And I don't just mean the words ASS and NADS, lol. ―cobaltcigs 11:19, 14 September 2019 (UTC)Reply

?[1]???
TODO: link ucc code chart (PDF)
0123456789ABCDEF
U+206x

WJ

NO DS

Notes:
  1. ^ As of Unicode version 16.0.
  2. ^ Pink cells indicate format characters.
I can't find a file in the Unicode Character Database that lists the display forms for the dotted box characters. They aren't in NamesList.txt, which is parsed into the PDF that you linked to. So they would have to be gathered manually from the PDFs, unless they can be found somewhere else. — Eru·tuon 04:13, 18 September 2019 (UTC)Reply
As far as I know, there isn't anything in the UCD. I've always determined dotted box notation manually. BTW: I think the display_20xx parms above are appropriate. DRMcCreedy (talk) 04:40, 18 September 2019 (UTC)Reply
To clarify, "manually" would mean by visual approximation. Copy/paste gives us private-use codepoints assigned to arbitrary glyphs which represent the whole abbreviation (in some font that probably doesn't exist outside the PDF). So much eww. ―cobaltcigs 13:39, 18 September 2019 (UTC)Reply
If you're interested, the fonts with the dashed glyphs (SpecialsUC4/5/6.ttf) are bundled with the free Unibook application that is used to generate the Unicode and ISO/IEC 10646 code charts. BabelStone (talk) 16:06, 18 September 2019 (UTC)Reply

Info panel demo

edit
?[1]???
TODO: link ucc code chart (PDF)

Lua error in Module:Unicode_chart at line 384: assign to undeclared variable 'isCombining'.

?[1]???
TODO: link ucc code chart (PDF)

Lua error in Module:Unicode_chart at line 384: assign to undeclared variable 'isCombining'.

Click the links and hold your breath. ―cobaltcigs 03:13, 16 September 2019 (UTC)Reply

Thanks, I like the idea, especially showing a large version of the character. I think the U+ and character name do not need to be in a huge bold font (maybe just normal bold), and the "(assigned)" is redundant -- using a normal font and removing "assigned" should also reduce the annoying horizontal expansion and contraction of the box as you click on characters with different lengthed names. I suppose the UTF-8 is useful to some people, but I would remove the characterization of the UTF-8 hex values as I cannot see how they could be useful. BabelStone (talk) 09:05, 16 September 2019 (UTC)Reply
  • "assigned" is the default phrase returned when a character in question is not "control", "format", "surrogate", "private-use", "unassigned", "space-separator", "line-separator", or "paragraph-separator". The struck-out categories will probably never be part of any chart, which leaves five that are potentially interesting.
  •   Done Making the chart stay continuously at width: 100%; would probably help. Setting th 8% and td 5.75% would add up to same, and might also be helpful.
  • I've got it loading named character entity references from a subpage in addition to calculating the numeric ones, which is probably the single crowdpleasingest information here. The UTF-8 is of interest to the extent that it's what our urlencoding uses (Δ is 0xCE 0x94 and %CE%94). UTF-16 less so, but I thought about it.
  •   Removed The mojibake depiction of these bytes as separate chars was slightly helpful when debugging but not meant as a serious feature.
cobaltcigs 10:13, 16 September 2019 (UTC)Reply
Nice method – I was surprised that it could be done without JavaScript! Maybe instead of the values from Module:Unicode data/control, which include only some of the General Categories, the table could show the long name of the actual General Category. (I've added the long names of the General Categories to Module:Unicode data/category.) — Eru·tuon 10:37, 16 September 2019 (UTC)Reply
It relies upon the css :target selector to show/hide the panel for any given codepoint. I think this would be nearly adequate if not for the vertical anchor-jumping. I suppose moving the info panel to the top (below the pdf link and above the column headers) would make it slightly less annoying, but it would look weird. Another consequence of this is that whenever multiple charts are present on the same page, opening an info panel on one chart will close info panels all others. So using Javascript would probably be better. It would only require convincing the right people that feature is worthwhile and not too app-like.
For now I've reduced the size of the bold-face character name from 125% to 110%, set the root table element to full page-width, and set the columns to fixed percentages that add up to 100%.
I've also removed the 'Amiri' font from the .script-Arab css class, because it makes the U+FDFD ligature wide enough to make these percentages meaningless. I don't know if other characters are similarly affected. I'll need to install the first three fonts to test whether they have the same problem (or, indeed, others).
I've now made it pull the "long name" (which appears to always be more interesting than the word "assigned") from Erutuon's info. Hopefully it's never nil and hopefully the extra info won't be overwritten by updates.
cobaltcigs 20:50, 16 September 2019 (UTC)Reply
You can rely on lookup_category never returning nil (at least when supplied a valid code point); memo_lookup guarantees that. The return value is either a "real" category when the code point is found in singles or ranges or Cn (Unassigned). — Eru·tuon 22:40, 16 September 2019 (UTC)Reply
Oops. Actually, what I said is true of Module:Unicode data/sandbox, but at the moment Module:Unicode data is buggy. — Eru·tuon 23:35, 30 September 2019 (UTC)Reply

Selectability: CSS vs. plain text

edit

Putting general category after character name is good; show/hide is good; 100% width of chart is very good. At present you cannot select and copy the entire info panel information: UTF-8 and HTML headings, as well as parentheses around general category, are not selected, and there is no space between character name and general category so the two are concatenated on copy. Can we make all parts of the info panel copyable, and separate parenthesised general category from character name by a space character rather than putting in different cells? BabelStone (talk) 10:15, 17 September 2019 (UTC)Reply

What you see there is actually an intentional css effect (see instances of :before { content: 'foo'; } on the styles.css page). This is similar to navboxes (example) where they use a spaced U+00B7 MIDDLE DOT (' · ') as a non-selectable separator for list items (<li>). Here I've used commas and spaces instead, and also used the same technique for ul:before list labels. It could all easily be reverted to plain text. I'll await further discussion about whether it should, because it did take a bit of work to make it look right. And really, the whole idea here was actually to help users copy the &foo; html character entity reference without accidentally including the adjacent comma. ―cobaltcigs 12:01, 17 September 2019 (UTC)Reply
Ah, that explains it. Personally, I prefer plain text so that the user can select everything. I think we can dispense with the comma between HTML forms (semi-colon followed by a comma just looks weird), and separate them with a space. BabelStone (talk) 16:11, 18 September 2019 (UTC)Reply

Actual aliases vs. corrections

edit

Can we have a demo of the info panel for a block with one or more characters that have a formal alias? I suggest Vertical Forms with its horrendously long name and alias for FE18. BabelStone (talk) 10:22, 17 September 2019 (UTC)Reply

Eww, a spelling error ("BRAKCET"). So the correctly spelled name is currently not loaded at all because it's recorded in the aliases file as a correction rather than an alias. Aliases are currently loaded by the module (see control characters in the Latin chart above), whereas corrections will be a new concept which I'm not yet sure how best to handle. Do we want to show the misspelled title (maybe with a {{sic}} tag, even) and note the correction as such on the next line? Or should we just replace it outright without comment? I suppose I'll begin reviewing the other corrections vs. what names they are correcting, to see how trivial or major their differences tend to be. For now, here's what the Vertical Forms block currently looks like: ―cobaltcigs 12:01, 17 September 2019 (UTC)Reply
?[1]???
TODO: link ucc code chart (PDF)

Lua error in Module:Unicode_chart at line 384: assign to undeclared variable 'isCombining'.

Complete list of corrections (28) to consider

# Codepoint name Correction
U+01A2 LATIN CAPITAL LETTER OI LATIN CAPITAL LETTER GHA
U+01A3 LATIN SMALL LETTER OI LATIN SMALL LETTER GHA
U+0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT SYRIAC SUBLINEAR COLON SKEWED LEFT
U+0CDE KANNADA LETTER FA KANNADA LETTER LLLA
U+0E9D LAO LETTER FO TAM LAO LETTER FO FON
U+0E9F LAO LETTER FO SUNG LAO LETTER FO FAY
U+0EA3 LAO LETTER LO LING LAO LETTER RO
U+0EA5 LAO LETTER LO LOOT LAO LETTER LO
U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN TIBETAN MARK BKA- SHOG GI MGO RGYAN
U+11EC HANGUL JONGSEONG IEUNG-KIYEOK HANGUL JONGSEONG YESIEUNG-KIYEOK
U+11ED HANGUL JONGSEONG IEUNG-SSANGKIYEOK HANGUL JONGSEONG YESIEUNG-SSANGKIYEOK
U+11EE HANGUL JONGSEONG SSANGIEUNG HANGUL JONGSEONG SSANGYESIEUNG
U+11EF HANGUL JONGSEONG IEUNG-KHIEUKH HANGUL JONGSEONG YESIEUNG-KHIEUKH
U+2118 SCRIPT CAPITAL P WEIERSTRASS ELLIPTIC FUNCTION
U+2448 OCR DASH MICR ON US SYMBOL
U+2449 OCR CUSTOMER ACCOUNT NUMBER MICR DASH SYMBOL
U+2B7A LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE HORIZONTAL STROKE LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE VERTICAL STROKE
U+2B7C RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE HORIZONTAL STROKE RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE VERTICAL STROKE
U+A015 YI SYLLABLE WU YI SYLLABLE ITERATION MARK
U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET
U+122D4 CUNEIFORM SIGN SHIR TENU CUNEIFORM SIGN NU11 TENU
U+122D5 CUNEIFORM SIGN SHIR OVER SHIR BUR OVER BUR CUNEIFORM SIGN NU11 OVER NU11 BUR OVER BUR
U+16E56 MEDEFAIDRIN CAPITAL LETTER HP MEDEFAIDRIN CAPITAL LETTER H
U+16E57 MEDEFAIDRIN CAPITAL LETTER NY MEDEFAIDRIN CAPITAL LETTER NG
U+16E76 MEDEFAIDRIN SMALL LETTER HP MEDEFAIDRIN SMALL LETTER H
U+16E77 MEDEFAIDRIN SMALL LETTER NY MEDEFAIDRIN SMALL LETTER NG
U+1B001 HIRAGANA LETTER ARCHAIC YE HENTAIGANA LETTER E-1
U+1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
If we do want the corrections to appear below the boldface name, similarly to aliases (see screenshot from ye olde localhoste), I'm ready to update this module accordingly. Note: I did check the list and confirm no codepoint has both a correction and an alias. Perhaps we also want some kind of footnote explaining aliases and corrections to the reader, but I'll hold off on that. ―cobaltcigs 13:55, 17 September 2019 (UTC)Reply
I like how the correction is shown directly below the name in your screenshot; it makes it easy to compare the two. — Eru·tuon 17:57, 17 September 2019 (UTC)Reply
  Done And with that in mind I've also removed the font-size enlargement of the bold-face character name. ―cobaltcigs 18:17, 17 September 2019 (UTC)Reply
The code point name is immutable so it should always be shown as-is, as you're doing (even when it's clearly wrong). As far as the data is concerned, "correction" is just another type of alias like "alternate", "abbreviation", and "figment". I think all alias types should be shown using the "Type: ALIAS" format without need for an explanation. It looks like that isn't being done for code points like U+0093, etc. Lastly, I wouldn't count on there never being a second alias to a code point with a correction type alias. There's no such restriction, even though that's the case right now. DRMcCreedy (talk) 18:56, 17 September 2019 (UTC)Reply
For the sake of clarity I'll use U+000A as a more extreme example. Are you saying it should look like this?
(visually at least; never mind the style attributes approximating current css class effects; actual markup will be much shorter)
LF
U+000A <control>
(control)
Aliases:
  • LINE FEED
  • NEW LINE
  • END OF LINE
  • LF
  • NL
  • EOL
[other stuff below...]
i.e. putting aliases of all types (including abbreviations) in a single list, in the order given in the aliases file, with zero regard for what type of alias they are, and without choosing any of them to replace the name <control> at the top. I can do that once certain that's what you mean. Let's also revisit the question of how to decide which of multiple abbreviations should be shown in the box. ―cobaltcigs 20:09, 17 September 2019 (UTC)Reply
Related: Can I also get your opinion on whether to put atypical abbreviations in the boxes for #General Punctuation, row U+206x above? ―cobaltcigs 20:15, 17 September 2019 (UTC)Reply
Yes. I'd display all of the aliases in the order they appear in NameAliases.txt (which is preserved in Module:Unicode data/aliases). But I also think the type of alias is useful to know. My preference would look like this:
LF
U+000A <control>
(control)
Control: LINE FEED
Control: NEW LINE
Control: END OF LINE
Abbreviation: LF
Abbreviation: NL
Abbreviation: EOL
[other stuff below...]
  Done, see #info-000A above. Using <ul> because <br /> is for poetry and mailing addresses. And I've just noticed the word "alias" won't actually appear to the reader. ―cobaltcigs 17:09, 18 September 2019 (UTC)Reply
As far as which abbreviation to use in the Wikipedia chart, I think it should match the official, cited Unicode chart. I'm guessing that a lot of them match the first/only abbreviation type of named alias but obviously not always. As you mentioned, U+206x is a good example of chart abbreviations that don't match named aliases. I'm thinking a table of chart abbreviations would be required. You could probably default the chart abbreviation if no exception is found but would it be worth the processing to not find a match first or is it faster to just add them all to a table?
My concern with using different chart abbreviations than Unicode is that there is no right answer. If someone were to change the Wikipedia chart abbreviation for U+000A from LF to NL would that be wrong/revertable? What about LINE? Or LFEED? If we don't have a definitive way to determine the chart abbreviation we open ourselves up to edit wars. Being able to cite the actual Unicode chart gives us one, definitive chart abbreviation.
Great work so far, BTW. DRMcCreedy (talk) 22:10, 17 September 2019 (UTC)Reply
Okay, clearly I misinterpreted "I think all alias types should be shown using the Type: ALIAS" to mean "replace more specific alias-type labels with the word ALIAS". Makes a lot more sense with a picture drawn, glad I asked.
So my actual concern about U+206x is that stand-in symbols might be mistaken for the actual glyph even by readers otherwise familiar with "normal" control/format character abbreviations which consist of multiple capital letters. So some explanatory footnotes might really be needed there.
Agreed. My first draft of a note would be "A dashed box indicates characters which normally have no visible display or only modify the display of other characters. "Dashed Box Convention" (PDF). Unicode Consortium."
The citation might be overkill. Although the nuances are pretty complicated so maybe the citation is justified. DRMcCreedy (talk) 02:04, 18 September 2019 (UTC)Reply
Currently the display text can be overridden from the calling environment (ultimately, a block-specific template) for all assigned codepoints with few restrictions,[1] which has been done in the U+206x example (and less constructively in the "Vulgar" Latin sandbox section). If we do load a master list of favored abbreviations from a sub-module (containing everything from LF to NULL NOTE HEAD), the display_NNNN = FOO parameters could be totally deleted.
  Done and   Removed
cobaltcigs 23:14, 17 September 2019 (UTC)Reply
Oops, I completely forgot about the display_NNNN = FOO parm. I like the idea of a master list because it centralizes the data but either approach will work. DRMcCreedy (talk) 02:04, 18 September 2019 (UTC)Reply
+1 for a master list. BabelStone (talk) 16:13, 18 September 2019 (UTC)Reply
  Donecobaltcigs 06:42, 19 September 2019 (UTC)Reply

References

  1. ^ Exception: whitespace characters, where the main grid disregards all abbreviations real or fake, instead forcing white-on-green rectangular display of the literal character to show relative size (and allow user to select/copy just like any other printable character). This differs from the source material but seems beneficial enough to justify. So for these codepoints, only in the lower info panel can the display text such as NBSP actually be overridden.

Master list complete

edit

See Module:Unicode chart/display and make any corrections/amendments as needed. Maybe I missed a few reading all those PDFs. Except for the CJK blocks where even "skimming" would be too generous a term. display_NNNN params will be whacked soon. ―cobaltcigs 04:38, 19 September 2019 (UTC)Reply

  Removedcobaltcigs 06:42, 19 September 2019 (UTC)Reply
I've reviewed the list and made some changes. DRMcCreedy (talk) 17:28, 28 September 2019 (UTC)Reply

Going horizontal

edit

I've made the utf/html info slide to the right rather than downward when an alias list is present. Seems like a more efficient use of space. Seems to look okay next to the infamous BRAKCET correction, which I've confirmed is the longest string in the alias file. ―cobaltcigs 20:05, 18 September 2019 (UTC)Reply

I don't like the other information forced to the right when there's an alias. It's unexpected and I don't think the savings in vertical space makes up for it. Sorry, it just looks misaligned to me.
Unrelated to the down vs. side option, I have two comments on the displayed information when you click on a code point:
First, can we move the hex HTML escape sequence before the decimal one (&#x... / &#...)? I've never understood why someone would go through the trouble of calculating the decimal value of a code point in order to create an HTML escape sequence but maybe that's just me. In any case, having the hex value first would align nicer with the UTF-16 information directly above it. Hopefully the hex usage is more comman anyway so it would make sense putting it first.
Second, instead of the wording "Introduced in Unicode version x", I'd like to use more precise wording that the source uses.[1] This wording change seems trivial but it gets around the messy issue of various pre-1.1 characters. If Age is 1.1 (the earliest shown in the file), it would say "Assigned as of Unicode 1.1". Otherwise it would say "Newly assigned in Unicode x". Thanks. DRMcCreedy (talk) 17:28, 28 September 2019 (UTC)Reply

Named subsets added

edit

To more thoroughly address DRMcCreedy's item #13, I've added a way to refer to pre-defined named subsets in lieu of inputting a range. I suppose it may also be feasible to do unions/differences/intersections at some point, if there's a demand for it.

?[1]???
TODO: link ucc code chart (PDF)

Lua error in Module:Unicode_chart at line 384: assign to undeclared variable 'isCombining'.

Also new is the black line indicating skipped rows. Seems like a helpful feature.

The block name is also optional now. If omitted, there's no PDF link. But we can still set a display title and a link target for the subject. This would allow greater flexibility in generating a chart that transcends block divisions, such as "all control characters" (the subset name for which could be "special" in that it's generated by a function reading an existing data file, rather than hardcoded). But here's a sillier example for now.

?[1]???
TODO: link ucc code chart (PDF)

Lua error in Module:Unicode_chart at line 384: assign to undeclared variable 'isCombining'.

cobaltcigs 13:45, 20 September 2019 (UTC)Reply

I'd lean towards a jagged line like a ripped piece of paper but the thick black line is certainly noticable enough for the user to realize something's going on. I would, however, like the notes to say "heavy" or "thick" black line because every row has a "black horizontal line". DRMcCreedy (talk) 17:28, 28 September 2019 (UTC)Reply

Orientation of glyphs for vertical scripts

edit

For scripts such as Mongolian and Phags-pa which are written in vertical columns, the glyphs in the font have horizontal orientation so that complete runs of horizontal text can be rotated into vertical orientation by a higher level protocol (commonly CSS). Currently, in our code charts we rotate the glyphs into vertical orientation. This used to match the Unicode code charts, which used to show vertically-oriented glyphs for Mongolian and Phags-pa, but a few years back the editor of the Unicode code charts deliberately changed the Mongolian and Phags-pa code charts to show horizontally-oriented glyphs to reflect how the glyphs are represented at the font level. My question is, should we continue to rotate glyphs in the dynamic Mongolian and Phags-pa charts or should we leave them in horizontal orientation to match the current Unicode code charts? My preference is to rotate into vertical orientation as this matches user expectation (it is how Mongolian and Phags-pa glyphs are presented in books on these scripts). BabelStone (talk) 08:12, 28 September 2019 (UTC)Reply

I don't have a strong preference, although I do think Unicode showing them horizontally seems strange. Vertical seems better. DRMcCreedy (talk) 17:28, 28 September 2019 (UTC)Reply

Unicode 13.0

edit

Unicode 13.0 will be released in March. Can we complete outstanding work on the Unicode chart module by then? Or shall we continue to use the old Unicode chart templates for the Unicode 13 update? BabelStone (talk) 10:16, 10 January 2020 (UTC)Reply

The cell displayed for U+E003B (TAG SEMICOLON) contains a colon in a dashed box instead of a semicolon

edit

The chart shown on page Tags_(Unicode_block) shows the various tag characters as their normal version in a dashed box, but the character shown in the box for U+E003B (TAG SEMICOLON) is a colon instead of a semicolon. I'm not quite sure where/how to update the template. 81.107.76.114 (talk) 00:29, 12 August 2020 (UTC)Reply

Missing end tag for table

edit

@Cobaltcigs and Erutuon: {{Unicode chart}} has little usage guidance, and I came to Module talk:Unicode chart (this very page), which has 6 missing end tags for <table>...</table>, all associated with {{Unicode chart}}. So I went to Pages that link to "Template:Unicode chart". There are 6 pages that transclude {{Unicode chart}}, and they all have missing end tags for <table>...</table>.

So, my request is either abandon this project, or write some usage notes that include how to use it without leaving a missing end tags lint error for <table>...</table>. —Anomalocaris (talk) 07:32, 8 October 2023 (UTC)Reply

Vanisaac mistakenly got rid of the end of the table (|}) while inserting this module into Template:Unicode chart. SWinxy added it back, but inside the noinclude tag. I just moved it so that it was transcluded. I'm not sure the module should be in the template at this point because it's still marked as "pre-alpha" and hasn't been worked on since 2019, but I'm not going to try to evaluate that. — Eru·tuon 20:48, 8 October 2023 (UTC)Reply
Ah thank you. I must've thought that Module:Unicode chart somehow emitted a |} upon transclusion of this template, but not when the module was invoked, hence why I put the |} in the noinclude. SWinxy (talk) 21:38, 8 October 2023 (UTC)Reply
Erutuon: Thank you for taking care of this! —Anomalocaris (talk) 22:59, 8 October 2023 (UTC)Reply

Trying again from scratch

edit

When I stumbled across this (April 2024) Template:Unicode chart wasn't working and no one seemed to be actively working on it. I sent a message to User:Cobaltcigs (the last person who edited Module:Unicode chart and when I didn't hear back, I went ahead and started trying to build by own version in the sandbox. The pages I'm using are:

After a couple days, I've created something that works in the majority of testcases, although there are still some edgecases for unusual characters that still need to be ironed out. You can see my version at:

- Eievie (talk) 18:22, 22 April 2024 (UTC)Reply