Template talk:Wikipedia rank by size

Latest comment: 8 months ago by GreenC in topic Change in JSON

Acerbot edit

User:Acebot was maintaining the rankings but it stopped here in November and at the other wiki-langs. There is a request on my talk page by User:Arystanbek from kkwiki if there was anything that could be done. -- GreenC 18:50, 29 May 2020 (UTC)Reply

@Johnuniq: do you think we can make a new Module along the lines of Module:NUMBEROF, or possibly incorporate this functionality into that template. Either way I think the .tab data should be a different page as the fields will be different and to keep the data from getting too large. On the bot side, probably download the .tab used for NUMBEROF crunch it to determine the rankings and post a new .tab .. The template would need to be able to display forward and backward eg. 1 = enwiki and enwiki = 1. If you are interested in doing the Lua side let me know, or any feedback or ideas. There is also {{Wikipedia rank by size/WP}} used in List of Wikipedias but not sure what that does at first look. -- GreenC 18:50, 29 May 2020 (UTC)Reply

I'm happy to write any module code needed. In principle the module could do all the work itself from the current data because Lua is astonishingly fast but obviously having the bot provide the data already sorted and inverted would be much more efficient. Whichever of those module solutions was used, I would be inclined to add a function to Module:NUMBEROF and have a new Module:NUMBEROF/data (possibly Module:NUMBEROF/rank). That would mean the rank overhead, even if small, would only be incurred on a page where it was used. And vice versa: the numberof overhead would only be incurred where it was used. Johnuniq (talk) 00:01, 30 May 2020 (UTC)Reply
Alright super! I will add another .tab to Commons, plus makes the data generally available for whatever purpose. One issue will be migration since there are about 24 wikis. If it was part of Module NUMBEROF would it be #invoked by this template? Otherwise would have to modify every instance to use Template NUMBEROF which would be a major project. - GreenC 00:59, 30 May 2020 (UTC)Reply
Good. The contents of Template:Wikipedia rank by size would be replaced with {{#invoke:NUMBEROF|rank}} (plus the include/doc stuff), so there would be no need to change the template calls. A wiki would need: edit to Template:Wikipedia_rank_by_size; replace Module:NUMBEROF; new Module:NUMBEROF/rank. Johnuniq (talk) 02:30, 30 May 2020 (UTC)Reply

@GreenC: For recreational purposes, I just looked at what would be involved in implementing this purely in the module. I think it would work and there would be no need for more bot code or another data table at Commons. A key point is that this template is only used on a handful of pages and Lua would be plenty fast enough (although of course I'd have to test that). Let me know if you'd like me to implement a trial version. I assume we are only interested in "xxx.wikipedia" where xxx is a code such as "en" or "de". That is, there is no attempt to rank xxx.wikiquote etc. Johnuniq (talk) 03:30, 30 May 2020 (UTC)Reply

Oh .. I just finished the bot: c:Data:Wikipedia statistics/datarank.tab. I kind of like it because it allows other developers access to the JSON which is not available anywhere via API. Since it is working, I will keep it up to date, but it is your decision to use it or not. The pro is the data file is smaller so faster download, and no number crunching. I agree for now it will be wikipedia only, other projects would require a separate tab file for each. -- GreenC 04:11, 30 May 2020 (UTC)Reply
Another benefit, if in the future a new column was added (depth ranking?) it shouldn't require changes to the Lua code which is a revision control nightmare across many language sites. -- GreenC 04:22, 30 May 2020 (UTC)Reply
OK, I'll think about that soon. However, a quick look suggests that the data allows the module to easily map site → rank-by-articles (e.g. 'en' → 1), but the module would have to invert that table itself. Given that there are a bunch of columns, your choice to sort by site name is understandable, but that means the module has to grind away to generate a rank-by-articles → site table. That's a pretty minor issue and I should have the module ready in under 24 hours. Johnuniq (talk) 05:16, 30 May 2020 (UTC)Reply
Typo in datarank.tab: sources = "Data source: Calculted from...", should be "Calculated". Johnuniq (talk) 09:10, 30 May 2020 (UTC)Reply

I modified Module:NUMBEROF and created Module:NUMBEROF/rank + Template:Wikipedia rank by size/sandbox. The following shows the current main template (not using module) and the sandbox (using module):

  • {{Wikipedia rank by size|1}} → en
  • {{Wikipedia rank by size|2}} → ceb
  • {{Wikipedia rank by size|3}} → de
  • {{Wikipedia rank by size|4}} → fr
  • {{Wikipedia rank by size|5}} → sv
  • {{Wikipedia rank by size|6}} → nl
  • {{Wikipedia rank by size|7}} → ru
  • {{Wikipedia rank by size|en}} → 1
  • {{Wikipedia rank by size|ru}} → 7
  • {{Wikipedia rank by size/sandbox|1}} → en
  • {{Wikipedia rank by size/sandbox|2}} → ceb
  • {{Wikipedia rank by size/sandbox|3}} → de
  • {{Wikipedia rank by size/sandbox|4}} → fr
  • {{Wikipedia rank by size/sandbox|5}} → sv
  • {{Wikipedia rank by size/sandbox|6}} → nl
  • {{Wikipedia rank by size/sandbox|7}} → ru
  • {{Wikipedia rank by size/sandbox|en}} → 1
  • {{Wikipedia rank by size/sandbox|ru}} → 7

It's probably ready but I'll have to check it later. Inverting was very easy, I shouldn't have complained! Johnuniq (talk) 10:01, 30 May 2020 (UTC)Reply

Very nice. Based on /rank yes it doesn't appear to make much difference inverting in-module, doing the loops anyway, very efficient. Invariably someone will want to rank sister projects and it's an open question where to put the data, combined into one tab or separate tabs for each sister, there are pros and cons each way. It might be a good idea to have it done, everything in place, so when users update the config.tab it just works. What are your thoughts on how to organize the .tab (s) ? I tried adding a "data2[]" to the JSON but Commons only accepts one "data[]" structure. I'm leaning towards separate tab which is more efficient performance, and cleaner for developers and manual views, but it creates a bunch of pages on Commons. The file names could match the project name eg.. {{Wikipedia rank by size|1|wikibooks}} retrieves data from c:Data:Wikipedia statistics/rank.wikibooks.tab so defining which file to retrieve would be easier. Could also remove ".wikipedia" from the site column. -- GreenC 14:10, 30 May 2020 (UTC)Reply
The current code assumes .wikipedia is in the data. I didn't bother checking that as I thought changes may easily occur, such as other projects. I don't think it's worth doing any other planning or development regarding other projects until someone asks for it. However, using Data:Wikipedia_statistics/rank.wikibooks.tab or maybe something like Data:Wikipedia_statistics/wikibooks/rank.tab would work. (BTW the data is datarank.tab while rank.tab might be simpler.) Unfortunately there is no way for the main module to pass a parameter to the data module (other than using a global variable which causes problems when other modules use Module:No globals), although the data module can get the current frame and peek at the parameters itself. It should be able to work out which Commons data page to get, and then the current code would handle the rest. Keeping the .wikipedia or .wikibooks suffixes is probably ok (and a good sanity check), although if implementation were ever needed, that might be revisited. Re rankings: let's hope there are never two sites with the same number of articles because if they have the same rank, the inverse lookup will only find one of them. BTW I have finished my check of the modules: I think they are ready. Johnuniq (talk) 03:48, 31 May 2020 (UTC)Reply
Once the new NUMBEROF/rank is released, it will be copied and documented to the 70+ other language wikis, plus organic copying over time. If new features are added it requires re-copying and documentation changes which is a major amount of work (days/weeks). It probably would be less work to have the feature done up front. We don't need to create all the .tab files until someone asks for it, but when they do, the code could support it without updates. I'm fine with ~/wikibooks/rank.tab or whichever name you think is best. To be more precise I should have named it Data:Wikimedia_statistics but it's already released, and anyway there is a proposal to rebrand the organization as 'Wikipedia' so it might end up being accidentally prescient. Anyway let me know if you want to do it and if not that's cool we'll move forward with the current. -- GreenC 12:46, 31 May 2020 (UTC)Reply

Rank plan edit

How about this? The current file gives xxx.wikipedia rankings and is c:Data:Wikipedia statistics/datarank.tab. Each site is xxx.wikipedia. I propose changing both the file location and site name.

Syntax

  • {{Wikipedia rank by size|1}} gives #1 for wikipedia (default parameter 2 = wikipedia).
  • {{Wikipedia rank by size|1|wikibooks}} gives #1 for wikibooks.

Files

  • Rank for xxx.wikipedia at Data:Wikipedia statistics/rank/wikipedia.tab and each site is xxx (e.g. "en" not "en.wikipedia").
  • Rank for xxx.wikibooks at Data:Wikipedia statistics/rank/wikibooks.tab and each site is xxx as above.

If the slash is a problem, rank-wikipedia.tab and rank-wikibooks.tab would be fine.

Some other consistent naming system would be fine. Above you mention wikibooks/rank.tab. I could live with that but it suggests that other statistics could be at other subpages, such as wikibooks/data.tab. All the NUMBEROF info, including wikibooks, is at Wikipedia statistics/data.tab which makes me prefer what I put above. If you ok the above I can implement it. Or tweak the proposal. Johnuniq (talk) 04:55, 1 June 2020 (UTC)Reply

Awesome. The data pages are done ex. c:Data:Wikipedia_statistics/rank/wikiversity.tab -- GreenC 17:21, 1 June 2020 (UTC)Reply

@GreenC: Problem: I forgot about mw.loadData. While the above plan could be implemented, since the rank module would only be loaded once per page, if a page ever mixed projects it would fail. For example, if {{Wikipedia rank by size|1|wikibooks}} appeared on a page, then using {{Wikipedia rank by size|1}} later would fail because the rank module would have already loaded the wikibooks data and would not be loaded again to get the wikipedia data. I'll think about this but there is no simple solution. One approach would be to use loadData if getting wikipedia data, but require for anything else. That might work although if there were 100 "anything else" templates, the data would be parsed 100 times. The proper solution would be a separate module for every project, but that would lead to a dozen almost identical modules, and the need to deploy and maintain them. Johnuniq (talk) 23:23, 1 June 2020 (UTC)Reply

Do you mean if the /rank modules were apart like /rank.wikipedia and /rank.wikiversity .. with 7 loadData() possibilities.. would that work? Or do you mean sep modules of NUMBEROF itself? -- GreenC 00:34, 2 June 2020 (UTC)Reply
Seven rank modules, such as Module:NUMBEROF/rank/wikipedia and Module:NUMBEROF/rank/wikiversity. Module:NUMBEROF would work out which was needed and would use loadData on it. Do you know of a link showing that it is seven what I am calling projects although I know that is not quite the right word? Does the link show how many individual entries would exist? That is, the sum of the number of languages for each project. I'm wondering if all the rank/wikipedia entries could be in one data file (handled by one submodule) and all the entries for the other six projects in one other data file (handled by a second submodule). Then there would be Module:NUMBEROF + Module:NUMBEROF/rank + Module:NUMBEROF/other (or better names, although I prefer simple words rather than phrases where reasonable). Johnuniq (talk) 02:26, 2 June 2020 (UTC)Reply

Eight sister projects with separated language-sites:

If all were added to config.tab it would be 836 (currently about 300) which would make the data.tab JSON about 60k in size (currently near 20k). As for breaking /rank in two it would be 37% in /rank and remainder in /other .. the larger will be used less often anyway so that is good. I'm still confused how this is much different from the current which has the loadData() problem. Would the data pages stay the same or also be reduced to two? -- GreenC 03:27, 2 June 2020 (UTC)Reply

There could still be separate data files, so nine in total (eight rank + statistics data). Regarding modules: /rank would handle wikipedia only for efficiency since it is actually used, while /other would handle each of the others. The /other module would have to load each of the seven other rank files and return them in a structure to be determined that would allow the main module to get whatever rank information was needed for the particular template. I suspect that would be manageable from a performance point of view, although it's less than ideal. Johnuniq (talk) 06:52, 2 June 2020 (UTC)Reply
I don't know how well my vague plan would work, and I don't know how much trouble it is for you to produce the data files. If you want, create all the files and I will make some code work. Then we'll know if the performance is ok. Johnuniq (talk) 09:40, 2 June 2020 (UTC)Reply
The data files are done. Other than /wikipedia.tab they each have 2 entries enwiki and cswiki. Do you want to fully populate them? It would add to the size of /data.tab 3x so I was holding off until required, but it might be a chicken and egg it won't be used until working. -- GreenC 14:14, 2 June 2020 (UTC)Reply

I was not aware that all that ranking data would be added to c:Data:Wikipedia statistics/data.tab as well. I dumped that page at 8 May 2020 and the only items that were not .wikipedia were:

meta.wikimedia commons.wikimedia species.wikimedia total.wikimedia
en.wikiquote total.wikiquote
en.wikivoyage total.wikivoyage
en.wikibooks total.wikibooks
en.wikinews total.wikinews
en.wikiversity total.wikiversity
en.wikisource total.wikisource
total.all

In a way it's logical that everything would be in the data file but that makes it humungous with a lot of material that is unlikely to be used. By contrast, there are pages with hundred of NUMBEROF for .wikipedia counts. I suppose we could see how much overhead it generates but it would be good if another approach was possible. At any rate, I'll work on the ranking code but will need a day or two. Johnuniq (talk) 00:09, 3 June 2020 (UTC)Reply

config.tab can be edited by anyone and it determines which sites are included in data.tab and the /rank tabs by the bot. It might be good and easier to offer stats and rankings for everything, anyone can edit config.tab so we have no control limiting what's included. I agree data.tab with everything all-in is about 3x current size. It's only loaded once per page so 60k is like a small picture, easily handled, but is bloatish for most use cases. Another idea is split data.tab in two mirroring how /rank works .. the total.X can still work even though they would encompass some records across multiple tabs, since they are calculated by the bot prior to writing the tab. -- GreenC 04:31, 3 June 2020 (UTC)Reply
OK, 60KB should be a blip. In that case, if it's not too much hassle, why not fully populate the data now? That will mean it's ready for my testing in a day or two, and we will also see what the overhead is. I just took a copy of the NewPP report at List of Wikipedias (after purging it) and will compare with the result after data.tab expands to 60KB. Johnuniq (talk) 05:24, 3 June 2020 (UTC)Reply

Hmm, I wonder if N should be supported (like in {{NUMBEROF}}) to format the result with commas or however it should be done in the local language (although commas will never be needed as the largest number is much less than 999). I suppose so. Johnuniq (talk) 07:21, 3 June 2020 (UTC)Reply

Testing enhanced ranks edit

The code might be ready although I'll need a day or so to check it.

  • {{Wikipedia rank by size/sandbox|12}} → zh
  • {{Wikipedia rank by size/sandbox|12|wikipedia}} → zh
  • {{Wikipedia rank by size/sandbox|12|wikiquote}} → es
  • {{Wikipedia rank by size/sandbox|12|wikibooks}} → id
  • {{Wikipedia rank by size/sandbox|12|junk}} → -1
  • {{Wikipedia rank by size/sandbox|cs||N}} → 28
  • {{Wikipedia rank by size/sandbox|cs|wikipedia|N}} → 28
  • {{Wikipedia rank by size/sandbox|cs|wikiquote|N}} → 5
  • {{Wikipedia rank by size/sandbox|cs|wikibooks|N}} → 24
  • {{Wikipedia rank by size/sandbox|bn||N}} → 63
  • {{Wikipedia rank by size/sandbox|bn|wikipedia|N}} → 63
  • {{Wikipedia rank by size/sandbox|bn|wikiquote|N}} → 37
  • {{Wikipedia rank by size/sandbox|bn|wikibooks|N}} → 40

Johnuniq (talk) 10:21, 3 June 2020 (UTC)Reply

This looks great, John. Yeah I think the "N" is important for languages that require native script. -- GreenC 15:30, 4 June 2020 (UTC)Reply

config.tab edit

@GreenC: Please check my edits at c:Data:Wikipedia statistics/config.tab. I guessed that there was no easy way to add data, and in fact you must have munged it from manual extraction from meta. At any rate, I did that to add wikibooks and wiktionary entries from meta:Wikibooks/Table and meta:Wiktionary/Table. Notes:

  • I did not add anything extra such total.
  • I added the entries in the order they were in meta, but kept all the wikibooks items together, and all the wiktionary items together.
  • I omitted some entries that I could see were inactive from the fact that the Special:Statistics page had a note that the project had been closed. I guessed that all the items that I added were active although I have no idea how that could be checked.
  • I wonder what order the entries in config.tab should have. I can sort them if wanted. If all the en items are together, it becomes difficult to compare config.tab with the meta tables. It might be better to sort them by sister project (so all wiktionary entries are together) then site name?
  • I can finish copying data from meta if you confirm that what has happened so far is ok.

I also compared the list of wikipedia entries from meta:List of Wikipedias/Table with config.tab. They were identical, although in a slightly different order, except that config.tab has mo (0 activeusers) that is not in meta. Johnuniq (talk) 09:33, 4 June 2020 (UTC)Reply

Oh great please do. I am wrapped up in another project that demands my attention for a bit, I wasn't sure where an authoritative list could be found, maybe a MediaWiki API entry point? Posted at question at Wikipedia:Village_pump_(technical)#List_of_all_Wikimedia_sister_projects. -- GreenC 15:29, 4 June 2020 (UTC)Reply
Got an answer in 7 minutes :) Extension:SiteMatrix which produces this JSON which is everything we need to generate the config.tab because it includes a "closed" key:value .. which raises the question, do we need a config.tab at all if the list can be generated automatically? -- GreenC 15:45, 4 June 2020 (UTC)Reply
That is excellent. In due course, please refactor the bot to use the SiteMatrix. Of course we'll have to see whether the module can cope with the size of the data and the number of times it is called, but it probably can. Config.tab might have some use for extras that are not in API result? I'm thinking of the totals, but I guess the bot handles that. I might add some more to config.tab while experimenting but I would be very happy for it to be emptied or deleted. Johnuniq (talk) 00:49, 5 June 2020 (UTC)Reply
Re performance: I compared a preview of 100 calls to {{Wikipedia rank by size}} (which currently uses a simple template #switch with fixed data) with the same calls to {{Wikipedia rank by size/sandbox}} (which uses the module). The module version took 0.58 seconds of CPU time, versus 0.43 for the template. The transclusion expansion time was 560 ms for the module versus 404 for the template. Conclusion: the module is sufficiently fast, even with 595 entries in config.tab. Johnuniq (talk) 01:21, 5 June 2020 (UTC)Reply

Johnuniq: The bot has a new feature. It has the option to read from c:Data:Wikipedia_statistics/config.tab or from API Extension:SiteMatrix. This is controlled by a switch at Template:NUMBEROF/conf. Normally in production it would always be the API, but for temporary testing it can be toggled over to use conf.tab. This switch isn't meant for general use and will be undocumented and maybe the /conf page should be protected for template users. This method should solve immediate needs. If we run into a situation where the API is not giving the results wanted, a more complicated solution can be done, probably two sub pages on Commons "/add" and "/sub" so data[] objects can be added, deleted or modified ie. first sub then add. But I'd like to avoid that because it will be a maintenance headache to monitor those pages and data errors. -- GreenC 17:25, 7 June 2020 (UTC)Reply

I went ahead and made it live since performance was not a problem with the 595 entries and currently 836 with everything included. -- GreenC 20:53, 7 June 2020 (UTC)Reply
Great. I template-protected Template:NUMBEROF/conf to avoid damage that could impact many projects. For my curiosity, I started writing a Python script to get the data from SiteMatrix because I recalled using the json module at some time and it was pretty easy. I got distracted and did not quite finish, but I did enough to run into the specials entry. That includes meta + commons + species which are in c:Data:Wikipedia statistics/data.tab but not in c:Data:Wikipedia statistics/config.tab. You might want to add them there. Johnuniq (talk) 03:21, 8 June 2020 (UTC)Reply
Thanks for the protection. Yes I forgot to manually refresh config.tab it is done .. there are many specials, I picked the few that looked most relevant. The full list below if you see any more that should be included, they are not added automatically as many are private or don't seem relevant. -- GreenC 03:46, 8 June 2020 (UTC)Reply
Wikimedia.org specials

Merge notes edit

Notes on roll-out to other wikis

  • eowiki pending review
  • krcwiki unable to edit, "malicious edit"
  • ruwiki using a different system see Module:NumberOf/today
  • ukwiki pending review
@GreenC: now you can edit/create: kk:Special:AbuseFilter/history/79/diff/prev/347 --Arystanbek (talk) 11:19, 11 June 2020 (UTC)Reply
Arystanbek, thank you! I need to edit the following:
It is a lot. I tried kk:Module:NUMBEROF/doc but did not work. @Arystanbek: -- GreenC 14:18, 11 June 2020 (UTC)Reply
@GreenC: I edited this abuse filter: 49, please try again. --Arystanbek (talk) 11:18, 12 June 2020 (UTC)Reply
Everything loaded OK except kk:Template:NUMBEROF/doc gives an error. I will leave for to you to create, which needs translation anyway. I accidentally created kk:Template:Wikipedia rank by size and kk:Template:Wikipedia rank by size/doc which is a duplicate of kk:Үлгі:Уикипедия/Мақала саны бойынша орны - Could you delete kk:Template:Wikipedia rank by size and kk:Template:Wikipedia rank by size/doc? Thanks. -- GreenC 14:24, 12 June 2020 (UTC)Reply
I edited kk:Template:NUMBEROF/doc template: kk:Special:Diff/2773133 and deleted duplicate templates --Arystanbek (talk) 04:52, 13 June 2020 (UTC)Reply
I think it is all done. Thanks for your help! -- GreenC 13:16, 13 June 2020 (UTC)Reply

Thank you, great job! --Arystanbek (talk) 18:21, 14 June 2020 (UTC)Reply

template:Wikipedia rank by size/WP edit

User:Johnuniq, I posted a request for help at Wikipedia:Requested_templates#Help_with_Template:Wikipedia_rank_by_size/WP but the board is not very active at the moment. Is this something you know how to do by any chance? Just needs two additional (optional) arguments passing through. -- GreenC 14:29, 7 September 2020 (UTC)Reply

I believe I've done what was wanted. Please check the results! BTW, WT:WikiProject Templates is a good place to ask for help with templates. Johnuniq (talk) 07:17, 8 September 2020 (UTC)Reply

Johnuniq. Thanks! Did not know about that forum. Looks like a new problem. After addeding the template to Wikiquote#Multi-lingual_cooperation it appears /WP is not designed with other sister projects in mind. It creates a hard wiki-link to the Wikipedia page. Producing [[English Wikipedia|English]] even though the second argument is wikiquote. Which raises another question about the template name. The whole thing might require a rethink. -- GreenC 14:22, 8 September 2020 (UTC)Reply

Johnuniq: Problem solved. Check this out. -- GreenC 16:59, 11 September 2020 (UTC)Reply
Very impressive! Johnuniq (talk) 09:48, 12 September 2020 (UTC)Reply

I've been wondering whether external links like the following are needed at Wikiquote#Multi-lingual cooperation.

If wanted, you could try making it work with these examples:

Some info is at WP:INTERWIKI. Johnuniq (talk) 23:54, 14 September 2020 (UTC)Reply

Johnuniq, this is a great idea, better than external links. I went ahead and changed all 7 tables. Then noticed some sites don't have "Main page" redirects, for example pl (Polish Wikivoyage). Removing the "Main Page" but keeping a trailing ":" seems to dothe trick. -- GreenC 00:59, 15 September 2020 (UTC)Reply
Hmm, pretty mysterious. I suspect (see WP:INTERWIKI) that it doesn't reliably work with multiple prefixes. Johnuniq (talk) 03:06, 15 September 2020 (UTC)Reply

Change in JSON edit

The bot that maintains the ranking data at Commons ie. https://commons.wikimedia.org/wiki/Data:Wikipedia_statistics/rank/wikipedia.tab and all the rest, has made a change. Previously ranks were strictly from 1..300 (whatever), that is, if there are 300 rows the ranks are from 1..300. Ties were ignored, if two sites have the same number they were not given tied scores. This is changed, they now have tied scores, some sites share the same rank. This is to reflect the reality of the data, how end user applications deal with ties will be up to that application. If it's a problem for this template let me know. -- GreenC 21:48, 23 September 2023 (UTC)Reply