Archive 1 Archive 2 Archive 3 Archive 4

identifying a list of toponym articles using a specific title format

Hi,

Per earlier discussion at Wikipedia talk:Naming conventions (geographic names)#Serbia village name format, we have a discrepancy in the formatting of disambiguation markers for villages in Serbia. Can someone please help produce a list of all articles in the main namespace that are underneath any of the categories of Category:Populated places in Serbia and match the regular expression /\(.*\)$/? We'd then be able to redact the list to make sure we don't have any false positives or negatives, and put this up for a mass move.

Likewise, if the query isn't too taxing on the available resources, we could do the analogous thing with articles underneath Category:Populated places in Bosnia and Herzegovina that have the same issue (it would be great to see what the impact of the mass move could be there).

TIA! --Joy [shallot] (talk) 07:21, 1 April 2021 (UTC)

Joy, let me know if this is what you're looking for.
To save them in a different format (such as TXT or CSV), go to the Output tab, pick what you want, then re-run the queries. I adjusted your RegEx to be .*\(.*\)$. I put the category depth to 10, you can experiment with going deeper to make sure that is deep enough to include everything. I see a lot of non-villages mixed into the data... I think some of the deep sub-categories and/or their articles may need attention. For example, Probus (emperor) is in the results because he is included in Category:Sirmium, which was a Roman city that happened to be located in Serbia. Perhaps an additional category or a different RegEx would help to prune out these non-villages. –Novem Linguae (talk) 07:53, 1 April 2021 (UTC)
Hmm, seems like depth 2 should work, and then we'll just have to go through and do some of that janitorial work to clean up the categorization itself. Thanks! @No such user: please feel free to join me :) --Joy [shallot] (talk) 10:42, 1 April 2021 (UTC)
@Joy: On a quick skim, we'll need depth of at least 3 for Bosnia and Herzegovina (because of entities). A slightly better regex is /\([A-Z].*\)$, only showing uppercase disambiguation terms. I'm a bit busy these days, but manual filtering does not look like a big job. Shall we save those results to someone's userpage? No such user (talk) 13:30, 1 April 2021 (UTC)

Users with highest edit count during last 365 days (debug)

Chlod and I worked on this query, but both our attempts timed out. Here's the code. Any suggestions for speeding this up? I think the revisions table has 1 billion rows, so it's quite a large table. Also, I read somewhere that WHERE, GROUP BY, and ORDER BY should be run on columns that are indexed/keyed for faster queries, but currently all 3 of those are being run on non-keyed columns. The SQL optimizer tool is giving Query plan 1.1 is using filesort or a temporary table. This is usually an indication of an inefficient query. If you find your query is slow, try taking advantage of available indexes to avoid filesort.Novem Linguae (talk) 17:09, 2 April 2021 (UTC)

First of all, your date format is wrong - rev.rev_timestamp (and all other timestamps in this db) are stored in YYYYMMDDHHMMSS format, while UNIX_TIMESTAMP() is seconds since the epoch. You'd want something like DATE_FORMAT(DATE_ADD(NOW(), INTERVAL -1 YEAR), "%Y%m%d%H%i%s"), not UNIX_TIMESTAMP() - 365*24*60*60.
Even with that fixed, though, that query will (several days later, at least) show the last year's edit count of all users and ips who've made even one edit in that time. Even if gathering the data were instant, it would still take a very long time just to transfer the results. So there's that.
Assuming some other way to limit the number of users - perhaps only allowing ones with user.user_editcount > some appropriate number - you'll probably get faster results by first finding the minimum rev_id, then querying by that instead of rev_timestamp instead. quarry:query/53813 has the last hour's results with that method; I stopped waiting on the equivalent query directly on rev_timestamp after five minutes.
(Also, revision_userindex isn't useful here over revision, though it might be if you start limiting by user_editcount or other values in the user table as I suggested above.) —Cryptic 17:37, 2 April 2021 (UTC)
And I left out the hard part: querying the revision (or logging) table is slow, due to the volume of data, and it's not linear with respect to the amount of data being queried. Raising the time limit to a month (and only displaying the twenty highest editcounts) took eight and a half minutes; going back a year would thus take at least an hour and a half, which is well over the timeout limit on Quarry and I think directly on toollabs too. Maybe you could break it apart into month-long chunks and fix it in post-processing. —Cryptic 18:33, 2 April 2021 (UTC)
Cryptic, thanks. This is my first time working with such a massive SQL database. I'm learning a lot. I appreciate your detailed explanation. CommanderWaterford, looks like this query ended up being impractical. –Novem Linguae (talk) 18:44, 2 April 2021 (UTC)
Hm. There must be a way to get this data. @Firefly fyi CommanderWaterford (talk) 18:52, 2 April 2021 (UTC)
For an approximate answer, how about fetching a year-old revision of Wikipedia:List of Wikipedians by number of edits/1–1000 and comparing it with the current revision? -- John of Reading (talk) 19:30, 2 April 2021 (UTC)
In the very first step a good idea.... but it would not include users like me for example :-) which are less than one year here. CommanderWaterford (talk) 20:54, 2 April 2021 (UTC)
For the record, Chlod got a version of Cryptic's query working by adding LIMIT 1000. Thanks all. This was a good learning experience. –Novem Linguae (talk) 21:59, 2 April 2021 (UTC)
After 2.48 hours (8936.40 seconds), that is... Chlod (say hi!) 00:09, 3 April 2021 (UTC)
Ha! There are six actual humans ahead of me. BD2412 T 04:45, 5 April 2021 (UTC)

All pages created within last 2 days that have an {{Uncategorized}} tag

My approach was to check for pages in Category:All articles needing additional references. I tried Petscan, but couldn't find a filter for created date. I tried Quarry, and got it mostly working, but I got stuck on filtering by created date. I'm surprised there isn't a "created date" field in the `pages` table. Would appreciate any help. Even if it's to smack me for missing a simple solution :-) –Novem Linguae (talk) 12:16, 12 April 2021 (UTC)

Novem Linguae, would this sorted search be sufficient? --Trialpears (talk) 12:34, 12 April 2021 (UTC)
Trialpears, hah, I had a feeling I might have missed something simple. Thank you very much for your insight. I'll leave this open in case somebody wants to fix the SQL query too. Mainly so I can learn what I did wrong with my SQL. –Novem Linguae (talk) 20:36, 12 April 2021 (UTC)
That query looks like it'll match any page with at least one edit within the last two days, not ones with zero edits older than two days. Try adding an AND rc_new=1. Also no need to join page; rc_cur_id, rc_namespace, and rc_title should all map directly to page_id, page_namespace, and page_title (perhaps in the absence of moves? I haven't ever worked with the recentchanges table, and don't know its gotchas).
What constitutes a "new page" is a bit murky in any case. Does, say, moving a page out of Draft to the mainspace count? Neither rc_new nor something like WHERE NOT EXISTS (SELECT 1 FROM revision WHERE rev_page=page_id AND rev_timestamp < /*something*/) can detect that. The pagetriage table might be able to help here, but I can't find any documentation for it. —Cryptic 21:02, 12 April 2021 (UTC)
Thanks Cryptic. I think I found the PageTriage table documentation, in case you're curious. mw:Extension:PageTriage/pagetriage_log_table. There are additional tables in the navbox at the bottom. Also, I got the SQL query working. Here it is for the record. –Novem Linguae (talk) 04:01, 13 April 2021 (UTC)
That's just a type dump, which I'd found, and could've run myself. The replicas don't show the key column since they hide the table itself behind a view, but it's not actually useful anyway without being able to see the combinations and orders of fields in the composite indices. There's no description of the fields' semantics, either, so no way to know what the data means without guessing (or, I suppose, reading the extension source). Compare mw:Manual:Revision table. —Cryptic 14:12, 13 April 2021 (UTC)

Indefinite move-protections placed long ago on articles

10+ years ago, as a first cutoff? For Wikipedia:Village pump (proposals)#Make page movers eligible to move move-protected pagesxenotalk 21:33, 26 April 2021 (UTC)

xeno.   In progress. 1) Query 1 - All pages that are move protected (sysop only). 23,251 pages, most of which have no expiration. 2) Query 2 - All pages that are move protected (sysop only), with a column for the date protection was last set. This one is going slow and may time out. Anyone is welcome to fork it and try to speed it up. 3) Query 3 - Get rid of results that were protected in the last 10 years. I haven't thought of a good way to do this in SQL yet. I tried WHERE MAX(log_timestamp) < DATE_FORMAT(DATE_ADD(NOW(), INTERVAL -10 YEAR), "%Y%m%d%H%i%s"), but it did not like WHERE MAX, per [1]. Again, if anyone sees a good solution, feel free to fork and post. Thanks. –Novem Linguae (talk) 22:41, 26 April 2021 (UTC)
Thanks Novem Linguae, I think there is an issue with the query. A useful set could be any indefinitely move protected page with User:NawlinWiki present in the protection log. –xenotalk 22:44, 26 April 2021 (UTC)
There's issues with both the queries. In 54375, pr_id is page_restriction's primary key, and has nothing to do with page_id; you want pr_page. In 54378:
  • You're mixing namespaces, so not only is the query unindexed on both the logging and page tables, it's equating e.g. all of 4, Talk:4, User:4, User talk:4, Template:4, Template talk:4, Book:4, and Book talk:4.
  • It's not grouping on anything, so it'll only ever show one page.
  • It's showing the most recent log entry of any type, not just protection addition or modification, or even just protection-related.
A working query is hard. Ideally you'd want to join on log_page instead of log_namespace/log_title, since it'll follow page moves (there won't be a log entry for Foo if you move-protect Bar and then move it to Foo, even though it's Foo, and not the redirect at Bar, that ends up protected), but log_page was only added in October 2009. Even then, of the 31701 pages that are currently move-protected to sysop, I'm only seeing 931 pages that have any protection log entries at all (quarry:query/54382). And if I come at it from the other direction - specifically looking for move-protected pages without a corresponding protection log entry by log_page - I only see 11803 of them (quarry:query/54383). So I'm doing something very wrong too.
(BTW, DATE_FORMAT() just returns a string, so if you're not interested in an exact time, just use log_timestamp <= '2011' or such.) —Cryptic 00:48, 27 April 2021 (UTC)
  • Looking at the 11803, probably best to exclude results where they're indefinitely protected as well. –xenotalk 00:59, 27 April 2021 (UTC)
    • All but 26 are indefinite, and only five of them expire before 2027 (quarry:query/54385).
      Saner results for the post-October 2009-era pages at quarry:query/54384. I don't know why the results are so different from query 54382; they should be the same. —Cryptic 01:06, 27 April 2021 (UTC)
      • Oh, duh, I made the same pr_id/pr_page mixup in 54382 that I pointed out in Novem Linguae's queries. —Cryptic 01:14, 27 April 2021 (UTC)
      • And on a reread, you were asking about indef edit protection, not indef move protection, which makes more sense. Results are still in quarry:query/54385. —Cryptic 01:33, 27 April 2021 (UTC)
        Ha, how’s that for a random pick: Fantasy_Empires (check the sysop ;) –xenotalk 01:49, 27 April 2021 (UTC)
        Thank you for the list! A suggested refinement: " ideally a table of: PAGENAME, EDIT PROTECTION LEVEL, MOVE PROTECTION LEVEL, LAST PROTECTION LOG DATE, LAST PROTECTION LOG REASON, LAST PROTECTING SYSOP, and possible "ISREDIRECT?" for these page would be nice, for the set of pages where (NS:0 && editprotection NOT sysop && moveprotection IS sysop). " (I think some of these might be precluded by the Cryptic's earlier statement.) –xenotalk 16:04, 27 April 2021 (UTC)
        I've added page_is_redirect and current edit protection level (if any) to both base queries, and a best guess for log entry to the pre-10/2009 one. Last protecting sysop can't be extracted independent of the log entry, so it's not 100% reliable. Move protection level is always sysop, since that was an initial condition. While ns=0 and editprotection != sysop could be extracted from sorting, I've filtered them out in quarry:query/54423 for the pre-10/2009 query; do you need me to do the same for the post-10/2009 one? —Cryptic 20:01, 27 April 2021 (UTC)
        I think this should give us plenty to go on - we can use filtering for the later results: thank you! –xenotalk 20:11, 27 April 2021 (UTC)
        quarry:query/54424. —Cryptic 20:19, 27 April 2021 (UTC)
        Cryptic, is it possible to filter out pages based on pageviews? — Alexis Jazz (talk or ping me) 00:42, 28 April 2021 (UTC)
        Not so far as I'm aware. —Cryptic 01:23, 28 April 2021 (UTC)

Longest-existing articles that have never had a ref tag

Thomas Ranch is about to be deleted, having existed in Wikipedia for over fifteen years without a single inline reference, and without ever having an external link to an independent source. Grub Smith, though a bit longer, has been in a similar state for an equally long time. I happened to come across these two not while searching for suspect articles but while doing general cleanup around the given name Thomas and the surname Smith. How many more articles are out there in this condition? I propose that the best way to find out, if technically feasible, would be to generate a list of articles that have never had an inline ref tag over the course of their existence, sorted by age, and push through them from the oldest on forward. If someone has the Wiki-fu to generate such a list, please have at it. BD2412 T 02:42, 5 April 2021 (UTC)

Note: Per further discussion at WP:VPT, we could limit this to start with to articles created early on, perhaps before 2007 or before 2006, for example. Obviously, any article that currently has a ref tag could be eliminated at the outset, before looking at histories (although it might be enough to generate a list of currently unreferenced articles first created before those dates). BD2412 T 03:04, 5 April 2021 (UTC)
The replicas don't have the information you want: there's no way to see whether a page has a ref tag, or to get any more data from previous versions of a page than what's displayed in the history page. The closest we can come here, even in principle, is generate lists of articles that don't currently have any external links. Past that, it's difficult to efficiently search directly by earliest revision (though easy to get when you already have a list of pages), and defining what exactly qualifies as an article is difficult. As a first stab, quarry:query/53880 has pages in the main namespace, aren't true redirects or true disambigs or in Category:All set index articles, currently have no external links, and whose first revision is before 2002. Still mostly year articles and false positives. —Cryptic 03:45, 5 April 2021 (UTC)
I am thinking that the most immediate cases can be caught by an inquiry further narrowed to articles categorized in the Category:Companies category tree, or articles categorized in Category:Living people. The two articles noted above actually both have external links (in an "External links" section), but these are to their company/personal websites - that was what prompted me to focus on the absence of ref tags. I am curious, how do we know which articles have external links, if we aren't looking at the text? Can we see, for example, articles lacking a "References" section? BD2412 T 04:42, 5 April 2021 (UTC)
There's a table that lists all external links from all pages, to implement Special:Linksearch. We can't see sections. —Cryptic 12:54, 5 April 2021 (UTC)
Very interesting, thanks. BD2412 T 15:18, 5 April 2021 (UTC)
This idea goes in the same overall direction as the recently started WP:SWEEP project. – Uanfala (talk)
I am thinking that the most immediate cases can be caught by an inquiry further narrowed to articles categorized in the Category:Companies category tree, or articles categorized in Category:Living people. The living people category will likely work well, but I found that digging down through the company category tree produces mostly dirt after just a few levels. I think you'll have more success grabbing articles that have a Company or company infobox (there are two of them, case sensitive). This set of articles is very clean. Sam at Megaputer (talk) 23:50, 8 April 2021 (UTC)

In 2019, I wrote a bot that added {{unreferenced}} to 10,000 articles. It could have added many more, but the community was dubious/conservative during an RfC and I had dial it back to the most extreme cases only. Thus any external link anywhere on the page disqualified it from being tagged, including a single external link in the external link section or in the infobox transcluded from a template. After it ran, there were zero complaints of bad tagging. It is capable of distinguishing sections and main body. It can create lists instead of tagging. It scours the entire 6+ million corpus looking for candidates. This is a non-trival bot even though it sounds easy-ish there are endless edge case exceptions - for example what counts as a dab vs. article? Not at all clear when you look at it. And so many external link templates to look for. It can't be done with a query it requires a purpose built application. The suggestion of looking for the worse of the worse is interesting, but I suspect the majority of articles tagged with {{unreferenced}} have always been that way. -- GreenC 02:33, 9 April 2021 (UTC)

Most-viewed stub/start class articles

In a discussion at the WP:Core Contest, I suggested a list of the most viewed Wikipedia articles that are start class or less would be beneficial for the competition, as presumably articles of interest to swathes of readers would be important ones to get right for our reputation. Obviously it may be useful to weed out the articles that are only prominent due to recent news stories (with a sharp spike in views e.g. Ron DeSantis (views), in favour of those that have had more consistent page views through it's life on Wikipedia. I guess we could choose a certain standard deviation for page views to fall within from day to day? Note: Sometimes an article as different Class ratings from different Wikiprojects. Can you code in that the highest rating by the various Wikiprojects is the correct one? (Sometimes it is re-classed up but the other Wikiprojects don't catch up) I wouldn't worry about the Article Important class too much as many of these may not have been classed correctly.

--Coin945 (talk) 15:42, 29 May 2021 (UTC)

Indefinite creation-protected articles

Hi, why isn't this working? I'm trying to generate a list of all indefinitely creation-protected articles at a sysop level. Anarchyte (talk) 10:09, 4 June 2021 (UTC)

If a page doesn't exist, then there won't be a row in the page table for it. And creation protection isn't stored in page_restrictions (largely for that reason); it's in protected_titles. SELECT * FROM protected_titles WHERE pt_expiry = 'infinity' AND pt_create_perm = 'sysop' AND pt_namespace = 0;. —Cryptic 10:18, 4 June 2021 (UTC)
You might also want to put in something for durations so long as to be effectively indefinite - there's ten pages creation-protected with expiries in 2030 or later, and 21 more between 2025 and 2030 (not all in mainspace or sysop-only). —Cryptic 10:35, 4 June 2021 (UTC)
You don't need a quarry for that, just use Special:ProtectedTitles. --Trialpears (talk) 10:20, 4 June 2021 (UTC)
If there's a way to filter that by duration, I don't see it. —Cryptic 10:23, 4 June 2021 (UTC)
Oh, guess that's only a thing at Special:ProtectedPages. --Trialpears (talk) 10:31, 4 June 2021 (UTC)
It doesn't really make a lot of difference here, as it turns out. I'm seeing 41,166 pages in the main namespace indefinitely salted to sysop, and only 104 with a fixed duration. —Cryptic 10:39, 4 June 2021 (UTC)

Thanks, Cryptic — that works perfectly. Anarchyte (talk) 11:38, 4 June 2021 (UTC)

Help wanted compiling list for WikiProject Sweep

  You are invited to join the discussion at Wikipedia talk:WikiProject Sweep § June 2021 follow-up. {{u|Sdkb}}talk 21:48, 4 June 2021 (UTC)

The query for getting pages created before 2004 Sep 20 with <10 non-minor edits in history took 2.5 hours to run. As per the linked thread, apparently we want the equivalent query with 2012 instead of 2004. This query clearly doesn't scale. Is there any way to optimise it? Cc @Cryptic. – SD0001 (talk) 03:54, 7 June 2021 (UTC)
Beyond my experience. Using a HAVING clause instead of that outermost subquery (superquery?), as in quarry:query/55737, might help some, but I doubt it. This really needs better criteria: besides being hard to efficiently query for and not selective enough - there's going to be at least ten times as many hits in the 20120920-and-earlier query - the number of revisions of a page, unless much, much larger than the 10 looked for here, isn't a good indicator of how many editors have reviewed it. —Cryptic 06:06, 7 June 2021 (UTC)
@Cryptic, any idea about what sort of better criteria we could use? We brainstormed a bunch, and limiting based on that seemed most promising. Overall, the set is expected to be huge since it's a huge undertaking, so massive results don't necessarily indicate something amiss. We also don't need full lists yet at this point, just counts of how large different lists would be and samples of the kinds of pages that would be on them. {{u|Sdkb}}talk 17:21, 7 June 2021 (UTC)
I was able to improve the runtime to 2 hours with quarry:query/56101, the key being to refactor it and set a LIMIT in the right place to prevent the db from chugging through the entire history of edits for pages that have too many. This can be improved further if we put more WHERE conditions in sub1, such as filtering out disambig pages that are making up much of the output. – SD0001 (talk) 07:47, 20 June 2021 (UTC)
Turns out filtering out disambigs didn't speed it up. Almost the same time: quarry:query/56118. – SD0001 (talk) 10:06, 20 June 2021 (UTC)

Hello!

The SQL code in c:Category talk:Animated PNG didn't work. Does anyone know how to fix it? The goal is to find all animated PNGs.Jonteemil (talk) 18:17, 3 September 2021 (UTC)

Didn't work how? Did it list images that aren't animated pngs, or images that are already in the category, or not list images that are animated pngs and not in the category, or just run forever without results? Full results aren't going to be fast - there's just short of 3.5 million rows it to look at, and no good way to speed it up further - but tacking on a "LIMIT 1" to the end found c:File:3rd order intermod animation (thumbnail).png after a couple seconds, which looks right to me? —Cryptic 22:38, 3 September 2021 (UTC)
And just letting it run gives me 28 hits in 65 seconds:
Extended content
none of which are in the category, and almost all of them animated pngs. c:File:Golden Melody Awards 28th Index9..png and c:File:Зонтаг АП Путешествие в Луну.png look like bad data: the first's img_metadata starts 'a:6:{s:10:"frameCount";i:1;' instead of i:0, and the second 'a:6:{s:10:"frameCount";i:19;' despite not being animated at all. Nothing you can do about that, except maybe reupload them and hope they get detected better. I'm a bit surprised there's so few hits, so if there's files it should've found but didn't, please link a few. —Cryptic 23:02, 3 September 2021 (UTC)
@Cryptic: Nvm, I just had to wait a bit I guess, 28 results now.Jonteemil (talk) 01:19, 4 September 2021 (UTC)

Unused Templates reports

Hello, this is a request I made on the talk page of the Database reports page, but it hasn't been answered over there. It is part of the overall proposal discussion I have started on the WikiProject Templates to create an Unused Templates Task Force to deal with the backlog of unused templates.

I've requested four reports per the original discussion and I'm going to relist them here:

I would like the report to run for at least two months as the task force is currently in an idea stage. When the reports are going to expire, if possible I would like to be notified of when it will happen. I need four reports from the Unused Templates database:

1) All unused templates that are not stubs or redirects.

2) All stub templates listed as unused. According to one of the users on the talk page discussion, there are either exactly or about 1,000 stub templates.

3) All redirects listed as unused. According to one of the users on the talk page discussion, there are either exactly or about 69,000 redirects.

4) Templates that were created and/or edited in the last year and the present year. --WikiCleanerMan (talk) 13:23, 28 September 2021 (UTC)

Do you need this regularly, or is a one-off query ok? —Cryptic 22:08, 18 October 2021 (UTC)

U.S. related articles possibly needing date tags

I'd like to refine this list (articles subcategorized under Category:United States to three levels that do not already have {{Use mdy dates}} or {{Use dmy dates}}) to remove all articles that are subcategorized under a different country. It'd also be nice to remove all articles that contain a date in DMY format outside of a reference. Would one or both of those be possible? {{u|Sdkb}}talk 19:08, 18 October 2021 (UTC)

You can't exclude by non-templated date format with a query, for the same reason Petscan can't: no access to article text.
Excluding by other country is possible, but I'd need a more precise definition similar to what you used for "subcategorized under Category:United States" - I suppose something like in the Category:Countries by continent tree to a depth of five? —Cryptic 22:18, 18 October 2021 (UTC)
@Cryptic, how about articles that are in Category:Wikipedia categories named after countries to a depth of three, except for those in Category:United States and the seven categories before the A countries (e.g. Wikipedia categories named after dependent territories‎)? {{u|Sdkb}}talk 22:46, 18 October 2021 (UTC)
Excluding those subcategories doesn't turn out to help much - 181004 articles, compared to 183841 with them. (Slightly more than the Petscan results because I didn't exclude soft redirects or disambigs.) Still want the full list? Removing less than 3000 hits wouldn't be worth losing clickable links if I were in your position. —Cryptic 01:13, 19 October 2021 (UTC)
I do, yeah. I'm trying to propose a bot task for it and discovering just how passionately some editors care about date formats. So although I don't personally consider it a tragedy if an article in e.g. both Category:Video games developed in the United States and Category:Video games developed in Canada gets tagged with MDY, removing those articles from the pool will very much be necessary to have a hope of achieving consensus. Cheers, {{u|Sdkb}}talk 01:48, 19 October 2021 (UTC)
Emailed to you. —Cryptic 02:08, 19 October 2021 (UTC)
I object to Canadian articles being mass-changed to mdy date format, such as Category:Video games developed in Canada. Canada has no preference between dmy and mdy, MOS:DATETIES clearly states that Articles related to Canada may use either format with (as always) consistency within each article. We shouldn't be enforcing a standard date format on Canadian articles against the MOS, and against the fact that Canada doesn't exclusively use mdy dates. Joseph2302 (talk) 15:03, 19 October 2021 (UTC)
Yep, that's why I made this query. If it's not clear, I'm removing Canada (or other country) articles from the list of those that will be affected. {{u|Sdkb}}talk 15:54, 19 October 2021 (UTC)
All good then, if you're just planning on tagging US articles :) Joseph2302 (talk) 16:01, 19 October 2021 (UTC)
That's the idea :) A subset of the list (it was too big to paste entirely) is at User:Sdkb/sandbox/testpage. I didn't find any errors in my spot check, but let me know if you do. {{u|Sdkb}}talk 16:13, 19 October 2021 (UTC)
Here are some articles from your sandbox page that do not appear to be "sure thing" candidates for {{use dmy dates}}: Linguistic Association of Canada and the United States, Party for the Commonwealth of Canada, French ship Saint-Rémi, French ship Beaumont. – Jonesey95 (talk) 16:31, 19 October 2021 (UTC)
Hmm. I think what we're seeing there is some instances where the category tree doesn't reach deep enough. For the first one, it's included because of Category:Education in the United StatesCategory:Educational organizations based in the United States‎Category:Academic organizations based in the United States, but to be excluded we'd have to go for layers: Category:CanadaCategory:Education in CanadaCategory:Educational organizations based in Canada‎Category:Academic organizations based in Canada. For the two ships, I think we'd have to go five layers deep, but that might start to cause substantial leaks. Cryptic, do you know how many pages there would be in the list if we increased the depth for exclusions to four or to five? {{u|Sdkb}}talk 18:06, 19 October 2021 (UTC)
Oh, and it looks like we also need to exclude articles under Category:Military of the United States per WP:MILFORMAT. {{u|Sdkb}}talk 18:57, 19 October 2021 (UTC)

Your count for Linguistic Association of Canada and the United States is right (if "for" was a typo for "four"), but your route is off - Category:Education in Canada isn't a direct subcat of Category:Canada, and Category:Academic organizations based in Canada isn't a direct subcat of Category:Educational organizations based in Canada. It goes Category:Wikipedia categories named after countries (depth 0) > Category:Canada (depth 1) > Category:Organizations based in Canada (depth 2) > Category:Organizations based in Canada by subject (depth 3) > Category:Academic organizations based in Canada (depth 4).

What depth for Category:Military of the United States? (Just articles in that category is depth 0, articles in its direct subcat Category:American military aviation is depth 1, etc. to be sure we're talking about the same thing.)

The query excluding articles in depth 4 subcats of Category:Wikipedia categories named after countries looks like it's going to die. Not all that surprising; depth 4 from Category:United States is 8165 categories, depth 3 from Category:Wikipedia categories named after countries is 39249, and depth 4 from it is 166700. —Cryptic 21:24, 19 October 2021 (UTC) Of course it finished immediately after I gave up and saved the edit - 172888 results at depth 4 without doing anything with the Military subtree; Linguistic Association of Canada and the United States and Party for the Commonwealth of Canada correctly excluded, the French ship articles are still there. —Cryptic 21:29, 19 October 2021 (UTC)

(edit conflict) I'd say a depth of three for U.S. military would hopefully capture all the articles in that realm without creating too many leaks? And not too surprising; that's a lot of categories haha. I'm doubtful we'll ever be able to get to 100%, but I think we've gotten pretty close. {{u|Sdkb}}talk 21:33, 19 October 2021 (UTC)
162590 articles. I've mailed you the list. —Cryptic 22:11, 19 October 2021 (UTC)
Much thanks! {{u|Sdkb}}talk 18:38, 20 October 2021 (UTC)

Non-existent Draft PAGENAMEs with in-links from mainspace

Hi. I'm looking for two related queries, one yielding a count, the other a handful of sample pages:

  • count – how many non-existent pages in Draft space have one or more in-links from mainspace? (I.e., if Draft:Foo_123 has one in-link from mainspace, Draft:Bar_456 has five, and Draft:Baz_678 has one hundred, the total count is 3.)
  • sample – can I get a list of around a dozen or two non-existent PAGENAMEs in Draftspace from that set, somewhat randomly selected (i.e., not all starting with 'A').

Extra credit: which red Draftspace pagename has the highest number of mainspace in-links, and how many? Thanks! Mathglot (talk) 06:02, 25 October 2021 (UTC)

All links from mainspace to draft (excluding the 71000+ irritating fake ones created by {{R with possibilities}}) are at quarry:query/59495. The only redlinked one with more than one incoming link is Draft:2021 Football Queensland Premier League. —Cryptic 14:13, 25 October 2021 (UTC)
@Crytpic:, Super, thank you so much! Mathglot (talk) 02:58, 26 October 2021 (UTC)

List of articles with the same journal/magazine/work/website/newspapere= and publisher= in the same template

Basically, I'm looking for things like

  • {{cite journal ... |journal=Foobar |publisher=Foobar}}
  • {{cite magazine... |magazine=Foobar |publisher=Foobar}}
  • {{cite journal ... |work=Foobar |publisher=Foobar}}
  • {{cite web ... |website=Foobar |publisher=Foobar}}
  • {{cite news... |newspaper=Foobar |publisher=Foobar}}
  • {{citation ... |journal=Foobar |publisher=Foobar}}

E.g. if in a template you have any of [|journal=, |magazine=, |work=, |website=, |newspaper=] that matches a |publisher= in the same template, I'd want to know about it. It's OK if whitespace and casing differs. Thanks. Headbomb {t · c · p · b} 00:51, 17 October 2021 (UTC)

This isn't something that can be queried for. We can see template transclusions but not their arguments, unless they have a side effect like categorizing or linking to another page (and even then we couldn't see repeats if, for example, journal=Foobar|publisher=Foobar both caused the article to link to Foobar).
I think the only practical way of finding these would be to make Module:Citation/CS1 emit a tracking category for such matches. Try asking at Module talk:Citation/CS1/Feature requests? —Cryptic 22:27, 18 October 2021 (UTC)
Hmm.. that kinda sucks. Oh well, I'll try to think of another way to do this. Headbomb {t · c · p · b} 23:14, 18 October 2021 (UTC)
You must already know this, but there's some scripts in Ohconfucious's "Sources" that can target those. Dawnseeker2000 09:27, 28 October 2021 (UTC)
@Dawnseeker2000: not away of this, but do you have a link? Headbomb {t · c · p · b} 22:33, 30 October 2021 (UTC)
Yes, it's User:Ohconfucius/script/Sources.js, lines 194–199 and 430–433. Dawnseeker2000 23:03, 30 October 2021 (UTC)

Searching for edit summary text over the revision table

So last night my time I found this IP range with a rather interesting edit summary pattern, and thought I might try to use my limited MySQL knowledge to figure out how to use database queries to search for more instances of it. I made a working query for the RecentChanges table using query 24762 but my attempt to apply it to the whole revision table failed spectacularly (it seems to have hung (it's been running for about twelve hours now) but I can't figure out how to stop it). I originally tried searching for the phrase "%https://en.wikipedia.org/wiki/American_and_British_English_spelling_differences#/media/File:Defence_Defense_Labour_Labor_British_American_spelling_by_country.svg%" but this thread on mediawiki.org said that prefix searches like this are unbelievably slow, so I changed the query string to "China uses US spelling%" and simplified the query as far as I could without success. I can't figure out how to make use of the "revision_comment" alternative view in this case. Judging by the RecentChanges test and other independent searching of IP ranges that I've done, I'm not sure there'll be many more results, but if anyone can help me make a successful query for this info in the revision table, I'd appreciate it. Graham87 08:31, 7 November 2021 (UTC)

OK, five issues to start with here:
  1. The first part of your join condition for comment ("ON comment_id = comment.comment_id") is tautological. You want "ON comment_id = rev_comment_id".
  2. You really badly need to define the join between revision and page - either explicitly ("FROM revision JOIN page ON page_id = rev_page") or in your where clause ("WHERE /*...*/ AND page_id = rev_page") - or else you're querying over the equivalent of a table the size of those two tables multiplied together, some 70 quadrillion rows.
  3. The revision table has over a billion rows. Without some (indexed) way to restrict which rows to search - and rev_comment_id isn't indexed, and even if it were, comment.comment_text isn't indexed either - there's no way your query will even come close to completing. rev_timestamp, like you do with rc_timestamp in your recentchanges query, is a good first step; a partial match on actor_name might help, too, if narrow enough.
  4. And the alternate view you're looking for is comment_revision, not revision_comment, though in light of #3, that's not going to make an appreciable difference compared to just using comment.
  5. Avoiding the prefixed-wildcard search isn't going to matter, since comment.comment_text isn't indexed anyway. —Cryptic 09:20, 7 November 2021 (UTC)
An example with results limited to 2021 is at quarry:query/59765. A year or two at a time is probably doable, so long as you expect relatively few matches (up to a few thousand) so Quarry doesn't choke. —Cryptic 10:28, 7 November 2021 (UTC)
Thanks very much! I only recently noticed your second message and using your tips plus other experimentation I made this, which also seems to work OK. Searching for the URL sometimes produces different results to searching for "uses US spelling". Off your query I made query 59770. I'll leave that one running for a while. Graham87 13:37, 7 November 2021 (UTC)

Biography pages redirect opportunities

Related to this idea, I'd like to create a bunch of redirects from e.g. Jane Lastname to Jane Q. Lastname. Would it be possible to compile a list of articles that meet the following criteria?

  1. In a subcategory of Category:People (or talk page tagged with {{WikiProject Biography}}, if that's easier)
  2. The title takes the form [word1] [capitalletter]. [word2]
  3. There is no article with the title [word1] [word2], nor any other articles with the title [word1] [anything] [word2]

Cheers, {{u|Sdkb}}talk 22:24, 21 November 2021 (UTC)

Practically anything is easier than querying category trees.
quarry:query/60063 for hits starting with "A". Limiting it to a constant initial letter at a time makes the query much, much faster, to about a second and a half per hit from about six minutes per. —Cryptic 01:45, 22 November 2021 (UTC)
What happens if a reader searches for "Jane Lastname" when the redirect doesn't exist? They'll be presented with the search results, and given that the article Jane Q. Lastname matches with its title, it will appear at (or very near) the top of the search results. Now, what happens if at some point an article is created about Jane R. Lastname? If no redirect had been created, then both articles would remain reachable from the search results. But if a redirect had been created, then only one of the two articles would be accessible.
Even if not looking into the future, what happens now if there's also an article about Jane Lastname (musician)? Or one about another Jane Lastname that's currently titled J.K. Lastname? – Uanfala (talk) 02:09, 22 November 2021 (UTC)
@Uanfala, ack, I forgot about disambigufiers; thanks for bringing that up. @Cryptic, does the quarry cover that or does it need to be adjusted? I'm not sure there's anything we could reasonably do about the J.K. Lastname example, but that seems like a pretty rare situation.
Regarding the possibility of future articles, my view is that it's responsibility of whoever creates an article to integrate it into the encyclopedia. So whoever creates Jane R. Lastname should either put a hatnote at Jane Q. Lastname (if she is the primary topic) or turn Jane Lastname into a disambiguation page (if not). Best, {{u|Sdkb}}talk 02:38, 22 November 2021 (UTC)
It may have been the creator's (or curator's) responsibility to create navigation from "Jane Lastname" to "Jane Q. Lastname", but the fact that you're here – and that your query is returning thousands upon thousands of results – is a clear indication that those people haven't reliably done that. I sympathise with your undertaking, but it will be a net positive in the long run only if there exists a robust system of tracking newly created articles of this type and disambiguating as needed. – Uanfala (talk) 03:06, 22 November 2021 (UTC)
In addition to helping readers navigate, the situation I'm seeking to prevent with this is where someone redlinks Jane Lastname, someone later creates Jane Q. Lastname without the redirect, and then a third person clicks on the redlink and tries to turn it into an article, not realizing one already exists. I've seen that happen or nearly happen a few times, so that seems like the bigger issue to me. {{u|Sdkb}}talk 20:53, 22 November 2021 (UTC)
It won't match titles with disambiguators; it specifically looks for word2 at the end of the title. I can make it make it do so, but it's actually easier for me to make it look for Lastname anywhere in the title, so Jane Q. Lastname would be prevented by Jane Lastname, Austria, if you'd prefer. —Cryptic 10:20, 22 November 2021 (UTC)

# of autopatrolled articles created per day

I drafted a query for this but ran into some issues. Can you guys take a look and see if you can fix it? I'm trying to use the alias day in some spots that it doesn't like, and I don't know how to fix it. Thanks. –Novem Linguae (talk) 02:18, 13 December 2021 (UTC)

COUNT(*), and put your GROUP BY clause before your ORDER BY. —Cryptic 02:22, 13 December 2021 (UTC)
Awesome, that worked, thank you. Second question. Why is this query only returning a year's worth of data? Does the pagetriage_log table get old entries deleted or something? –Novem Linguae (talk) 02:27, 13 December 2021 (UTC)
Seems so. "SELECT ptrl_timestamp FROM pagetriage_log ORDER BY ptrl_timestamp ASC LIMIT 1" gives me 20201212205515. —Cryptic 02:31, 13 December 2021 (UTC)

List of article creators

I wrote a query that gets an article's author. If I wanted to get this to get the authors of multiple articles in one query, what's the best way to refactor it? WHILE? Subquery? Here's my list of page IDs. Thank you. 66411662, 63051276, 67331300, 67494602, 67251604, 67475738, 67025524, 67282385, 67492243, 67505824, 68570713, 65754113, 68673796, 67481281, 68288689, 67485694, 68624634, 67062564, 67486327, 65912571, 67495969, 65558215, 67162967, 67504737, 66978220, 65952291, 67306801, 64208962, 67222236, 67365517, 68510913, 67480539, 66411751, 65228882, 67252944, 66476730, 68469744, 67008083, 66555751, 67282708, 67419043, 68693806Novem Linguae (talk) 16:57, 27 December 2021 (UTC)

Rather than fetching all revisions for a given page and showing only the lowest, I'd get a list of rev_id's with a subquery (SELECT MIN(rev_id) FROM revision WHERE rev_page IN (66411662, 63051276, /* etc etc */) GROUP BY rev_page) then select page_namespace, page_title, actor_name for each of those. —Cryptic 17:11, 27 December 2021 (UTC)

Stuck queries

I have a couple of stuck queries (61115, 58635) which still claim to be running hours beyond the max_statement_time I set. I'm aware that the tables are large, but I'm an experienced Quarry user who used to write SQL for a living, and the optimizer suggests that the queries use the expected indexes efficiently. The Stop button returns an HTML 500 response and seems to do nothing. Are these SQLs effectively dead, or is there some sort of regular clean-up process to release them? I'm reluctant to fork them or otherwise repeat my work in case it overloads the server. I remember that queries used to claim to be running after they had died for some reason (out of memory?) but in those cases it was possible to amend and re-run the query rather than it remaining permanently locked as "running". Certes (talk) 00:29, 31 December 2021 (UTC)

I don't know a thing about SQL, but the stop button not working – whic ofc is only one aspect of what you're describing – appears to be a known issue: https://phabricator.wikimedia.org/T290146. – Uanfala (talk) 01:12, 4 January 2022 (UTC)
I've completed that task in a less efficient way by correlating a series of big Cirrus searches, but would still be interested in any replies. Certes (talk) 20:47, 5 January 2022 (UTC)

Missing Redirects Project

 –  ― Qwerfjkltalk 17:57, 7 January 2022 (UTC)

File queries across all wikis

Hello!

I would like a query of all files with MIME type image/x-bmp and image-x-xcf, respectively, across all wikis (all language versions of Wikipedia, Commons, Wikibooks, Mediawiki etc.). They can all be found at at Special:MIMESearch/image/x-bmp and Special:MIMESearch/image/x-xcf respectively on each wiki (or Special:Search/file: local: filemime:bmp / Special:Search/file: local: filemime:xcf). I would like the list to be rendered like this so a wikilink will work on any wiki:

w:File.NAME.bmp
fr:File.NAME.bmp
c:File:NAME.bmp
pt:v:File:NAME.bmp
etc.

Is that possible? To clarify I would like two queries, one per MIME type. Thanks!Jonteemil (talk) 21:28, 9 January 2022 (UTC)

It's not possible now that wikis are on different database servers (and it was never easy before that). The best that can be done is to show you how to run the query on a single wiki, and let you run it on each yourself. —Cryptic 21:44, 9 January 2022 (UTC)
@Cryptic: I see, thanks for your reply. Since special:MIMESearch will work on every wiki it's not really necessary to make a query. The reason to requesting a query was only to have a compiled list on all BMP and XCF files across all wikis, but I guess that's not possible.Jonteemil (talk) 22:49, 9 January 2022 (UTC)

A-level article that are not GA or FA

I need to check these article. Can anyone do this?--28 (talk) 07:52, 10 January 2022 (UTC)

quarry:query/61425. —Cryptic 12:38, 10 January 2022 (UTC)

Missing colo(u)r redirects

Can I have a list of all mainspace and Wikipedia namespace pages containing either spelling of the word "colo(u)r", or derivative words, regardless of redirect/disambiguation status, where there is no page with the other spelling? Animal lover 666 (talk) 09:57, 13 February 2022 (UTC)

  • It'll be easier and more flexible to inspect the page titles dump (here; search for "all-titles") than to query, since you don't have any restrictions other than the titles and the strings you're looking for aren't at the start. —Cryptic 10:31, 13 February 2022 (UTC)
This query may be what you need. It interprets "derived words" liberally as anything containing Color or Colour; you probably wanted Colored but not Colorado. I don't think everything on that list deserves a redirect. Certes (talk) 13:23, 13 February 2022 (UTC)

Help optimizing a query?

I'm trying to select (mainspace, non-redirect, non-dab) pages which have small gaps between their page_random value and the next-smallest page_random value (again, among mainspace non-redirect non-dab pages). I have a query here, which works when limited to a small subset of pages, but scales very badly as I relax the range of page_random values. I'd very much appreciate it if anyone could suggest any optimizations that would let me run it over the whole pages table (or at least larger chunks of it). Colin M (talk) 18:35, 4 March 2022 (UTC)

I can take a look by Sunday. TrangaBellam (talk) 18:41, 4 March 2022 (UTC)
The operation takes x2 time because you can't index copy on rownum. Is LAG useful here? Certes (talk) 18:55, 4 March 2022 (UTC)
This is exactly what I needed! I had tried using LAG previously after seeing some mentions of it on StackOverflow, but I got a syntax error and just assumed the function wasn't supported by whatever SQL dialect/version Quarry uses. I must have just been using it wrong. Based on the documentation you linked, I was able to write a new query using LAG which is orders of magnitude faster. Thank you so much, Certes! Colin M (talk) 20:33, 4 March 2022 (UTC)

Map of countries by GAs/FAs per capita

This is purely for curiosity (or a possible Signpost feature), so take it up only if it's of interest. But I'd like to create a map of the world that shows how many good articles each country has per capita, and a similar one for FAs. The query for this would be articles in Category:Good articles (or Category:Featured articles) whose Wikidata item has country (P17) set to a given country. Those would then be adapted to shade an interactive map similar to Wugapodes' {{Interactive COVID-19 maps}}. {{u|Sdkb}}talk 08:06, 17 March 2022 (UTC)

  • You really want to be at d:WD:RAQ, not here, but here's a very basic version. They can probably add bells and whistles like merging "Kingdom of England" into its eventual successor state "United Kingdom" over there.
    You know P17 excludes people, right? (I didn't.) This query uses P27 (citizenship) too, though place of birth might be better. —Cryptic 10:36, 17 March 2022 (UTC)
    @Cryptic, okay, I got something working-ish lol: User:Sdkb/sandbox#Sandbox (set date slider to March 15). And the winner is...  Greenland, with 20 FAs per million people! (Or, to frame it slightly less impressively, 1 FA and 50,000 people lol) {{u|Sdkb}}talk 18:13, 17 March 2022 (UTC)

And here's for GAs with people, a larger data set:

The winner is still Greenland, but Iceland's exceptional performance is looking less like a fluke. I wonder why that is (and Norway, too). And I expected India to do badly, but...geez. The only two visible countries to appear to have no GAs are Turkmenistan and Guinea-Bissau; I'll go offer rewards for those. {{u|Sdkb}}talk 19:17, 17 March 2022 (UTC)

@Sdkb: I note that WikiProject Turkmenistan says they have one GA (Battle on the Irghiz River) while WikiProject Guinea-Bissau has two (2012 Guinea-Bissau coup d'état and Guinea-Bissau women's national football team). I suggest your query has an error. Chris Troutman (talk) 20:17, 17 March 2022 (UTC)
@Chris troutman, thanks for finding those! It's certainly an imperfect query—Battle on the Irghiz River doesn't have any country info on Wikidata, so that's why it was missed, but I'm not sure about Guinea-Bissau. {{u|Sdkb}}talk 20:21, 17 March 2022 (UTC)
Guinea-Bissau shows up with 2 good articles in both queries for me. —Cryptic 21:13, 17 March 2022 (UTC)
I traced the error back to this. {{u|Sdkb}}talk 21:43, 17 March 2022 (UTC)
@Sdkb: this is quite fascinating, but the scrolling image above doesn't seem to work for me. If you get a static image of some kind, I think it'd be a fun addition to add to WP:Countries. CMD (talk) 05:57, 27 March 2022 (UTC)
@Chipmunkdavis, the next iteration of the idea is at Wikipedia:Village_pump_(technical)#Pushpin_map_of_the_world's_FAs, if you're curious. WP:Countries would definitely be a good spot to add it! {{u|Sdkb}}talk 06:02, 27 March 2022 (UTC)

  Courtesy link (pushpin map): VPT Archive 196. Mathglot (talk) 04:02, 2 April 2022 (UTC)

Orphans, and poorly-linked articles in a WikiProject

Hello, I'm interested in creating a list of orphans and articles with very few inlinks, which may be helpful as a To-do list for WikiProjects. For starters, and as an illustration and real-world example, I'm interested in a list of orphan articles that are categorized in WP:WikiProject Military history ("orphan" defined as an article with no in-links from other pages in mainspace; "in WikiProject X" means the associated talk page contains template {{WikiProject X}}, or one of its redirects.) I was able to find this advanced search which is not quite the same thing but the "1,259" result may be a lower bound for the number of actual orphans.

Before generating any possibly long lists of articles, I'd like to first get a feeling for the scope of the problem. So, as a first db query can we do a summary query which returns eleven rows containing two values, where column one (let's call that 'nbr of inlinks') contains the integers 0 to 10, and column two ('article count') is the number of articles (ns=0) in WP:WikiProject Military history containing exactly that number of inlinks from other mainspace articles. So, e.g., the top data row contains a '0' on the left, and the number of orphans in WP:MILHIST on the right. The next row, a '1', and the count of MILHIST articles with exactly one (non-self) in-link from mainspace. And so on.

If the article count is not too huge, could we do a second query with one column containin the list of orphan articles in MILHIST as well? Please skip redirects, and I'm probably forgetting some other stuff, but anything that wouldn't be useful in a "ToDo" artile list for a wikiproject should be skipped as well. If you can request the article titles as a wikilink, that would be handy, and if it's not too difficult to add a second column containing a wikilinked Q-item from WikiData for those articles that have Wikidata links with at least one non-English Wikipedia link (i.e., >= 1 link excluding en-wiki and simple), that would be ideal. (Example: Olga Bogdanova is an orphan, but has d:Q26243976 attached to it, containing two non-English Wikipedia articles (ar, and ru); so in this case, col. 2 would be d:Q26243976. Thanks! Mathglot (talk) 02:02, 2 April 2022 (UTC)

quarry:query/63558 for the summary; quarry:query/63559 for the full list (just short of 75000 rows, so it'll take a while to load). If there's a way to join any wikidata data at all after the server split, I don't know it. —Cryptic 13:05, 2 April 2022 (UTC)
Well done; I was wondering how to do this efficiently. Beware that some unlinked pages such as 1st Brigade are dabs which don't need more links, and apparent orphans may be linked via redirects. Certes (talk) 13:23, 2 April 2022 (UTC)
Incoming links directly from redirects (like from 1971 war (disambiguation) to 1971 war) shouldn't be counted either - for every article with no incoming links at all, there were another four whose only incoming links were from redirs. (It's conceivable that there are redirs that link to pages other than the ones they're redirecting to; those get removed too, but they're quite rare, they shouldn't be doing that, and the links wouldn't display anyway.) Removing those and dabs isn't a problem: quarry:query/63563 (summary), quarry:query/63564 (full list).
Counting incoming links via redirects, like from (hypothetical) redirect Alpha to Beta, where we want to count articles that link to Alpha in our count for incoming links to Beta, is harder to do in a single query. I can't get it working efficiently for even a single page, let alone the large aggregate query. —Cryptic 15:06, 2 April 2022 (UTC)
Thank you so much for this. I'm a little confused what the difference is between the two versions, with the 1::4 difference in size for the first row of each. I understand the part that dab pages shouldn't be included, but shouldn't that reduce the count after query 63558 was modified to 63563? Is the second one including dabs with few inlinks, or excluding them? What accounts for the 4x larger counts in that first row? As a practical matter, if we want to fix the articles with no in-links, which is the better query result list to use, the one indicating '1206', or '4125' articles? Putting it another way, how do I describe in English what those two numbers are, so I can explain them to members at the Project? Mathglot (talk) 21:13, 2 April 2022 (UTC)
63563 displays slightly more articles than 63559 did. The extra articles had more than ten incoming links including redirects, but have ten or fewer now we are excluding redirects. The increase would be greater, but it is reduced by excluding the dabs. Certes (talk) 21:57, 2 April 2022 (UTC)
63558 shows '1206' for row 1 (inlinks=0), and 63563 shows '4125' in row 1; what do those numbers represent exactly, and why the big difference? I assume that '1206' means, "there are 1206 mainspace pages in the project with zero mainspace inlinks, excluding circular redirects", is that correct? What does the '4125' mean? Mathglot (talk) 22:02, 2 April 2022 (UTC)
The second pair of queries removes dabs, but also doesn't count inlinks that are from redirect pages. An article like 1st-2nd Chasseur Regiment was counted in the first set of queries as having one inbound link, despite that link being from 1st-2nd Chasseurs Regiment which redirects to it. The second one correctly counts it as having no inbound links - you don't really unorphan an article by creating a redirect to it, unless something links to that redirect. There's 1206 pages with no mainspace pages in the project linking to them at all; 4125 non-dab mainspace pages with no pages linking to them except perhaps redirects to that page. —Cryptic 22:12, 2 April 2022 (UTC)
Aha, now I get it. Thanks. Mathglot (talk) 22:26, 2 April 2022 (UTC)
I've had a go at including links to redirects in quarry:query/63571. This removes 11,000 articles which go over the ten-link limit once links to redirects are included. It's basically Cryptic's query, sliced and reassembled, plus a simple visit to the redirect table, but I get to tick off WITH RECURSIVE in my I-Spy book of SQL keywords. Analysis by link count is 0: 1540, 1: 5586, 2: 8008, 3: 9115, 4: 8892, 5: 7756, 6: 6817, 7: 6045, 8: 5156, 9: 4545, 10: 4104. Certes (talk) 23:32, 2 April 2022 (UTC)
Nice refinement, and the first few two or three rows are probably worth plotting over time. I'll check around for some graphing extension or template that can plot it, or else a tool that can create a graph we can upload to commons once every X interval. But where are you getting the summary figures from, I don't see them? I checked a few adjacent quarry numbers, but still can't find it. These values come from a query I might want to rerun periodically, so a link to it would be great.
Interesting pattern in those counts, don't you think? A concave-down curve, peaking at 3. Wonder if there's a hidden narrative in the data there? I'm guessing something like, "After creating the page, the editor goes on to find around 3 articles to link it from. They figure that at '3', they've done their duty for integrating it into the rest of the project, and they're done. Some over-achievers do more, and the lazy ones do less." Or, what's your narrative? (I realize I'm o/t for *request* a query, but these data paint an interesting picture to think about.) Mathglot (talk) 00:10, 3 April 2022 (UTC)
The number of incoming links may resemble a Poisson distribution, with every relevant article having some probability of being linked to the near-orphan. I didn't do the counting in SQL: it's much quicker to download the data and run cut|uniq on it! I've forked quarry:query/63580 with a nasty hack to include totals at the start of the the single allowed result set. I've also added a list of link targets (article title plus redirects), separated by "|". Certes (talk) 12:36, 3 April 2022 (UTC)
No need to hack just to get more than one resultset (e.g. quarry:query/28650), though you're still better off with it here to avoid running the expensive query twice. —Cryptic 13:12, 3 April 2022 (UTC)
I meant that I couldn't get the WITH to apply to both result sets. The not-a-temporary-table vanishes at any hint of a semicolon which, as you say, means repeating the expensive bit. Certes (talk) 13:34, 3 April 2022 (UTC)
Oooh, that's really nice, thanks! If I understand it correctly (only bits of it), the part above the UNION gathers the 11 summary rows, and the part below it gathers the long article list, and the hack somehow involves renaming or recasting things to make them look enough like the same type of thing, that they can appear in the same column, and then be sorted in such a way as to dump the 11 summary rows first; is that very approximately what's going on there? I just wanted to say I very much appreciate and have enjoyed watching both of you develop this, as well as the interplay resulting in various refinements, and I hope you've enjoyed it, too. I think we have a winner, here, and this can be our "regular query" to rerun from time to time. Thank you both very much! Mathglot (talk) 17:31, 3 April 2022 (UTC)
Yes, that's what's going on. The final step would be more legible and maintainable as two separate queries, but running them within the Quarry framework would mean doing the hard work twice, doubling both the server load and the time you have to wait. Certes (talk) 21:09, 3 April 2022 (UTC)
@Certes: Understood, thanks. One last thing: I've forked this to quarry:63623 so I can run it. If perchance you play around with 63580 or tweak it further, can you ping me and let me know, so I can fork another? Otherwise, I think we're done. Thanks! Mathglot (talk) 20:44, 5 April 2022 (UTC)
Will do, though I'm not planning any further tweaks. (By the way, it's quarry:query/63623. Bizarrely, quarry:63623 looks for a user of that name.) Certes (talk) 22:08, 5 April 2022 (UTC)
Also, a piggyback question: if we wanted to run the summary query once a week, or month, or whatever to watch the progress in reducing orphans, is there some tool you know of in xtools or someplace accessible from a WikiProjet page, that would let us plug in the counts, and generate a simple time-series graph as a bar chart or line graph, showing how the counts vary over time? That would be ideal. Thanks, Mathglot (talk) 21:18, 2 April 2022 (UTC)
There's instructions on rerunning queries at Wikipedia:Request a query/Archive 1#Newly created Presbyterian articles. —Cryptic 22:12, 2 April 2022 (UTC)
Thanks. Mathglot (talk) 00:10, 3 April 2022 (UTC)

This was about more than mere curiosity, and I hope it will lead to significant improvements in orphan backlog reduction. I've created a pilot at WikiProject Military history, and if you're interested in following along, see Wikipedia talk:WikiProject Military history#De-orphaning articles, and follow the link to the demo page. Mathglot (talk) 01:01, 4 April 2022 (UTC)

List of fully-protected or interface protected double redirects

As bots can't fix these. Thanks! 🐶 EpicPupper (he/him | talk) 21:50, 15 April 2022 (UTC)

quarry:query/63824. There's only seven admin-protected ones (and two autoconfirmed). No explicitly interface-admin-protected ones at all, but there are twelve from titles ending in .js or .css - quarry:query/63825. —Cryptic 22:27, 15 April 2022 (UTC)
Thanks @Cryptic! Is there a way to filter by only admin-protected ones? Thanks! 🐶 EpicPupper (he/him | talk) 22:36, 15 April 2022 (UTC)
Add AND pr_level IN ('sysop', 'interface-admin') to the end of the first query, before the semicolon. I didn't mainly to prove that there aren't any interface-admin-protected ones, as opposed to a typo in the group name, or a missed rename like the recent 'oversight' to 'suppress'. (I'm fairly sure the current configuration disallows explicit protection to interface-admin anyway, and if we haven't changed the internal name from sysop to administrator yet we aren't likely ever to, but might as well be sure.) —Cryptic 22:48, 15 April 2022 (UTC)

cite book template with |location=<location> and without |publisher=<publisher name>

Because of discussion at Help talk:Citation Style 1 § |location= without |publisher=, I wonder if there a way to get the number of articles that meet these criteria:

  1. has a {{cite book}} template that has:
    |location=<location> AND
    (
    |publisher= is empty OR
    |publisher= is omitted
    )
  2. has a {{citation}} template that:
    does not have:
    (
    |journal=<journal title> OR
    |magazine=<magazine title> OR
    |newspaper=<newspaper title> OR
    |periodical=<periodical title> OR
    |website=<website title> OR
    |work=<work title>
    ) AND has
    |location=<location> AND
    (
    |publisher= is empty OR
    |publisher= is omitted
    )

Sure, we can modify Module:Citation/CS1 to add templates that match these criteria to a category but it might be better to know beforehand the magnitude of the problem if there is one.

All I want is the numbers. Thanks.

Trappist the monk (talk) 15:37, 23 April 2022 (UTC)

I don't think Quarry can do this because it has no access to the template parameters or wikitext. A search like this might work for question 1 but doesn't return any matches for me so the regexp may contain errors. Question 2 could be done similarly in theory but might be over the 300-character limit or time out (needing multiple runs with prefix:A etc.). Certes (talk) 16:54, 23 April 2022 (UTC)
Thanks for that. Yeah, errors; \s does not match whitespace characters. For example, this search does not find thousands of articles using {{cite book}} (with a space) but this search does. Similarly, \w is just an escaped w... Rewriting your regex like this:
hastemplate:"cite book" insource:/\{ *cite book *(\| *(([^p]|p[^u]|pu[^b])[^\|\}]*|publisher *= *))*\| *location *= *[A-Za-z][^\|]*(\| *(([^p]|p[^u]|pu[^b])[^\|\}]*|publisher *= *))*\}\}/i
returns some number of results; Sweden, for example, has {{cite book}} with |location= and without |publisher= but, alas, all {{cite book}} templates in U.S. state, though returned by the search, have both |location= and |publisher=.
Writing a regex to find something that doesn't exist is difficult; that cirrus search has crippled regex doesn't make it any easier... Thanks for the attempt. Were I you, I would not spend any more time on this.
Trappist the monk (talk) 18:19, 23 April 2022 (UTC)
Thanks for spotting my silly mistakes: I forgot Cirrus was so limited. If only we had (?!…) or anything else resembling PCRE. Certes (talk) 19:13, 23 April 2022 (UTC)
Just in case someone tries to use this in future: false positives come from [^p] matching a space. [a-oq-z] works less badly. Certes (talk) 22:53, 24 April 2022 (UTC)

Documentaries with plot sections

Would it be possible to take the list here and filter it to only contain entries that have the wikitext "==Plot==" or "== Plot =="? {{u|Sdkb}}talk 02:23, 28 April 2022 (UTC)

Not with a query - there's no access to page text. —Cryptic 08:40, 28 April 2022 (UTC)
Actually, correcting myself slightly - we can see section names in the very limited case where there's a redirect to that section. There's 94 redirects to sections of those articles, none named "Plot"; the closest there is is Riley ReynoldsHot Girls Wanted#Synopsis. —Cryptic 14:16, 28 April 2022 (UTC)

Edit summaries

Hi, I'd like to know if I can find out how many times a particular edit summary from a popular semi-automated tool has been used. Ohconfucious's MOS:NUMDATES script's former edit summary was: date formats per [[MOS:DATEFORMAT]] by [[WP:MOSNUMscript|script]] (date formats per MOS:DATEFORMAT by script). Dawnseeker2000 21:30, 8 May 2022 (UTC)

A quick check through recent changes shows 3300 in the last month (quarry:query/64419). A full check is likely to take much longer. Certes (talk) 00:05, 9 May 2022 (UTC)
quarry:query/64420 can be limited to date ranges and takes about 2 seconds per day, or 12 minutes per year, so could be run in sections. That text started to appear on 10 December 2014, so nine one-year runs might work. Certes (talk) 00:25, 9 May 2022 (UTC)
Thank you sir. Dawnseeker2000 06:51, 9 May 2022 (UTC)

Queries

Hello, folks,

Is it possible to run a query on my own contributions and logs? This is not for a higher purpose, I am just curious about how many Teahouse invitations I've posted (I think it's in the tens of thousands) so I'd run a query to see how many of my edit summaries state "Welcome to Wikipedia: check out the Teahouse!" since that is the default edit summary when posting an invitation. I'm also curious about the pages that I've deleted, if I could run a query on the deletion summaries on Special:Log/Liz to see the proportion that were based on a particular speedy deletion criteria or whether they were deleted PRODs or because of an AfD decision. This query might be more complicated because sometimes multiple CSD criteria are listed in a deletion summary or the wording is not concise. But I don't think I've ever left a blank deletion summary so there would be some reason listed for every page deletion I've made.

If these tools are really intended to help manage the project, I understand that my request is purely personal. It's just that my edit count is growing and I know at a certain point in the future, Edit Count will no longer be available for me to see my how my edits breakdown so I thought I'd inquire about these two queries before my edit counts and log entries become too large for these to be practical requests.

Many thanks for any feedback you have, even if you tell me that these results are not possible to obtain. I'm constantly in awe of editors and admins who create bots or write scripts and make tools which make our editing here so much easier and simple. Your talents are appreciated! Thanks again. Liz Read! Talk! 20:24, 9 May 2022 (UTC)

For the first request, you can use https://sigma.toolforge.org/summary.py (which is linked as "edit summary search" at the bottom of your contributions page). I'll let someone else handle the second request. Oh, and you can still use WikiScan to get similar information to what the edit count report provides when you cross its limit. * Pppery * it has begun... 20:43, 9 May 2022 (UTC)
For your CSD counts, here's an approximation I wrote in SQL. To refresh these numbers in the future, log in to Quarry, click "fork", then click "Submit query". –Novem Linguae (talk) 22:17, 9 May 2022 (UTC)

Quarry : subpages

Hey, I'm trying to list French sub talkpages 'À_faire' (like fr:Discussion:Dantès/À faire) that still exists while talkpage (like fr:Discussion:Dantès) OR main page (like fr:Dantès) doesn't exist.

I started with :

SELECT page_title
FROM page talkpage
WHERE talkpage.page_title LIKE '%/À_faire'
AND talkpage.page_namespace IN (1,5,11,15,101,119,711,829)

But, I'm not able to end this SQL query with NOT EXISTS in order to check :

Maybe this is a starting point :

AND NOT EXISTS (
SELECT 1
FROM page mainpage
WHERE mainpage.page_namespace=talkpage.page_namespace-1
AND mainpage.page_title=talkpage.page_title)

But... it's going to list French sub talkpages 'À_faire' WHEN pages like fr:Dantès/À faire doesn't exist BUT that's not what I'm looking for. I need to check IF fr:Dantès OR fr:Talk:Dantès doesn't exist instead. Any clue?

Thank's --LD (talk) 22:10, 12 May 2022 (UTC)

This should show '/À_faire' pages where neither fr:Dantès nor fr:Talk:Dantès exists.
AND NOT EXISTS (
SELECT 1
FROM page mainpage
WHERE mainpage.page_namespace IN (talkpage.page_namespace-1, talkpage.page_namespace)
AND mainpage.page_title=SUBSTRING(page_title,1,CHAR_LENGTH(page_title)-8))
8 here is short for CHAR_LENGTH('/À_faire'). Did you also want to see pages where one of the two exists but not the other? Certes (talk) 22:50, 12 May 2022 (UTC)
Thank's for answering @Certes, yeah also would like to see that LD (talk) 22:59, 12 May 2022 (UTC)
AND (NOT EXISTS (SELECT 1 FROM page mainpage
    WHERE mainpage.page_namespace = talkpage.page_namespace-1
    AND mainpage.page_title=SUBSTRING(page_title,1,CHAR_LENGTH(page_title)-8))
 OR NOT EXISTS (SELECT 1 FROM page mainpage
    WHERE mainpage.page_namespace = talkpage.page_namespace
    AND mainpage.page_title=SUBSTRING(page_title,1,CHAR_LENGTH(page_title)-8)))
I.e. AND (NOT EXISTS(main page) OR NOT EXISTS(talk page)) Certes (talk) 23:02, 12 May 2022 (UTC)
@Certes, thank's again. Erm, something might be wrong, i.e. query listed fr:Discussion:'Pataphysique/À_faire while main page fr:'Pataphysique and fr:Discussion:'Pataphysique exist.
It rather should list <any talk subpage 'À_faire'> WHEN <any main page> AND+OR <any talk page> doesn't exist. LD (talk) 23:20, 12 May 2022 (UTC)
Try quarry:query/64526. Although documented as working on characters, SUBSTRING seems to work on bytes, so multi-byte characters like À give the wrong answer. Certes (talk) 23:48, 12 May 2022 (UTC)
Thank's for solving that, @Certes, I really appreciated! LD (talk) 00:16, 13 May 2022 (UTC)

Files whose extension don't match the MIME

Hello!

I'd like to find files whose extension don't match the MIME. I can find them using CirrusSearch, for example Special:Search/file: local: filemime:png -intitle:/\.png/i. You must however do this MIME by MIME - you can not do it all MIMEs simultaneously. It'd be nice to do a collective query of ALL files which don't match their MIME. You can find all possible file extensions and MIMEs on Special:MediaStatistics.Jonteemil (talk) 19:58, 17 May 2022 (UTC)

Quarry:query/64650 finds just the one offender: File:Poly.pov. Certes (talk) 21:46, 17 May 2022 (UTC)
@Certes: Thanks!Jonteemil (talk) 21:46, 19 May 2022 (UTC)

Broken Queries

Hi, I'm trying to re-run a bunch of queries on Quarry that I created a while back for looking for candidates to invite to New Page Patrol. For example: https://quarry.wmcloud.org/query/42657 among others. However, when I try to re-run them, it says "Bad database name". Could you indicate what needs to be changed to get them working again? — Insertcleverphrasehere(or here)(or here)(or here) 21:26, 26 May 2022 (UTC)

@Insertcleverphrasehere: You need to enter enwiki_p in the "Enter the DB name here..." field between the description and the actual query * Pppery * it has begun... 21:29, 26 May 2022 (UTC)
@Pppery Thanks very much... that did it. — Insertcleverphrasehere(or here)(or here)(or here) 22:06, 26 May 2022 (UTC)

Unreviewed species articles

Part 1: Need a query that gets a list of PageTriage/NPP unreviewed articles. This is tricky because I don't think there's a field anywhere that stores if an article is unreviewed. Rather, you have to JOIN logging or JOIN pagetriage_log_table and look for the absence of a record. Might also have to limit it by date, since PageTriage logs appear to be self-deleting after 12 months, so the absence of a log after 12 months = reviewed. More info.

Part 2: Need a way to figure out if the unreviewed articles are species article or not. One of the best ways is to search the article text for "is a species of", but we can't search article text in SQL. Another way might be to check for any biological sciences stub template. Another way might be to check for any article title that is in the format /[A-Z][a-z]+ [a-z]+/ (two words, first character is a capital letter, all other characters are lowercase) and I can do some manual filtering after that. –Novem Linguae (talk) 21:44, 26 May 2022 (UTC)

pagetriage_page has entries going back to 2012, and has a handy column for page creation time too. "Some" manual filtering is extremely optimistic; there's 464730 non-redirect pages in mainspace matching /^[A-Z][a-z]+_[a-z]+$/. —Cryptic 22:10, 26 May 2022 (UTC)
Thanks for the info. Good to know about pagetriage_page. The # of articles would be manageable since I'd be running it on the part 1 list of pages. I spot checked Special:NewPagesFeed just now and the first 10 articles do not match the pattern. –Novem Linguae (talk) 22:15, 26 May 2022 (UTC)
Hrm, while pagetriage_page's earliest entries are from a long time ago, it only has 171471 rows, so it clearly doesn't have all mainspace pages, or even all patrollable mainspace pages. I can't find evidence of the page_ids from the early entries ever having existed, either, whether in page, logging, or archive, and there's enough - averaging about 2000 a year from 2012 to 2020 - that I'm dubious that they've all been suppressed. Unless entire-page suppression is a lot more frequent than per-revision suppression.
There are 14758 entries in the table where ptrp_reviewed_updated is from 2021, and 138107 from 2022. So if we assume that ptrp_page_id really is a page_id, the table contains only pages still in the review window, and it contains all such pages - all very big assumptions, since the table's completely undocumented - then quarry:query/64831 has pages that have neither been reviewed nor patrolled nor autopatrolled. I haven't done any sanity-checking on its output. —Cryptic 22:38, 26 May 2022 (UTC)
That query produces suspiciously few rows; most of them are not taxa; most of the taxa are redirects. (I expect an incomplete table rather than any problem with the query). A rare example of a real result is Parapristella aubynei. Are the desired pages likely to be in Category:Articles with 'species' microformats? Certes (talk) 23:01, 26 May 2022 (UTC)
I agree (and said before running it) that the simple regex on page title isn't a very good selector for "species article". I do have reason to think that the table is complete, though: Special:NewPagesFeed says there's 14161 unreviewed articles, which almost exactly matches SELECT COUNT(*) FROM pagetriage_page JOIN page ON page_id=ptrp_page_id WHERE ptrp_reviewed=0 AND page_namespace=0 AND page_is_redirect=0; at 14159. —Cryptic 00:17, 27 May 2022 (UTC)
I've gone to the other extreme in quarry:query/64839 which produces only 38 results but all should be taxa or at least individual specimens. However, most of them are genera rather than species. Perhaps there simply aren't many to review. I've excluded redirects (we'd have to check the target's categories) and assumed the new articles have species microformats, which sounds like a stretch but seems to be true. Certes (talk) 00:48, 27 May 2022 (UTC)
Thank you both. Both those queries are helpful, and I've already reviewed like 10 species articles and redirects thanks to this data. Yes, there probably aren't many unreviewed species articles (excluding redirects) because they are easy to review and so are usually reviewed quickly. But a couple slip through the cracks.
I don't know much about microformats. The MediaWiki page doesn't give a great definition. What is a species microformat, and is a template required to mark a page as a species microformat? Some of these unreviewed articles may not have a {{Speciesbox}} or {{Taxonbar}} yet. –Novem Linguae (talk) 01:19, 27 May 2022 (UTC)
@Novem Linguae: Good question. I only know what I read in the microformats project. Their species microformat page says the category appears via {{Taxobox}} and other templates so, if no one added those, the page is probably not categorised. Certes (talk) 02:04, 27 May 2022 (UTC)

Level-5 Vital Articles lacking short descriptions

As suggested by Sdkb here, looking for a query that lists all articles in Category:All Wikipedia level-5 vital articles but not in Category:Articles with short description. Thank you! (I'm not watching this page – please use {{reply to|PerfectSoundWhatever}} on reply)PerfectSoundWhatever (t; c) 15:01, 1 June 2022 (UTC)

Normally, you could do this with Petscan, but the level-5 category is assigned to the article's talk page. If you don't get an answer to this query request, you can go through the category page manually and show all of the short descriptions using this script. – Jonesey95 (talk) 16:26, 1 June 2022 (UTC)
quarry:query/65002. The page not in mainspace is Category:Port Arthur massacre; Category talk:Port Arthur massacre's categorization into the v5 cat needs fixing separately. —Cryptic 20:46, 1 June 2022 (UTC)

List of pages by last update

Is it possible to obtain a list of recently updated talk pages efficiently, e.g. through SQL via page.page_latest? It could be a list of pages updated since a certain date, or a complete list of all pages with the date of last update, or possibly a list of all pages sorted by last update ranking. Per discussion Wikipedia:Bot_requests#Add_Reflist-talk_to_talk_pages_with_references. GreenC 16:21, 3 June 2022 (UTC)

@GreenC: quarry:query/65053 lists talk pages updated since the start of this month. Fork and change the timestamp to select a different start date. It currently shows all talk namespaces but could be limited further, e.g. page_namespace in (1, 11) shows just (article) Talk: and Template_talk:. I've excluded redirects: including them would speed the query up but produce lots of false positives. Certes (talk) 23:52, 3 June 2022 (UTC)
This is great! Follow up at Wikipedia:Bot_requests#Add_Reflist-talk_to_talk_pages_with_references. -- GreenC 03:02, 5 June 2022 (UTC)

List of pages where page language is not the same as the last part of title

An edit request was made at m:Module talk:Template translation where Verdy p said that the language specified in the metadata of a page may not be the same with the last part of its title. For example, m:User:Tangotango/Whatredirectshere/sv is expected to be in Swedish, but MediaWiki doesn't think so. I have tried writing a query myself using this manual but page_lang always return NULL. NguoiDungKhongDinhDanh 03:42, 15 June 2022 (UTC)

What precisely are you looking for? Some possible results:
  1. Pages without a slash, with page_lang set, like m:Die falsche Version (de); there's 64 of them
  2. Pages with a slash, with page_lang set, where the title after the slash doesn't match page_lang, like m:IRC/wikipedia-fr/extraits choisis (fr); there's 376
  3. Pages with a slash, with page_lang set, where the title after the slash doesn't match page_lang, and the part after the slash matches the page_lang of at least one other page; there aren't any
  4. Pages with a slash, without page_lang set, like m:10 Wikiweeks/Participants; there's about 2.7 million
  5. Pages with a slash, without page_lang set, and the part after the slash matches the page_lang of at least one other page, like m:Country codes/en; there's about 1.7 million
Anything like "the part after the slash is a valid page_lang" isn't going to be possible unless there's a list of language codes already in the same database that I'm not seeing. Well, I suppose I could enumerate them all in the query, manually, but I'm not going to. —Cryptic 06:42, 15 June 2022 (UTC)
@Cryptic: If an unset page_lang means that page's language is similar to that of the project, then I'm looking for pages with at least one slash, without page_lang set, where the part after the last slash isn't en but matches the page_lang of at least one other page. Honestly, I don't expect 1000, let alone 1.7 million. NguoiDungKhongDinhDanh 07:42, 15 June 2022 (UTC)
quarry:query/65375. There's some 216000 pages even after I omit namespace Translations: (which has another 1.27 million), so it'll load slow.
If you want the results from the Translations namespace, we'll have to arrange some other way to get them to you; quarry can't handle that many results. —Cryptic 09:01, 15 June 2022 (UTC)
@Cryptic: page_lang seems to be inconsistent with {{PAGELANGUAGE}}. For example, m:Main Page/es is clearly a Spanish translation (expand). This request is about to be out of scope for this page so perhaps I'll need to ask at WP:VPT. Thanks a lot anyway. NguoiDungKhongDinhDanh 13:20, 15 June 2022 (UTC)

Pages that starts with“List of”and ends with“Canada”with“box”in the middle

I need these pages, but I don't know how to write a SQL query. Can anyone help me?--Q28 (talk) 14:21, 18 June 2022 (UTC)

@Q28: A Cirrus query finds them (with potential false positives having other text after Canada, but I don't see any). Certes (talk) 15:19, 18 June 2022 (UTC)
@Q28: Quarry:query/65478. Sam Walton (talk) 15:24, 18 June 2022 (UTC)

Request for table showing the length of short descriptions in article space

I don't know if this is a query exactly, but I thought I would start here. What I am hoping to have generated is a table, a histogram of sorts, showing the number of articles that have short descriptions, put into bins by the length of the short description. Something like this:

Histogram of short description lengths in article space
Length (in characters) Number of articles
N/A (no SD) 2,000,000
1 10
2 100
3 265
... ...
99 2,204
100+ 245

The purpose of this data request is to inform an ongoing conversation about the appropriate length of short descriptions.

Redirect pages should be excluded from the counts. The maximum bucket value should be "100+", since there should be only a hundred or so longer than that. One tricky bit is that some short descriptions are assigned by infobox templates, but I think that should be OK, since the descriptions are still listed under "Local description" on the Page information page for the article. According to "state of the project" data, there should be about 4.5 million articles with short descriptions and 1.6 million without them. The histogram does not need to be up-to-the-minute, so if it based on a database dump, that is fine. Please let me know if you have any questions. Thanks. – Jonesey95 (talk) 03:19, 24 June 2022 (UTC)

quarry:query/65373. Of the 579 in the 100+ slot (which gets orphaned all by itself on the second page, alas), 398 are exactly 100, 33 are 101, 14 at 102, and no other counts have more than 8. 19 total at 150 or higher; the longest three are 219, 235, and 250. Disambigs aren't removed, unlike your State of the Project link, which is why both the no-shortdesc count and the total of the ones with shortdescs (4679722) are higher than your estimates. —Cryptic 03:52, 24 June 2022 (UTC)
I'm presuming that the stats for four characters are being skewed by the 'none' command to generate a deliberate blank? - X201 (talk) 13:22, 24 June 2022 (UTC)
No, they show up in N/A. There aren't any pages in with their shortdesc set to 'none' in the database, and the example at WP:SDNONE - Alpine skiing at the 1960 Winter Olympics – Men's downhill - doesn't have a wikibase-shortdesc property at all. It's not immediately obvious how it differs in the database from a page without a local shortdesc set; I'll investigate. —Cryptic 14:00, 24 June 2022 (UTC)
And, hrm, it shows its central description from Wikidata in Page information, too, instead of a blank, 'none', or no local/central description field at all. There's clearly a special case to make it so {{short description|none}} doesn't add a property, but if it's meant to disable the Wikidata shortdesc, it's either not working or not enabled.
If this is expected behavior, I can generate a count of (non-redirect, mainspace) pages that transclude {{short description}} but don't have a property set, I suppose, and find them that way. —Cryptic 14:09, 24 June 2022 (UTC)
Scratch that, too. (That's right, I don't work with short descriptions. How could you tell?) Central description shows up whether there's a local shortdesc or not, and the way to find ones set to "none" is by looking for pages in Category:Articles with short description and no wikibase-shortdesc property set. There's currently 152676 such non-redirect pages in mainspace, and yes, they're lumped in with "n/a" in the first query. Need this query so it can be rerun, or is just the number enough? —Cryptic 14:29, 24 June 2022 (UTC)
These results are exactly what I needed. I'm sure there are quirks in the data (there are strange anomalies at 35 and 71 characters that I can't explain), but they will inform our discussion well. Thanks! – Jonesey95 (talk) 15:50, 24 June 2022 (UTC)
35 is because of "Topics referred to by the same term"; 339550 pages have that exact shortdesc. The two most common 71-length shortdescs are "Village in Federation of Bosnia and Herzegovina, Bosnia and Herzegovina" (2844 pages) and "Village and Jamoat in Districts of Republican Subordination, Tajikistan" (57 pages); the next highest repeat shortdesc has 8 pages. —Cryptic 16:21, 24 June 2022 (UTC)
And the lengths, btw, are in bytes, not characters. That shouldn't make much difference in the aggregate results, but, for example, "District in León Cortés Castro canton, San José province, Costa Rica" showed up in the results for 71 bytes despite being 68 characters long because é is represented by two bytes. —Cryptic 16:29, 24 June 2022 (UTC)
Thank you, very helpful. I'm not too worried about bytes and characters, unless an SD is all Japanese characters or something, in which case we have other problems. The results of the query will meet our needs for thinking about approximate numbers and percentages, and I have advised editors to take them with a grain of salt. – Jonesey95 (talk) 16:43, 24 June 2022 (UTC)

Revert counts for AfDs closed as redirect

For each AfD closed as redirect, how many times has the page been restored to an article and reverted to a redirect since then?

Example: WP:Articles for deletion/Children of This Earth, Special:PageHistory/Children of This Earth: 1 restore, 1 revert

Notes:

  • The current state (redirect or article) would be a useful output column.
  • Some editors restore articles temporarily to add content and self-revert, such as User:BOZ with Special:PageHistory/Yeomanry (Greyhawk) (the first example I found, was kept at AfD). These would be nice to flag or exclude. Each restore or revert in a separate row including username could be manipulated later in a spreadsheet.
  • Delete and redirect would be nice, especially if it can be implemented easily by forking the query and changing a string parameter.

I think I may be able to figure this out if pointed in the right direction. Thanks! Flatscan (talk) 04:21, 27 June 2022 (UTC)

This is hard, since page text isn't in the toolforge database mirrors. The best that's even possible would be to search for edit summaries similar to "Wikipedia:Articles for deletion/Whatever closed as redirect" - which is itself very, very slow, roughly four or five minutes per month of edits scanned even when you're looking for an exact match - and on top of that, look for later revisions in the same pages that are enough larger that you can infer that it wasn't a redirect (since, except for the page text which we can't see, you can only definitively tell if the current revision of a page is a redirect or not). You can query for tags, and there's some that could help - "New redirect" and "Reverted" and "Undo" and "Rollback" are probably the best - but they aren't all added consistently, and there isn't one for "new non-redirect created from redirect".
About the closest thing that's actually feasible as a query is to create a list of afds created by a certain person or in a certain range of dates or some other representative sample, list which of them name pages that are currently redirects and so probably were closed as redirect, and ask a human to inspect their histories one by one. Even that's not entirely straightforward, because of "(5th nomination)" and such being stuck onto the end of the AFD pagename, and it'll of course miss multi-article afds.
Some bot frameworks, on the other hand, might be able to help. I think it's AnomieBot that tags newly-created pages that have existing afds? —Cryptic 05:30, 27 June 2022 (UTC)
Thanks so much for your insights! I suspected that it may require a bot. I think that's correct – I saw it recreate a talk page on my watchlist recently. I will file this project under "if/when I have time". Flatscan (talk) 04:28, 28 June 2022 (UTC)

Request for list of pages with 5+ transclusions and Linter errors

I am looking for pages outside of Template namespace that have multiple transclusions, maybe 5 or more, and also have Linter errors. Linter errors, when they are present on a page, are listed on the "Page information" page under "Lint errors".

If this list of pages is too long, you can increase the number of transclusions to 10 or some other number that gets the list to a reasonable size. I am expecting that the list will contain a few thousand pages. – Jonesey95 (talk) 22:39, 30 June 2022 (UTC)

quarry:query/65736. Yes, Wikipedia:WikiProject Food and drink/Open tasks/tasks really is transcluded on 47149 pages. —Cryptic 23:19, 30 June 2022 (UTC)
That's the one. Thanks! There are a bunch of pages on there where the Linter error is not present when the page is transcluded, but there is no way to know until you dig through the report. I just fixed a page with five errors and hundreds of transclusions. The report also turns up strange errors: I've already found one weird signature that is present on 800 pages. – Jonesey95 (talk) 00:03, 1 July 2022 (UTC)

Potential R from short names

Pages like Red Banner of Labor, where the redirect target (case insensitive) contains the title, and is not in Redirects from short names. ― Qwerfjkltalk 22:24, 1 July 2022 (UTC)

That's not really a good example - Red Banner of Labor's redirect target doesn't include the redirect title, so this might be less useful than you think. I'll take a look. —Cryptic 22:31, 1 July 2022 (UTC)
You're going to get a lot of false positives if you use just that criterion; there's 426470 total matching redirects, even after I limit it to redirects from mainspace to mainspace, and omit cases where the redirect is identical to the target except for case. quarry:query/65758 has the results for redirects starting with "Red". I see redirects from plurals, and from postal abbreviations, and from partial place names at a glance. If you put together a list of other categories to exclude, or some other criteria (perhaps omit cases where the target is the redirect name with "s" appended), I can see if that helps. —Cryptic 22:51, 1 July 2022 (UTC)

Edits summaries by Qwerfjkl (bot) matching (\d), with revision id and page

Per User talk:Qwerfjkl#Category:Austro-Hungarian films, I'd like to have all the edits by my bot that contain removed: [[Category:...]] ([2-7]) or if not possible, just the individual queries (or a template query, I suppose). From the edits I need the revision ID and the page name (I'm planning on mass-reverting all of the edits). ― Qwerfjkltalk 16:20, 5 July 2022 (UTC)

quarry:query/65828. The newlines embedded in the comments are particularly bad form. —Cryptic 16:35, 5 July 2022 (UTC)

Color boxes

I would like to know all pages that have at least 80 color box template transclusions (not case sensitive). Not sure if this is relevant, but note that color boxes have multiple alternative template names, "{{Color box}}", "{{Colorbox}}", "{{Colour box}}", "{{Colourbox}}", "{{Color square}}", and "{{Legend3}}". TheGEICOgecko (talk) 05:28, 4 July 2022 (UTC)

Queries can only see whether there are any transclusions at all to a given page, not how many there are, the same way Special:Whatlinkshere only shows one entry per template. —Cryptic 06:16, 4 July 2022 (UTC)
@TheGEICOgecko:, those many names definitely complicate things, but you could get some rough results by playing with the following regex search
insource:"color box" insource:/(\{[Cc]olor ?box.+?){38}/

The first "insource" is needed because with the second one alone regex search times out. That one searches for any 38 repetitions of Colorbox/colorbox/Color box/... followed by any text until another Colorbox/... Just something to start with, don't expect all possible results. Ponor (talk) 08:56, 6 July 2022 (UTC)

Broken Query

I'm trying to re-run a query from a few years ago and running into some issues.

  • Old Query: https://quarry.wmcloud.org/query/32574
  • New Query: https://quarry.wmcloud.org/query/65845Insertcleverphrasehere(or here)(or here)(or here) 02:18, 7 July 2022 (UTC)
    Here you go. —Cryptic 03:16, 7 July 2022 (UTC)
    @Cryptic Are you sure this works? Just to use Rosguill for an example, who works in the same patrolling areas I do (redirects), he has definitely done more than myself, but is absent from the list. Cf. Wikipedia:Database reports/Top new article reviewers. Happy Editing--IAmChaos 03:24, 7 July 2022 (UTC)
    I have no idea whether it works. I just updated Insertcleverphrasehere's query to match the changed schema. —Cryptic 03:33, 7 July 2022 (UTC)
    Rosguill isn't in the results because the query examines only users that currently have the "patroller" right set, and they're only autoreviewer and sysop. It doesn't care that patroller doesn't grant any permissions that sysop doesn't. —Cryptic 03:39, 7 July 2022 (UTC)
    Can probably change all those AND ug_group = 'patroller' to AND (ug_group = 'patroller' OR ug_group = 'sysop') to fix this. –Novem Linguae (talk) 04:04, 7 July 2022 (UTC)
    Actually this query was intended to exclude admins. So is working as intended I think. — Insertcleverphrasehere(or here)(or here)(or here) 08:45, 7 July 2022 (UTC)
    Well if you don't want admins, thats no problem. Edit: I just saw the diswcussion over at the new talk page referencing this, makes sense. Happy Editing--IAmChaos 03:01, 8 July 2022 (UTC) (edited 03:06, 8 July 2022 (UTC))

A page in the draft namespace that may conform to CSD G7

My filter rule is that the draft has only been edited by one person and that the current byte of the page is zero bytes (That is, the page is currently empty). Q𝟤𝟪 07:44, 14 July 2022 (UTC)

Here you go.
Novem Linguae (talk) 09:00, 14 July 2022 (UTC)
quarry:query/65980 has 0-byte drafts with only one editor, not just one revision. Your first query shows ones that were created blank, which are only speedies in articlespace. —Cryptic 09:20, 14 July 2022 (UTC)
And quarry:query/65981 is the same, but ignores bot edits. —Cryptic 09:36, 14 July 2022 (UTC)

Why isn't this query working?

This is a query that tries to grab the most recent 1000 entries in the "Potential copyright violations log". You can view this log at Special:Log to see the kinds of entries the query is supposed to get. According to the PHP source code, 'pagetriage-copyvio' is the log type, but as you can see I am getting 0 results in my query. What'd I screw up? –Novem Linguae (talk) 23:48, 18 July 2022 (UTC)

You didn't. If you try SHOW CREATE TABLE logging; you'll see that rows with log_type = 'pagetriage-copyvio' are filtered from the public replicas. Since they're visible without any needed user rights at [2], and the log type is added only by an extension, it's reasonably likely they were accidentally left out of the whitelist rather than being deliberately omitted from it. You can try opening a phab task.
Why are you looking at the pagetriage-copyvio log, anyway? It's singularly useless without the purported source of the infringement. —Cryptic 01:04, 19 July 2022 (UTC)
I'm writing documentation for PageTriage. mw:Extension:PageTriage#Special:Log. I'm also working on a patch that will stop logging to and hide pagetriage-deletion, which is just an unnecessary duplicate of entries in pagetriage-curation log_action=deletion. Thanks for your help above. –Novem Linguae (talk) 01:34, 19 July 2022 (UTC)

Scan 5000 most recent comments for some text

I'm trying to get the most recent 5000 comments and scan them for some text using WHERE LIKE. Here's my attempt. However I imagine my LIMIT 5000 is applying to the first 5000 matches, not the first 5000 rows of the comment table, making my query slow. How do I get the first 5000 rows of the comment table, then scan for my text within those 5000 results? Subquery? –Novem Linguae (talk) 06:28, 23 July 2022 (UTC)

Yes, but since duplicate edit summaries reuse the same row in the comment table, you also have to join revision (or recentchanges, or logging) and use that to order, like quarry:query/66178. There's no results even though there are recent edit summaries like that because the 5000 most recent edits only cover about 37 minutes - what you're really after are the n most recent such edits along with their timestamps, as in quarry:query/66179. —Cryptic 07:18, 23 July 2022 (UTC)

The enwiki self-made work and FP, but it not the QI of commonswiki

I want to submit these files to QI so that I can review these pictures. Q𝟤𝟪 06:38, 21 July 2022 (UTC)

Hi. What's FP? What's QI? –Novem Linguae (talk) 08:09, 21 July 2022 (UTC)
It's WP:FP and c:Commons:Quality images. petscan:22495882. My mindreading isn't good enough to know if he wants images in c:Category:Featured pictures on Wikimedia Commons excluded too, so e.g. File:Glasshouse and fountain at lalbagh.jpg is on this list. —Cryptic 08:14, 21 July 2022 (UTC)
@Cryptic No need to rule it out. But can you provide a code for quarrry? I'm not used to this. Q𝟤𝟪 03:58, 23 July 2022 (UTC)
The petscan results should be way easier to work with, but fine. quarry:query/66180. —Cryptic 07:28, 23 July 2022 (UTC)

Wiktionary query

Hello!

I was hoping that I could be assisted with a Wiktionary query. I would like a query with all articles that meet two conditions: it should be a member of deepcat:"Non-lemma forms by language" and Special:WhatLinksHere/PAGE should return 0 pages, it should hence be orphaned. I want in other words a list on all non-lemma words (regardless of language) that has 0 links to it. Grateful for answers. Jonteemil (talk) 23:12, 2 August 2022 (UTC)

quarry:query/66388. I'm surprised it worked - I'd thought Quarry couldn't handle more than 65536 result rows, and a tree depth of 1 (the category and its direct subcategories) was almost enough for that with some 64000 matching pages.
You sure you don't want pages that have no incoming links from mainspace, not pages with no incoming links at all? —Cryptic 00:16, 3 August 2022 (UTC)
You're right, from mainspace is what I actually want, if that will find less results, thanks.Jonteemil (talk) 13:40, 3 August 2022 (UTC)
quarry:query/66402. Despite having fewer raw results the first query, they represent more distinct pages - 81592 compared to 64449 - because the first had duplicate rows. —Cryptic 18:40, 3 August 2022 (UTC)

All files in category X and category Y, including exif metadata

Hello. Could someone please construct a query for all enwiki files in both category Self-published_work and category All_free_media, and have the two result columns be page.page_title and image.img_metadata? I tried in this query but I think I messed something up, because the resultset is bigger than I was expecting, and also because it is slow. Thanks. –Novem Linguae (talk) 10:27, 6 August 2022 (UTC)

It looks right to me. There are a lot of files to check, and people who self-publish on Wikipedia are likely to use a free licence. You could add AND cl_type='file' but it may make no difference: other pages shouldn't be in those categories. Of course, you'll need to join to page and image for the name and metadata. Certes (talk) 12:10, 6 August 2022 (UTC)
Improved query.Novem Linguae (talk) 12:48, 6 August 2022 (UTC)

surprising number of username = NULL in image table query

Hello. Does anyone know why, in my query of some files in the file namespace, most of the results have user_name = NULL? This suggests that almost every file in the 100,000 results has an unknown or logged out uploader. Also, deleting the "AND user_name NOT IN" part gets me the exact same number of results as having it in, and I was expecting it to eliminate some results. user_name is supposed to be the user name of the file uploader. Perhaps I messed something up. –Novem Linguae (talk) 20:58, 9 August 2022 (UTC)

LEFT JOIN user ON user_id = actor_name, mw:Manual:Actor table. You want = actor_user. —Cryptic 21:19, 9 August 2022 (UTC)

Exclude blocked users

Hello. Can I get help getting this query to exclude blocked users please? I tried adding the below and a couple variations (HAVING, wrapping the whole thing and making it a subquery) but couldn't quite crack it. Thank you.

JOIN user ON actor_user = user_id
LEFT JOIN ipblocks ON ipb_user = user_id
WHERE ipb_sitewide != 1

Novem Linguae (talk) 10:34, 12 August 2022 (UTC)

The query as written should work correctly, if correctly means "isn't blocked at all". If you want "isn't blocked at all, or is blocked only per-page/namespace", then you have to write out WHERE ipb_sitewide IS NULL OR ipb_sitewide != 1. "NULL != 1" isn't true. —Cryptic 11:05, 12 August 2022 (UTC)
I've had a go at what I think you mean in quarry:query/66612. That version does a left join to sitewide blocks only, so ipb_id is NULL for a unblocked or partially blocked user. Certes (talk) 11:11, 12 August 2022 (UTC)

Short description eligible articles

Hi! I'd love a query written for the following conditions:

  • in mainspace
  • not a redirect
  • not in category "Articles with short description"
  • not in category "All disambiguation pages"
  • talk page not in category "WikiProject Lists articles"

Thanks! 🐶 EpicPupper (he/him | talk) 21:01, 20 August 2022 (UTC)

@EpicPupper: I've had a go in quarry:query/66793: 1,615,883 results. Any solution will have to check every article's categories and will take a while. Do you want to limit the output somehow, e.g. by page size? I've checked that all GAs and FAs have SDs; picking out the next level such as A-class or high-importance is not easy. The best way forward may still be to apply SDs via templates such as infoboxes, to reduce the count to a size where manual additions become feasible. Certes (talk) 23:33, 20 August 2022 (UTC)
Petscan can be useful for queries like this; they can help you find articles that resemble one another, which makes it easier to assign short descriptions. Here's one that looks for articles using {{Infobox person}} and that do not have a short description. – Jonesey95 (talk) 11:45, 21 August 2022 (UTC)
Yes, PetScan is particularly good when you need to check articles against talk pages. It's also great at including and excluding categories. However, it's not so good at excluding categories without any positive category inclusion (e.g. it can do "in Cat:A but not Cat:B", but not "not in Cat:B"). Certes (talk) 17:14, 21 August 2022 (UTC)
I find that it works quite well as long as something else is included, like the presence of a template. – Jonesey95 (talk) 14:41, 22 August 2022 (UTC)
@Certes, thanks. Could you add to the query to filter the results by, say, all articles with {{Infobox person}}? I'd rather not use PetScan for this particular project as I'm building an application to generate short descriptions via AI, and I think it'd be best to not rely on an external tool. Cheers! 🐶 EpicPupper (he/him | talk) 18:11, 21 August 2022 (UTC)
@EpicPupper: Done: "only" 108,775 articles now, and a lot quicker as it only needs to scan people. Quarry has a History button top right to access the full version. Person is an awkward case, as we can't easily extract a good SD. Perhaps an AWB/JWB run could extract description from a typical lead "John Smith (1888–1999) was a[n] description" to get something like "English chemist and unicyclist", but 100,000 is still an awful lot of articles. Certes (talk) 18:35, 21 August 2022 (UTC)
I'm pretty sure that there was a bot approved to do additions like this, and that is why there are "only" 108K infobox person articles left without SDs. I'd have to dig for quite a while to find that BRFA though, assuming that my memory is correct. – Jonesey95 (talk) 14:41, 22 August 2022 (UTC)

For NPP

For a major project report on NPP we need the following stats, if possible:

Sample period: 12 recent months.
# X Description Query
1 Number of new articles. quarry:query/66788
2 Number of new articles created by users with < 90 days/500 edits. quarry:query/66853
3 Number of new users with < 90 days/500 edits. who created articles and have been blocked for any reason. quarry:query/66863
4 Number of new articles created by autopatrolled users. quarry:query/66854
5 Number of new articles draftified. quarry:query/66855
6 Number of new articles submitted to AfC
7 Number of drafts deleted including G13 quarry:query/66789
8 Number of new articles deleted (by any deletion process). quarry:query/66789
9 Number of NPPers who have not edited en.Wiki for 12 months. quarry:query/66874
10 Number of NPPers who have not done any reviews for 12 months. quarry:query/66875
Creations and deletions by month since 2018-06-27. (raw data, and what exactly is counted)
  Creations
  Deletions

Stats curves on one graph. Sample period: 5 recent years (last 60 months)

  1. ☑ Number of new articles created by year over the last 5 years
  2. ☑ Number of new articles created but deleted by year over the last 5 years.

I realise this might be a tall order, please let me know if any of these criteria are very awkward or impossible. Thanks, Kudpung กุดผึ้ง (talk) 11:22, 20 August 2022 (UTC)

  • How are you defining "new article"? "A page created in the main namespace", or in some combination of namespaces (probably main and draft), is easy; "a page created in (whichever namespaces) that isn't currently a redirect if it still exists" is also. Things like not counting pages created in draft unless they were later moved to main, or not counting redirects that have since been deleted, is hard. Similarly, the most natural queries for the two five-year period requests are "pages created in the main namespace" and "pages deleted while in the main namespace", which is going to result in confusing data due to moves both ways.
    Getting the edit count of the page creator as of the time of creation is hard. Age of the page creator's account at the time of creation is also awkward, but less so. The current age and edit count is easy, but of course less useful.
    #1, approximated as pages created in main namespace excluding non-deleted redirects: quarry:query/66788.
    #7 and #8 (approximating "article" as any page in the main namespace, and not looking for newness) are both at quarry:query/66789. —Cryptic 12:00, 20 August 2022 (UTC)

Hi Cryptic. By 'new article' I mean atricles in mainspace not including redirects or dabs. A draft is not a mainspace article, but an article created in, or moved to the draft namespace. I'll look at the quarries in the morning. Thanks. (FYI: Novem Linguae). Kudpung กุดผึ้ง (talk) 17:22, 20 August 2022 (UTC)

  • I've updated the first query to exclude (still-existing) dabs. Like whether they were redirects, this information isn't available for deleted pages, nor for non-current revisions of existing ones. #2 for account age only (not edits) at quarry:query/66853, and #4 at quarry:query/66854. All of these also exclude pages created outside the main namespace but later moved there. —Cryptic 11:44, 24 August 2022 (UTC)
  • #5 as number of mainspace pages moved to draft by month at quarry:query/66855. I've made no effort to exclude redirects and dabs from this; that would require tracking down where the page currently is, which I think is non-trivial. (Or maybe log_page is accurate here; the move log is awful to work with, and I don't trust it.) Neither dabs nor redirs get draftified much in any case. I've also not limited it to new pages, since that would require tracking any previous moves between creation (in whatever namespace) and when it was moved specifically from main to draft, and again the move log is a horrid cludge of php data serialized into a string. —Cryptic 12:09, 24 August 2022 (UTC)
  • Your five-year queries will be easier if it's ok to start from 2018 June 27 around 23:00 UTC, which is when the creation log begins. The data's available for earlier, but would have to be dug out of the revisions and archive tables, which is awkward and very, very slow. —Cryptic 13:44, 24 August 2022 (UTC)
    Thanks, Cryptic. Yes, from 27 June 2018 would be fine. Kudpung กุดผึ้ง (talk) 20:06, 24 August 2022 (UTC)
    Graph above. —Cryptic 02:28, 25 August 2022 (UTC)
  • #3 - blocks of article-creating new users - at quarry:query/66863. —Cryptic 03:08, 25 August 2022 (UTC)
  • The #/NPP'ers without edits/reviews shouldn't be rough (and I think I've done something similar before for no reviews), but I'm beat. Maybe tomorrow.
    On the other hand, I can't think of a way offhand to track articles submitted to AFC. Maybe wayback machine caches of an appropriate category? Though even if there's a good cat to look at and enough archives of it, that'll miss any drafts that got submitted and then reviewed in between archives. —Cryptic 03:08, 25 August 2022 (UTC)
    Could check the comment table, looking for Submitting using [[Wikipedia:AFCSW|AfC-submit-wizard]]. Won't get them all, but should get an approximation. I checked 9 drafts at the top of the queue just now, 8 used that. Possible problems: 1) Will have duplicates (folks submitting multiple times). 2) Associated page may be deleted... not sure if this messes with the comment table? 3) Does the comment table try to recycle identical comments? If so that might mess with the count. –Novem Linguae (talk) 05:57, 25 August 2022 (UTC)
    archive.ar_comment_id is NULL in the public views on the replicas, yes. comment's rows are reused extensively - it's the whole point of comment.comment_hash - so if you're trying to count edits by comment, you always join from revision or logging or some other table; querying comment by itself is almost never useful. Duplicate submissions isn't an issue; it's easy enough to only count one submission per page or per title (so you have the choice of counting a draft at a given title that was submitted, rejected, resubmitted, rejected, deleted, recreated, and submitted again either one, two, or three times). —Cryptic 12:23, 25 August 2022 (UTC)
  • Last two straightforward queries done.
    I'm dubious about the actual utility of this. Excluding pages created outside of mainspace but later moved into it seems counterproductive; it's now very, very common for a new user to create a page in draft or userspace, get autoconfirmed by editing it, and move it into mainspace themself. None of these queries count such pages. —Cryptic 22:17, 25 August 2022 (UTC)

Disambiguation pages with no WikiProject banner

Hello! I'm back for another query for disambiguation pages with no WikiProject banner. This query should output articles that are:

  • in mainspace
  • not a redirect
  • have the category "All_disambiguation_pages"
  • talk page does NOT have the category "WikiProject_Disambiguation_pages"

Thanks! I tried tweaking the above query with this, but I'm horrible at SQL and it failed. 🐶 EpicPupper (he/him | talk) 00:30, 26 August 2022 (UTC)

Here you go. —Cryptic 01:31, 26 August 2022 (UTC)
Some hits such as Boarding don't have a talk page. We don't usually create a talk page just for the {{WikiProject Disambiguation}} banner, but we do usually add the banner (or overwrite inappropriate content with it) when the talk page already exists. Certes (talk) 09:58, 26 August 2022 (UTC)
Good catch. Not counting non-existent talk pages reduces the result set from 53,406 to 243. –Novem Linguae (talk) 10:08, 26 August 2022 (UTC)
Hi @Certes, I wasn't aware that this was the standard? Is there a policy page somewhere referencing it? Cheers! 🥒 EpicPickle (they/them | talk) 17:34, 26 August 2022 (UTC)
I don't think it's documented anywhere, but it's de facto practice and I'm confident that you'd find a consensus for it at WikiProject Disambiguation. Certes (talk) 19:48, 26 August 2022 (UTC)

IP editors - How much of the new content do they create?

Hi.

I am after a query to find out whether what % of new content do IP editors create. Wakelamp d[@-@]b (talk) 13:00, 31 August 2022 (UTC)

That depends very heavily on what you define as "new content". For example, if you count only edits that added bytes, IPs added 12.5% of the bytes to mainspace and 10.9% to draft in roughly the last thirty days. —Cryptic 13:25, 31 August 2022 (UTC)
And if you also exclude edits with the "Reverted" tag, it drops to 8.75% in mainspace and 9.54% in draft. —Cryptic 01:22, 1 September 2022 (UTC)
Thank you @Cryptic I explained my self badly. by content I meant chars characters. My interest is because of [[3]] Wakelamp d[@-@]b (talk) Wakelamp d[@-@]b (talk) 04:24, 1 September 2022 (UTC)