Wikipedia:Village pump (technical)/Archive 215

How can I convert Google Books urls to neutral format?

Object: For a citation, I want to include the url for the Google Books page. (It's in Preview mode.)

Problem: I live outside the U.S., so the url I get here is country-specific. Furthermore, the url I get includes irrelevant data, like the search term I used to find it in the first place.

Question: How do I convert the url to the country-neutral, approved short form?

And a suggestion: It would be helpful if this knowhow were included in WP:Help? Ttocserp 10:32, 24 August 2024 (UTC)

Ttocserp, can you link using the ISBN (e.g. https://books.google.com/books?vid=ISBN0863163165)? — Qwerfjkltalk 10:41, 24 August 2024 (UTC)
Thank you, although that will give me the book, but not the page. Also, it won't work for older books since there was no ISBN then, Ttocserp 11:44, 24 August 2024 (UTC)
Yeah someone should file a WP:BOTREQ to remove unnecessary parameters and use the correct language version of Google Books. See https://en.wikipedia.org/wiki/Special:LinkSearch?target=books.google.de for example. If possible it would be even better to convert them all to {{cite book}} or {{Google Books URL}}. We already have a bot removing tracking parameters from URLs (although I can't remember the name) so it might be a good feature request for them. Polygnotus (talk) 11:47, 24 August 2024 (UTC)
User:Citation bot normalizes Google Book URLs. Polygnotus (talk) 12:34, 24 August 2024 (UTC)
Thank you; problem solved. Ttocserp 12:46, 24 August 2024 (UTC)
Well, not yet. @Ttocserp: nothing on this planet is easy. There are currently 198 articles with one or more links to the German version of Google Books. And there are many languages other than German and English. Working on it. Polygnotus (talk) 12:50, 24 August 2024 (UTC)

So, this simple request led to a lot of todoes:

  1. MediaWiki doesn't appear to have a way to filter Special:LinkSearch by namespace. many Phabricator tickets, T37758 dates back to 2012... But the functionality already exists in NewPagesPager.php so it shouldn't be too hard to make something similar.
  2. I couldn't find a userscript that allows the user to filter Special:LinkSearch by namespace. User:Polygnotus/namespace.js does it, exceptionally poorly (it does not make API requests it just filters whatever is displayed on the page).
    • If there are no other scripts that do this task better then something like this should be improved and added to the list.
    • Help:Linksearch#Toolforge claims there is a toolforge tool that can filter external links by namespace but it is dead. Can it be resurrected? Can it also be used for the other Special: pages?   In progress I asked over at User_talk:Giftpflanze#toolforge:Linksearch
    • Have to check if there are more Special: pages that also don't have filter capabilities but should have.
    • AWB also appears to be unable to use External link search as a list generator. Should that be added as a feature request?   In progress T373261
  3. According to this there are 187 language versions of Google. Someone should probably scan the latest dump to see which appear in it.
  4. We need either a feature request for User:Citation bot or a WP:BOTREQ for whoever wants to deal with this.   In progress I asked over at User_talk:Citation_bot#Feature_request:_Google_Books

To get an idea of the scale of the problem, here we got: de, nl, es, fr. 620 combined articles that contain one or more of these links. Polygnotus (talk) 13:16, 24 August 2024 (UTC)

You might find this and variant searches useful: insource:"books.google" insource:/\/\/books\.google\.(de|fr|es|nl|pt|it)/. That search finds about 1480 articles.
Trappist the monk (talk) 13:52, 24 August 2024 (UTC)
@Trappist the monk: Thank you! I do not understand why Special:LinkSearch returns a different number than the same languages in that regex via the searchbox. Polygnotus (talk) 14:12, 24 August 2024 (UTC)
The search looks for that pattern in the wikitext of pages. Linksearch looks for links in its rendered output. Consider templates like {{geoSource}}, which can produce links to google books from wikitext like {{geoSource|DE|AHL|42}}. (Or even the Template:GeoSource page itself, which has many links to google books despite that url not appearing in its wikitext at all, through its transclusion of Template:GeoSource/doc.) —Cryptic 14:54, 24 August 2024 (UTC)
Ok, that was a stupid question. Thanks! Polygnotus (talk) 15:30, 24 August 2024 (UTC)
Since a bot can only edit URLs that literally exist in the wikitext, insource is more accurate for that purpose. Unless the bot is programmed for those special templates, which most are not since there are thousands, each with their own syntax that can change on a whim. -- GreenC 13:14, 26 August 2024 (UTC)
And because it is only looking for six of the who-knows-how-many languages google books supports. And because the search is constrained to mainspace.
Trappist the monk (talk) 15:35, 24 August 2024 (UTC)
I managed to avoid those pitfalls ;-) Polygnotus (talk) 16:06, 24 August 2024 (UTC)
Probably the best task to look at is phab:T12593, which is specifically about Special:LinkSearch. It's not as simple as you seem to think, the SQL query would be too inefficient. NewPagesPager and some others have different enough table structure that efficient queries can be written. Anomie 16:33, 24 August 2024 (UTC)
Very interesting, thank you! I will have to do a bit more research. Polygnotus (talk) 09:55, 25 August 2024 (UTC)
Polygnotus, you can use the API for this, e.g.:
async function fetchGoogleBooksLinks(languageTLD) {
    const api = new mw.Api();
    let params = {
        action: 'query',
        format: 'json',
        list: 'exturlusage',
        euquery: `books.google.${languageTLD}`,
        eunamespace: 0,
        eulimit: 'max'
    };

    let titles = [];
    let continueToken;

    do {
        if (continueToken) {
            params eucontinue = continueToken;
        }
        console.log(`Fetching for ${languageTLD}`)
        const data = await api.get(params);
        const exturls = data.query.exturlusage;

        exturls.forEach((exturl) => {
            titles.push(exturl.title);
        });

        continueToken = data continue ? data.continue.eucontinue : null;
    } while (continueToken);

    return titles;
}

async function fetchAllGoogleBooksLinks() {
    const tlds = ['de', 'fr', 'es', 'nl', 'pt', 'it'];
    let allTitles = [];

    for (const tld of tlds) {
        const titles = await fetchGoogleBooksLinks(tld);
        allTitles = allTitles.concat(titles);
    }
    console.log("* [["+allTitles.join("]]\n* [[")+"]]")
}

fetchAllGoogleBooksLinks();
See User:Qwerfjkl/sandbox/Language-specific Google Books links. — Qwerfjkltalk 09:02, 26 August 2024 (UTC)
Thank you! I know this cool little trick where I can take any piece of code and double it in size... by converting it to Java. Let me tell you, bigger is not always better. Polygnotus (talk) 12:35, 26 August 2024 (UTC)
In case it helps, I keep statistics, Enwiki has 2,122,411 Google Books links as of last month. If all you found was about 1,500 malformed links, about 0.0007 percent, is pretty good. We do have a much bigger problem with the corpus of GB links, which BTW is one of the largest corpuses<sp?> comparable to nytimes.com and web.archive.org .. the problem is that many of them have stopped working, directly (hard-404), or indirectly (soft-404 and crunchy-404 - see WP:LINKROT#Glossary). It could be 10% or more (200,000+). Nobody is really maintaining the 2 million corpus other than the URL normalization work of Citation bot. IABot typically skips them since archives usually don't work. Dead-ish GB links sit there, unresolved, often providing nothing more than an "About the book" page confirming the author and title, with a "Buy this book" button. At one time the link had content, but GB changes things around, removing content leaving little behind. All 2 million are at risk of becoming hard, soft and crunchy 404s. -- GreenC 13:44, 26 August 2024 (UTC)

geohack is down

  Resolved

Looking in to reports that geohack (where article coordinates go) is down. Getting 504 timeouts when following links such as [1]. — xaosflux Talk 13:37, 26 August 2024 (UTC)

Zombies in the search index

 
Screenshot of an error with a Wikipedia search

This search – -hastemplate:"short description" prefix:"List of" – returns just two results and they are the same page – List of Coastal Carolina Chanticleers head baseball coaches. One copy (2024-06-09T16:09:55) is a version of the article just before it was moved to draftspace (2024-06-09T16:49:44). The other version is the article that was re-created under the original article name. Any clues as to how we expunge the undead version from the search index? — GhostInTheMachine talk to me 14:08, 26 August 2024 (UTC)

copied from WP talk:Short descriptionGhostInTheMachine talk to me 14:13, 26 August 2024 (UTC)
phab:T331127 maybe, which should have been fixed a while ago. Izno (talk) 14:59, 26 August 2024 (UTC)
Thanks. Looks like it, but the ticket was closed a few weeks back. Should it be re-opened? — GhostInTheMachine talk to me 15:29, 26 August 2024 (UTC)
I nudged it, we'll see what the response is. Izno (talk) 15:52, 26 August 2024 (UTC)

Tech News: 2024-35

MediaWiki message delivery 20:29, 26 August 2024 (UTC)

Partially blocked user still able to edit page?

Any idea how DN27ND is still able to edit the page Nori Bunasawa? They appear to have been partially blocked from editing it on July 31, but were able to make dozens of edits to it on August 27. –Novem Linguae (talk) 07:34, 28 August 2024 (UTC)

The most likely explanation is that, because of an AfD debate, the article was deleted and the pageblock then ended. And then the article was recreated, and the editor was then able to contribute to it. The current incarnation of the article was deleted by another administrator a few minutes ago, just as I was about to click the delete button. My explanation is an informed hunch and those with deeper understanding of the software may have a better explanation. Cullen328 (talk) 07:44, 28 August 2024 (UTC)
On 28 August 2024, Special:Contributions/DN27ND moved User:DN27ND/sandbox3 to Nori Bunasawa. I guess a partial block from editing does not prevent moving a page to the deleted and unsalted target. The edits occurred in the sandbox, before moving. Johnuniq (talk) 07:46, 28 August 2024 (UTC)
There's some edits after the page was moved. I think Cullen328 above might be on the right track. Maybe partial blocks are by page_id rather than page_title. Thank you both for the ideas. –Novem Linguae (talk) 07:48, 28 August 2024 (UTC)
Just a brief update here...
The previously blocked user is already arguing for undeletion at Requests for Undeletion, here [3].
Given that the article coudl as easily have been deleted under G5 as G4, would it not be possible for an admin to just site block the user, rather than for others to have their time wasted by his continual bad faith actions? Axad12 (talk) 08:29, 28 August 2024 (UTC)
That's off-topic for the technical discussion. Nardog (talk) 08:48, 28 August 2024 (UTC)
You can post future updates about user behavior in Wikipedia:Teahouse#Speedy deletion criteria, which I'm subscribed to. –Novem Linguae (talk) 08:54, 28 August 2024 (UTC)
No problem, I will open a thread at ANI and copy you in. (I will say for now however that I feel that the user's behaviour at Requests for Undeletion was quite unacceptable). Axad12 (talk) 08:57, 28 August 2024 (UTC)
@Novem Linguae Partial blocks apply to a specific incarnation of a page, not the page title (as you say, it works by page id). If you move a page to a new title the partial block should move with it.
You also cannot use a partial block to stop an editor creating a page.
See the manual on mediawiki: MW:Manual:Block and unblock#Partial blocks
See Phab:T271253 for a request to make this clearer. 86.23.109.101 (talk) 10:48, 28 August 2024 (UTC)

Namespace naming inconsistently

Perhaps changed recently or I did not pay attention but Special:Watchlist has filter for namespace and refers to Article/Mainspace as "Content", the first I've seen it named as such anywhere. I like content but find it confusing when elsewhere e.g in Special:MovePage it's called (Article). I don't have a strong opinion on the correct name, but think we should minimize confusion. ~ 🦝 Shushugah (he/him • talk) 10:55, 28 August 2024 (UTC)

Are you perhaps confusing the "All contents" entry with the "Article" entry in the drop down ? —TheDJ (talkcontribs) 11:18, 28 August 2024 (UTC)
If you really see "Content" and not "All contents" then what is your language at Special:Preferences? "All contents" in the watchlist means all non-talk pages. There is also a MediaWiki concept of "content namespaces" which can be set by a wiki with mw:Manual:$wgContentNamespaces. I think it's only mainspace for all Wikimedia wikis (no mention in InitialiseSettings or CommonSettings). I haven't seen it used in any watchlist settings. PrimeHunter (talk) 11:52, 28 August 2024 (UTC)
{{trout}} me 🎏 You are right! I did not read closely enough, despite bothering to report it here, because I would have expected it to select all content/talk pages then. Thank you for the informative links down MediaWiki rabbit hole! ~ 🦝 Shushugah (he/him • talk) 12:36, 28 August 2024 (UTC)

Onlyinclude allows for transclusion of hidden comments?

I was wondering, has it always been the case that <onlyinclude>...</onlyinclude> tags, if placed inside of a hidden HTML comment, will still transclude its contents when called from another page? (For example, see here and here in my sandbox.) This seems strange to me, if the code is hidden it seems like it should not be transcluded elsewhere. This doesn't seem to apply to includeonly though.

I'm aware of the issue with onlyinclude+nowiki tags mentioned here, but found nothing about hidden comments. Thanks, S.A. Julio (talk) 06:54, 28 August 2024 (UTC)

I assume it has always been like that and it seems logical to me. Comments <!-- ... --> are saved as part of the page and not stored somewhere else. If the page has onlyinclude tags inside then all other parts of the page are ignored on transclusion so the comment start and end tags are not seen. includeonly works different. A page with includeonly but no onlyinclude is processed from the beginning on transclusion so the comment tags are seen there. PrimeHunter (talk) 12:10, 28 August 2024 (UTC)
Okay, thanks for the clarification. S.A. Julio (talk) 19:11, 28 August 2024 (UTC)