Wikipedia:Bots/Requests for approval/KiranBOT 12

KiranBOT 12

New to bots on Wikipedia? Read these primers!

Approval process – How this discussion works
Overview/Policy – What bots are/What they can (or can't) do
Dictionary – Explains bot-related jargon

Operator: Usernamekiran (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 15:59, Tuesday, September 24, 2024 (UTC)

Function overview: update Accelerated Mobile Pages/AMP links to normal links

Automatic, Supervised, or Manual: automatic

Programming language(s): pywikibot

Source code available:

Links to relevant discussions (where appropriate): requested at BOTREQ around 1.5 years ago: Wikipedia:Bot requests/Archive 84#Accelerated Mobile Pages link eradicator needed, and village pump: Wikipedia:Village_pump_(technical)/Archive_202#Accelerated_Mobile_Pages_links, recently requested at BOTREQ a few days ago: special:permalink/1247505851.

Edit period(s): either weekly or monthly

Requested edit rate: 1 edit per 50 seconds.

Estimated number of pages affected: around 8,000 for now, but the estimation is high, around thousands of pages. later as they come in.

Namespace(s): main/article

Exclusion compliant (Yes/No): yes (for now), if required, that can be changed later

Function details: with usage of extensive regex patters, the bot looks for AMP links. It avoids false matching with general "amp" words in the domains eg yamaha-amplifiers.com. After finding, and updating the a link, the bot checks if the new/updated link is working, if it gets a 200 response code, the bot updates the link in article. Otherwise, the bot adds that article title, and (non-updated) link to a log file (this can be saved to a log page as well). —usernamekiran (talk) 15:59, 24 September 2024 (UTC)[reply]

addendum: I should have included this already, but I forgot. In the BOTREQ, and other discussions, an open source "amputatorbot" github was discussed. This bot has a lot of irrelevant functions for wikipedia. The only relevant feature is to remove AMP links. But for this, the amputatorbot utilises a database for storing a list of ~~~400k~~ ~200k AMP links, and another list of canonical links of these AMP links. Maintaining this database, and the never-ending list of links for Wikipedia is not feasible. The program I created utilises comprehensive regex patterns. It also handles the archived links gracefully. —usernamekiran (talk) 17:50, 28 September 2024 (UTC)[reply]

Discussion

Maintaining this database, and the never-ending list of links for Wikipedia is not feasible But you wouldn't have to maintain this database right, if the authors of that GitHub repo already do, or have made it available?
The program I created utilises comprehensive regex patterns. It also handles the archived links gracefully. Would you mind providing those patterns here for evaluation?

Aside from that, happy for this to go to trial. @GreenC: any comments on this, and does this fall into the scope of your bot? ProcrastinatingReader (talk) 10:40, 29 September 2024 (UTC)[reply]

I will soon post the link to github, and reasoning for avoiding the database method. —usernamekiran (talk) 13:21, 29 September 2024 (UTC)[reply]
@ProcrastinatingReader: Hi. Yes, the author at github has made it available, but I think the database has not been updated in 4 years, I am not sure though. I also could not find the database itself. If we utilise the database, the bot would not process the "unknown" amp links that are not in the database. In that case we will have to use the method that we are currently using. Also, the general process would be more resource intensive I think, ie: "1: search for the amp links in articles 2: if amp link is found in article, look for it in the database 3: find the corresponding canonical link 4: replace in the article. Even if the database is being maintained, we will have to keep it updated, and we will have to add our new findings to the database. I think this simpler approach would be better. KiranBOT at github, AmputatorBot readme at github. Kindly let me know what you think. —usernamekiran (talk) 19:50, 29 September 2024 (UTC)[reply]

PS: I notified GreenC on their talkpage. Also, in the script, I added more comments than I usually do, and the script was created over the days/in parts, so the commenting might feel a little odd. —usernamekiran (talk) 19:54, 29 September 2024 (UTC)[reply]
This sounds like a good idea. I ran into AMP URLs with the Times of India domains, and made many conversions. It seemed site specific. Like m.timesofindia.com became timesofindia.indiatimes.com and "(amp_articleshow|amp_videoshow|amp_etphotostory|amp_ottmoviereview|amp_etc..)" had the "amp_" part removed. Anyway, I'll watchlist this page and feel free to ping me for input once test edits are made. -- GreenC 23:42, 29 September 2024 (UTC)[reply]
@ProcrastinatingReader: if there are no further questions/doubts, is a trial in order? I am sure about one issue related to https, but I think we should discuss it after the trial. —usernamekiran (talk) 15:16, 2 October 2024 (UTC)[reply]
{{BAG assistance needed}} —usernamekiran (talk) 08:42, 5 October 2024 (UTC)[reply]
Reviewing the code, you're applying a set of rules (amp.domain.tld → www.domain.tld, /amp/ → /, ?amp=true&... → ?...) and then checking the URL responds with 200 to a HEAD request. That seems good for most cases, but there are going to be some instances where the site uses an unusual AMP URL mapping and responds with 200 to all/most/some invalid requests, especially considering we are following redirects (but not updating the URL to the followed redirect). It also will not work for the example edit from the BOTREQ? I don't know how to solve this issue without some way of checking the redirected page actually contains some of the content we are looking for, or access to a database of checked mappings. Maybe the frequency of mistakes will be low enough for this to not be a problem? I am unsure. Any thoughts from others? — The Earwig (talk) 16:10, 5 October 2024 (UTC)[reply]
These are good points. Soft-404s and soft-redirects are the biggest (but not only) issues with URL changes. With soft-404s, you first process the links without committing changes, log redirect URLs, see which redirect URLs are repeating, manually inspect them to see if they are a soft-404; then process the links again with a trap added to treat the identified soft-404s as a dead link. Not all repeating redirects are soft-404s but many will be, you have to do the discovery work. For soft-redirects, it requires foreknowledge based on manual inspections, like the Times of India example above. URL changes are difficult for these reasons, and others mentioned in WP:LINKROT#Glossary. -- GreenC 17:53, 5 October 2024 (UTC)[reply]
@GreenC any suggestions on logic/algorithm? I will try to implement them. I dont mind further work to perfect the program —usernamekiran (talk) 20:32, 6 October 2024 (UTC)[reply]