lowercase sigmabot III not archiving properly

edit

For about the last three days, lowercase sigmabot III has only been archiving the Administrator's noticeboards and nothing else. Somebody mentioned that you gave it a good kick the last time it went on the fritz, so I will go ahead and notify you. Safiel (talk) 16:37, 29 April 2024 (UTC)Reply

Thanks for the notice. I've kicked it again and added a workaround in case this issue happens again. — The Earwig (talk) 04:29, 30 April 2024 (UTC)Reply
Hi, hope you're well. I think the bot is down again. ~~ AirshipJungleman29 (talk) 11:36, 12 June 2024 (UTC)Reply
Thanks, AirshipJungleman29. Different issue from last time. I think I've fixed it. — The Earwig (talk) 03:01, 13 June 2024 (UTC)Reply

Copyvio Detector and Google

edit

Hi,

(Sorry if this is the wrong forum for asking, but if so, perhaps you could point me in the right direction?)

I use the Copyvio Detector (great tool, BTW!) in checking new AfC drafts, at least a dozen times most days. I sometimes get an error message saying that the detector has exceeded its maximum allowed Google searches. This issue has always been there, occasionally, but in the last week or two it has occurred daily. When I start reviewing, around 6am or so UK time, the first few reviews always hit this problem. Then, maybe 8am (?) the daily quota probably gets reset, or something else happens, because from then onwards everything is fine until the next morning.

So I was thinking, I don't suppose there's much we can do to increase the quota (?), but would it be possible to add another search engine as a fallback option? Either so that when the user gets that error message, they could manually tick a box to use Bing (say) instead; or maybe the Detector could automatically switch to using the alternative if Google has failed.

I realise this may not be possible, either for technical or policy reasons, but thought I'd ask at least. Cheers, -- DoubleGrazing (talk) 09:35, 8 May 2024 (UTC)Reply

Hi DoubleGrazing, using Bing or some other engine as a fallback is definitely something we’ve discussed—I hadn’t realized the issue had gotten this bad recently. The main issue here is these services usually cost money, and while the WMF pays for our Google access right now, I don’t know if I will be able to ask for access to additional search engines. First, I can take a deeper look into whether anyone is overusing their share of the tool’s resources; we might need to block/limit them. (Our plan with Google allows about 1500 articles to be checked per day.) — The Earwig alt (talk) 16:11, 8 May 2024 (UTC)Reply
Okay, thanks for shedding some more light on this; needless to say, I knew nothing about how these things work.
I guess we at AfC are taking up quite a chunk of that quota, given that we see what are by definition new drafts usually by new users. I for one run the check probably at least on ⅓ of the drafts I review (and if you think that makes me an overuser, feel absolutely free to point this out, of course!). Even at NPP we deal with relatively more experienced users, so there's that much less of a need to check for CV.
It may be that I see the problem worse than some others, mind, because of my weird early-morning AfC habit, combined with the time zone I'm in. -- DoubleGrazing (talk) 17:05, 8 May 2024 (UTC)Reply
Hi again,
Quick update on this, the problem (of the copyvio detector running out of Google quota) has lately become worse. Unlike before, when it would only manifest in the early morning UK time, and usually be fine after 8am UK / 0700 UTC, it's now happening also in the afternoon. This is relatively new, maybe in the past week or two, so I've not yet have a good feel for what time it happens exactly (in case that matters); I would have said late afternoon, but eg. today it started already around 1pm UK / 1200 UTC.
Best, -- DoubleGrazing (talk) 12:35, 4 July 2024 (UTC)Reply
Sorry taking a while to get back, but I'm actively working on an improvement for this now. — The Earwig (talk) 06:43, 19 July 2024 (UTC)Reply
Great to hear, thanks. :) DoubleGrazing (talk) 10:35, 19 July 2024 (UTC)Reply

Earwig's Copyvio Detector

edit

Hello, The Earwig,

I have a question about this editing tool. It seemed like I could run this 20 or more times before I got a notice that I had reached my daily limit. But now, I receive a notice if I just run it a few times. Has this limit been decreased for some reason? I use this tool quite a lot while patrolling drafts and CSD categories so it's sometimes difficult to remember to go back to reexamine some pages the next day when I have reached my daily limit for the current day. Thanks for any insight you can provide. Liz Read! Talk! 20:21, 8 June 2024 (UTC)Reply

Hi Liz. Rest assured this isn't related to your own usage of the tool. The daily limit is shared by all users, and allows for about 1000–2000 pages to be checked per day, so even if you're checking a few dozen, that's not a major contributor to the limit getting reached. We've been noticing this issue more frequently recently (see a few threads above) and we're doing some work to restrict other users of the tool who are actually overusing their share of its resources. I'm hoping to have things back to normal soon. — The Earwig (talk) 04:23, 11 June 2024 (UTC)Reply
I didn't realize that I posted two messages about the same issue. I should have reviewed your talk page before posting my subsequent message. I guess I have a sense of frustration now that I know I'm competing with RichBot for copyright inquiries. Liz Read! Talk! 03:11, 8 August 2024 (UTC)Reply

Copyvio detector constantly timing out

edit

Hello again Ben! I am having issues with the Copyvio detector, finding it almost impossible to get it to generate a report. "The URL http://weaponsystems.net/weaponsystem/CC02%20-%20PTZ89.html timed out before any data could be retrieved" for example. Frequently it goes down completely as well. Any assistance appreciated. Thanks, — Diannaa (talk) 11:00, 13 June 2024 (UTC)Reply

Sorry, there aren't any quick fixes for this. I am working on it. — The Earwig (talk) 16:06, 13 June 2024 (UTC)Reply
Actually, I’ve found a partial fix to improve performance. Let’s see if it helps. — The Earwig alt (talk) 17:19, 13 June 2024 (UTC)Reply
It's much better, thanks! Fixing copyvio is tedious enough lol. — Diannaa (talk) 23:16, 13 June 2024 (UTC)Reply

Copyvios + Arc (Also, RichBot)

edit

Hi Ben,

I've started using the Arc browser, for some reason whenever I try and access Copyvios on it, I get an Internal Server Error. Trying the same URL in Edge works fine. Not sure where the bug is there, but hopefully you can find it.

Also, I see above there still seems to be issues regarding usage, did you need me to tone RichBot down a bit? - RichT|C|E-Mail 17:10, 28 June 2024 (UTC)Reply

Hey Rich, sorry I took a bit to reply. This is my first time hearing about Arc and I don't really feel like creating an account to test, so I can't confirm on my end. Are you sure it's an Internal Server Error or may it be a 403 Forbidden? (We may have inadvertently blocked its user agent as a crawler, which would give a 403, but I don't see anything in our block list that looks like it or Chrome [except Linux], so I don't know.) This is pretty strange.
Regarding bot usage, there are two main issues the tool's had lately: general downtime and exhausting our Google credits. I've improved the tool's performance a bit so the former is not a major issue now, but we are still frequently exhausting our daily Google quota. I've checked RichBot's usage and recently it's been consuming around 10-20% of our total Google credits. That's not too excessive, but if you could find a way to tone it down a bit compromising its usefulness, it would be appreciated. — The Earwig (talk) 08:10, 1 July 2024 (UTC)Reply
No worries, I have reduced RichBot to only look at 100 (plus existing CVs) per run, so 200 per day (excluding manual runs). Is there a way we can increase the credits? I don't mind throwing some £ at it if need be - RichT|C|E-Mail 09:31, 1 July 2024 (UTC)Reply
No way that I know of unfortunately; the WMF pays for it, but Google's API terms limit our usage without some kind of special arrangement that I have been unable to get. — The Earwig (talk) 15:25, 1 July 2024 (UTC)Reply
Typical Google lol... ah well, worth a shot - RichT|C|E-Mail 17:52, 1 July 2024 (UTC)Reply
Hey The Earwig. Big fan. Is there a venue where advocacy from affected editors might get us closer to that special arrangement? Firefangledfeathers (talk / contribs) 17:50, 18 July 2024 (UTC)Reply
Hi Firefangledfeathers, thank you. I'm not sure who we could talk to about this, to be honest. My former contact at the WMF no longer works there and it's not clear to me who is responsible for managing the relationship with Google right now. Going the other way, i.e. getting someone in a position of power at Google who could help, might be more fruitful. But that is just speculation; I don't know who specifically that might be. — The Earwig (talk) 06:02, 19 July 2024 (UTC)Reply
Thanks. I don't have any bright ideas. I'll probably go with the low-hanging fruit and post at WP:VPWMF. Firefangledfeathers (talk / contribs) 12:00, 19 July 2024 (UTC)Reply
And it's definitely a 500, 'The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.' - RichT|C|E-Mail 14:07, 1 July 2024 (UTC)Reply
Ah, I think I've figured it out. Could you try now? — The Earwig (talk) 15:36, 1 July 2024 (UTC)Reply
Much better :) Thanks :D - RichT|C|E-Mail 17:51, 1 July 2024 (UTC)Reply

The Signpost: 4 July 2024

edit

Administrators' newsletter – July 2024

edit

News and updates for administrators from the past month (June 2024).

 

  Administrator changes

 
 

  Technical news

  Miscellaneous


edit

Hello, The Earwig,

I regularly used this tool you created, mostly when patrolling drafts or CSD-tagged articles, I'd probably used it 3 or 4 times a day. When I used it too much, I'd get a message that I was over my limit of how often I could use it. At least that's how I thought things worked. Now, I get this message every time I try to see whether a page is a copyright violation, I have not gotten a successful response to a query in many, many weeks now. So, I'm wondering is this "limit" actually for all users on this platform and not tied to individual editors? Because something odd is going on and maybe new page patrollers or AFC reviewers are using it for every article they review if I can not just get one or two reports on suspicious articles or drafts I've come across. I know with AI, there are ways users can get around copyright restrictions but I still found the tool helpful.

Do you have any idea why it is suddenly no longer available to generate reports? Can you tell me the time of the day when it "resets" so that maybe I could make inquries then? Or is there any possibility of raising this limit of reports generated? I mean, I'm glad it's become so popular but it has also become unavailable for use for those of us who just want to make a few queries a day. Thank you. Liz Read! Talk! 22:31, 19 July 2024 (UTC)Reply

Hi Liz, truly sorry about the ongoing issues. I'm aware and working on it (see some of the threads above you), with the time I have available. I thought things has improved with the overall performance improvement last month, but it has really just made this particular problem of running out of the search quota much worse. Anyway, I am working on it now.
To answer your questions: yes the quota is shared by all users, and we cannot easily raise it. It's a hard limit enforced by Google that I cannot bypass without some special arrangement. It resets I think around midnight Pacific Time, i.e. Google's time zone.
I think the issue is some bots/automated traffic making too many queries. In the past I have been able to block them or ask them to slow down, but that approach has become less effective lately. So, I will be adding authentication to the tool to make sure only logged in users can use it and I can more accurately identify who is overusing it. I expect to finish that work this weekend and I am hopeful that will solve the issue. If it doesn't, there are other things I can try. — The Earwig (talk) 00:43, 20 July 2024 (UTC)Reply
Update: I am still working on this, but have made progress. — The Earwig (talk) 05:14, 22 July 2024 (UTC)Reply
FYI, I've also run into this issue the last couple of days. I'm assuming you're still working on it, or that life has gotten in the way of you fixing the issue. I dream of horses (Hoofprints) (Neigh at me) 21:20, 30 July 2024 (UTC)Reply
Yes, it's still my current focus with the free time I have. — The Earwig (talk) 00:21, 31 July 2024 (UTC)Reply
Just circling back to see how you responded to my query last month. Still have not successfully submitted a query and gotten a report in several months now. I realize that we are all volunteers so I don't have high expectations of when this issue might be "fixed" as we all have outside lives.
But I didn't realize though that regular editors were competing with bots, that's a battle individual editors can never win so please block those bots, if possible! I don't even see how a bot would be able to handle a copyright violation report and interpret it appropriately. Liz Read! Talk! 03:06, 8 August 2024 (UTC)Reply

The Signpost: 22 July 2024

edit

Administrators' newsletter – August 2024

edit

News and updates for administrators from the past month (July 2024).

  Administrator changes

  Isabelle Belato
 

  Interface administrator changes

  Izno
 

  CheckUser changes

  Barkeep49

  Technical news

  • Global blocks may now target accounts as well as IP's. Administrators may locally unblock when appropriate.
  • Users wishing to permanently leave may now request "vanishing" via Special:GlobalVanishRequest. Processed requests will result in the user being renamed, their recovery email being removed, and their account being globally locked.

  Arbitration


Earwig returns 0% on url-comparison with clever close paraphrase

edit

Hello. I noticed a {{circular}} tag at Ceteris paribus and ran this URL comparison to find out how much duplication there was, and in what section(s). To my surprise, it came back with 0.0%. However, notice these:

Comparison snippets

From: https://www.masterclass.com/articles/ceteris-paribus-explained#7MlD3BCbNL4NC0BejpGo02

1. Supply chain: Ceteris paribus considers production factors, such as logistics, sourcing, competition, and trends with buyers to determine the price of goods. For example, a bread seller observes the costs of the ingredients, labor, packaging, and distribution, in addition to competitors, economic inflation, and consumer trends. Ceteris paribus stipulates that if other factors remain the same, a decrease in the supply of bread will cause prices to rise.

2. The law of supply and demand: In the law of demand, buyers demand less of an economic good when prices are higher. The law of supply says that sellers will supply more of an economic good when prices are higher. The interaction of these two laws determines the actual market price and volume of goods. Ceteris paribus identifies, isolates, and tests the impact of an independent variable that would affect these two laws and the causal factors in the market supply and prices.

3. Gross domestic product: Economists use ceteris paribus to study the GDP, assuming that variables remain fixed to determine the effect in the money market.

4. Interest rates: If the interest rates increase, the independent variable, then the demand for debt goes down as the cost of borrowing increases, the dependent variable.

5. Minimum wage: Economists use ceteris paribus to determine the potential effects of a minimum wage increase, including the possible outcome of fewer jobs available if companies must pay employees more.


From Ceteris paribus#Applications rev. 1238986793:

The concept of ceteris paribus is crucial for economists and can be applied in researching:

  1. Supply chain. Ceteris paribus considers aspects of production, that being competition in the market, production costs, inflation, and consumer trends to conclude pricing of goods, imposing that keeping the aspects of production constant, minimising supply will adjust prices to increase.[1]
  2. Law of supply and demand. The law of demand states that, when prices rise the demand of goods fall, whilst the law of supply dictates that as prices rise sellers are more willing to supply. When these laws interrelate market prices and supply in the market are determined. Ceteris paribus is used in the law of supply and demand through determining how independent variables will impact the casual factors of prices and supply in the market.[1]
  3. Gross domestic product. Ceteris paribus is used in relation to GDP to determine how the money market will change when variables remain constant.[1]
  4. Interest rates. Through keeping interest rates as the independent variable, as interest rates rise, thus borrowing costs rise forcing a reduction in the demand for debt, that being the dependent variable.[1]
  5. Minimum wage. To define the possible effects of a rise in the minimum wage economists will use ceteris paribus. Possible effects include how wage increases may force employments down.[1]

References

  1. ^ a b c d e "Ceteris Paribus Explained: 5 Economic Uses for Ceteris Paribus". MasterClass. 2021-12-21. Retrieved 2024-06-05.

There is a lot of close paraphrase here, maybe enough to cover their tracks and confuse the detector. I remember glancing at Andrei Broder's shingle-based detection paper eons ago (might be this one) and I don't know how yours works, but if it is shingle-based, would it be feasible to add a new param to the input form, or in the settings, maybe in an 'advanced' section, to set the shingle size? In a case of paraphrase like this one, where the information is clearly copied but words are shifted around in the sentences, a shorter shingle size might do a lot better at detecting the similarities. This might kill processing time in the web search version, so maybe would only work when the 'url' radio button was selected, but still could be pretty useful for cases like that, and might make a great tool for assigning a measurable value to close paraphrase, which afaik we do not have currently, and is all very hand-wavy. Thanks, Mathglot (talk) 19:32, 6 August 2024 (UTC)Reply

It does slightly better (4.8%) specifying revision id 1151114395. What is going on here? Mathglot (talk) 20:09, 6 August 2024 (UTC)Reply
Okay, just noticed that in both of those revisions, Earwig doesn't appear to see past the first short section of the web page, so the paraphrased section I am addressing doesn't appear to be visible to Earwig, or at least, it isn't displaying it on the comparison page, for some reason, if you scroll down. Mathglot (talk) 21:59, 6 August 2024 (UTC)Reply
That's exactly it, Mathglot. The website loads its content through JavaScript so it's not available to the tool. There isn't an easy workaround for this, but there are some options I could try further in the future. Since the content doesn't show up in the comparison view as part of the source, my hope is that people will figure out what's going on, as you were able to. — The Earwig (talk) 00:23, 7 August 2024 (UTC)Reply
Thanks for that. Even if it could see it, I wonder if it would come up with any kind of rating, due to the paraphrase? Not sure what kind of test bed you use, but if you could copy the MasterClass page and save it offline locally (post-js, or just scraping the rendered page manually and saving it) and run Earwig against that file, I'd be interested to see what it would come up with. And if you use shingling and it's parametrizable, whether the rating would change if you reduced the shingle size. Mathglot (talk) 01:14, 7 August 2024 (UTC)Reply
OK, I can do a quick experiment of that, Mathglot. The tool does use shingling, actually. I haven't seen this paper and independently came up with a similar algorithm many years ago. Internally I call the shingle size the degree, and I've exposed that as a query-string-only parameter if you would like to play with it.
I manually copied the text to a pastebin. With the tool's default shingle size of 5 words, almost no similar text is found, and the similarity score is 5.7%. With size 3, it's 38.3%. With size 2, it's 67.1%. At this point a lot of the similar content is trivial ("is a", "in the", "of the"), so the odds of a false positive are much higher, though it does at least highlight some interesting similarities, too.
The tool doesn't have a way of identifying more unique common phrases. If we could down-weigh "is a" but up-weigh, say, "wage economists", we could lower the default shingle size and get more sensitive results. The default size was actually 3 several years ago, but I raised it because the false positive rate was just a bit too high and it was causing confusion. So there's a delicate balancing act with the current algorithm.
Food for thought. Thanks. — The Earwig (talk) 05:20, 7 August 2024 (UTC)Reply