Good work on Attack on Pearl Harbor in popular culture

Nice job. I wonder if you would be interested in putting your compsci skills to work in an altogether different topic area to generate a similar report, and if you are interested, you could try to publish your findings in the Signpost. Check out my proposal at the end of the discussion here and comment if you have any suggestions for how such a task could be even attempted. Viriditas (talk) 05:58, 9 August 2023 (UTC)

@Viriditas: What do you mean by FNB archives? I'm not familiar with this. Uhai (talk) 19:18, 9 August 2023 (UTC)
Sorry, my mistake for not linking to it: WP:FTN. To summarize: people add nonsense to Wikipedia all the time. A small, but dedicated group of gruff, tough-minded volunteers work diligently to fix or remove this material. I’m not talking about simple vandalism, I’m talking about complex bias that can be difficult for most people to spot if they aren’t looking for it. We have a general sense of the problem on the English language Wikipedia as you can see from the FTN archives. I also think the Signpost has posted some studies and statistics on the issue. And if you watch the noticeboard closely for any length of time, you can see there’s a pattern to the reports, often based on current events or recent publications. The problem that I raised here with you is that nobody has a general sense of what is occurring on the non-English Wikipedia sites. To uncover the scope of the problem, we would first have to put together a list of topics which have one or more matching subjects on the the other wikis. That’s the easy part. The list could range from UFOs to climate change to the assassination of JFK, to crystal healing and Ayurveda. You get the idea. I’m not familiar with the best methodology going forward, so we would need to figure that out. For example, how many items should be included? 50? 100? Can it be randomized in some way? Does that even matter? Etc. Then we would need a very simple bot to just copy and paste versions of these articles and append it to a file. From there, we would need to figure out how to measure the level of extant, remaining fringe bias in this sample, compared to the English Wikipedia. That’s pretty much it. Viriditas (talk) 23:04, 9 August 2023 (UTC)
@Viriditas: I think the biggest challenge would be how to measure bias in the sample. If done manually, you would of course need some standardized process along with native speakers to follow it, as machine translation to English may mask biases or create them where they don't exist in the source text. If done automatically, you would need specialized machine learning algorithms, probably one for each language, which themselves would have their own biases.
Matching topics to other wikis I also see as being difficult, barring there existing some resource I'm not aware of that already does so. You could translate category names and try to manually match them, I suppose, in the event machine translation doesn't give the exact corresponding category name? But this depends on articles being added to the relevant categories, which often isn't the case from what I've seen here on enwiki. Matching article titles directly has the same difficulties for the same reason. Of course, we would only need to analyze a sample rather than every single article, but we would still need a list of every article to randomly sample from for a given topic.
It's an interesting idea, and honestly could make for a great academic paper even, but seems quite challenging and time-consuming and the technical aspects are probably beyond what I'd be able to do alone barring the thing being done mostly manually. There's a lot of potential statistical pitfalls here as well, but I think they could be overcome with careful reporting of results.
Do you, by chance, have a link to any of the Signpost issues you mentioned that discuss this issue? I'd be curious to see the methodologies used. Uhai (talk) 19:51, 10 August 2023 (UTC)
Thank you for the reply and apologies for the delay. The article titles are thankfully already handled by the Wikidata project. If you’re not familiar with it, please get involved. Basically, the article title problem is solved, so we would have a one-to-one correspondence in that regard. That’s the easy part, however, as you explain up above. I’ve seen other studies and reports measuring bias so I would need to look at those and see if they are useful and relevant. I will take a look at the Signpost back issues and see if there is anything helpful. I will consider what you said up above and see if I can find anything to share with you in response. Thanks for your time. Viriditas (talk) 07:37, 11 August 2023 (UTC)
One option would be a look at those articles which are just translations of English and those articles which aren't. Checking out the sourcing would be the next step: a possible indicator for issues is if the article in the other language is of a comparable or greater length than the English language article (already an unusual situation) and lacking sources. After that, it'd have to be spot-checking. jps (talk) 10:58, 11 August 2023 (UTC)
@Viriditas: @ජපස: Right, Wikidata hadn't occurred to me for some reason. Anything with analyzing the metadata, such as article length, of articles should be pretty easy if I can be provided with a list of enwiki categories or articles. I can join from an enwiki article to its Wikidata item and then to articles on other projects using wikitech:PAWS. From this point, comparisons can be made. I can also do things like look at external links (including refs) in articles. I'm not sure if there's any common domain names that show up in refs or external links when it comes to these fringe theories, but I know here on enwiki, Blogspot is often a problem (in more ways than one).
I haven't dug in too much so far, but I'm not immediately sure of how to easily and automatically tell if articles are translations from enwiki or original material. It doesn't seem like Wikidata has any indication. I'm sure you can find out manually pretty easily by comparing the organization of sections. Maybe an automated approach could start there by parsing the wikitext and comparing. Uhai (talk) 05:48, 12 August 2023 (UTC)