Wikipedia talk:OABOT

(Redirected from Wikipedia talk:The Wikipedia Library/OABOT)
Latest comment: 4 months ago by Back ache in topic HDL

Question

edit

Naive question(s) from a non-wikipedia-person (JTW):

  • How much standardization is there, and how many edge cases are worth pursuing? I'm trying to figure out what tags to search for but it seems like there are layers of deprecated standards at this point.
  • A first pass is to just worry about references using the Cite Journal syntax. That's pretty standardized and easy to match. The simplest script that's worth writing is something like: find all cite journal tags, look for doi/pmid/pmc IDs, and look up an OA link to that paper if it's not present — Preceding unsigned comment added by Jamestwebber (talkcontribs) 18:27, 11 September 2015 (UTC)Reply

Actually, the more I look into this the more confused I get. Can we establish a set of test-cases that this bot should handle?

[edit: as is probably obvious, I've never done anything on wikipedia before. Will sign things in the future] James Webber (talk) 19:44, 11 September 2015 (UTC)Reply

Getting a free copy of an article by DOI / PMID

edit

To get a free to read copy of articles by DOI, we could use the CORE search engine via its API. It accepts DOIs and other identifiers as search parameters. Note however that the indexing looks a bit faulty to me: for instance, this arXiv document is associated with a DOI, and CORE harvests arXiv, but searching for this DOI from the CORE interface does not return anything. The metadata tools we have developped for the Dissemin project overcome this issue and it should not be too hard to provide a similar API to be used by this bot. Pintoch (talk) 20:17, 11 September 2015 (UTC)Reply

Wikidata should be involved/merged with this

edit
  • Hi. I was at the recent Wikipedia Science Conference. At this, Dario Taraborelli had a great suggestion that Wikidata could house ALL the mappings of useful literature: DOI <-> PMID <-> PMC <-> arXiv ID

NCBI has a useful API to help with some of this: http://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/ I'd just tack on free full text access URLs as another useful mapping in this. But this should all be done on Wikidata and used in Wikipedia from Wikidata Metacladistics (talk) 20:20, 14 September 2015 (UTC)Reply

Coordination with OA Signalling project

edit

Hi, the OA Signalling project is doing something very similar, just for openly licensed references. A short sketch of the workflow sits here, and it includes Wikidata's WikiProject Source Metadata (alluded to above) as well as a gadget to display information from the OA Button. It would be great if we could join forces on those aspects that are independent of paywalls and licensing. -- Daniel Mietchen (talk) 00:33, 24 November 2015 (UTC)Reply

Indeed, please merge this page to Wikipedia:WikiProject Open Access/Signalling OA-ness or at least move under Wikipedia:WikiProject Open Access. ALL CAPS page titles aren't standard, but are ok as redirects. Nemo 14:50, 2 June 2016 (UTC)Reply

Proof of concept

edit

I've written a quick proof of concept here. Feedback welcome! An interesting discussion about how open access should be indicated in references is taking place here. Pintoch (talk) 18:05, 16 March 2016 (UTC)Reply

Might "Free Version"be better in front?

edit

There's a section the the free version link has "Free version" added at the end. I think it might be better to put "Free version" or whatever marker at the front of the link rather than at the end, especially if the links are going to look like the one in the example. Chuck Baggett (talk) 15:07, 28 May 2016 (UTC)Reply

Hey, Chuck Baggett. Thanks for the feedback. This mockup is entirely hypothetical and would ultimately have to be refined and approved by the CS1 template editors and other reference buffs. I personally think it's problematic to put free version before basic identifiers like title and author and date. There may be many other ways to make the free version link more prominent and I'm open to modeling and demoing any/all of them. For now the immediate focus is on having the bot add a link (technically). How that link appears is important and not-yet-decided. We will raise that discussion in the next week or two. Cheers! Jake Ocaasi (WMF) (talk) 15:57, 31 May 2016 (UTC)Reply
I think the |url= should be used as a place where put the most useful link for our readers. A free to read version is arguably more useful than a paywalled one. We should make sure we still link to the version from the publisher, but that is what |doi= is for. If the free to read link also corresponds to an identifier, then we should also add it as an identifier (so, it would appear both as |url= and |arxiv=, say). Adding a "Free version" link would generate too much clutter, I think. − Pintoch (talk) 19:18, 31 May 2016 (UTC)Reply
I think the url should be the version the editor actually read when they cited the content, but I'm open to discussing all the options. Jake Ocaasi (WMF) (talk) 18:42, 2 June 2016 (UTC)Reply

URL replacement

edit

Re the "Edge cases for future development": it's always good to remove an URL to a paywalled version from the url parameter, as long as the DOI is provided (which can be used to easily reach the publisher's version). --Nemo 11:30, 15 August 2017 (UTC)Reply

Yeah I agree - the bot should not be blocked by |url= that are resolved versions of an existing |doi=. − Pintoch (talk) 10:18, 16 August 2017 (UTC)Reply
Another good example: in [1], the existing URL is broken and the CiteSeerX cache is probably an archived copy of that original URL. It would be very good to replace or remove the broken URL. --Nemo 11:33, 30 August 2017 (UTC)Reply

CiteSeerX

edit

I'm duly checking the CiteSeerX links before adding them, so I now got this (after about 20 downloads):

Download Limit Exceeded You have exceeded your daily download allowance.

Lame. --Nemo 11:40, 30 August 2017 (UTC)Reply

You can try downloading the uncached versions that they list instead (which are not hosted by them, so you should not have any rate-limit on that). But they are not always listed though. − Pintoch (talk) 12:40, 30 August 2017 (UTC)Reply

On the bright side, now that oaDOI was added as a source the CiteSeerX links are much less common, so it will take more time to hit the limit in any given day. --Nemo 07:36, 26 October 2017 (UTC)Reply

edit

In [2], rather than linking at [3], it should link to [4].

This applies to other links to that domain, like the 2nd link it changed in that diff. Headbomb {t · c · p · b} 12:04, 30 August 2017 (UTC)Reply

Likewise, in [5] rather than link at [6], the bot should link at [7]. Headbomb {t · c · p · b} 12:27, 30 August 2017 (UTC)Reply
This should be true in general. Link to the free document/PDF when possible, rather than simply to a page where the document can be found if you look hard enough. Headbomb {t · c · p · b} 12:27, 30 August 2017 (UTC)Reply

Going to @Nemo bis: on this as well, since you've unleash the both on a lot of physics articles, creating a lot of these links needing to be updated to point to the PDFs. Headbomb {t · c · p · b} 12:28, 30 August 2017 (UTC)Reply

Personally I prefer links to the records because then the abstract is quickly accessible. I prefer the link to the PDF only when the interface makes the PDF hard to find. --Nemo 12:34, 30 August 2017 (UTC)Reply
Repository managers also tend to prefer that, as it gives an opportunity to the reader to discover their platform. I have met multiple researchers who were explicitly told not to give direct links to the full texts but to the landing page instead (for various reasons). If a direct link to the PDF is really preferred (by a guideline somewhere on Wikipedia), then the CiteSeerX identifier should be updated to point directly to the cached PDF (and same for arXiv), as the PDF url can be obtained directly from the identifier. − Pintoch (talk) 12:44, 30 August 2017 (UTC)Reply
CERN links are hard to find. They're buried at the bottom of a page containing videos, and half of million other links. We should put readers first, not repository managers first. Go at [8], where is the relevant link? It will take you a while to find it. Headbomb {t · c · p · b} 12:56, 30 August 2017 (UTC)Reply
For me it took maybe a couple seconds (without knowing the repository software). There is a clear "PDF" link text and icon, with good contrast, in a clearly delimited area, in a predictable position, without a need for JavaScript, localised in my language. This is not a case of hard to find PDF. Additionally, what if the user is interested in the video after all? From the PDF URL they'll almost never be able to go back to the record. Nemo 13:06, 30 August 2017 (UTC)Reply
For me it took me about 2 minutes, because I thought it was the video, and that didn't make any sense. Clicking on download also didn't give me the paper I was looking for. Then I scrolled to the bottom of the box, and there was still no link, so I went back up and dug in "files" where I finally found the link. Headbomb {t · c · p · b} 13:20, 30 August 2017 (UTC)Reply
I do realise that people react differently. For instance I tend to not click anything (I'm particularly video-blind) and to use page up/down or "end" abundantly. But still, you'll probably agree this repository is a masterpiece in usability compared to, say, Elsevier's websites. --Nemo 13:43, 30 August 2017 (UTC)Reply
I'm not saying a link shouldn't be given, but it should be a link to the document, rather than making the reader hunt for it, otherwise they'll think it's a link added in mistake, or a link only containing superficial information about the document. Headbomb {t · c · p · b} 14:16, 30 August 2017 (UTC)Reply
I think for me the safest way to reject this change is simply to say: I am happy to deploy this change if somebody takes the time to write the code for it... − Pintoch (talk) 17:12, 30 August 2017 (UTC)Reply

Article size

edit

Sometimes the tool seems to timeout on some articles. Do we know what's the largest article size or number of links it can handle? For now the biggest I found in my testing is [9], I think. --Nemo 12:38, 30 August 2017 (UTC)Reply

That is a problem indeed. No I don't know what the maximum size would be. Note that there is some caching at reference-scale, so the request could potentially complete if you try a second time. − Pintoch (talk) 12:45, 30 August 2017 (UTC)Reply

OAbot usage

edit

Is there any way to see the OAbot edits a user has made? I found a link that seemed to be an error in a CiteSeerX link - namely, the link did not go to a full article but rather went to a notice that said

you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at

So, then I look up in the upper right hand corner and see a pdf link, which has the following info within the URL - www(dot)employees(dot)csbsju(dot)edu - and which, from the link, I can see is a link for a specific Economics class taught at the College of Saint Benedict/Saint John's University in Minnesota. I'm not sure this is all strictly legal-ish, to get at this article's content through a link that clearly isn't posted for public use. Shearonink (talk) 15:00, 2 September 2017 (UTC)Reply

I'm not sure I understand your question or what case you're talking about, but if the link gets interrupted for copyright reasons that's one more reason to consider the CiteSeerX links ok: it means they handle copyright notices so we need not worry about what remains up. As for the legal implications of linking, it's generally ok to link resources which are already public in the web (see e.g. CJEU on "new public"). --Nemo 09:08, 3 September 2017 (UTC)Reply
edit

According to § How does the bot work?, the bot should only be looking for links when the existing linking isn't the full free article. That seems in keeping with general ideas of maximum compliance with WP:V and making as much source material as reachable as possible to readers, while simultaneously not bloating refs with redundant external links. Things like DOI and PMID have the advantage of being stable and redirect properly, whereas links to publishers' websites are susceptible to linkrot if the publisher changes their website. I noticed a series of edits today where following the doi link allows me to access to the free full article for free, but using OABot still added a direct link to it (example). Is that intended? DMacks (talk) 02:22, 24 October 2017 (UTC)Reply

As an alternative, the CS1 citation templates have a field specifically to identify when a standard identifier does provide the full content for free (vs just free abstract and possible link to paywalled full). See Template:Cite/doc#Access level of identifiers for details. Seems like it would be preferable to note that an existing stable link is already free vs adding a less-stable additional link that goes to the same provider. DMacks (talk) 02:30, 24 October 2017 (UTC)Reply
I don't see how adding an extra (accessible) link is a problem. It's a problem only to add paywalled links. :) --Nemo 14:23, 24 October 2017 (UTC)Reply
I agree we could try to avoid proposing these links, it's just not entirely straightforward but I'll try to look into that − Pintoch (talk) 15:12, 24 October 2017 (UTC)Reply
Actually according to the docs (WP:OABOT#Examples), it already is supposed to know about and add |doi-access=free tags. Maybe do a regex or string comparison of the proposed link and the doi (or other identifier) and if they match to within some closeness (same hostname and maybe some later string details) assume that the publisher itself (target of doi link) hosts the free content (url link) and presumably one can get to free to doi. DMacks (talk) 07:48, 27 October 2017 (UTC)Reply

How to run this bot on a specific page?

edit

Sorry if I'm missing it. Can I run this bot on a page similar to checklinks or the citation bot? Thanks. - Scarpy (talk) 16:47, 24 October 2017 (UTC)Reply

Click "Start editing a random page", and then at the top you'll have an input field where you can type the name of the page you want to analyze. Analysis will take a long time, though. − Pintoch (talk) 17:16, 24 October 2017 (UTC)Reply

Issues

edit

Hello. I am having several issues:

  • The random pages button isn't working properly. It's giving me pages that other WIkipedians have already included links with the OAbot e.g Biotechnology, Ant and Economics. I've only just started today using this bot, so is it starting from the list from scratch for me?
  • Also, edits are not being made for me on Firefox/Microsoft edge for Brain. I've checked the wikipedia article to check if it's being used (which it isn't) but for some reason, I can't add the link. Thanks --MrLinkinPark333 (talk) 22:10, 24 October 2017 (UTC)Reply

Rights checking

edit

Hi,

I wanted to free some references as my birthday gift. It turned out checking the rights on the link proposed.

Alexander technique and Feldenkrais method: a critical overview Sanjiv Jain, MD, Kristy Janssen, PA-C, Sharon DeCelle, MS, PT, CFT

However, according to http://www.sherpa.ac.uk/romeo/search.php (checking what authors can do) "author cannot archive publisher's version/PDF". The pdf proposed in the link is clearly the publisher's version. (http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=DCD202DCC6ABDCDB516E1C80F4CB94AC?doi=10.1.1.611.4183&rep=rep1&type=pdf)

So I wonder about the rights on publications available trough CiteSeerX.

Using the bot, what is the interaction with the authors ? In my opinion changing access to science should be done upstream, not downstream. I emailed Sanjiv. I'll make a deeper opinion on this with such experiences.

--RP87 (talk) 08:11, 25 October 2017 (UTC)Reply

RP87, I don't know what you mean by "upstream", but sure, it's better if the articles are archived by the authors themselves (green open access).
Nemo 07:39, 26 October 2017 (UTC)Reply
edit

[10] is obviously a bogus URL. Is the bot confusing url with title (trying to construct a wiki-formatted link with visible alternative text) but then inducing editors to paste that as the "url" itself or is it failing to urlencode the url to protect whitespace? DMacks (talk) 17:34, 26 October 2017 (UTC)Reply

This is indeed a whitespace error, I will fix it in the tool. − Pintoch (talk) 08:46, 27 October 2017 (UTC)Reply

Please fix the bot or stop

edit

Whatever is going in with this bot, people are using it to add, and even re-add, ELNEVER violations. See this followed this for example. I don't know if it is the bot or people not being careful enough using it, but either way COPYVIO additions are being added throughout WP. Jytdog (talk) 15:02, 28 October 2017 (UTC)Reply

Those are not WP:ELNEVER violations by a longshot. Headbomb {t · c · p · b} 18:20, 28 October 2017 (UTC)Reply
The paper that OAbot suggested a link for was published by Liebert and their policy is here and says authors can post preprints but says in bold: "The final published article (version of record) can never be archived in a repository, preprint server, or research network." The link there is to the final published article.
OA is important but ELNEVER must be followed. Jytdog (talk) 21:21, 28 October 2017 (UTC)Reply
Author webpages are neither repositories, preprint servers, nor research networks. Headbomb {t · c · p · b} 21:32, 28 October 2017 (UTC)Reply
I am going to post at ANI to have this bot paused. Done here. Jytdog (talk) 22:17, 28 October 2017 (UTC)Reply
Jytdog, author rights for sharing their paper (and which version) can be deterimed at this website, which we link to right in the tool on every page where you can add a link: http://www.sherpa.ac.uk/romeo/index.php Ocaasi (WMF) (talk) 23:04, 28 October 2017 (UTC)Reply
People appear to be ignoring that tool. And I just put in "science" and while the detailed entry is correct (final published version not allowed to be posted), the summary version on the results page (which I can't link to) is incorrect and lists Science as "green" (OK to post final published version). Jytdog (talk) 23:06, 28 October 2017 (UTC)Reply
The solution is to educate people, not to shut the bot down, people are responsible for the edits they make. Headbomb {t · c · p · b} 23:12, 28 October 2017 (UTC)Reply
I just had to revert this. How on earth can kitsrus.com be treated as a credible repository for reprints from the NEJM? LeadSongDog come howl! 20:09, 2 November 2017 (UTC)Reply

ANI

edit

  There is currently a discussion at Wikipedia:Administrators' noticeboard/Incidents regarding an issue with which you may have been involved. Jytdog (talk) 22:41, 28 October 2017 (UTC)Reply

IEEE Article

edit

I just saw this OABOT edit that added a link to an MIT website that apparently is user "benmv"'s "public" subdirectory. The 2002 paper's authors are Han and Thorup; neither of them appear to be at MIT. Neither appears to be "benmv".

Querying https://dissem.in/api/10.1109/SFCS.2002.1181890 finds nothing.

Querying oaDOI produces a hit, but it looks like a COPYLINK. That makes me doubt the source can be trusted. Glrx (talk) 23:44, 3 November 2017 (UTC)Reply

Bot adds pmc= to citation when PMC= is already present?

edit

This edit is tagged as using the bot to add a |pmc= parameter when |PMC= was already present. If this is a bot bug, can you please try to fix it? Thanks. – Jonesey95 (talk) 15:38, 12 November 2017 (UTC)Reply

Ran into this today. I think it should be clear from the reference snippet when reviewing, but still, it shouldn't show up in the bot's queue. ~ Amory (utc) 00:44, 24 April 2018 (UTC)Reply

New maintainers

edit

Hi CristianCantoro, Nemo bis, Ocaasi and Samwalton9, I have added you as maintainers of oabot on the Toolsforge so you should be able to deploy the changes made to the tool yourselves. It works as follows:

  • ssh to tools-login.wmflabs.org with your wikitech account
  • run the command "become oabot"
  • "cd www/python" to go to the directory where the source code of the bot lives
  • "git checkout master ; git pull" to sync the code from github
  • "webservice uwsgi-python restart" to restart the web server

Cheers, − Pintoch (talk) 16:06, 15 November 2017 (UTC)Reply

@Pintoch: Thanks! Sam Walton (talk) 18:06, 15 November 2017 (UTC)Reply
Thank you :) --CristianCantoro (talk) 14:15, 16 November 2017 (UTC)Reply

OABot is unable to parse these references

edit

OABot is currently unable to add links in articles that do not use reference templates, like this one. Can we add a citation parser to OABot so that it can find links for these references? Jarble (talk) 17:03, 7 January 2018 (UTC)Reply

I would love to see that happening. I do not currently have the time to work on that but I am keen to help anyone find its way in the codebase and brainstorm. − Pintoch (talk) 17:51, 8 January 2018 (UTC)Reply
@Pintoch: Instead of using a citation parser, it would also be possible to find open-access documents using scholar.py. As far as I know, OABot is not yet capable of doing this. Jarble (talk) 00:04, 21 January 2018 (UTC)Reply
@Jarble: I do not think this is feasible because of the rate limits imposed by Google Scholar. I also think it is against their terms of service. But I would be happy to be proved wrong. − Pintoch (talk) 19:32, 21 January 2018 (UTC)Reply
edit

In Special:Diff/819541202 the bot (apparently under manual control) added an arXiv link to a paper, in the format |url=http://arxiv.org/pdf/math/9805045. The correct format for such links is instead |arxiv=math/9805045. Please fix. —David Eppstein (talk) 23:55, 9 January 2018 (UTC)Reply

@David Eppstein: thanks for the bug report. Will do. − Pintoch (talk) 13:30, 13 January 2018 (UTC)Reply

Zenodo

edit

Please remove Zenodo from the sites that OABOT looks for publications to link to. I keep finding witless users of OABOT adding links using OABOT to Zenodo, and Zenodo seems to have no screen for what people upload. The last diff I reverted like this, was this one. Jytdog (talk) 19:53, 20 April 2018 (UTC)Reply

How do you know the author did not gain authorisation for that upload? --Nemo 20:51, 20 April 2018 (UTC)Reply
That's the wrong question. Per WP:ELNEVER "External links to websites that display copyrighted works are acceptable as long as the website is manifestly run, maintained or owned by the copyright owner; the website has licensed the work from the owner; or it uses the work in a way compliant with fair use." Since Zenodo is NOT run, maintained, or owned by the copyright owners of these papers, and does not display any evidence that the work has been properly licensed, we should not use its links. In this specific case, the pdf file is clearly directly copied from the journal website, and is not labeled as open there, so it has all appearances of being a copyright violation. But even in less clear cases we need evidence that it is not a copyright violation, not just a lack of evidence that it is one. —David Eppstein (talk) 21:22, 20 April 2018 (UTC)Reply
By this reasoning, we should not use any institutional repository. Your reading of the policy is therefore clearly wrong. --Nemo 22:23, 20 April 2018 (UTC)Reply
This is our second go round and you are displaying no more awareness of your responsibility to follow policy than the first time. You are obligated to ensure that links that you add are OK. Jytdog (talk) 22:34, 20 April 2018 (UTC)Reply
@Nemo bis: We can use institutional repositories when they clearly display the provenance of the files they provide and where that provenance clearly indicates that it was provided by the authors of the paper. So arXiv is almost always ok, because the arXiv metadata page (which is the page you should always be linking, not the direct link to the pdf, unlike several of your recent edits) states who uploaded the paper and it is almost always an author. CiteSeerX is sometimes ok, because the metadata page clearly states where CiteSeerX found the file and sometimes that is a repository or web page controlled by the author, but you need to check carefully rather than just assuming. Repositories that just scrape the web and do not indicate the provenance of their files are almost certainly not ok. It may be the case that they found their files through an author site, but you have no way of knowing that, and in any case it would be preferable to just directly link to the author site. —David Eppstein (talk) 23:08, 20 April 2018 (UTC)Reply
Thank you for qualifying your thought. This reading of the policy is already more workable, and one that Zenodo complies with. Zenodo states who uploaded the file, in the metadata section. It's true that, unlike OSF or arXiv, it doesn't necessarily provide the individual author with the permissions to alter the record, but corrections are still possible if requested and they're responsive to error reports: I believe this is what matters to achieve the goals you stated. --Nemo 21:50, 28 April 2018 (UTC)Reply
  • This is now under discussion at ANI. Jytdog (talk) 22:02, 5 November 2018 (UTC)Reply
  • You know, I did see one of these Zenodo links on my watchlist and was a little concerned. I mean, with PubMed Central there is usually a contractual requirement with the authors that the article be deposited in PMC; so we can be sure that the publication is legit. Zenodo on the other hand seems like it's often not clear. Jo-Jo Eumerus (talk, contributions) 07:10, 6 November 2018 (UTC)Reply
    • User:Jo-Jo Eumerus my understanding of zenodo is that a) they encourage scientists to upload articles and b) they don't check if the scientists are uploading appropriate versions. Many of the final published versions that have been linked-to in WP via zenodo via OABOT that i have checked, were COPYLINK violations. If you are unaware, the OABOT has four instructions, and the fourth one is "Is the new link likely copyright-compliant?". People are not doing that last step well, too often, as they make swift runs with OABOT. It takes time and some knowledge to check. Jytdog (talk) 14:51, 6 November 2018 (UTC)Reply

Automate PubMedCentral

edit

In my survey of link proposals today, I noticed that the quality of PubMedCentral suggestions has reached a very good level. Out of ~250 PMC IDs found, only 2 or 3 were wrong, I think (just 404). I don't know how many such IDs are left, but considering the importance of medical references, it would be nice to ensure we use them comprehensively. An actual bot run may be in order. --Nemo 22:20, 20 April 2018 (UTC)Reply

These are the top suggested domains now:

 21446 http://citeseerx.ist.psu.edu
  8145 https://www.ncbi.nlm.nih.gov
  3461 http://www.ncbi.nlm.nih.gov
  3213 http://arxiv.org
  1297 http://onlinelibrary.wiley.com

I think the arXiv and PubMedCentral identifiers/links are uncontroversial and a bot run would be well received. --Nemo 07:22, 23 April 2018 (UTC)Reply

  • comment I would support, however more opinions should be sought from other editors...IMO--Ozzie10aaaa (talk) 11:40, 23 April 2018 (UTC)Reply
  • I support linking to ncbi, arxiv, and wiley. The CiteSeerx repository listed above seems less clearly a holder of the information being served; for example, a search for my own name yields completely wrong affiliation (not a big deal, but suggestive that it may not be well-maintained since I've never had that affiliation) and more than one of those publications' copyright was originally owned by the publisher (e.g. ASM) and should not, AFAIK, be served by CiteSeerx (I wish all of my publications were open access, but that's not up to me). Perhaps they have some arrangement with ASM, Rockefeller, etc of which I'm unaware - or perhaps these have all been made open access by the respective publishers/copyright holders - not claiming that someone's copyright has been infringed or that the database is improper, but it's not clear from their site how they justify distributing so much material that may be protected elsewhere. Might want to be careful if provenance is unclear. — soupvector (talk) 16:05, 23 April 2018 (UTC)Reply
  • In favor of PMC — clear and trustworthy repository, wouldn't ever think twice about something found there. As above, though, citeseerx gives me pause; I'm new to the tool, but have been very unsure of those links. ~ Amory (utc) 00:43, 24 April 2018 (UTC)Reply
  • I support second the addition of the PubMedCentral (PMC) identifiers. The National Library of Medicine in the US has done a lot of solid work with this repository and any additional visibility stands to benefit readers. Lauren maggio (talk) 13:42, 24 April 2018 (UTC)Reply
  • If there's an identifier, the bot should add it. I've contacted the CiteSeerX people directly last year, and they confirmed they take down copyright violation if they find any / are reported such violation. So the bot should also remove/comment out CiteSeerX identifiers if they do not resolve. Headbomb {t · c · p · b} 15:15, 24 April 2018 (UTC)Reply
  • Absolutely add PMCid where available, it contributes to wp:V to have more users able to read the sources and has only minor variations from the publisher's final versions. The www.ncbi.nlm.nih.gov targets should be the same for http and https protocols, why would the counts differ at all? I'm not so sure about arXiv: how consistent are these articles with their peer-reviewed versions, or did they even get through peer review?LeadSongDog come howl! 20:31, 24 April 2018 (UTC)Reply
    @LeadSongDog: The arxiv links have had multiple bots approved to add them User:Bibcode Bot (temporarily? dead) and User:Citation bot. The links are considered desirable. They are not the version of record, so at not automatically linked like PMC is, but they are significantly better than no access at all. Most will be fairly close to the final product, only differing in formatting and minor revisions. Headbomb {t · c · p · b} 16:07, 27 April 2018 (UTC)Reply
    Seems reasonable, thanks.LeadSongDog come howl! 16:33, 27 April 2018 (UTC)Reply
    LeadSongDog, the count differs because some suggestions have the PMC URL recorded in HTTPS and others in HTTP, presumably for some variation in the data sources. This doesn't change the edit (which is only about adding the identifier, not a full URL), so the two numbers have to be added up to get the total number of suggestions. --Nemo 21:53, 28 April 2018 (UTC)Reply
    Well, NCBI has been https only for quite some time now, with an automatic redirect from http but anyhow... the counts would only add if there is no overlap between the two. LeadSongDog come howl! 15:20, 7 May 2018 (UTC)Reply
  • A request for bot approval was filed here: Wikipedia:Bots/Requests_for_approval/OAbot_2. Feel free to comment there. − Pintoch (talk) 23:05, 28 April 2018 (UTC)Reply
  • The bot is now running. --Nemo 15:23, 15 May 2018 (UTC)Reply
  • The first run is completed, OAbot is now looking for new URLs (it might take a few days). --Nemo 21:18, 16 May 2018 (UTC)Reply
  • It's now doing about a thousand more edits. --Nemo 09:24, 22 May 2018 (UTC)Reply

False positive

edit

The link added in this edit made using OAbot to the Coronary artery bypass surgery page goes to a different article from the one cited. Broadly similar, but they're not the same articles. Graham87 06:57, 21 April 2018 (UTC)Reply

Thank you for reporting, this was a genuine error. --Nemo 21:32, 28 April 2018 (UTC)Reply
edit

Not a huge deal, but it seems the bot considers a DOI with http and https to be different links. I was offered on Language complexity to add http://doi.org/10.1075/hl.39.2-3.08jos when the reference already contained http://doi.org/10.1075/hl.39.2-3.08jos — going the other way (replacing http with https) would be a fine edit, I guess. ~ Amory (utc) 20:46, 27 April 2018 (UTC)Reply

Thanks for reporting this odd suggestion.
It's true that the suggestions are not able to prefer HTTPS over HTTP (for instance see the statistics above: PubMed is included both in HTTP and HTTPS), however I'm not sure this is a problem for the DOI. The proposed edit would add a "doi" parameter which use HTTPS (per local templates and m:IWM). If anything, I suspect the problem is that uppercase "DOI" template parameter is not being recognised as synonymous with "doi". I'll try to check if some duplicate parameters have been added. --Nemo 21:37, 28 April 2018 (UTC)Reply
edit

Maybe it's just me (Firefox, macOS) but whenever the bot suggests a Zenodo pdf, my browser is automatically prompted to download it. It's rather jarring, and doesn't happen for any other source. ~ Amory (utc) 19:56, 11 May 2018 (UTC)Reply

Do you mean the PDF URLs like [11] or the record URLs like [12]? The latter loads a PDF reader, whose behaviour may depend on your JavaScript preferences inter alia. The former will most likely depend on your content preferences. --Nemo 20:04, 11 May 2018 (UTC)Reply
It's the former, but I haven't clicked any link or tried to open the file. When the OABot interface loads, I'm automatically prompted as if the link was attempted to be opened, but only for Zenodo pdfs. Try this for an example, it happens in Chrome as well. ~ Amory (utc) 20:16, 11 May 2018 (UTC)Reply
Ah, I had not understood you meant in the tool. This is to load the preview of the link at the bottom of the page, which may or may not work depending on your browser's configuration, HTTPS and the website's HTTP headers. The only way to provide a consistent experience is to cache the files and serve them ourselves. That's not been done yet. --Nemo 22:06, 11 May 2018 (UTC)Reply

  Done, this should be fixed now. --Nemo 21:08, 16 May 2018 (UTC)Reply

NBER

edit

Edits like this one are unnecessary, since the given DOI already links to PDF of the NBER paper. --bender235 (talk) 19:53, 25 May 2018 (UTC)Reply

I personally don't add many such links, you could say I agree. There are however different opinions on this:
  • using the url= parameter is an effective way to convey the fact that the paper is actually accessible (the alternative would be to use |doi-access=free, which however is less known and not preferred yet; it doesn't even linkify the title):
  • some users appreciate having a direct link to the PDF; the DOI usually redirects to a record, where the PDF or full text link might be very hard to find and/or the user may face significant hurdles from the interface (JavaScript, banners and so on).
--Nemo 20:00, 25 May 2018 (UTC)Reply
I generally ascribe to the first point (second doesn't apply in this case, but is indeed generally true). I'm inclined to add such urls as I think a hyperlinked title — with the accompanying pdf symbol and "(pdf)" test — makes it clear to readers they can read the document with just a click. DOIs are not universally known or understood; I would guess folks are less familiar with them than ISBNs. Moreover, if you rarely have access to items with a DOI, you may have learned not to try and click them. Basically, I think adding the direct url parameter increases access for all curious readers. ~ Amory (utc) 20:16, 25 May 2018 (UTC)Reply
I personally find direct links to the pdf version of a paper, instead of links to the metadata page, annoying. When we link to arXiv, for instance, we should always link to the abstract page, not to the pdf. The reason is that if I want the pdf I can easily get it from the metadata page but if I want the metadata it is usually much more difficult to get it from the pdf link. So my preference, if a doi exists, is to use only that, and not to add redundant extra links. —David Eppstein (talk) 21:03, 25 May 2018 (UTC)Reply
If I had to choose, sure, but one doesn't preclude the other — a DOI, pmid, or arxiv link are still present — and having the choice can't hurt. ~ Amory (utc) 20:45, 27 May 2018 (UTC)Reply
It can hurt. It misleads the reader, into thinking there are multiple distinct sources for the paper when really there is only one source, linked twice. —David Eppstein (talk) 21:28, 27 May 2018 (UTC)Reply

CiteSeerX

edit

I'm seeing a lot of bad CiteSeerX links added (ones where the source listed by CiteSeerX is clearly not the original author of the paper, but a class site at a different school or other similar violations of WP:ELNEVER). Is there some way to admonish OABOT users more strongly to check this and if they are unsure not to add the link? Or do we have to disable this feature? —David Eppstein (talk) 17:21, 20 August 2018 (UTC)Reply

Bot added doi when DOI was already present

edit

Here, the bot added |DOI= when |doi= was already present. DferDaisy (talk) 00:44, 6 November 2018 (UTC)Reply

Also at Romani Americans (here), at Romani people in Austria (here), at Romani people in Germany (here), and at Romani people in the Republic of Macedonia (here). Please can this be fixed so that it stops causing errors in citations? DferDaisy (talk) 00:48, 10 November 2018 (UTC)Reply
Thanks. Fixing. Nemo 15:11, 10 November 2018 (UTC)Reply
I think that's it: [13] [14] [15]. Nemo 17:21, 10 November 2018 (UTC)Reply
Thanks. DferDaisy (talk) 14:51, 11 November 2018 (UTC)Reply

404 - Not Found

edit

Is the bot broken? moved? I keep getting errors. --Nessie (talk) 18:48, 12 February 2019 (UTC)Reply

The code didn't survive the operating system upgrade on Toolforge: phabricator:T215789. Nemo 19:47, 12 February 2019 (UTC)Reply
Pintoch fixed it! Nemo 18:33, 13 March 2019 (UTC)Reply

Zenodo

edit

I thought Zenodo had been removed from OAbot due to the large numbers of copyright violating links there? Guy (help!) 16:19, 8 September 2019 (UTC)Reply

error-generating edit

edit

This edit seems to have generated a parsing error in {{cite journal}}. "url= missing title". I suppose it is the prior presence of "title-link", which I would regard as superior to zenodo. Citation handling is presently unstable anyway but the upshot is problematic for editors concerned with careful citing. Thincat (talk) 09:07, 11 September 2019 (UTC)Reply

I had also reported the problem here where it has been suggested that OAbot should not add url= if title-link is preseent. I agree with this: removing title-link would rarely be appropriate. Thincat (talk) 13:57, 11 September 2019 (UTC)Reply

Thank you for the report. It is very rare that such parameter conflicts arise: less than 1 in 1000 edits, as far as I can see. Normally the user resolves them shortly after adding the link.
I regularly go to Category:CS1 errors: URL–wikilink conflict, where there is a small but constant influx of pages with such a warning due to human "error" (just a mild confusion really), and I find that it's considerably easier to just fix them. Nemo 16:30, 11 September 2019 (UTC)Reply
edit

https://en.wikipedia.org/w/index.php?title=Koopmans%27_theorem&curid=2178720&diff=917535596&oldid=911113372 File adding duplicates existing Handle link (hdl). AManWithNoPlan (talk) 13:16, 24 September 2019 (UTC)Reply

The only way to avoid it is that Citation bot adds hdl-access=free when it in fact knows the full text is open access in the location. Nemo 13:56, 24 September 2019 (UTC)Reply
That's hard to tell, but I think that some work could be put into noticing that the new url and the handle are the same. AManWithNoPlan (talk) 17:31, 24 September 2019 (UTC)Reply
Yes, but then we should also verify whether the handle actually works and coincides with the same PDF, which is not foolproof. When we get the non-handle URL from Unpaywall, it's usually because the repository does not consider it its primary identifier (as exposed by OAI-PMH). OAbot also currently doesn't mess up with access parameters for existing identifiers, although that might change if the automatic addition of them is approved. Nemo 19:22, 24 September 2019 (UTC)Reply
I've added a simple string check now, for the existing hdl parameters. Maybe it works. Nemo 08:30, 27 September 2019 (UTC)Reply

Wildly inappropriate

edit

Why is OAbot adding links to a crank website? [16]. This website hosts a mix of copyright violations, misleadingly presented documents and fringe editorialising. It is completely inappropriate as a source on Wikipedia. Guy (help!) 06:39, 25 September 2019 (UTC)Reply

The cold fusion article cites "Thermal Behavior of Polarized Pd/D Electrodes Prepared by Co-Deposition", so it adds a link to a free version of that article. That the site is crank is irrelevant, given the source itself is a crank one. If there's a copyright issue, that's different, but it's completely orthogonal to the site being reliable (wrt to facts) or not. Headbomb {t · c · p · b} 06:46, 25 September 2019 (UTC)Reply
However, here the issue mostly seems to be that a conference of the same title exists, and the bot matches the wrong version of the paper. Headbomb {t · c · p · b} 06:48, 25 September 2019 (UTC)Reply
That's easy to fix though, with the next suggestion by https://fatcat.wiki/ . Nemo 06:56, 25 September 2019 (UTC)Reply
The issue is that you cannot trust that site. It hosts material in violation of copyright, it also includes modified and editorialised versions of papers but presents them as if they were the originals. That site is not appropriate for use on Wikipedia. Guy (help!) 12:46, 25 September 2019 (UTC)Reply

Breaks citation template by putting url on wrong parameter

edit

See this diff. It concerns a {{citation}} template with |contribution=, |title=, and |title-link=. By adding |url=, OABOT broke the template (it cannot simultaneously handle title, title-link, and url). If it should have been added at all, it should have been added as |contribution-url=. —David Eppstein (talk) 20:08, 29 September 2019 (UTC)Reply

edit

I have been using OABOT recently as adding open access links to article PDFs is a valuable form of referencing. However, many of my edits were recently reverted with the same language, "semanticscholar is neither the author nor publisher of this work and should not be linked directly like this." When I followed up with the editor who did this, the response was (in part): "To link to a copy of a published journal paper, we need convincing evidence that the paper is not a copyright violation, posted against the publisher's licensing requirements. Direct links to the author's web site are ok, as are direct links to the publisher. Links to institutional repositories where the author works are ok. CiteSeerX may or may not be ok, but you can always check it and tell, because it has a landing page that shows you where it got the copy of the paper from. Semanticscholar has no such thing. It is just a direct link to a pdf file with no evidence attached to it of where it comes from or whether it should have been made freely available. So we can't use those links, and you should not be adding them." This reasoning makes sense to me, though it seems counter to OABOTs work so I am bringing this issue here for community clarification.

This is the first I have heard something like this, and while I strive to avoid copyright violations, I am now puzzled as OABOT is built upon using SemanticScholar. Can somebody help me clarify this, namely to what extent SemanticScholar itself indexes open access articles and thus we can then use them within Wikipedia citations, or if a questionability exists about CitatiomnScholar pdfs themselves and we should avoid using them and thus avoid using OABOT itself? Thanks for helping me to sort this out. --- FULBERT (talk) 14:20, 19 January 2020 (UTC)Reply

I don't know who made these claims, but they are wrong. Semantic Scholar uses (click "Sources" at the bottom of https://www.semanticscholar.org/) PDFs from the Association for Computational Linguistics, the Association for Computing Machinery, AMiner, ArXiv, BioOne, CiteSeer, DBLP, De Gruyter, Frontiers, HighWire Press, Hyper Articles en Ligne, IEEE, Karger, Microsoft Academic, MIT Press, OdySci Academic, Project MUSE, PubMed, The Royal Society, SAGE Publishers, Science, SciTePress, SPIE, Springer Nature, Taylor & Francis, and Wolters Kluwer. All of these are high reputation repositories, journals, publishers, and will all comply with copyright laws and other DMCA takedowns (as does Semantic Scholar itself). If there are copyright violations, it is an exception, rather than the norm. Headbomb {t · c · p · b} 14:33, 19 January 2020 (UTC)Reply
Last Summer's Wikipedia:Reliable sources/Noticeboard/Archive 272#Semantic Scholar seems to indicate that it's a generally license-compliant site. DMacks (talk) 15:35, 19 January 2020 (UTC)Reply
Where is the evidence that its pdfs come with permission of the publishers? All you get from semanticscholar is a bare pdf. For CiteSeerX, there is a landing page that shows the provenance of the files. Here there is nothing like that. "I found it on the internet and I never heard about them getting caught for copyright violations" is not a valid justification. —David Eppstein (talk) 19:59, 19 January 2020 (UTC)Reply
Just for the record, I did not participate in that discussion and am not posting it in support of any thoughts on my own, just as a pointer to a previous relevant discussion. DMacks (talk) 20:01, 19 January 2020 (UTC)Reply
"Where is the evidence that its pdfs come with permission of the publishers?" SemanticsScholar having partnerships with those repositories and publishers, and not being sued into oblivion, for one. What's next, asking Springer or Nature if they have permission to publish an authors work? The working hypothesis that Semantics Scholar steals papers is unreasonable. Headbomb {t · c · p · b} 03:52, 20 January 2020 (UTC)Reply
The hypothesis that it steals papers is borne out by my experiment below. But in general, we can't just link to copies of paywalled material because we found it on the web somewhere. We need a reason to believe that it is legitimate, rather than closing our eyes, burying our heads in the sand, and trying really hard not to see the evidence that it is illegitimate. —David Eppstein (talk) 04:44, 20 January 2020 (UTC)Reply

Ok, as an experiment, I tried looking up myself on SemanticScholar [17]. The second paper it lists for me is "The Crust and the beta-Skeleton: Combinatorial Curve Reconstruction", with a linked PDF. The paper has only ever been published in a paywalled journal. The linked PDF is the paywalled journal version. The landing page for the PDF does not have any indication of its provenance. Neither I nor any of my two coauthors have accounts on SemanticScholar (all our pages display the same "claim your account" banner), so we cannot have uploaded it there ourselves. I checked whether my coauthors perhaps uploaded the paper to the web and discovered that there is an author-uploaded version, but a different version, formatted as we submitted it rather than as the journal published it, so that version is not the one semanticscholar obtained. In short, basically the first thing I found when making even a very cursory search appears to definitely be a copyvio. Given both the ease of finding piracy there, and the difficulty of determining whether any particular link might be pirated, I believe I am justified in both reverting these link additions and with threatening with blocks anyone else who adds them. (Incidentally, I am not particularly upset that my paper is linked in this way, and am not planning to submit a takedown request to semanticscholar — as an individual researcher, I am happy for my work to become more available in this way. However, as a Wikipedia editor I believe that it would be very bad for Wikipedia to become generally thought of as a haven for piracy, and I think we should work to prevent actual Wikipedia-supported piracy when possible.) —David Eppstein (talk) 02:16, 20 January 2020 (UTC)Reply

If you signed away your copyright to the publisher, they can give a license to SemanticScholar or others. SemanticScholar states that it has licenses from (some) publishers. Given you are the author, there is no need for speculation: you can just ask the publisher directly whether they gave a license. Nemo 10:25, 20 January 2020 (UTC)Reply

I appreciate the discussion that has occurred here Nemo, David Eppstein, Headbomb, DMacks, as I find it informative. However, I think this warrents further clarification for a clearer consensus to help inform future action, specifically about the copyrigh issue of including Semantic Scholar pdfs within references. I started a conversation on the Reliable sources / Noticeboard for additional input. I will post whatever else I find about it here for others who may have similar questions about this. --- FULBERT (talk) 15:49, 20 January 2020 (UTC)Reply

I also have concerns over the abundance of proposed links pointing to pdfs hosted on the pdfs.semanticscholar.org domain. I'm preparing to lead an in class activity (library science, information & society) in which students navigate the OABot to add OA links to Wikipedia articles, however the process by which I am requiring them to consult the journal/publisher policies on Sherpa Romeo is leading me to find that the majority of the proposed links are to publisher versions of papers that are not permitted to be archived OA. Additionally in many cases the existing citation contains a functional PMID that resolves to a legal OA copy. I love the idea of this bot project but believe the concerns raised by others are valid and with the flood of pdfs hosted on semantic scholar, I don't feel comfortable answering my student's questions related to whether semantic scholar would be considered an academic social network? If yes then linking to submitted/accepted versions would be fine but these are mostly published versions. Shackpoet (talk) 22:46, 22 September 2020 (UTC)Reply

Shackpoet, thanks for your comment. The current queue of suggestions is the result of an accelerated run to prepare the automated addition of some parameters. I'll now refresh it so that your students find easier edits. When is your course?
As for Semantic Scholar, it's not a social network but a digital preservation effort like the Internet Archive. Its main purpose is to index the full text of publicly available PDF files to provide various search and machine learning features, so it falls squarely under the criteria for fair use established by the Google Book Search Settlement Agreement. However, I understand that may be too much for the usual LIS course. Nemo 06:27, 23 September 2020 (UTC)Reply
Shackpoet, I have now updated the queue. There are now suggestions for about 1400 articles in the queue, which should suffice for your class. A relative majority of them are for links to https://www.biodiversitylibrary.org .
I left a few cases where the publisher's guidance is not to archive the work, such as the Journal of Zoology/Wiley "prohibiting" the archival of a 1897 article by Martin Jacoby. Such absurdities are illustrative of how publishers' claims need to be taken with a grain of salt and I suppose your LIS class happens after users have been taught the basics on the public domain, but if another set of examples would suit your class better let me know. Nemo 09:42, 23 September 2020 (UTC)Reply

→→Thanks Nemo, the class is this morning and it will make for a simpler intro editing activity to have this different set of articles that are not suggesting semanticscholar hosted PDFS. I also appreciate the inclusion of an older piece to facilitate a discussion of PD/CC0. I'll look into your links relating semanticscholar to the Google Book Search Settlement as I'm intrigued at the fair use angle here. Shackpoet (talk) 16:34, 23 September 2020 (UTC)Reply

502 Bad Gateway

edit

All I get is a bad gateway error.  — Chris Capoccia 💬 15:22, 22 June 2020 (UTC)Reply

Works for me, try deleting your cookies. I wonder if such problems will decrease now with the separate domain... Thank you for testing! Nemo 16:37, 22 June 2020 (UTC)Reply
thanks for the tip. i deleted all the cookies for toolforge and now it's working again.  — Chris Capoccia 💬 19:22, 22 June 2020 (UTC)Reply

S2CID parameter vs URL to Semantic Scholar

edit

Has the OABOT team considered using the S2CID parameter instead of a URL for Semantic Scholar?  — Chris Capoccia 💬 12:13, 30 June 2020 (UTC)Reply

The S2 crazy URLs regularly change and the PDF links expire, but the corpus ID never changes - excelent idea. AManWithNoPlan (talk) 12:24, 30 June 2020 (UTC)Reply
No, we never implement repository-specific URLs. We might skip citation templates which have an s2cid parameter in the future. Direct links to PDFs are useful, the links don't bitrot (they just redirect elsewhere in the worst case; less than 1 % of the URLs currently in the queue) and there is currently no other way to convey that the SemanticScholar URL actually provides the full text, so it's all good as is.
I'm just sorry that the removal of the URLs from the templates can now cause some extra manual work for those who use the tool, so a few days ago I removed all semanticscholar.org URLs from the queue. There's plenty of other work to do before these anyway. Nemo 12:36, 30 June 2020 (UTC)Reply
|s2cid-access=free conveys that. Which will soon autolink things. Headbomb {t · c · p · b} 12:39, 30 June 2020 (UTC)Reply
Yeah, "soon". In the meanwhile, it's used on a whopping 3 articles. Nemo 18:53, 30 June 2020 (UTC)Reply
Well it'd be used on more articles if bots supported it. It was only rolled out in April ish I believe and most S2CID links were added by Citation bot. Headbomb {t · c · p · b} 18:58, 30 June 2020 (UTC)Reply
@Headbomb: as things stand, no autolinking for Semantic Scholar will be rolled out in the next CS1/2 update - the RFC was only about DOIs. − Pintoch (talk) 21:06, 30 June 2020 (UTC)Reply
Then that should be rolled in the next update. There's no reason not to. Headbomb {t · c · p · b} 21:37, 30 June 2020 (UTC)Reply
I am clearly not going to oppose that. − Pintoch (talk) 06:34, 1 July 2020 (UTC)Reply

oabot and |title-link=

edit

Recent changes to cs1|2 have added several articles to Category:CS1 errors: URL–wikilink conflict. Some of these are like this oabot edit where the bot appears to be ignoring the content of |title-link=. |title= can be linked to only one target. This isn't a new error, the change in cs1|2 applies a different error message and category.

This particular edit was a while ago. If the bot is still capable of making similar edits, please correct it. If already corrected, never mind.

Trappist the monk (talk) 14:53, 12 July 2020 (UTC)Reply

Thank you for the notification.
If I remember correctly, most cases were fixed a while ago by Pintoch after you wrote either here or on user talk:OAbot. I'd like a better solution than ignoring all citations with title-link entirely, though: it's fine to skip a public domain PDF when the citation already links a Wikisource copy, but it makes less sense when the link is to an English Wikipedia article which merely explains some term. Such internal links were usually moved to title-link by some bot, after being originally added as simple double brackets links in the title parameter by users with no intention of overriding the URL. Nemo 15:11, 12 July 2020 (UTC)Reply
Still breaking cs1|2 citations. Here is the history of one cs1|2 template in On the cultivation of the plants belonging to the natural order of Proteeae:
All of this churn could have, should have, been avoided if OABOT would do the right thing and avoid adding |url= when |title-link= has a value or when |title= is wikilinked.
Please fix the bot.
Trappist the monk (talk) 12:14, 26 October 2020 (UTC)Reply
Trappist the monk, is there anything in my part? --Evolutionoftheuniverse (talk) 12:23, 26 October 2020 (UTC)Reply
Previewing your edits before you save them is always a good plan. Even though the bot 'made the edit', you are always responsible for what you let it do. As we can see from this example, bots are not perfect, so we should never assume that what they recommend is correct.
Trappist the monk (talk) 12:52, 26 October 2020 (UTC)Reply

Bot does not recognize jstor-access=free

edit

In this diff, the bot added an hdl parameter to a citation that already had a jstor link with jstor-access=free. Expected behavior is that it will not edit a citation that already has a free link. Kim Post (talk) 21:35, 23 July 2020 (UTC)Reply

Expected behavior by whom? If there is a valid hdl for a reference, I would like it to be added, regardless of how jstor treats visitors. —David Eppstein (talk) 22:12, 23 July 2020 (UTC)Reply
I say that because it's the documented behavior: "The bot won't add a link to an alternative version of a source that is already signaled as free to read (that is, if   appears in the rendered source)" Kim Post (talk) 12:58, 31 July 2020 (UTC)Reply
A link here is a url. However, there's a weird thing here in that this is an hdl to a public domain source, which is nonetheless not available because of "copyright restrictions". Headbomb {t · c · p · b} 16:40, 23 September 2020 (UTC)Reply

Is semanticscholar.org pukka?

edit

I thought semanticscholar.org was generally copyright safe, but looking at some recent additions made by this bot I'm genuinely not sure. Take:

Which is downloaded from JSTOR with a notice that the download is for personal use only, and to "contact the publisher regarding any further use of this work".

This is a PDF of the Science article http://doi.org/10.1126/science.6207592, which is copyrighted to the journal and paywalled.

What gives? Alexbrn (talk) 08:29, 14 November 2020 (UTC)Reply

I'm not sure what you mean by "additions made by this bot". The bot (User:OAbot) does not add URLs; only users do.
I'll just note that it's not enough to look at individual terms of use. Separate licenses from the rightsholders, as well as copyright exceptions and limitations, take precedence. Nemo 11:31, 15 November 2020 (UTC)Reply
And in this case? Alexbrn (talk) 11:36, 15 November 2020 (UTC)Reply

Bad edit with OABOT

edit

See this thread. Short version: A user used OABOT to link a preprint version of a reference that did not include the specific information that the reference was used to reference. I warned the user that all such edits need to be checked manually to make sure that, when linked and published versions are different, the differences are not relevant to the usage of the reference (as they were in this case). I'm not sure what there is for OABOT to do except maybe make that point clearer to its users. The link in this case was to escholarship.org, which is not usually problematic; the problem was that the reference was used for an author biography rather than the paper itself, and the escholarship version omitted that part.

A question: I added an empty url field to this reference with a comment warning people not to add links that do not include the author biography. Will this be sufficient to prevent other OABOT users from making the same mistake? —David Eppstein (talk) 01:33, 6 December 2020 (UTC)Reply

Thank you

edit

Thanks to you and your handlers for all the good work. I use the Unpaywall browser extension to bypass those who don't deserve my money but find I'm needing it less and less due to your improvements. Certes (talk) 11:27, 26 December 2020 (UTC)Reply

Glad to help! Nemo 11:09, 30 December 2020 (UTC)Reply

December 2020 refresh

edit

Seeing there was some usage lately, I'm refreshing the suggestions now. The last refresh was in September 2020 and in the meanwhile Unpaywall made some major fixes (mostly restoring a bunch of bronze OA URLs at Wiley). Tens of thousands of new doi-access=free suggestions are being found. Nemo 11:09, 30 December 2020 (UTC)Reply

Refresh completed. Now about half of the suggestions in the queue are for https://www.biodiversitylibrary.org, https://www.osti.gov or https://academiccommons.columbia.edu, so if you have up on using the tool because you didn't like the suggestions I recommend to give it another try! Nemo 09:20, 3 January 2021 (UTC)Reply

Discussion at Help talk:Citation Style 1 § Automating URL access tags

edit

  You are invited to join the discussion at Help talk:Citation Style 1 § Automating URL access tags. {{u|Sdkb}}talk 04:55, 6 April 2021 (UTC)Reply

Update/shameless plug of WP:UPSD, a script to detect unreliable sources

edit

It's been about 14 months since this script was created, and since its inception it became one of the most imported scripts (currently #54, with 286+ adopters).

Since last year, it's been significantly expanded to cover more bad sources, and is more useful than ever, so I figured it would be a good time to bring up the script up again. This way others who might not know about it can take a look and try it for themselves. I would highly recommend that anyone doing citation work, who writes/expands articles, or does bad-sourcing/BLP cleanup work installs the script.

The idea is that it takes something like

  • John Smith "Article of things" Deprecated.com. Accessed 2020-02-14. (John Smith "[https://www.deprecated.com/article Article of things]" ''Deprecated.com''. Accessed 2020-02-14.)

and turns it into something like

It will work on a variety of links, including those from {{cite web}}, {{cite journal}} and {{doi}}.

Details and instructions are available at User:Headbomb/unreliable. Questions, comments and requests can be made at User talk:Headbomb/unreliable. Headbomb {t · c · p · b} 13:16, 25 April 2021 (UTC)Reply

I vehemently disagree with the selection criteria, but I respect the work you've put into this. The shameless plug was welcome. ;) Nemo 15:17, 25 April 2021 (UTC)Reply
Not really sure what's to be disagreed with here. Everything reflects consensus on mainstream boards like WP:RSN and WikiProjects. Headbomb {t · c · p · b} 15:27, 25 April 2021 (UTC)Reply
Headbomb, I don't know what consensus supports the list of supposedly bad publication venues, but I know that it doesn't follow the consensus definition of predatory and it seems to ignore far larger problem children like Scientific Reports, Procedia, Nature Communications and friends, not to mention individual titles of legacy publishers which have had severe issues. Nemo 18:33, 25 April 2021 (UTC)Reply
WP:RSN discussions, Beall's list (with a sanity check, far from everything from it is included), Wikiproject discussions, etc... with minor tweaks to categorization (e.g. yellow vs red for some sources due to how they are actually used in Wikipedia). Again, see the note at the top of the documentation page. If you dispute the inclusion of any source, or want to add more, do feel free to start a discussion at WP:RSN and I'll be happy to update anything according to consensus (within the limitations of the script). You can also add your own custom rules if you want. Headbomb {t · c · p · b} 18:38, 25 April 2021 (UTC)Reply
Beall's list is deprecated pretty much everywhere in the civilized world. Are you saying there was a WP:RSN for every single DOI prefix or venue you included in those regular expressions? Nemo 18:47, 25 April 2021 (UTC)Reply
"Beall's list is deprecated pretty much everywhere in the civilized world". It is most definitely not. It's the starting point, but not the end, of most predatory-crap fighting efforts. As for RSNs and every DOI prefix there, no there's not an individual RSN discussion for each of them, mostly because there is no need for such discussions on the vast majority of Beall-identified stuff to begin with. But I do cross check against Predatory publishing#Characteristics. Again, if you have concerns about a specific listing, do feel free to dispute it at WP:RSN and I will happily update my script according to consensus. Headbomb {t · c · p · b} 19:03, 25 April 2021 (UTC)Reply
That's clearly a double standard: inclusion based on gut feelings, changes based on big discussions and consensus. I don't have time for such a haphazard process. But then again, it's not a problem, as long as it's clear to everyone that the list in the script just reflects personal opinions. Everyone works on their own priorities, Wikipedia is a volunteer project.
In general, I'm not in the business of producing naughty lists of publication venues, because I believe in actually reading the sources and in assessing each author for their worth. However, I'll point out that more scientific criteria are available for those who feel a need to judge works by their cover. Brembs 2018 points out some. Nemo 19:23, 25 April 2021 (UTC)Reply
Well, you're certainly free to not use it. As far as double standards go, "I believe in actually reading the sources and in assessing each author for their worth" is very much using your own gut feeling. This script, however, isn't based on that, but rather consensus. If you don't want to be part of that consensus, I can't really force you to participate. Headbomb {t · c · p · b} 19:31, 25 April 2021 (UTC)Reply

Missing several open-access articles on Behavior modification facility

edit

Mostly those marked as open access in APAPsycnet, like doi:10.1037/h0100629. Headbomb {t · c · p · b} 14:48, 19 June 2021 (UTC)Reply

Old bad url — translation rather than text of English original

edit

In 2019, the bot made this edit, adding a Czech translation of an originally-English reference as the url for the reference. It wasn't noticed until now. I didn't find any copies of that specific link elsewhere but it's entirely likely that this was part of a bad batch of additions that did similar things to other references. —David Eppstein (talk) 17:54, 22 November 2021 (UTC)Reply

OABOT on SqWiki

edit

Hello! I'm a crat from SqWiki and we may be interested in having OABOT as a member in our community. What would I need to do for that to happen?

Also, 1 question: I already know OABOT changes paywalled citations with free-to-read one. Does it also add citation links if they're missing altogether from the citation? - Klein Muçi (talk) 11:49, 16 February 2022 (UTC)Reply

Klein Muçi, sorry for the slow answer. Thanks for your interest! Unfortunately at the moment we have very little development capacity, so the best would be to find a Python bot developer in your community to run a fork of oabot (it needs to be ported to Python 3 at least). Then there's probably a need to adapt the citation templates parsing.
The bot has mostly updated {{cite journal}} templates with a DOI parameter filled in. Where there's no DOI, it can query https://dissem.in by author and title, but this is slower and fraught with potential errors. It doesn't have any capacity to act on unstructured citations, but outside the English Wikipedia it would be rather easy to add such a feature by relying on the CrossRef API which returns a DOI from unstructured citations. Then you can just throw away the old text and replace it with a templated citation. Nemo 22:25, 22 January 2023 (UTC)Reply

New edit suggestions

edit

Someone expressed interest in the tool so I updated the suggestion queue shown at https://oabot.toolforge.org/ . It now has suggestions for over a thousand articles. (There are many more in queue for the bot.) Most suggested links are from https://www.osti.gov, https://academiccommons.columbia.edu and https://www.biorxiv.org .

(For context, the automatic refresh has been stuck for a while due to the deprecation of Python 2, but I managed to give it a kick and run it manually.) Nemo 22:27, 22 January 2023 (UTC)Reply

Now running on python3

edit

I've restored the web tool, it seems to work fine. It's now running on the same python3 codebase which I used in January to update suggestions. The tool had been broken for a few months after Toolforge deprecated the old python setup. Thanks Josve05a and Headbomb for the nudge!

(This edit was powered by the Turku bike kitchen.) Nemo 21:06, 3 August 2023 (UTC)Reply

Seems to be working, thanks Ocaasi for testing! Nemo 19:22, 4 August 2023 (UTC)Reply
Thanks for the fix! Headbomb {t · c · p · b} 06:59, 5 August 2023 (UTC)Reply

New edit suggestions ready

edit

After a few bumpy weeks, the bot is mostly done doing the easy edits and the web tool at https://oabot.toolforge.org should be stable enough for regular usage, if you have some time. Currently we have about suggestions in the queue for about 4000 articles, with some of the most common domains in suggested links being https://www.biorxiv.org, https://www.osti.gov, https://academiccommons.columbia.edu, https://ris.utwente.nl, https://www.biodiversitylibrary.org, https://figshare.com, https://escholarship.org, https://hcommons.org . Nemo 21:38, 23 August 2023 (UTC)Reply

@Afernand74, Abductive, Jaireeodell, Awkwafaba, Neko-chan, Amorymeltzer, Hike395, Corn cheese, QueerEcofeminist, Ganesha811, Anas1712, Ezlev, Losipov, Ambrosia10, Rsjaffe, Styyx, Frostly, and Shackpoet:, as you used the tool before, I'd be interested in your comments on whether it's working well for you now. Nemo 06:55, 25 August 2023 (UTC)Reply
After a few spins it coughed up this error for me:
Extended content

OAbot Oops! Something went wrong. Error: (pymysql.err.OperationalError) (1203, "User s52920 already has more than 'max_user_connections' active connections") (Background on this error at: https://sqlalche.me/e/20/e3q8) Traceback (most recent call last): File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 145, in __init__ self._dbapi_connection = engine.raw_connection() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3288, in raw_connection return self.pool.connect() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 452, in connect return _ConnectionFairy._checkout(self) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 1267, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 716, in checkout rec = pool._do_get() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 170, in _do_get self._dec_overflow() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 147, in __exit__ raise exc_value.with_traceback(exc_tb) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 167, in _do_get return self._create_connection() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 393, in _create_connection return _ConnectionRecord(self) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 678, in __init__ self.__connect() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 903, in __connect pool.logger.debug("Error on connect(): %s", e) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 147, in __exit__ raise exc_value.with_traceback(exc_tb) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 898, in __connect self.dbapi_connection = connection = pool._invoke_creator(self) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 637, in connect return dialect.connect(*cargs, **cparams) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 615, in connect return self.loaded_dbapi.connect(*cargs, **cparams) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/connections.py", line 358, in __init__ self.connect() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/connections.py", line 664, in connect self._request_authentication() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/connections.py", line 954, in _request_authentication auth_packet = self._read_packet() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/connections.py", line 772, in _read_packet packet.raise_for_error() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/protocol.py", line 221, in raise_for_error err.raise_mysql_exception(self._data) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/err.py", line 143, in raise_mysql_exception raise errorclass(errno, errval) pymysql.err.OperationalError: (1203, "User s52920 already has more than 'max_user_connections' active connections")

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/flask/app.py", line 1484, in full_dispatch_request rv = self.dispatch_request() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/flask/app.py", line 1469, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) File "app.py", line 149, in review_one_edit nb_edits = UserStats.get('en', username).nb_edits File "./oabot/userstats.py", line 81, in get instance = session.query(cls).filter_by(wiki=wiki, user_name=user).first() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 2747, in first return self.limit(1)._iter().first() # type: ignore File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 2846, in _iter result: Union[ScalarResult[_T], Result[_T]] = self.session.execute( File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2246, in execute return self._execute_internal( File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2131, in _execute_internal conn = self._connection_for_bind(bind) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1998, in _connection_for_bind return trans._connection_for_bind(engine, execution_options) File "", line 2, in _connection_for_bind File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go ret_value = fn(self, *arg, **kw) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1123, in _connection_for_bind conn = bind.connect() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3264, in connect return self._connection_cls(self) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 147, in __init__ Connection._handle_dbapi_exception_noconnection( File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2426, in _handle_dbapi_exception_noconnection raise sqlalchemy_exception.with_traceback(exc_info[2]) from e File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 145, in __init__ self._dbapi_connection = engine.raw_connection() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3288, in raw_connection return self.pool.connect() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 452, in connect return _ConnectionFairy._checkout(self) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 1267, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 716, in checkout rec = pool._do_get() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 170, in _do_get self._dec_overflow() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 147, in __exit__ raise exc_value.with_traceback(exc_tb) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 167, in _do_get return self._create_connection() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 393, in _create_connection return _ConnectionRecord(self) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 678, in __init__ self.__connect() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 903, in __connect pool.logger.debug("Error on connect(): %s", e) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 147, in __exit__ raise exc_value.with_traceback(exc_tb) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 898, in __connect self.dbapi_connection = connection = pool._invoke_creator(self) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 637, in connect return dialect.connect(*cargs, **cparams) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 615, in connect return self.loaded_dbapi.connect(*cargs, **cparams) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/connections.py", line 358, in __init__ self.connect() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/connections.py", line 664, in connect self._request_authentication() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/connections.py", line 954, in _request_authentication auth_packet = self._read_packet() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/connections.py", line 772, in _read_packet packet.raise_for_error() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/protocol.py", line 221, in raise_for_error err.raise_mysql_exception(self._data) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/pymysql/err.py", line 143, in raise_mysql_exception raise errorclass(errno, errval) sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1203, "User s52920 already has more than 'max_user_connections' active connections") (Background on this error at: https://sqlalche.me/e/20/e3q8)

Abductive (reasoning) 08:13, 25 August 2023 (UTC)Reply
The Sherpa/Romeo information seems to always return "Unknown" for me. — hike395 (talk) 12:40, 25 August 2023 (UTC)Reply
Sherpa/Romeo v2 became much more difficult to interpret. I no longer rely on reports from the API as a result. I would recommend providing a link to the Sherpa/Romeo record because they really now require humans to read them on site now. -- Jaireeodell (talk) 13:50, 25 August 2023 (UTC)Reply
What should users do when a citation refers to an OA version and the suggested "free to read link" refers to the same version? For example, the DOI points to biorxiv.org landing page and the URL points to the file on that landing page. See: Bacterial phyla reference for "The novel shapeshifting bacterial phylum Saltatorellota". It would be nice if I could mark it OA somehow from the bot without adding the extra link. -- Jaireeodell (talk) 14:05, 25 August 2023 (UTC)Reply
If you are citing a BioRxiv preprint as you are apparently doing in this template:
{{Cite journal|title=Evolutionary Implications of Anoxygenic Phototrophy in the Bacterial Phylum Candidatus Palusbacterota (WPS-2)|last1=Ward|first1=Lewis M.|last2=Cardona|first2=Tanai|date=2019-01-29|last3=Holland-Moritz|first3=Hannah|s2cid=92796436|doi=10.1101/534180|doi-access=free}}
do not use {{cite journal}} because BioRxiv is not a journal. Instead, use {{cite biorxiv}}:
{{Cite biorxiv |title=Evolutionary Implications of Anoxygenic Phototrophy in the Bacterial Phylum Candidatus Palusbacterota (WPS-2) |last1=Ward |first1=Lewis M. |last2=Cardona |first2=Tanai |date=2019-01-29 |last3=Holland-Moritz |first3=Hannah |biorxiv=10.1101/534180}}
Ward, Lewis M.; Cardona, Tanai; Holland-Moritz, Hannah (2019-01-29). "Evolutionary Implications of Anoxygenic Phototrophy in the Bacterial Phylum Candidatus Palusbacterota (WPS-2)". bioRxiv 10.1101/534180.
The other preprint templates are {{cite arxiv}}, {{cite citeseerx}}, {{medrxiv}}, and {{cite ssrn}}. There is also a wrapper template {{cite preprint}}. When/if the source cited with a preprint template is published in a WP:RS journal or book, and if the WP:RS version supports the en.wiki article text, the preprint template should be converted to {{cite journal}} or {{cite book}} and fleshed out to include the appropriate bibliographic detail.
Trappist the monk (talk) 14:58, 25 August 2023 (UTC)Reply
Thanks! This is helpful. I'd have to leave the OAbot interface to fix these though, right? -- Jaireeodell (talk) 15:03, 25 August 2023 (UTC)Reply
I don't know, I don't use this tool.
Trappist the monk (talk) 15:47, 25 August 2023 (UTC)Reply
Mostly works for me! I noticed that some OAbot displays force a file download when the URL preview cannot show the file (too large?). It's a little unsettling, but not a deal breaker. -- Jaireeodell (talk) 14:20, 25 August 2023 (UTC)Reply
Seems to be working well for me. Glad to see it running smoothly. Sometimes it suggests superficial edits, but no big problems. Thanks for all the work you do on this! awkwafaba (📥) 14:41, 25 August 2023 (UTC)Reply
Working well here, my only complaint is how it loads super zoomed out while it's fetching the PDF, but that's probably just an artifact of my browser. Also echoing how there should be a link to Sherpa if the API is broken --~ฅ(ↀωↀ=)neko-channyan 14:56, 25 August 2023 (UTC)Reply
Ope, right after I posted that I got a timeout error, which probably should be handled more gracefully:
Extended content

HTTPSConnectionPool(host='lirias.kuleuven.be', port=443): Max retries exceeded with url: /bitstream/123456789/619108/1/MM_JChromB_2017_1_to%20Lirias.docx (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 110] Connection timed out')) Traceback (most recent call last): File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/connection.py", line 203, in _new_conn sock = connection.create_connection( File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790, in urlopen response = self._make_request( File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 491, in _make_request raise new_e File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1092, in _validate_conn conn.connect() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/connection.py", line 611, in connect self.sock = sock = self._new_conn() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/connection.py", line 218, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: : Failed to establish a new connection: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 844, in urlopen retries = retries.increment( File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='lirias.kuleuven.be', port=443): Max retries exceeded with url: /bitstream/123456789/619108/1/MM_JChromB_2017_1_to%20Lirias.docx (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 110] Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/flask/app.py", line 1484, in full_dispatch_request rv = self.dispatch_request() File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/flask/app.py", line 1469, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) File "app.py", line 463, in stream_url r = requests.get(url) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='lirias.kuleuven.be', port=443): Max retries exceeded with url: /bitstream/123456789/619108/1/MM_JChromB_2017_1_to%20Lirias.docx (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 110] Connection timed out'))

--~ฅ(ↀωↀ=)neko-channyan 15:03, 25 August 2023 (UTC)Reply
And another bug I encountered during my afternoon break: it recommended a doi.org link. That should be specifically blocked in the system. ~ฅ(ↀωↀ=)neko-channyan 22:08, 25 August 2023 (UTC)Reply

Thanks for the comments!

  • I don't know what to do about the PDF previews. I've changed them to load through a proxy in order to override some repositories which force the download, but of course this is a hack and creates other problems. The default size is very small so that the main title is hopefully visible above the fold but that's debatable. Whether download is triggered depends also on your browser preferences (I used a separate browser with different preferences for this task). More comments on how you use the previews and what features should be prioritised would be welcome.
  • There was a bug which prevented the Sherpa data from loading. Now it's slowly reappearing for the existing suggestions.
    • More generally, we currently prefill most suggestions from Unpaywall because it's faster for bot edits, but then we also reload the Dissemin data for the suggestions which the bot can't handle. That takes a while, see T281076. In such cases you should also be able to reject the edits and then reload the suggestions again for the same page: they should be regenerated with Dissemin data included.
    • We try to link Sherpa for an ISSN but that requires us to know the ISSN and that currently relies on the Dissemin data, so it won't be there either if that's missing.
    • It's worth noting that the publishers' preferences about archival are nowadays mostly irrelevant for many repositories, as repositories/authors in multiple countries (including Germany, France, Belgium, the Netherlands, Switzerland and Austria) benefit from copyright laws which override private contracts to allow archival. If you're unsure how to read Sherpa data, it might be easier to focus on such jurisdictions (if you're interested in completing all the suggestions about one repository or jurisdictions contact me and we can think of something to make it easier).
  • The interface for doi changes is a bit confusing. When it says that a link to doi.org is proposed, it doesn't actually mean that the url parameter would be changed. Was there an edit preview which suggested otherwise? We could also give a lower priority to DOI IDs, see also T283717.
  • I've filed the timeout error as T345041.

Nemo 17:25, 27 August 2023 (UTC)Reply

You're right, the doi.org and handle suggestions did end up just meaning it wanted to add those instead of a url. Maybe a long term goal would be to change the phrasing from "proposed link" to something else. Thank you for your work so far and for opening bug tickets!
There's also a new problem now where it's adding a blank |url-access=|url-status=|archive-url=|archive-date= block every so often ~ฅ(ↀωↀ=)neko-channyan 16:00, 28 August 2023 (UTC)Reply

Incorrect removal of free doi-access

edit

OAbot is incorrectly removing doi-access=free from various pages. Here is an example. --Whywhenwhohow (talk) 00:51, 26 August 2023 (UTC)Reply

Already discussed at https://en.wikipedia.org/wiki/User_talk:OAbot#bot_incorrectly_removed_manually_added_free_access_tag No idea why this bot has multiple talk pages. AManWithNoPlan (talk) 01:27, 26 August 2023 (UTC)Reply

OABot should never replace an already accessible URL copy with an inferior URL without a human checking it

edit

In Special:Diff/1172389538 to the page Lexell's theorem, User:Afernand74 replaced a link to an excellent scan, https://archive.org/details/journalfurdierei1218unse/page/45, with a link to a dramatically crummier scan, https://zenodo.org/record/1685826/files/article.pdf, using the edit summary "(Added free to read links in citations with OAbot #oabot)". This kind of replacement should never happen without a human directly comparing the two links. –jacobolus (t) 19:43, 26 August 2023 (UTC)Reply

As an aside, even when they are the best available source for a paper, Zenodo links should be added using the id parameter set to something like Zenodo:1685826, rather than linking the url parameter directly to a PDF URL. Edit, there's even a {{Zenodo}} template: Zenodo1685826 will put up a little green unlocked symbol. –jacobolus (t) 20:00, 26 August 2023 (UTC)Reply
Special:Diff/1172389538 is an edit which was made manually.
Personally I find the Zenodo-hosted PDF clearly superior: it's more readable because it's black and white, it has an actually legible OCR and it allows downloading without having to download the entire issue. Disagreements about specific citations can be discussed in the talk page of the affected article.
You can propose changes to the citation templates in Help talk:Citation Style 1. Nemo 11:32, 27 August 2023 (UTC)Reply
The Zenodo scan has dramatically lower resolution, is 1-bit color instead of full color, does not include any of the figures, and is missing all context. The internet archive includes a full-text OCR that looks like the same content, though I didn't compare OCR quality in detail. A bot or script should not ever be recommending this kind of replacement unless there is a human actually checking and comparing.
Does the bot ever make such edits? When a human running it as a script sees such an edit come up, is there some kind of clear warning message that there was already a URL there, and the human needs to compare them and consider which is a better link? [The documentation page here says: The bot only adds a parameter if it does not contain anything before (so, the bot does not erase any information from the templates).]
I am not "propos[ing] changes to the citation templates". The citation templates already have an id parameter for precisely this kind of use. Bots (and humans) should use it. But even if you use the url parameter, the link should always go to a page like https://zenodo.org/record/1685826/ rather than https://zenodo.org/record/1685826/files/article.pdf. –jacobolus (t) 15:04, 27 August 2023 (UTC)Reply
The purpose of OAbot is to provide links to the full text. The web tool displays pre-existing links and recommends to check them. Nemo 17:14, 27 August 2023 (UTC)Reply
Okay, let me be clearer, since this seems to be not sinking in: In the event that the change is deleting existing links there needs to be a prominent warning that a link will be deleted and the editor making the change needs to click the previous link to see what it contained before deciding to delete it. Is there already such a warning very prominently and obviously included? If not, that is a serious bug.
One possible way to avoid this problem is to just add to the id parameter instead of replacing the URL parameter. For example, the citation could add a 'Zenodo' ID like:
Steiner, Jakob (1827), "Verwandlung und Theilung sphärischer Figuren durch Construction" [Transformation and Division of Spherical Figures by Construction], Journal für die reine und angewandte Mathematik (in German), 2 (1): 45–63, doi:10.1515/crll.1827.2.45, EuDML 183090, Zenodo1685826
This preserves the existing URL to the significantly superior scan, and only adds a new link, rather than deleting anything.
(Aside: it would be appreciated if the bot didn't litter the template markup with a bunch of empty parameters.)
Separately, the url parameter should always prefer a link of the form https://zenodo.org/record/1685826/ rather than a link of the form https://zenodo.org/record/1685826/files/article.pdf. Inserting the latter link instead of the former is also a bug. –jacobolus (t) 17:50, 27 August 2023 (UTC)Reply
Using bold and red will not make your opinions stronger. They've been noted. Nemo 18:21, 27 August 2023 (UTC)Reply
They've been noted – Great, thanks. That was entirely unclear from (indeed, seemed contradicted by) your previous responses. –jacobolus (t) 18:25, 27 August 2023 (UTC)Reply

This is now happening a bunch again, with user:AB-Babayo acting as the bot. Can you please stop having perfectly fine URLs in citations replaced with crummier Zenodo URLs unless a human is in the loop carefully examining both links and deliberately picking one? The easiest possibility would be to just never replace existing links. Another alternative would be to add to the id parameter instead of replacing the URL parameter, as discussed above. If this kind of thing keeps happening it may be necessary to start a formal complaint, possibly asking this bot to be shut down. –jacobolus (t) 01:35, 29 December 2023 (UTC)Reply

Please provide links both to the best version freely available and to a paywalled version if it may be better for some (or all) purposes. Certes (talk) 11:32, 29 December 2023 (UTC)Reply
The specific one I reverted was the replacement of https://archive.org/details/londonedinburg3371850lond/page/198/ by https://zenodo.org/record/1919807/files/article.pdf. But the specific one is not really the point; these edits replacing links are being made by the hundreds with no significant human involvement/intervention. This bot or a human running it as a script should not ever be doing URL replacements in an automated or semi-automated way. (Adding new URLs where there previously was none is fine.) –jacobolus (t) 11:48, 29 December 2023 (UTC)Reply
Another recent example is Special:Diff/1192242835, which replaces a landing page of a reference (with useful metadata) with a direct pdf link (more fragile and less useful because one can easily go from the landing page to the pdf but not vice versa). In general, I don't think this bot should be in the business of removing urls and replacing them with different urls. —David Eppstein (talk) 15:48, 29 December 2023 (UTC)Reply
Unfortunately many links to landing pages are broken (or just garbage inserted by VisualEditor), so it's generally an improvement to add a link that we know to be working. In that particular example the repository was migrated from digital.library.wisc.edu to minds.wisconsin.edu; a link to the current landing page would not have been replaced. The oabot tool asks to verify whether the existing links already provide adequate access.
Using the handle is good. When citation templates will start autolinking hdl-access=free, as previously discussed, oabot can stop suggesting additional links (though there will still be a need to remove the garbage links from the url parameter). Nemo 13:53, 31 December 2023 (UTC)Reply
"Generally an improvement" is not good enough for automated edits. A human needs to explicitly (carefully and patiently) check. –jacobolus (t) 16:52, 31 December 2023 (UTC)Reply
Wait, it changed an old landing page link to a PDF link instead of to the new landing page? XOR'easter (talk) 16:48, 1 January 2024 (UTC)Reply
Agreed, in my estimation an archive.org link should never be replaced with a zenodo one. |id= with {{zenodo}} in some cases might be worth adding, but a link to archive.org should not be removed, and this certainly should not be done with any automated process. Umimmak (talk) 19:38, 31 December 2023 (UTC)Reply

More updates on URL management

edit

I'm still working on addressing the comments on the identification of non-free-to-read URLs and DOIs.

  • I've added more questions and answers to this page, to address common doubts.
  • You can help by using the tool and reporting any false positive/negative OA link. (You don't even need me for this, as you can check the Unpaywall API yourself and email their support, which usually answers within few days if you have specific information about a single article/journal/publisher: but check the FAQ first. Sending patches is easy enough but I'm not sure how long it takes to get them reviewed at the moment.)

I've made the first manual edit adding a url-access parameter for a paywalled URL. I'm starting quite conservative as I'm hoping to eventually automate this rather tedious job (and also a rather pointless one as paywalled redundant URLs should actually be removed). So there will be frequent changes in the upcoming days, come back later if the tool gives you dissatisfying suggestions. Nemo 08:55, 7 December 2023 (UTC)Reply

Edit summary

edit
  Resolved

Please thank your handlers for all the great work in making sources more accessible. One minor suggestion: sometimes the edit summary promises more than it delivers. Here, for example, the bot notes that subscription is required, a helpful change and probably the best that could be done with the available sources. However, the ES claims that it Added free to read link, which unfortunately wasn't possible in this case. Could that be rephrased? Thanks again, Certes (talk) 10:32, 27 December 2023 (UTC)Reply

Indeed that's suboptimal. Those url-access changes are mostly meant for a future test run with the bot, but meanwhile users are doing them manually. I've now changed the default edit summary, hopefully it's clearer. (OAbot us currently running on a test branch of the code, I'm hoping to get it stable enough to merge it to master soonish.) Nemo 13:55, 31 December 2023 (UTC)Reply

HDL

edit

Seeing you add a HDL to a DOI ref I made has led me down a rabbit hole of finding what it (the Handle System) is :-) Back ache (talk) 12:18, 24 June 2024 (UTC)Reply