User talk:JL-Bot/Archive 3

Latest comment: 5 years ago by JLaTondre in topic WP:RECOG Update

Recognized content updation

Hi JL-B, is the bot not updating the Recognized content page of the WikiProjects anymore? At Wikipedia:WikiProject Madonna/Recognized content it last updated on July 17, 2017, after which two articles were promoted and it never ran on the page. —IB [ Poke ] 10:16, 22 August 2017 (UTC)

@IndianBio: Are the two articles tagged with {{WikiProject Madonna}}? Headbomb {t · c · p · b} 12:14, 22 August 2017 (UTC)
@Headbomb:, yes they were from the beginning. —IB [ Poke ] 12:22, 22 August 2017 (UTC)
List of songs recorded by Madonna was promoted to a featured list on 14-Aug which was after the last run of that task (12-Aug). It would have been picked up on this upcoming weekend's run (the bot didn't run this past weekend as I was away). I just ran it against Wikipedia:WikiProject Madonna/Recognized content and it did pick that one up. What was the other page you were expecting it to find? -- JLaTondre (talk) 20:58, 22 August 2017 (UTC)

WP:JCW/TAR

Are '&' and 'and' considered synonymous? I can't remember if I mention that equivalence before. If not, they should be. Headbomb {t · c · p · b} 16:43, 25 August 2017 (UTC)

Yes, they are. The pattern matching treats them as equivalent. By the way, I noticed an issue. On Target3, it is grouping together all the invalid targets (#218). I'm assuming that these should instead not be included. I can easily skip them if you want. Though you may want to take a minute to look over them to see if there are any patterns you would like me to try to correct for in the original parsing. -- JLaTondre (talk) 23:43, 25 August 2017 (UTC)
No no no no please keep those! It's super useful to them grouped together. Headbomb {t · c · p · b} 00:49, 26 August 2017 (UTC)
Pseudo-edit. Actually, the cleaner solution would be to upload everything pointing to an invalid target at Wikipedia:WikiProject Academic_Journals/Journals cited by Wikipedia/Invalid1. The results should be displayed in the same style as alphabetical listings (with every entry <nowiki>'d) so the /Target ranking is legit. Headbomb {t · c · p · b} 00:49, 26 August 2017 (UTC)

Thinking about the matching logic a bit more... instead of only differing by punctuation, caps, whitespace, there are a couple of words that don't significantly affect journal title: The/Le/La/L', in, and/et/und/&, of/fur/für/de. Those could effectively be stripped from entries before doing a look up. This way something like |journal=Journal Physics C gets picked up for |journal=Journal of Physics C even if no redirect has been created for it. Headbomb {t · c · p · b} 00:59, 26 August 2017 (UTC)

@Headbomb: Currently, The is ignored when determining what subpage to list the result on (ex. "The Journal" is listed on the J pages). I assume do the same for Le/La/L'? And for a status update: My first attempt at adding magazines and the normalization resulted in using too much memory. It has always been memory intensive and the extra content pushed it over the edge. I'm refactoring the code to eliminate that issue. It will take a bit longer than I hoped. -- JLaTondre (talk) 23:57, 5 September 2017 (UTC)
Yes, same for Le/La/L'. Running it on JCW only for now will be nice, and MCW whenever you have time.
One thing, we now have {{R from Bluebook}} (and variants), similar to {{R from ISO 4}} and variants. Those should be used for |d-type=BB/|t-type=BB. Headbomb {t · c · p · b} 00:09, 6 September 2017 (UTC)
JLaTondre (talk · contribs) any plans on running WP:JCW/TAR again? Even the old code would be nice. Headbomb {t · c · p · b} 18:54, 9 September 2017 (UTC)
Done. -- JLaTondre (talk) 22:06, 9 September 2017 (UTC)
@JLaTondre:, I notice the bot now has some sort of misspelling tolerance in WP:JCW/TAR... What's the exact logic used? It could use a bit of tweaking. Headbomb {t · c · p · b} 14:13, 10 September 2017 (UTC)
Not intentionally. It has the first attempt at the above "the", "and", "of" normalization. Probably a regex issue. What are you seeing? -- JLaTondre (talk) 20:53, 10 September 2017 (UTC)

Looking at WP:JCW/TAR, for the #1 (Nature), it picks up Naturel/Nature'/Nature). For #3 (Science), it picks up Sciences/The Sciences. For #4 (PNAS), it picks up Proc Nati Acad Sci USA on top of the correct Proc Natl Acad Sci USA. For #10 (Cell) it picks up Cells. For #93 (Genetics), it picks up Genetica. For #96 (Nature Communications), it picks up Nature Communications. Maybe a . instead of a \. in the regex somewhere? Headbomb {t · c · p · b} 21:26, 10 September 2017 (UTC)

Keep in mind, 95% of the time it's useful to have the additional entries. But the logic could be tweak/restricted to redlinks [or some other possible tweaks], so that entries like Genetica don't get picked up for Genetics (journal), or The Sciences for Science (journal). 21:38, 10 September 2017 (UTC)
The "a"s and "i"s are errors, but the plural is intended. "Proceeding" vs "Proceedings", "Journal" vs "Journals", and "Communication" vs "Communications" are common. I can drop the general case and just put in specific ones? -- JLaTondre (talk) 23:36, 10 September 2017 (UTC)
What about "Proc. Natl Acad. Scie."? Hence my question about the logic used. If I know what it is, I could suggest tweaks.Headbomb {t · c · p · b} 00:39, 11 September 2017 (UTC)
The current implementation is bolted on to the back end. It parses the generated output & creates regexes to find matches. It is a bit clumsy, but it was the easiest thing to do. As part of the revamp, I'm switching to normalizing the citations on the initial parsing. It should be more reliable and definitely more maintainable. The logic is:
  1. remove leading the|le|la|l'
  2. replace & with a space
  3. replace  and|et|und  with a space
  4. replace  of|fur|für|de  with a space
  5. replace  the|le|la|l'  with a space
  6. replace  in  with a space
  7. replace  for  with a space
  8. remove s from end of words
  9. replace  int|intl|international  with intl
  10. remove punctuation
I considered replacing journal with j, but usually in those cases all the words are abbreviated so would have little effect on the output. -- JLaTondre (talk) 23:36, 11 September 2017 (UTC)
"Remove plurals" meaning striping the final s? Headbomb {t · c · p · b} 00:17, 13 September 2017 (UTC)
Yes, I updated that line to be more precise. -- JLaTondre (talk) 23:29, 13 September 2017 (UTC)

On a similar note, I'm not really it's a good idea to "fix" the bug about stray characters. It picks up several typos that need to be fixed. E.g. in Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Target3, entries #251 it picks up "American" as a typo for "America", in #253 it picks up "Journal of Molluscan StudiesO." for "Journal of Molluscan Studies". This is in general, quite useful. We just need a way to distinguish between typos, and legit different journals. Like do pick up "Journal of Molluscan StudiesO." [which doesn't exist] as a variant of Journal of Molluscan Studies, but recognize that Sciences (magazine)/The Sciences are different publications than Science (magazine), or that in entry 104 "SPINE" "Spine" and "Spine." belong to Spine (journal) rather than Spin (magazine). Headbomb {t · c · p · b} 11:49, 14 September 2017 (UTC)

Two more words that can be dropped for matches: Part, Section. That is, Journal of Physics Part A = Journal of Physics A / Acta Crystallographica Section D = Acta Crystrallographica D, etc... Headbomb {t · c · p · b} 15:18, 18 September 2017 (UTC)

Wikimania presentation / Wikidata bot

Hi JLaTondre,

Just to let you know that User:Headbomb/JCW was a resounding success at Wikimania this year. I must thank you for the high quality of your work, which enabled it. There was a lot of interest in seeing this work done for other editions (languages) of Wikipedia, and also to enlarge this to cover publishers/books and the like.

While at Wikimania, I also learned that pretty much the entire corpus of citations are now available at Wikidata, and querying from there would be 1) much more efficient 2) it may be possible to include 'manual' citations by finding instances of {{doi}} and querying Wikidata to find the corresponding item. This would yield much more reliable results with respect to usage. Would you be interested in coding such a bot / adapting JL-Bot to make use of Wikidata data? Headbomb {t · c · p · b} 19:27, 15 August 2017 (UTC)

Glad your presentation went well. Yes, I could look into using Wikidata. Is there anything that shows how it works? I've tried tracing between Wikipedia and Wikidata, but I'm not seeing "pretty much the entire corpus of citations". I'm probably missing something? Expanding the current bot version for publishers/books would be pretty easy. Expanding it for other languages would be more difficult as it currently works off pattern matching. If Wikidata works, I'm assuming that would make multiple languages easier. -- JLaTondre (talk) 21:24, 21 August 2017 (UTC)
There must be documentation somewhere. I'll be looking into it over the next few months, maybe this is something that can be achieved with direct queries / run on the tool server (or whatever the equivalent is these days), or maybe there's an API somewhere.
Right now, I'll be focusing on creating/updating our series of templates to handle magazines and publishers so the only thing you'd need to do is to do minor changes for those compilations. Headbomb {t · c · p · b} 23:42, 21 August 2017 (UTC)

Reviewing the intro of WP:JCW... it says "This is a bot-generated list of journals cited on Wikipedia using the "journal=" parameter of the {{Citation}}, {{Cite journal}}, and {{Vcite journal}} templates." Is this accurate? I feel it should be updated to ...the "journal=" parameter of the {{Citation}}, and {{Cite xxx}}, and {{Vcite journal}} templates,}} (or rather, any |journal= found in any citation template). Headbomb {t · c · p · b} 03:23, 22 August 2017 (UTC)

I thought the purpose was to list only journal citations (at least as far as possible). As far as I'm aware, those are the only journal citation templates. Other templates "alias" the |journal= parameter, but they aren't actually journal citations. -- JLaTondre (talk) 20:46, 22 August 2017 (UTC)
So what's the exact logic, currently? If it finds |work= or |magazine= in {{cite journal}}, does that count towards the compilation, since work/magazine/journal are aliases? Headbomb {t · c · p · b} 21:10, 22 August 2017 (UTC)
Just |journal=. The same logic is used to parse all templates and {{citation}} uses the parameters to mean different things. Once it is extended to magazines, it will be closer to your original comment - |journal= in any processed citation template will be recorded as a journal, and likewise any |magazine= will be recorded as a magazine. -- JLaTondre (talk) 23:21, 25 August 2017 (UTC)
Yes, that's how it should be. Catch |journal= from any citation templates for WP:JCW, and |magazine= from any citation template for WP:MAG/CITED. Headbomb {t · c · p · b} 01:01, 26 August 2017 (UTC)
I'm not seeing an easy way to programmaticly retrieve citation templates. They are spread out over multiple categories (some of which have other stuff in them). Are you aware of anything I'm missing? -- JLaTondre (talk) 01:02, 29 August 2017 (UTC)

It'd likely need to be a hard-coded list of templates. But a Category:Citation Style 1 templates + {{citation}} should be a good start. Here's a summary below.

Style Template Supports |journal= Supports |magazine= Supports |publisher=
CS2 {{Citation}} Yes Yes Yes
CS1 {{Cite arXiv}} No No No
CS1 {{Cite AV media}} Yes Yes Yes
CS1 {{Cite AV media notes}} Yes Yes Yes
CS1 {{Cite bioRxiv}} No No No
CS1 {{Cite book}} Yes Yes Yes
CS1 {{Cite conference}} Yes Yes Yes
CS1 {{Cite encyclopedia}} Yes Yes Yes
CS1 {{Cite episode}} No No Yes
CS1 {{Cite interview}} Yes Yes Yes
CS1 {{Cite journal}} Yes Yes Yes
CS1 {{Cite magazine}} Yes Yes Yes
CS1 {{Cite mailing list}} No No Yes
CS1 {{Cite map}} Yes Yes Yes
CS1 {{Cite news}} Yes Yes Yes
CS1 {{Cite newsgroup}} Yes Yes Yes
CS1 {{Cite podcast}} Yes Yes Yes
CS1 {{Cite press release}} Yes Yes Yes
CS1 {{Cite report}} Yes Yes Yes
CS1 {{Cite serial}} Yes Yes Yes
CS1 {{Cite sign}} Yes Yes Yes
CS1 {{Cite speech}} Yes Yes Yes
CS1 {{Cite techreport}} Yes Yes Yes
CS1 {{Cite thesis}} Yes Yes Yes
CS1 {{Cite video game}} No No Yes
CS1 {{Cite web}} Yes Yes Yes
Bluebook {{Bluebook journal}} Yes No No
Vancouver {{vcite book}} No No Yes
Vancouver {{vcite conference}} No No Yes
Vancouver {{vcite journal}} Yes No No
Vancouver {{vcite news}} No No Yes
Vancouver {{vcite web}} No No Yes
Vancouver {{vcite2 journal}} Yes Yes Yes
{{Cite comic}} No No Yes
LSA {{Cite LSA}} Yes No Yes

+Redirects to such templates (e.g. {{Cite dictionary}} → {{Cite encyclopedia}}. 01:36, 29 August 2017 (UTC)

Okay. I'll stick with the hardcoded list and expand it as above. -- JLaTondre (talk) 21:03, 30 August 2017 (UTC)
Done. The current run is using all the above templates. This is expanding the output from the prior run for the 20170901 dump. -- JLaTondre (talk) 17:09, 24 September 2017 (UTC)

Wikipedia:WikiProject Magazines/Magazines cited by Wikipedia

I believe I've setup the framework for this.

Differences are

Root location is Wikipedia:WikiProject Magazines/Magazines cited by Wikipedia

I believe if you perform these find/replace [avoiding this bit of the code]

Academic Journals → Journals
Journals → Journals
JCW → MCW

You will have pretty much updated the entire structure. All talk pages should be made to point to Wikipedia talk:WikiProject Magazines/Magazines cited by Wikipedia.

this bit of the code will likely have its priorities updated, but I'd want to see the output before tweaking. Headbomb {t · c · p · b} 05:42, 22 August 2017 (UTC)

I will see what I can do this weekend. Since the original approval was specific to WP:JOURNALS, should a new request for approval be submitted or is this a reasonable continuation? -- JLaTondre (talk) 20:50, 22 August 2017 (UTC)
I honestly don't see what would be controversial about it. I'll ping Hellknowz (talk · contribs), who approved JL-Bot for WP:JCW the first time around, to see if he wants a BRFA for this one, or if there's no point. Headbomb {t · c · p · b} 21:09, 22 August 2017 (UTC)
Drop a note at WT:BRFA, so it's official-ish, but I don't see any issue extending this. —  HELLKNOWZ  ▎TALK 21:13, 22 August 2017 (UTC)
JLaTondre (talk · contribs) any update on this? Headbomb {t · c · p · b} 20:17, 28 August 2017 (UTC)
In progress. Hope to have something for the Sep dump if I dont run into any issues. I'll post on WT:BRFA closer to when I have something done. -- JLaTondre (talk) 00:58, 29 August 2017 (UTC)
Headbomb (talk · contribs) Code changes for WP:MCW are pretty much completed. I still have to wrap up the TAR changes (section below), but the rest is done. I have posted at WT:BRFA. Pending response to that, I will upload the results tomorrow. -- JLaTondre (talk) 19:27, 17 September 2017 (UTC)
Awesome! Headbomb {t · c · p · b} 20:39, 17 September 2017 (UTC)
Task approved. Current run will complete the WP:MCW pages. -- JLaTondre (talk) 17:09, 24 September 2017 (UTC)

New JCW / MCW run

I have completed integrating the TAR functionality into the main bot as well as making the discussed improvements. This includes:

  • The entries section will show formatting reflecting the entry status (bold underline for existing dab, etc.).
  • The target type (book, database, etc.) is also available if this is every converted to a template.
  • The normalization has been changed to:
  • Update 1 & 5 above, to include Les
  • Update 4 above, to include des, du, and d'
  • Convert Part|Section|Series A to A (where A is any single character).
  • Allow up to 2 additional characters after the normalized citation, but exclude any results that match an existing page. However, the exclusion is currently an exact match (i.e. doesn't yet handle "Spine." vs "Spine (journal)"). It seems to be bringing in more valid inclusions than false positives, but there are still false positives. I'll continue to think about improvements here.
  • Invalid has been broken out to it's own page.

The bot is currently uploading the latest run including this, the expanding templates (section above), and WP:MCW. -- JLaTondre (talk) 17:09, 24 September 2017 (UTC)

JLaTondre (talk · contribs) The 'rank' parameter seems to have been lost [1]. This affect other pages too. Otherwise, looks good. Headbomb {t · c · p · b} 22:28, 24 September 2017 (UTC)
Doh! Fixed. -- JLaTondre (talk) 23:23, 24 September 2017 (UTC)
There's also a newer dump than the September 1st one too. Headbomb {t · c · p · b} 22:36, 24 September 2017 (UTC)
Yes. They take quite awhile to download. The next one will be out in a week. -- JLaTondre (talk) 23:23, 24 September 2017 (UTC)
[2] is out now? Headbomb {t · c · p · b} 00:31, 25 September 2017 (UTC)
[3] is out as well. Headbomb {t · c · p · b} 22:02, 2 October 2017 (UTC)
Yes, it's currently downloading. -- JLaTondre (talk) 22:55, 2 October 2017 (UTC)
Awesome! The old dump was getting pretty stale, especially since we couldn't do much work on based on the September 20 dump save for JCW-CleanerBot Task 1. I'll likely have some [minor] logic tweaks to propose after this one. Headbomb {t · c · p · b} 00:25, 3 October 2017 (UTC)
Done. -- JLaTondre (talk) 02:19, 4 October 2017 (UTC)

Tweaks to JCW/MCW

Minor tweaks for the compilations

  • Add bluebook support (if this hasn't been done already). You can detect {{R from Bluebook}} and variants, or Category:Redirects from Bluebook abbreviations. Prioritize this right after ISO. E.g. for WP:JCW, prioritize as ISO > Bluebook > Journal > Magazine > Newspaper > Website > Database > Book > Publisher.
  • In WP:MCW, prioritize (magazine) over (journal). E.g. in WP:MCW/POP, entry 51, National Geographic in the magazine column should link to National Geographic (magazine) rather than National Geographic (journal).
  • In WP:MCW, rather than prioritizing as ISO > Bluebook > Journal > Magazine > Newspaper > Website > Database > Book > Publisher, prioritize as ISO > Bluebook > Magazine > Journal > Newspaper > Website > Database > Book > Publisher.

Thanks! Headbomb {t · c · p · b} 16:14, 17 October 2017 (UTC)

The first two were supposed to there, but had bugs:
  • On output, it was still truncating the type to a single character (which works for the other types). For bluebook, this meant it was shorting to 'b' (the code for a book). Fixed it to correctly output 'bb'.
  • I put the logic in to know what type it was doing, but then forgot to change the order for the magazine case. Fixed.
I'll add the third. It's pretty easy to do. Thanks. -- JLaTondre (talk) 22:26, 18 October 2017 (UTC)
Awesome. Headbomb {t · c · p · b} 15:01, 19 October 2017 (UTC)
This seems to have screwed up JCW logic to work like MCW [4]. Headbomb {t · c · p · b} 00:41, 21 October 2017 (UTC)
That diff is a MCW page where it is prioritizing magazine over journal. On the JCW pages, the only changes I'm seeing are the bluebook types. -- JLaTondre (talk) 12:47, 21 October 2017 (UTC)
Yes, my bad. Looked too fast. Headbomb {t · c · p · b} 21:40, 21 October 2017 (UTC)
The third one (MCW: Magazine > Journal) is done. -- JLaTondre (talk) 00:13, 24 October 2017 (UTC)

Diacritic handling?

Most cases are probably covered by the "ignore up to 2 different character rule", but I was thinking what about non-English titles like

Would the bot realize that Comptes Rendus de l'Academie des Sciences Serie I is a variant of it (in the sense that it would recognizes 'ÉéÈèĖėÊêËëĚěĔĕĒēẼẽĘęẸẹ' = e)?

Also,

Headbomb {t · c · p · b} 23:02, 23 October 2017 (UTC)

No, it currently doesn't normalize diacritics. I'll add that. I'll also remove "Série". "Les" sort should be fixed in the current run. -- JLaTondre (talk) 00:22, 24 October 2017 (UTC)
Diacritic handeling has been implemented along with removing "Série". The diacritic change resulted in quite a few new entries being picked up. Only noticed one false positive ("Genèses" in "Gene (journal)" since it ends up a being one-character difference, after plural removed, from Gene) that will need to be added to the ignore list (prior section). -- JLaTondre (talk) 00:04, 26 October 2017 (UTC)

Tweak to /Target

I was thinking on ways to get more hits for title variants. I was thinking there should be a synonym list, similar to the type detection. I think you have some of that in place, but it could be expanded:

  • Abhandlungen = Abh.
  • Annal(s) = Ann.
  • Bulletin(s) = Bull.
  • Compte(s) Rendu(s) / C. R. = C.R.
  • Journal(s) = J.
  • Letter(s) = Lett.
  • Notice(s) = Not.
  • Proceeding(s) = Proc.
  • Publication(s) = Publ.
  • Review(s) = Rev.
  • Transaction(s) = Trans.
  • Zeitschrift = Z.
  • Magazine = Mag.
  • Newsletter = Newsl.
  • Encyclopaedia / Encyclopædia = Encyclopedia
  • Catalogue = Catalog
  • Supplementum / Supplement(s) = Suppl.

Also, to help find 'related' entries, if no entries exist, then the bot should combine

  • Supplementum / Supplement(s) / Suppl.

E.g. Acta Neurol Scand Suppl should be considered as Acta Neurol Scand unless Acta Neurol Scand Suppl points to a different target than Acta Neurol Scand. Headbomb {t · c · p · b} 15:08, 19 October 2017 (UTC) Headbomb {t · c · p · b} 15:01, 19 October 2017 (UTC)

The additional normalizations have been added. As far as matches to the normalizations, the bot already works as described. It will only aggregate citations if they don't resolve to a different target. The bot is currently saving updated output with the first two changes from the previous section and the additional normalization. Please let me know if you see any issues. -- JLaTondre (talk) 20:19, 20 October 2017 (UTC)
Will do! Headbomb {t · c · p · b} 23:47, 20 October 2017 (UTC)

One issue the normalization causes/exposes is when the bot normalizes to "J" (and "Z"), you have things like Cell vs Cell Journal which are different publications. But because you normalize "Cell Journal" to "Cell J", that only differs by 1 character from "Cell", and it gets lumped with the Cell entry. Maybe instead "J" (and other variants of Journal) should be normalized to "Journal", and "Z" normalized to "Zeitschrift"? Headbomb {t · c · p · b} 00:35, 21 October 2017 (UTC)

Yes, changed the normalization to be the longer version in all cases. I also added a check for "NORMALIZED journal" (for JCW) and "NORMALIZED magazine" (for MCW) which picked up a few more cases (ex. "Popular mechanics magazine" in MCW #43). The target pages have been updated with the changes. However, the JCW Invalid page is now showing results that are actually ok, so I need to check into what is going on there. -- JLaTondre (talk) 18:13, 21 October 2017 (UTC)
I'm not sure if "NORMALIZED journal" is a good idea (it seems to work fine for magazines though). There are a some false positives, like Circulation (journal) vs Circulation Journal. It does pick up good stuff. Just wondering on how to make it avoid bad stuff. If that's possible. Headbomb {t · c · p · b} 22:24, 21 October 2017 (UTC)
Looks like it picked up 21 journal occurrences that otherwise wouldn't have been caught. False positives are always going to occur with any normalization approach. If the second journal is notable enough for an article, creating an article (either at that name or redirect that name to it) would remove it from the list (as it would resolve to its own target). Otherwise, if it's a big enough deal, I can add the ability to suppress false positives via a configuration. -- JLaTondre (talk) 21:37, 22 October 2017 (UTC)

Alternatively, we could have a manually maintained /Distinct subpage, where we could have something like

  • Cell / Cell Journal
  • Circulation / Circulation Journal
  • Fobar / Foobar

that lists entries that are distinct journals and should not be amalgamated. If that's implementable, I leave the exact format to you. Headbomb {t · c · p · b} 21:58, 22 October 2017 (UTC)

Yes, that was what I was thinking when I said via a configuration. -- JLaTondre (talk) 23:41, 22 October 2017 (UTC)

re Invalid: The Invalid list is generated via the common targets processing. This means it has actually always had results that are okay as they match via the normalization. The recent normalization changes just increased the number of valid normalized results being returned. It would be easy to pull that out & just list only the invalids, but it would loose the article count column. Preference? -- JLaTondre (talk) 21:37, 22 October 2017 (UTC)

For invalids, the counts are nice, but they're not really important. Headbomb {t · c · p · b} 21:59, 22 October 2017 (UTC)
Okay, I will pull them out. -- JLaTondre (talk) 23:41, 22 October 2017 (UTC)
More importantly, the new dump is out. Headbomb {t · c · p · b} 22:01, 22 October 2017 (UTC)
It's been downloading for the last couple hours. Probably won't run against it until tomorrow. -- JLaTondre (talk) 23:41, 22 October 2017 (UTC)
Results from the 1020 dump are saving now. It contains all the changes discussed above except removing false positives from the common targets. I created User:JL-Bot/Citations.cfg to manage the false positives, but I haven't finished implementing it yet (and format of that page may change). I'll finish that off next. -- JLaTondre (talk) 00:21, 24 October 2017 (UTC)
Awesome! Ping me once you have finalized the format (no rush for it, I'll have plenty to fix in the meantime, I can ignore the 20 some cornercases rather easily). Headbomb {t · c · p · b} 00:31, 24 October 2017 (UTC)
The new invalid pages aren't correct. I'll fix that. -- JLaTondre (talk) 02:27, 24 October 2017 (UTC)
Invalid pages fixed. -- JLaTondre (talk) 00:10, 26 October 2017 (UTC)

@Headbomb: False positive ignoring has been implemented. I added the "Circulation (journal)" / "Circulation Journal" case to User:JL-Bot/Citations.cfg and it has been properly ignored. Let me know if you have any questions. Once you add additional ignores, I can re-run the common targets processing. -- JLaTondre (talk) 00:10, 26 October 2017 (UTC)

@JLaTondre: Alright, give it a whirl. Headbomb {t · c · p · b} 00:55, 26 October 2017 (UTC)
Done. -- JLaTondre (talk) 02:54, 26 October 2017 (UTC)

Bluebook

The October 23 code update reverted the bluebook behaviour to something broken. See [5]. Headbomb {t · c · p · b} 18:25, 7 November 2017 (UTC)

Fixed. Latest dump is running. Had an issue that took awhile to resolve (problematic citation that was about 2 hours into the dump file; made debugging slow). -- JLaTondre (talk) 00:23, 8 November 2017 (UTC)
Yup, works well now. Headbomb {t · c · p · b} 14:02, 15 November 2017 (UTC)

c't not recognized as a magazine

c't (#806) in Wikipedia:WikiProject_Magazines/Magazines_cited_by_Wikipedia/Popular4 isn't recognized as a magazine for the target. What gives? Headbomb {t · c · p · b} 19:11, 7 November 2017 (UTC)

It's resolving - based on the C't (magazine) redirect - to a target of c't (lowercase c) which doesn't match a page name it knows since Wikipedia page names always start with a capital. It's treating it as a red link. However, when displayed, the Wikimedia software is smart enough to ignore the case of the first letter. I have fixed it to always uppercase the first letter of a redirect target, but that requires re-running the whole database parsing. I'll upload the results based on the current parsing results and then re-parse. -- JLaTondre (talk) 00:44, 8 November 2017 (UTC)
Updated results with this fix are being upload now. -- JLaTondre (talk) 23:48, 9 November 2017 (UTC)
Yup, works well now. Headbomb {t · c · p · b} 14:02, 15 November 2017 (UTC)

& and redirects

In Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/E19, see the "Epigenetics Chromatin" ISO entry, which says redirects to Epigenetics & rather than Epigenetics & Chromatin. This is very likely caused by the encoded & in #REDIRECT [[Epigenetics &#38; Chromatin]]. Headbomb {t · c · p · b} 01:30, 21 November 2017 (UTC)

I'll update it to handle that situation. -- JLaTondre (talk) 01:55, 21 November 2017 (UTC)
Should be fixed. Latest dump is processing now. -- JLaTondre (talk) 15:11, 24 November 2017 (UTC)
@JLaTondre: Not 100% fixed. See entry 604 in Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Target7. Headbomb {t · c · p · b} 03:49, 25 November 2017 (UTC)
Should be really fixed now. Updated results have been posted. -- JLaTondre (talk) 19:19, 25 November 2017 (UTC)

Next run

For the next run, could you the 1000 most popular targets for journals rather than the 500 most popular ones? I found a way to great increase my work speed for cleanup. I've nearly gone through all 500 (about 30 remain) in 4-5 days, whereas before I would only be able to do cleanup for 100-ish entries per dump. Headbomb {t · c · p · b} 14:01, 15 November 2017 (UTC)

That's just for journals BTW. Increasing magazines wouldn't yield anything interesting. Headbomb {t · c · p · b} 14:05, 15 November 2017 (UTC)
Do you want to double the number of records per page or double the number of pages? -- JLaTondre (talk) 01:51, 21 November 2017 (UTC)
Went with double the number of pages since that was easier. -- JLaTondre (talk) 15:11, 24 November 2017 (UTC)
Yeah, double the pages is best here. Headbomb {t · c · p · b} 18:58, 24 November 2017 (UTC)
The footers for the /Target page only go up to 5. I've had to fix them manually. Something to keep in mind for the next run. Headbomb {t · c · p · b} 21:13, 26 November 2017 (UTC)
Fixed. -- JLaTondre (talk) 21:25, 1 December 2017 (UTC)
@JLaTondre: Not quite, now they list 250 entries per page! Headbomb {t · c · p · b} 21:34, 1 December 2017 (UTC)
I fixed that before posting the above so if you saw it after, it must have been a Wikipedia caching issue. -- JLaTondre (talk) 00:18, 2 December 2017 (UTC)
Looking forward to seeing the next dump's results. I did a crapton of cleanup on targets. Headbomb {t · c · p · b} 02:26, 2 December 2017 (UTC)

Recognized content / DYK weirdness?

What happened here or here? Headbomb {t · c · p · b} 23:50, 9 December 2017 (UTC)

Looks like the {{WikiProject Academic Journals}} transclussion query failed and not all results were returned. Re-ran those two pages. -- JLaTondre (talk) 03:16, 10 December 2017 (UTC)

WP:RECOG hasn't run in a while...

Any word on when the next run will be? Headbomb {t · c · p · b} 00:11, 3 March 2018 (UTC)

It's running today. I didn't have a chance last weekend. -- JLaTondre (talk) 13:53, 3 March 2018 (UTC)

WP:JCW and non-breaking space

In Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Target2, in entry #156, the same string (Acta Palaeontologica Polonica) is listed twice.

This happens because in some articles, Acta Palaeontologica Polonica is actually Acta_Palaeontologica Polonica, where _ is a non-breaking space (which isn't declared with a &nbsp;). Those should be striped and converted to regular spaces.

This also cause a dual listing Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/A10 (Acta Palaeontologica Polonica) and Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/A11 (Acta_Palaeontologica Polonica). Headbomb {t · c · p · b} 03:17, 7 March 2018 (UTC)

Underscores should already be converted to spaces. I will check what is going on. -- JLaTondre (talk) 22:01, 7 March 2018 (UTC)
@JLaTondre: It's not underscores, it's non-breaking spaces. Just not ones that are declared via &nbps;, but rather a direct use of a non-breaking space. I used underscores above simply to display their location in the string. Headbomb {t · c · p · b} 22:50, 7 March 2018 (UTC)
Yes, it's Unicode non-breaking spaces. Put in a fix & re-running. -- JLaTondre (talk) 22:58, 7 March 2018 (UTC)

Article nominator

Hi, such a great bot! Is it possible to add the editor(s) who nominate the article to the report output? E.g.

Featured articles

Thanks, – Lionel(talk) 02:03, 9 March 2018 (UTC)

Only if the information is available in a template on the article talk page. I don't believe that is the case. -- JLaTondre (talk) 00:20, 10 March 2018 (UTC)
For Featured articles, nom. appears to be the editor who creates the review sub page/the first editor to comment e.g. [6] and Wikipedia:Featured_article_candidates/Ronald_Reagan. DYK is similar. What would it entail to get the FA subpage and scrape off the user name?– Lionel(talk) 00:31, 10 March 2018 (UTC)
For GA it appears that Legobot maintains a table of nominators somewhere.– Lionel(talk) 00:34, 10 March 2018 (UTC)
That's not something I'll take on. This task already takes the good part of a day to run. The additional look ups would significantly expand it. In addition, there is a lot of complexity in using the review pages (ex. for 2009 World Series, there is no Wikipedia:Featured article candidates/2009 World Series, instead the correct one is Wikipedia:Featured article candidates/2009 World Series/archive3). You may wish to post at Wikipedia:Bot requests to see if anyone would be willing take on creating a separate listing of featured articles by nominator (either all or project specific). -- JLaTondre (talk) 15:44, 11 March 2018 (UTC)

Trailing garbage

I've been thinking of ways to find additional entries for WP:JCW/TAR. One of the common patterns in many journals is the inclusion of 'garbage' in the journal parameter. What I meant by that is something like |journal=''[[Microprocessor Report]]''. Volume 13, Number 3.

If the bot could ignore trailing Foobar + (e)digits, where Foobar is

  • Volume/Vol./Vol/V./V
  • Number/Num./Num/No./No/#/Issue/Iss./Iss
  • Pages/Page/Pp./P.
  • Digits

The idea is that things like

  • |journal=''[[Microprocessor Report]]''. Volume 13, Number 3.
  • |journal=Fairplay International Shipping Journal, Volume 178
  • |journal=Qualitative Social Work 7.2
  • |journal=PLoS ONE 11(7): e0157717

would get picked up. Existing publications with trailing stuff should remain as is however (e.g. Journal of Bone and Joint Surgery, American Volume/Back Issue!/Zzap!64 shouldn't be stripped to Journal of Bone and Joint Surgery, American/Back!/Zzap!). Headbomb {t · c · p · b} 14:58, 30 October 2017 (UTC)

Currently, journal parameters are parsed during the database dump processing. At that point, it doesn't have a full listing of all page titles (as titles come from the dump processing as well). This would require either splitting it into two sections or doing follow-up processing. Each are doable. I'll have to think on best approach. -- JLaTondre (talk) 23:44, 2 November 2017 (UTC)
If you don't have time to think before the next dump, run the existing code on whatever new dump is out there and reprocess it later once you have the new logic in mind. Headbomb {t · c · p · b} 23:54, 2 November 2017 (UTC)
Latest dump is out btw. Headbomb {t · c · p · b} 20:07, 4 November 2017 (UTC)
Latest dump is out BTW, although if you want to skip it, I'm not going to be able to do much work between now and the next dump on January 1st or so. A very merry Christmas and holidy season to you and anyone close to you as well. Headbomb {t · c · p · b} 21:24, 24 December 2017 (UTC)
Thanks! Had a good Christmas; hope you did also. I will process the next dump once it completes. -- JLaTondre (talk) 22:27, 1 January 2018 (UTC)
@Headbomb: As it has been awhile, wanted to give a status update. I haven't been able to work on it as much as I expected. I just finished updating it so that extracting the citation templates from the database dump and then parsing the journal/magazine from the citation templates are separate steps (this will allow me to have all the article titles during the second step). I verified I'm getting the same results as before. Next step will be to add in removing the additional patterns. The new separation of steps will also make future debugging easier as I won't have to re-parse the whole database dump when testing. -- JLaTondre (talk) 01:26, 22 February 2018 (UTC)
Awesome. I'm giving a talk on this Friday morning, so if possible wait to deploy the new logic Friday afternoon or later, just in case something catastrophic happens. Headbomb {t · c · p · b} 03:55, 22 February 2018 (UTC)
Not a problem. It will be a bit before I can complete that. I did download the 02/20 dump, but probably won't get a chance to process it until Sunday. -- JLaTondre (talk) 02:32, 23 February 2018 (UTC)
The presentation went really well. People (librarians) loved it and were really impressed with how we dealt with journals. Headbomb {t · c · p · b} 18:41, 24 February 2018 (UTC)
@Headbomb: This has been implemented and the results are saving now. For the most part, it does seem to improve the results. Though there are some cases were it doesn't (ex "Supplement to Volume 31" becomes "Supplement to"). I recommend reviewing the changes pretty closely to see if it is a net positive. Unfortunately, since it results in a consolidation in some of the output, it does effect the pagination so most of the pages did change even though most don't have an actual result change. I did not implement just trailing digits as that had way to many false positives. -- JLaTondre (talk) 18:59, 10 March 2018 (UTC)
I'll take a look. I wonder why there were so many false positives. From what I saw there should have been very little. Or at least a manageable amount. Maybe a tweak on how many trailing digits is allowed? Were they mostly years? Headbomb {t · c · p · b} 19:10, 10 March 2018 (UTC)

@JLaTondre: Ah I see where there confusion happened. I didn't mean for the bot to remove trailing garbage from WP:JCW/ALPHA or WP:JCW/POP, only for WP:JCW/TAR (and only for the purpose of finding matches). Meaning that if you had |journal=Letters to Nature, '''3'''(4):13-34, it would get reported at WP:JCW/TAR in the Nature entry, because it only differ from Letters to Nature (also regrouped under the Nature entry) by 'trailing garbage'. Headbomb {t · c · p · b} 19:22, 10 March 2018 (UTC)

Ah, you just wanted the normalization changed. I'm going to restore the prior version & then will do that. It's the same logic, only applied on a different field. -- JLaTondre (talk) 19:50, 10 March 2018 (UTC)
Okay, how's that? -- JLaTondre (talk) 23:39, 10 March 2018 (UTC)
Looks pretty good. I don't notice the bunch of Classic Rock #(\d+) Dragon #(\d+) (see WP:MWC/Invalid) in WP:MWC/Target1 or WP:MWC/Target1 however. Those should be picked up. Headbomb {t · c · p · b} 00:22, 11 March 2018 (UTC)

It also missed |journal=Proceedings of the National Academy of Sciences 112 (28) in WP:JCW/Target1. Headbomb {t · c · p · b} 01:00, 11 March 2018 (UTC)

I made two changes to address the above items:
  1. The target pages were excluding invalid links. I changed it so they are no longer excluded and it is picking up the magazine issues.
  2. I changed the normalization to strip "DIGITS DIGITS" at the end (prior it was just doing " DIGITS"). This occurs after the punctuation is removed so catches the above case. It actually picked up quite a bit as it catches year ranges also (which turned out to be common).
Take a look at the new target results. -- JLaTondre (talk) 15:23, 11 March 2018 (UTC)
Seems perfect! As always, I'm impressed at how quickly/efficiently you turn my ideas into reality. I've cleaned up everything in Target1 to Target6, I'll be tackling Target7-Target10 tonight/later this week. Really looking forward to the next dump to see the cleaned up list. If I find other cases the bot should pick up, I'll let you know :). Headbomb {t · c · p · b} 00:51, 12 March 2018 (UTC)

WP:RECOG DYK counter is off...

In Wikipedia:WikiProject Women in Red/DYK, I ask for the first 20 DYKs to be transcluded, But the <noinclude> section spans 19 entries. This was also true when I asked it to include 10 entries, but it spanned 9. Headbomb {t · c · p · b} 22:32, 11 March 2018 (UTC)

The sixth blurb is
That actually gets picked up twice because both athletes are part of the project. The bot is 'smart' enough to not display the line twice, but it does count it the second time as the parameter is currently the number of pages for which to display a blurb (not the number of blurbs). Options are:
  1. Continue to count pages and use that for both the noinclude and the count at the end (i.e. no change)
  2. Change the parameter to be the number of blurbs displayed before the noinclude and continue to display the number of pages at the end
  3. Change the parameter to be the number of blurbs displayed before the noinclude and change the count at the end to the total number of blurbs (vs. the number of articles)
If we go with the second one, then we also need new verbiage for the <includeonly>Transcluding 20 of 776 total</includeonly> as the 20 would become blurbs, but the 776 would be articles and they don't necessarily match (for this project, there are only 768 blurbs displayed for the 776 pages as there are 8 duplicates). Both 2 & 3 are easy changes. Preference? -- JLaTondre (talk) 23:38, 12 March 2018 (UTC)
Ah, I see. That IS clever. However, I think for purposes of displaying DYK blurbs on pages like WP:WPWIR, I think what most people would expect are the number of distinct blurbs, rather than distinct DYKs. If you have one of these blurbs, I doubt you'd want the blurb to be cut down to first 10 entries / have that one blurb dominate over all the other ones. Headbomb {t · c · p · b} 00:06, 13 March 2018 (UTC)
Three has been implemented and run against that page. -- JLaTondre (talk) 01:39, 13 March 2018 (UTC)

Trailing garbage, part 2

In Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/P23

We have entries like

PLoS One. 2013;8(3):e59354
PLoS One. 2014 Nov 6;9(11):e112361.
PLoS One. 2016; 11(8): e0160614
PLoS Pathog. 2014 Apr; 10(4): e1004071;

Those could probably be picked up with some code-fu. The base string there is PLoS One, which is recognized as a variant of PLOS One (WP:JCW/TAR #9).

What comes after that only differs by punctuation, digits [or e+digits], or brackets. I didn't think about dates, but those would be easy to include.

Step 1

Strip whitespace
PLoSOne.2013;8(3):e59354
PLoSOne.2014Nov6;9(11):e112361.
PLoSOne.2016;11(8):e0160614

Step 2, strip punctuation, brackets

PLoSOne201383e59354
PLoSOne2014Nov6911e112361
PLoSOne2016118e0160614

Step 3: Strip trailing
(January|Jan|February|Feb|March|Mar|April|Apr|June|Jun|July|Jul|August|Aug|September|Sep|Sept|October|Oct|November|Nov|December|Dec|\d+e\d+|\d+)+

PLoSOne
PLoSOne
PLoSOne

which is a match to PLos One (PLoSOne after whitespace stripping) only by 1 character, so it should be picked up. Headbomb {t · c · p · b} 13:21, 24 March 2018 (UTC)

I'm not exactly sure how you implemented this in your bot, but there should be a way to get those extra matches. Headbomb {t · c · p · b} 13:40, 24 March 2018 (UTC)
Done. New version upload. The prior implementation would remove only a single " \d+ \d+" at the end of the string. I fixed it to remove multiple as well as the month. It's now picking up all the PLoSOne cases above as well as quite a few other cases. Take a look and let me know if you see any problems. -- JLaTondre (talk) 01:43, 25 March 2018 (UTC)

WP:JCW/TAR whitespace tweak/bug?

PLOSOne here was not picked as a variant of PLOS One in WP:JCW/TAR (entry #9), even though it only differs by whitespace. It should have been. Headbomb {t · c · p · b} 13:24, 24 March 2018 (UTC)

PLOSOne normalizes to "plosone" and PLOS One normalizes to "plo one" (due to the plural removal). The differs by 1-2 characters only applies at the end of a string. Testing for a character difference in the middle of the string is not straightforward and prone to false positives since there are words that differ by one character (ex. accent vs. accept). I did test removing all spaces & s's, but that causes way too many false positives. The normalization is not going to be able to catch everything without letting through a bunch of unwanted stuff. -- JLaTondre (talk) 18:35, 24 March 2018 (UTC)
But if do the differs by 1-2 test at the end, wouldn't that cover the plurals implicitly since it would only differ by 1 character (the s)? Headbomb {t · c · p · b} 19:33, 24 March 2018 (UTC)
Oh, you mean at the end of the string, rather than at the end of the process? I think the extra false positives would be rather straightforward to handle via User:JL-Bot/Citations.cfg. Although I can't really come up with a straightforward way to compare strings that are offset by a character (e.g. abcdefghi vs abcdefghi) either. I thought you could do it with arrays, but those would only be useful for strings that differ by substitutions (abcdefghi vs abZdefghi). Headbomb {t · c · p · b} 19:40, 24 March 2018 (UTC)
When I did a quick test, the false positives were a lot. More than I think you want to weed through. Let me think about the approach you listed below. If I can make that work, I think it would be better. If not, then I can do a trial run with being more aggressive with the spaces & s's and you can evaluate the results. -- JLaTondre (talk) 01:45, 25 March 2018 (UTC)

Middle of the string comparison

I think I have an idea for middle-of-the-string comparison.

Substitutions are easy to do

Candidate string (after all the cleanup and normalization)

ab3de5ghijklmnopqrstuvwxyz

Target string (after normalization)

abcdefghijklmnopqrstuvwxyz

The string lengths differ by 2 or less, so a match is possible. Shove individual characters of both strings in arrays, compare if the values of the arrays match. 0 for match, 1 for mismatch

00100100000000000000000000

Sum the values of the comparison. If the sum is 2 or less, it's a match.

For a string offset, for a missing character, this will yield something like

Candidate string

a0cdfghijklmnopqrstuvwxyz

Target string

abcdefghijklmnopqrstuvwxyz

The string lengths differ by less than 2, so a match is possible. Compare the arrays

01001111111111111111111111

The sum of values is greater than 2. Go back to the original string (because it is shorter than the target string), insert a dummy character at the position of the first consecutive mismatches.

Candidate string

a0cd♠fghijklmnopqrstuvwxyz

Target string

abcdefghijklmnopqrstuvwxyz

Compare the arrays

01001000000000000000000000

which sums to 2, and would now be considered a match.

For a string with an extra character, this will yield something like

a0cdeefghijklmnopqrstuvwxyz

Target string

abcdefghijklmnopqrstuvwxyz

The string lengths differ by less than 2, so a match is possible. Compare the arrays

010001111111111111111111111

This sum is greater than 2. Go back to the target string (because it is shorter than the candidate string), insert a dummy character at the position of the first consecutive mismatch.

Candidate string

a0cdeefghijklmnopqrstuvwxyz

Target string

abcde♠fghijklmnopqrstuvwxyz

Compare the arrays

010001000000000000000000000

which now sums to 2, and would now be considered a match.

Headbomb {t · c · p · b} 20:08, 24 March 2018 (UTC)

Obviously this should only apply when string lengths are greater than 3, otherwise this is a meaningless comparison. Headbomb {t · c · p · b} 15:18, 26 March 2018 (UTC)
I have an initial implementation of the comparison algorithm and it seems to work well. I still have to integrate it into the citation processing. I will start with a difference of two, but feeling like it may warrant adjustment based on the size of the strings being compare. Two is pretty big difference for a 3 character string, but maybe not that big for a 30 character string. I'll play around with that to see the impact. -- JLaTondre (talk) 00:43, 27 March 2018 (UTC)
Maybe diff = 0 for <2 length, diff = 1 for 3-10, diff = 2 for 11-30, then add 1 per extra 10 character length (diff = 3 for >30 length, diff= 4 for >40length)? But I think it's best to keep it simple for now with simply diff = 2, and see where that gets us. Legits diffs of 3+ should be pretty rare. Headbomb {t · c · p · b} 02:26, 27 March 2018 (UTC)
@Headbomb: Updated target results using the above comparison have been uploaded. It is currently grouping anything with 2 or less differences (where strings are at least 3 characters long). It has picked up a lot more relevant results (in particular spelling errors, etc.), but looks like a lot of false positives as well at the smaller string sizes (3-5 characters). Take a look at the results. I can change the thresholds if needed. -- JLaTondre (talk) 20:50, 31 March 2018 (UTC)

Awesome, I'm at the gym right now, but I'll take a look when i get home later tonight. Hopefully be able to do some cleanup before the early April dump. Headbomb {t · c · p · b} 21:04, 31 March 2018 (UTC)

With a quick review, i think it's best if you implement the variable threshold as mentioned above. Or at least diff = 1 for length 3-5, and diff = 2 for longer than 5.Headbomb {t · c · p · b} 21:10, 31 March 2018 (UTC)

More detailed review. I can't fathom why some of those are picked up. For instance "Nature S.Rep." or "Nature's Web" is considered a variant of "Nature" or some of its existing redirects. Strip and normalize, you get Naturesrep or Naturesweb, which differ by 4 from Nature or Naturenews. Likewise, it picks up "Nutr Res" which normalized is "Nutrres", which is not a match for anything AFAICT. Headbomb {t · c · p · b} 22:51, 31 March 2018 (UTC)

I think I see what's happening. Looking at #49 (Myconet), it's clear you're still stripping the final s before comparing arrays. The new logic makes this obsolete/detrimental. Headbomb {t · c · p · b} 03:27, 1 April 2018 (UTC)
New version uploaded. There were valid cases that stripping plurals would find for longer strings. I removed stripping plurals, but changed the logic that 3-5 character strings use 1 delta (that also cuts down on more false positives), 6-20 use 2 deltas, and 21+ use 3 deltas. -- JLaTondre (talk) 17:07, 1 April 2018 (UTC)
Awesome, it looks much cleaner and just as useful if not moreso. I notice nothing that was dropped that shouldn't have been dropped, and the 3 deltas for 21+ picks up more things. I've added a couple of bypass in the /config subpages, but what's needed now is the new dump data. I cleaned up pretty much all entries from 1 to ~750 before the middle-of-string lookups, and the first 100s or so entries with the new logic. Headbomb {t · c · p · b} 19:33, 1 April 2018 (UTC)

User:JL-Bot/Citations.cfg issue?

In User:JL-Bot/Citations.cfg, JAMA (journal) has several exclusions (J Ir Med Assoc / J Ky Med Assoc / J Nat Med Assoc / J S C Med Assoc), yet Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/Target1 (entry 42) still lists them. What gives? Whitespace issue? Headbomb {t · c · p · b} 03:14, 3 April 2018 (UTC)

Yes, white space. It was expecting | journal and not |journal. I made it more tolerant & updated the results. -- JLaTondre (talk) 22:24, 3 April 2018 (UTC)

Create shortcuts

Could JL-Bot create shortcuts like

And likewise for WP:MCW. This would make it much easier to refer to a specific page. Headbomb {t · c · p · b} 15:50, 3 April 2018 (UTC)

Done. It will also create them when it creates new pages from now on. -- JLaTondre (talk) 22:35, 4 April 2018 (UTC)

20180420 Results

There is something up with the 20180420 results. It is producing way less results that it should. I'm looking into it. -- JLaTondre (talk) 20:15, 22 April 2018 (UTC)

It looks like the 20180420 dump is faulty. It is almost 2G less than the last one & it appears pages are missing. -- JLaTondre (talk) 01:41, 23 April 2018 (UTC)
I got a 13 gb one which is very similar than past ones. Maybe use a different dump file?Headbomb {t · c · p · b} 02:08, 23 April 2018 (UTC)
Nevermind, it's 12GB, not 13.something. Headbomb {t · c · p · b} 10:35, 23 April 2018 (UTC)

@JLaTondre: anyway you could finish the bot run, even with the partial dump? Headbomb {t · c · p · b} 16:28, 23 April 2018 (UTC)

Done. -- JLaTondre (talk) 00:30, 24 April 2018 (UTC)

A new dump file was released & I processed it. However, I left in debugging from checking the above & so not everything was processed. Re-running... -- JLaTondre (talk) 00:35, 26 April 2018 (UTC)

WP:JCW/Missing1 doesn't work correctly

All, or nearly all, journals listed have existing entries. Headbomb {t · c · p · b} 23:58, 23 April 2018 (UTC)

Byproduct of the incomplete dump. If the page wasn't in the dump, it will show up as missing. -- JLaTondre (talk) 00:32, 24 April 2018 (UTC)
@JLaTondre: The bot crapped the bed recently: e.g. [7]. Headbomb {t · c · p · b} 00:53, 26 April 2018 (UTC)
See my note in the section above. -- JLaTondre (talk) 01:41, 26 April 2018 (UTC)

User:JL-Bot/Citations.cfg

Could you update the bot to recognize this new format? It would make editing the list so much easier. Headbomb {t · c · p · b} 22:18, 27 April 2018 (UTC)

Yes, that's doable. -- JLaTondre (talk) 23:44, 27 April 2018 (UTC)
Done. It expects a single template per line so as long as that's maintained it should work. -- JLaTondre (talk) 23:42, 29 April 2018 (UTC)
Works pretty well, although I'm trying to exclude Trends (journal) from Trends (journals), even though it redirects there (see WP:JCW/Target1, entry 32, right below TRENDS in Ecology and Evolution). Any suggestions? Something like this?. Headbomb {t · c · p · b} 14:26, 3 May 2018 (UTC)
TAR works by combining all items that have the same target (from the regular pages) and then adding other records that have the same normalization (using the fuzzy match). The false positives was only being used on the second step. I've added it to the first step also. The configuration format can remain the same. When there is a [[first|second]] record, be sure to use the first in the exclusions (for both the target & the excluded page). -- JLaTondre (talk) 21:52, 3 May 2018 (UTC)

Also if you could rerun the bot on WP:JCW/TAR, that would be great. We have two new massive catch-all articles (List of Hindawi academic journals/List of MDPI academic journals) with hundreds/thousands of incoming redirects, so that caught a lot of unrelated journals before having the exclusions setup. Headbomb {t · c · p · b} 15:06, 3 May 2018 (UTC)

Done. -- JLaTondre (talk) 00:06, 4 May 2018 (UTC)

Question by User:Pbsouthwood

Moved from Wikipedia talk:Bots/Requests for approval/JL-Bot 5. Headbomb {t · c · p · b} 11:51, 19 May 2018 (UTC)

Is there a way to manually activate the bot to populate a new page and to debug the template on that page? Cheers, · · · Peter (Southwood) (talk): 06:52, 19 May 2018 (UTC)

@Pbsouthwood: New pages are automatically picked up on the next run (usually weekends). I can always manually run against a page upon request. If you're not getting the results you expect, leave me a note on this page & I will check into it. -- JLaTondre (talk) 12:03, 19 May 2018 (UTC)
Not a problem if it runs this weekend, I will have a look on Sunday night or Monday to see if it does what I expect. If so, end of problem, if not I will get back to you. Thanks, · · · Peter (Southwood) (talk): 13:43, 19 May 2018 (UTC)

WP:JCW/TAR issue

In WP:JCW/Target6, entry #550 is listed as 'Botany', even though 'Botany (journal)' exists. What gives? Headbomb {t · c · p · b} 18:41, 4 May 2018 (UTC)

The "PAGE" -> "PAGE (journal)" only applies to the journal column and not the target column per the original design. Phytologist, Plant Sciences, Plant biology, etc. redirect to Botany so that is the correct target. If the target gets 'rewritten', then you will be missing errors & creating others. For example, look at WP:JCW/P19, the Phytologist citation should probably be The Phytologist (which is a journal). If the target was "rewritten" to Botany (journal), that would be incorrect. -- JLaTondre (talk) 22:28, 5 May 2018 (UTC)
I mean that the compilation should consider it as "Botany (journal)" before looking at redirects, since "Botany (journal)" and what links to it is what is of interested to the compilation. Headbomb {t · c · p · b} 00:39, 6 May 2018 (UTC)
The TAR results are the most common targets from the JCW pages. Botany (journal) is a different target (shows up as 352 on WP:JCW/Target4). -- JLaTondre (talk) 12:13, 6 May 2018 (UTC)
Ah, I see then. I think in those cases the Botany link in the entries column should be excluded, since it's intended to be Botany (journal) and therefore not problematic. I'll need to think about what should be done about the other entries (Plant Biology/Plant Science, etc...) but I think simply creating (Plant Biology (journal)/Plant Science (journal) etc.) will be sufficient. Headbomb {t · c · p · b} 12:56, 7 May 2018 (UTC)
If you update the other entries, than the Botany one shouldn't be picked up. I believe it only shows up because the others redirect to Botany and then the normalization process draws it in. Another option would be to add a target exclusion option to User:JL-Bot/Citations.cfg (perhaps use the existing template, but without the entry part filled in) which would cause the bot to drop the specified target from the listing. -- JLaTondre (talk) 11:13, 8 May 2018 (UTC)
"I believe it only shows up because the others redirect to Botany and then the normalization process draws it in." Yup. An exclusion would probably work best though. Whatever syntax you want to use, but you're right that {{JCW-exclude|Botany}} is pretty natural. Headbomb {t · c · p · b} 12:23, 8 May 2018 (UTC)
Let's go with {{JCW-exclude|Botany}}. I'll get it in before the next run. Would you mind updating the template to display something meaningful when there isn't a second argument? Thanks. -- JLaTondre (talk) 21:34, 9 May 2018 (UTC)
Done. Headbomb {t · c · p · b} 23:01, 9 May 2018 (UTC)
Implemented. -- JLaTondre (talk) 16:09, 28 May 2018 (UTC)

Template:R from NLM abbreviation

We've got a new type of redirect to keep track of [8], tagged with {{R from NLM abbreviation}}, categorized with Category:Redirects from NLM abbreviations.

Should be as prioritized ISO > MathSciNet> NLM > Bluebook. Headbomb {t · c · p · b} 18:41, 9 May 2018 (UTC)

Ok, will put in for next run. -- JLaTondre (talk) 21:34, 9 May 2018 (UTC)
We also have {{R from MathSciNet abbreviation}} too.

Not sure if it matters, but I've updated some pages, and now those are standardized to

Headbomb {t · c · p · b} 13:09, 11 May 2018 (UTC)

BTW, if you don't have time to do these updates right now, just run the bot as is, and just update the bot later when you have time. Nothing in this section or the one above is critical. Headbomb {t · c · p · b} 13:38, 24 May 2018 (UTC)
Implemented. Content task is currently running. When it completes, I will kick off processing the latest dump. -- JLaTondre (talk) 16:10, 28 May 2018 (UTC)
@JLaTondre: I think you truncated the input for |d-type= and |t-type= to n rather than nlm, and m rather than math. Headbomb {t · c · p · b} 13:07, 31 May 2018 (UTC)
It was designed not to, but I had a typo. Fixed & validated on the 20180520 data. Currently downloading the 20180601 dump so you will see the fix when that completes. -- JLaTondre (talk) 12:00, 2 June 2018 (UTC)

Catering for lower than A class articles

Wikipedia:Women's Classical Committee - in my view - could do with a recognised content page for the ~200 or so articles currently under its care. However all of their articles bar one are below the level of the template's interest - see Wikipedia:Women's Classical Committee/Assessment. Might you be able to give consideration to constructing RC pages with Stubs & Starts & Cs & Bs? - thx --Tagishsimon (talk) 02:21, 24 May 2018 (UTC)

The bot works by comparing category contents & template transclusions. If the data is obtainable from that, it would be doable. However, I believe project assessments are parameters within a common template? If so, that doesn't fit how this bot works. -- JLaTondre (talk) 12:08, 2 June 2018 (UTC)
The parameters do categorize articles in specific categories, e.g. Category:B-Class physics articles. However, I'm not quite sure what the value of this is, since those listings would start to become quite redundant with the category system itself. It might make sense for smaller projects though. Headbomb {t · c · p · b} 12:17, 2 June 2018 (UTC)

Wikipedia:WikiProject Chicago/Featured articles

What happened to the date in this edit-TonyTheTiger (T / C / WP:FOUR / WP:CHICAGO / WP:WAWARD) 02:04, 31 May 2018 (UTC)

It uses 0000-00-00 when it fails to parse the date. Updated the bot to handle yet another undocumented format for {{ITN talk}}. -- JLaTondre (talk) 14:26, 2 June 2018 (UTC)

WP:JCW/Target2 not picking up Amer. Math. Monthly?

Entry #151 doesn't list Amer. Math. Monthly/Amer Math Monthly, even thought it links to American Mathematical Monthly since July 2011. It's also used in articles, e.g. Robin Wilson (mathematician), so I don't know why it's not being reported. Something to do with the new {{R from MathSciNet}}/{{R from NLM}} maybe? Headbomb {t · c · p · b} 14:48, 5 June 2018 (UTC)

Likewise for entry #284 in WP:JCW/Target3 not picking up Bull. Amer. Math. Soc./Bull Amer Math Soc as well. Headbomb {t · c · p · b} 14:51, 5 June 2018 (UTC)
Or entry #785 in WP:JCW/Target8 not picking up Spine (Phila Pa 1976). Headbomb {t · c · p · b} 14:54, 5 June 2018 (UTC)
Fixed. I didn't adjust the pattern match to account for types longer than one character when I implemented the two new ones. -- JLaTondre (talk) 02:20, 6 June 2018 (UTC)
Thanks. The fix seemed to worked just fine. Headbomb {t · c · p · b} 02:57, 6 June 2018 (UTC)

Project content - help please

Any clues please as to what I'm doing wrong at Portal:Bangladesh/Recognized content? I think there are 157 relevant articles but I haven't managed to persuade the bot to pick them up. Certes (talk) 13:26, 7 June 2018 (UTC)

Likely the bot needs to run. That's typically on weekends. Headbomb {t · c · p · b} 15:08, 7 June 2018 (UTC)
Thanks Headbomb. I put the request up on Monday, so I'll see what happens in the next few days. Certes (talk) 15:12, 7 June 2018 (UTC)
  Resolved

The bot ran this morning. Sorry for being impatient! Certes (talk) 09:11, 8 June 2018 (UTC)

Feature Request - WP:RECOG

@JLaTondre: regarding the WP:RECOG task, I was wondering if it would be possible to add the total count of each article type to each header. I.e. if there were 8 Good Articles for a project the header would look like: ===Good articles (8)=== instead of just ===Good articles=== It would make it much easier for users to know how many of each type quickly, especially for WP:DYK which doesn't have an associated category. Let me know what you think. Thanks! « Gonzo fan2007 (talk) @ 18:21, 16 July 2018 (UTC)

The bot already has a |display-total option which display the total count at the end of the list. See Wikipedia:WikiProject Chicago/Featured articles for an example. -- JLaTondre (talk) 00:42, 17 July 2018 (UTC)
  Facepalm, thanks! « Gonzo fan2007 (talk) @ 03:07, 17 July 2018 (UTC)
@JLaTondre: do you know what is going here: Wikipedia:WikiProject Green Bay Packers/Recognized Content. It was working for a while but now just displays the number 3. « Gonzo fan2007 (talk) @ 19:58, 20 July 2018 (UTC)
@Gonzo fan2007: See Template talk:Columns-list. Suffusion of Yellow (talk) 21:37, 20 July 2018 (UTC)
Awesome. Thanks Suffusion of Yellow. I undid the edit, which fixes the issue for now. « Gonzo fan2007 (talk) @ 21:49, 20 July 2018 (UTC)

DYK source

Where does the "Project content" function get its blurb for "Did you know?" entries? Is it from the parameters to the {{DYK talk}} template on each article's talk page, or does it read the numbered or monthly pages in the archives? (I'm currently fixing links to dabs in the monthly archives for use in a portal template, and was wondering whether that would also fix JL-Bot's input data.) Thanks, Certes (talk) 12:36, 3 August 2018 (UTC)

It uses the {{DYK talk}} blurb. -- JLaTondre (talk) 13:18, 4 August 2018 (UTC)

Misfiring links to journals

The question above reminds me that I some time ago I started fixing links which should go to journals but don't, e.g. where someone has consulted Abacus (journal) but claimed to have worked the answer out on an abacus. (I got up to C before finding something more urgent to do.) Does JL-Bot produce a list of links that should lead to journals but link to articles on other topics, or would that be an easy by-product? Certes (talk) 12:48, 3 August 2018 (UTC)

After some further digging, I see that WP:WikiProject Academic Journals/Journals cited by Wikipedia/A1 and friends (filtered by type=? and target=not blank) does the job, at least where there are few enough articles for links to appear. However, there are a lot of false positives from #ifexist: checks in templates, e.g. Dog intelligence correctly quotes an unlinked source AABS but records a wikilink to AABS which redirects to an unrelated topic. I almost removed my idea as not feasible but I'll leave it here in case you have some clever insight that I missed. Certes (talk) 13:33, 3 August 2018 (UTC)

You could use AutoWikiBrowser's List comparer function to find articles that link to Abacus & also transclude Template:Citation. It's not a perfect solution (still false positives, there are other citation templates), but it reduces the set to look at significantly. Also, you can use AWB's find and replace to make going through the list, including skipping false positives, much easier. -- JLaTondre (talk) 13:17, 4 August 2018 (UTC)
Thanks for the tips. I'm currently using JWB but should probably look at upgrading to AWB. Certes (talk) 14:10, 4 August 2018 (UTC)
@Certes: what ifexist check? Headbomb {t · c · p · b} 11:55, 6 August 2018 (UTC)
Sorry Headbomb, I forgot that you'd changed {{Infobox journal}} to use {{Linkless exists}}. The false positives must be coming from somewhere else. Certes (talk) 12:08, 6 August 2018 (UTC)
What false positives are we talking about here? Headbomb {t · c · p · b} 12:10, 6 August 2018 (UTC)
I was wondering why Dog intelligence appeared on Special:WhatLinksHere/AABS, but it no longer does. Perhaps the page has just been purged. AABS appears in a {{Cite journal}} (not Infobox journal, sorry) and I thought this may be related. Unfortunately I don't have enough information to track down any problems properly, and any bogus wikilinks may be transitory. Certes (talk) 12:26, 6 August 2018 (UTC)

WP:RECOG Update

JLaTondre Just fyi, it appears JL-Bot didn't get through the whole list of WP:RECOG updates. Special:Contributions/JL-Bot, it looks like it stopped in the L's. « Gonzo fan2007 (talk) @ 16:22, 6 August 2018 (UTC)

Odd, restarted it. Thanks for letting me know. -- JLaTondre (talk) 20:25, 6 August 2018 (UTC)