User talk:JL-Bot/Archive 4

Latest comment: 5 years ago by Headbomb in topic Dump finally up

TAR / Questionable & Unicode

Perl is deprecating the use of strings with code points over 0xFF in XOR. This impacts the WP:JCW/TAR and WP:JCW/CRAP matching logic for some citations. I have changed the logic to "de-Unicode" characters prior to doing the string matching. This change is in the results just uploaded for both. In general, it seems to be working for the better, but there are now some additional false positives. For example in TAR3, #240 is now picking up "Палеонтологический Журнал" which is the Russian name, but #299 is now picking up "КМЕ" which is a false positive (while it is only a 1 letter difference from NME, the Cyrillic K was causing the original XOR to miss it). There could also be some unintended oddities. If anyone sees anything weird, please let me know. Thanks. -- JLaTondre (talk) 22:20, 24 August 2018 (UTC)

Yes, it seems to be working better. Very good at catching Russian/Cyrillic names that are actual matches with journals. False positives are rare and easy to exclude. Headbomb {t · c · p · b} 23:45, 24 August 2018 (UTC)

tld|JCW-include

Could the bot support this? {{JCW-include|EEC Journal|EEC J.|EEC J}}

The idea would be that instead of declaring something like

and hope nothing creeps up in the future, we could just have {{JCW-include|CAGENA|Cagena}} Headbomb {t · c · p · b} 00:33, 25 August 2018 (UTC)

So in TAR, exclude anything that doesn't match the include? Yes, that would be doable (though it will miss future typos). Did you want to support both? That would take a little more work. -- JLaTondre (talk) 13:34, 28 August 2018 (UTC)
It would be supporting both. The idea is that tings like ABA Journal in WP:JCW/Target10 have like 20 exclusions to setup, and as soon as some other 3-letter organization creates journal it'll also be picked up. It should still pickup redirects as normal though (tweaked the template to make the intent clear), so if something like American Bar Association (1935-) is created and redirects to ABA Journal, we'd see that. It would just be telling the bot "don't bother looking for variants". Headbomb {t · c · p · b} 14:08, 28 August 2018 (UTC)
Though if you have only time to work on either grouping for WP:CRAPWATCH or this, grouping has a higher priority. Headbomb {t · c · p · b} 14:16, 28 August 2018 (UTC)
Actually put this one on pause for a little while. I need to put my thinking hat on. Headbomb {t · c · p · b} 14:18, 28 August 2018 (UTC)

JCW target matching

Lots of journals are named something like

  • Foobar: Official Journal of the Blah Society
  • Foobar: The Official Journal of Blah Society
  • Foobar: Official Organ of the Blah Society
  • Foobar: The Official Journal of thr Blah Society

If you find (The|an)?\s*(Official|International)\s*(Blog|Bulletin|Gazette|Guide|Handbook|Journal|Magazine|Newsletter).* at the end of something, then consider that equivalent to the same string without The Official Whatever.

The idea is that Official Journal of the European Union and The Official Journal of the International Hepato Pancreato Biliary Association are legit, but Obesity Reviews : an Official Journal of the International Association for the Study of Obesity = Obesity Reviews. Headbomb {t · c · p · b} 00:32, 19 August 2018 (UTC)

To confirm, this is for the initial processing (i.e. WP:JCW/ALPHA) and not just the TAR processing? -- JLaTondre (talk) 20:23, 19 August 2018 (UTC)
This would be for TAR and CRAP. Headbomb {t · c · p · b} 00:07, 20 August 2018 (UTC)
Implemented. Updated TAR has been uploaded so you can check it out. -- JLaTondre (talk) 00:00, 1 September 2018 (UTC)
Great to hear! I'll take a look and get back to you! Headbomb {t · c · p · b} 00:08, 1 September 2018 (UTC)
Seems to work bang on. Next up, grouping! Headbomb {t · c · p · b} 01:29, 1 September 2018 (UTC)

Fail to pickup an entry?

On WP:JCW/Target8, we have

However, it failed to pick up

for some reason. Headbomb {t · c · p · b} 18:01, 19 September 2018 (UTC)

Looking at B36 shows:
  • display=''[[Bulletin of the Atomic Scientist]]''|d-type=j|target=[[ bulletin of the Atomic Scientists ]]|t-type=?
  • display='''[[Bulletin of the Atomic Scientists]]'''|d-type=j|target=[[Bulletin of the Atomic Scientists]]|t-type=j
The spaces on the first one are the issues as from the software's perspective, they are different targets. It's also why it's not properly resolving the title type. Looking at Bulletin of the Atomic Scientist shows the page is:
  • #redirect [[ bulletin_of_the_Atomic_Scientists ]]
The bot is not properly handling spaces within a redirect. That's an easy fix. I made the change and will re-run to verify it. Unfortunately, it's in the dump parsing which is the first and longest step... -- JLaTondre (talk) 00:13, 20 September 2018 (UTC)
Fixed, updated results saving now. -- JLaTondre (talk) 22:19, 20 September 2018 (UTC)
Yup. Not a whole lot of entries changed, but it makes a big difference on selected journals (e.g. Acta Crystallographica went from #206 to #154). Movement in WP:JCW/Target9/WP:JCW/Target10 is mostly due to better exclusions. Thanks. Very much looking forward to the next dump and last few bits of polish on WP:JCW/CRAP. Headbomb {t · c · p · b} 00:58, 21 September 2018 (UTC)

Tweak to JL-Bot?

Thinking out loud here, when it comes to certain things, e.g. half a million redirects to OMICS Publishing Group, things don't quite rise up to the level of being featured on WP:JCW/TAR, but it's still be useful to have a centralized page to have an idea of what's linked, and what's typoed, and all that.

I'm thinking of having a sort of WP:JCW/CRAP, where the bot compiles things as it would in WP:JCW/TAR, but for specific targets (declared at User:JL-Bot/Questionable.cfg). What's possible here? Headbomb {t · c · p · b} 06:49, 3 August 2018 (UTC)

Yes, generating common targets based on a list would be an easy extension. I don't understand the intention of the TARGET2+ stuff? Thanks. -- JLaTondre (talk) 12:27, 4 August 2018 (UTC)

Basically it's a manual way of creating 'groups' of targets. For instance {{JCW-selected|Bentham Science Publishers|Category:Bentham Science Publishers academic journals}}

would be shorthand for

+

  • Anything that redirects to any of those.
  • Typos and variants

For the simpler case of {{JCW-selected|Baishideng Publishing Group|World Journal of Gastroenterology}}

This would look something like

Rank Publisher/Journal Entries (Citations, Articles) Total Citations Distinct Articles
1 Baishideng Publishing Group 638 350

Headbomb {t · c · p · b} 15:43, 4 August 2018 (UTC)

So the first parameter will always be a page? But the second+ parameters can be pages or categories? -- JLaTondre (talk) 13:48, 5 August 2018 (UTC)
Well I suppose in theory it could be a category, but I can't really conceive of case where, in practice, you'd have a category without a main article. Headbomb {t · c · p · b} 15:50, 5 August 2018 (UTC)

Break

First cut here. It does not have the hierarchy in the Entries column. How important is that? The existing TAR logic doesn't easily lend itself to that. If needed, I'll figure out away to squeeze it in, but will take longer. Other than that, let me know if it is matching what you were looking for. If it does, I'll integrate it into the normal bot running. -- JLaTondre (talk) 23:09, 5 August 2018 (UTC)

@JLaTondre: Do you mean the alphabetical order in "Target"? Not very important. In the second column, it'd be very nice to have a sort the bulleted hierarchy

Typos and redirects being omitted if they aren't used, but the direct links would be included even if they aren't used. Headbomb {t · c · p · b} 00:14, 6 August 2018 (UTC)

Looking at the first cut, a few things. First this [1].

But also entries 10/23/34 (Frontiers in Psychology/Frontiers in Plant Science/Frontiers in Endocrinology) should have been grouped with the first entry Frontiers Media, as declared in

See Mockup (entries #1, #2 and #4). For entry #2, I only merged Abstract and Applied Analysis, Advances in High Energy Physics and BioMed Research International, but you can imagine the other journals of Category:Hindawi Publishing academic journals being done the same way.Headbomb {t · c · p · b} 00:24, 6 August 2018 (UTC)

It should also pickup redlinks + typos of those redlinks. E.g. {{JCW-selected|Asian Journal of Chemistry}} should report Asian Journal of Chemistry (7) (per WP:JCW/A66), and matches for things like Asan Journal of Chemistry if they exist. Headbomb {t · c · p · b} 01:11, 6 August 2018 (UTC)
By hierarchy, I meant the grouping. The existing TAR code isn't set up to handle that so will take me a bit to get that in there. I'll check into the missing page(s). -- JLaTondre (talk) 20:26, 6 August 2018 (UTC)
For the Asian Journal of Chemistry, it doesn't have a target (target of — on A66). Therefore, when searching for common targets, it doesn't return anything. I'm thinking that instead of matching on targets (like TAR does), this should really match on display values. If the display value has a target, then it would search for other pages with the same target. Does that make sense? Or am I missing something? -- JLaTondre (talk) 20:53, 7 August 2018 (UTC)
Well for declared targets that are redlinks, the target is the redlink itself. So {{JCW-selected|Asian Journal of Chemistry}} means match |journal=Asian Journal of Chemistry+variants. Structure that however makes sense, code-wise. Headbomb {t · c · p · b} 21:02, 7 August 2018 (UTC)
Rank Publisher/Journal Entries (Citations, Articles)
1 Asian Journal of Chemistry
(and possibly, via [2])

You could also feed Asian Journal of Chemistry in [3] to get additional variants (Asian J. Chem./Asian J Chem and typos of said variants (Asian J-Chem). Headbomb {t · c · p · b} 21:08, 7 August 2018 (UTC)

@Tokenzero: would your site be handle to handle this extra load without chocking on itself? Headbomb {t · c · p · b} 21:09, 7 August 2018 (UTC)

Selected format

I would like to limit the format of User:JL-Bot/Questionable.cfg to not allow "|" within the notes or source fields (i.e. it is only used to separate the parameters of the {{JCW-selected}} template). See this change. That significantly reduces the complexity of parsing that template and reduces the chances for error. -- JLaTondre (talk) 19:30, 12 August 2018 (UTC)

@JLaTondre: sure, make the changes you want, I'm at a family event for the next few hours. In the long run, it will limit what we can put in note/source (e.g. other templates) tough, but that shouldn't be too much of a big deal. At least for now. Headbomb {t · c · p · b} 19:38, 12 August 2018 (UTC)
After reflection, there is an easy way to handle it. As long as the notes & sources fields are always at the end (i.e. the bot will ignore anything after whichever one comes first), we will be good. The benefits of sleeping on it. ;-) Thanks. -- JLaTondre (talk) 22:51, 14 August 2018 (UTC)
Sure that works! Do you have an ETA for a prototype? Headbomb {t · c · p · b} 23:33, 14 August 2018 (UTC)

Second Version

I uploaded the next version. It should be catching the cases like "Asian Journal of Chemistry" now. I still need to work on the grouping and having the bot save it to the wiki with paging. I also haven't done anything with LTWA abbreviations. I will continue working on it, but progress will be a bit slow as have some other things occupying me also. -- JLaTondre (talk) 00:26, 17 August 2018 (UTC)

That's fine, we can do a lot with bad grouping and no-LTWA lookups even if it's not 100% ideal. Looking forward to both these things being implemented though! (Of the two, grouping would be the most beneficial I'd argue). Headbomb {t · c · p · b} 01:13, 17 August 2018 (UTC)
@JLaTondre: btw, even if you haven't done any code improvements, a new upload of WP:CRAPWATCH would still be very useful. Headbomb {t · c · p · b} 14:28, 23 August 2018 (UTC)
I uploaded a new version. -- JLaTondre (talk) 21:59, 24 August 2018 (UTC)

Grouping

@Headbomb: Ran into the following situation: For {{JCW-selected|Allied Academies|Category:Allied Academies academic journals|source=BPL}}, most of the pages within the category redirect to Allied Academies so that is already their target. So how do you want this represented in the table? Just listed under Allied Academies? Listed under Allied Academies and also as it's own? In other words, this:

Rank Publisher/Journal Entries (Citations, Articles) Total Citations Distinct Articles
1 Allied Academies 15 4

Or this:

Rank Publisher/Journal Entries (Citations, Articles) Total Citations Distinct Articles
1 Allied Academies 15 4

I would assume the first would be sufficient, but the second is if I take the original description literally. -- JLaTondre (talk) 15:23, 1 September 2018 (UTC)

@JLaTondre: The first, yes. The hierarchy for the 'entries' column would be

Rank Publisher/Journal Entries (Citations, Articles) Total Citations Distinct Articles
1 Allied Academies 15 4

Headbomb {t · c · p · b} 15:36, 1 September 2018 (UTC)

Grouping, take 1

First version of grouping has been posted based on the 20180901 dump. It's not perfect - the target is always getting listed on the entries side even if it has no citations. I'll need to look into that as well as integrating it into the main bot run so the bot uploads the results with pagination. -- JLaTondre (talk) 22:47, 3 September 2018 (UTC)

Looks good. It'll need a few refinements, but it's on the right track. In general, something like the current

should be instead (e.g. if you have hits on the "first level", put the numbers on the "first level")

and

can just be

This, however (no hits on the first level, but hits on the second level)

is correct, since we can see the underlying redirect structure. Whereas

would be useless, since there's no hit to it, or to its redirects. Headbomb {t · c · p · b} 00:11, 4 September 2018 (UTC)

Changes made. -- JLaTondre (talk) 15:24, 15 September 2018 (UTC)
Seems to only be missing this part
which can just be
Headbomb {t · c · p · b} 22:15, 15 September 2018 (UTC)
Fixed. -- JLaTondre (talk) 18:14, 6 October 2018 (UTC)

Moved to /Questionable

I moved whatever was at /Selected# to /Questionable# btw (including /Selected.cfg to /Questionable.cfg). This is a clearer name, but {{JCW-selected}} remains the same. I have some additional plans for that template and the compilation, but those can be done once /Questionable is polished and the last few kinks worked out and the multipage stuff implemented. Headbomb {t · c · p · b} 01:04, 17 September 2018 (UTC)

Could we get a new run of WP:CRAPWATCH (at its new location), even if there isn't any other improvements to the code? Headbomb {t · c · p · b} 15:58, 24 September 2018 (UTC)
20181001 dump results uploaded. This weekend, if all goes well, I hope to complete the remaining items as well as role it into the regular run. -- JLaTondre (talk) 21:03, 3 October 2018 (UTC)
Fixed the "(journal)" case above & implemented saving as part of the bot run. It is doing 1500 lines per page (actual output will be longer as it keeps adding rows until the last row added makes the total over 1500, example: total is 1450, next row is 75, final total will be 1525). Since the number of lines in a row can vary much more significantly for questionable targets than for the common targets, using lines seemed better than the number of rows. Looking at using redirects in false positives next. -- JLaTondre (talk) 18:21, 6 October 2018 (UTC)
Amazing. I'll deep review for little things, but the quick review results look good. Headbomb {t · c · p · b} 18:28, 6 October 2018 (UTC)

WP:CRAPWATCH question

In #6 (OMICS Publishing Group), you have the same false positive that happens in several journals. To setup exclusions, do we need to use

to cover all instances, or individual exclusions like

The former would be much more useful. Headbomb {t · c · p · b} 14:34, 7 September 2018 (UTC)

So ignore all redirects to a target as well as the target? Yeah, that is doable. Just for questionable? Or for TAR also? -- JLaTondre (talk) 22:56, 7 September 2018 (UTC)
Well /TAR works fine as is. /Questionable is the one with the repetitions, so this would be for /Questionable. Headbomb {t · c · p · b} 03:24, 8 September 2018 (UTC)
With respect to the goupings, I assume if the group heading (example: "Biomedical Research" in the "Allied Academies" table example further up the page) is excluded, everything under that group should be excluded as well (even if not matching an exclusion)? -- JLaTondre (talk) 23:57, 13 September 2018 (UTC)
Might as well, yes. Headbomb {t · c · p · b} 21:43, 14 September 2018 (UTC)
This one is still not implemented. Headbomb {t · c · p · b} 18:37, 6 October 2018 (UTC)
Yes, that was what I said below I was going to work on next. ;-) -- JLaTondre (talk) 18:41, 6 October 2018 (UTC)
Exclusions based on redirects has been implemented & uploaded. Please review. -- JLaTondre (talk) 00:17, 7 October 2018 (UTC)
Seems to work! I think everything here on this page can be archived. I'll have something else, but it'll be an easy thing to do. Headbomb {t · c · p · b} 00:57, 7 October 2018 (UTC)

Portal talk:Scotland

Hello JL-Bot / JLaTondre I'm wondering why your recent updates to recognized content at Portal talk:Scotland no longer include the content of "Former featured articles", "Featured Lists", "Former good articles", or "Did you know? articles", amongst others?

I did notice that before the edit by TheTranshumanist these (and the other items) were included. Is this some consequence of a community discussion that I was unaware of? It's a pity because this previous content was of immense value for me in terms of efficiency of working on EN:WP.

I'd appreciate your reply. -Cactus.man 20:12, 7 October 2018 (UTC)

In the edit you link, TheTranshumanist removed those data types from the bot configuration. The bot only provides the types requested by the project. As those types were no longer requested, they were no longer provided. -- JLaTondre (talk) 23:00, 7 October 2018 (UTC)

Up targets to /15?

I've finally processed and cleaned up the first 1000 entries of WP:JCW/TAR last month or so. Going through them once per dump is really quick now, so we can up the entries to 1500 or even 2000 for targets. Headbomb {t · c · p · b} 01:20, 11 October 2018 (UTC)

Expanded to 1500. Easy to go to 2000 if you decide you want that after looking at the 1500. -- JLaTondre (talk) 13:46, 13 October 2018 (UTC)
Thanks. It's going to take me a while to hit the new 500. Lots of exclusions to setup, redirects to create, typos to flag, insource:/Foobar/ searches to do and cleanup... Headbomb {t · c · p · b} 14:33, 13 October 2018 (UTC)

Methods in Molecular Biology doesn't pick up Methods in Molecular Biology (Clifton, N.J.) ?

See Entry #1022 in WP:JCW/Target11. Headbomb {t · c · p · b} 14:44, 13 October 2018 (UTC)

Methods in Molecular Biology (Clifton, N.J.) has #REDIRECT: [[Methods in Molecular Biology]]. That isn't valid syntax and I'm surprised the redirect works. I've updated the redirect parser to catch that case. -- JLaTondre (talk) 16:56, 13 October 2018 (UTC)

WT:CRAPWATCH exception

Could you put an exception for this one. Headbomb {t · c · p · b} 01:10, 14 October 2018 (UTC)

  • ...

Should also all redirect to Wikipedia talk:WikiProject Academic Journals/Journals cited by Wikipedia/Questionable1. Headbomb {t · c · p · b} 01:10, 14 October 2018 (UTC)

Should be working. -- JLaTondre (talk) 13:34, 14 October 2018 (UTC)

Question about disambiguators

In 2011, you wrote "... [the bot] should now should properly detect all the (journal), (magazine), or (newspaper) variants.", referring to cases like Nature (journal) vs Nature, Flight (magazine) vs Flight, etc...

What disambiguators are supported here? Because it would be useful if this was extended to say,

Journal > Magazine > Newspaper > Website > Database > Encyclopedia > Book > Publisher

Headbomb {t · c · p · b} 22:38, 17 October 2018 (UTC)

Wow, 2011! It has been that long? It did just the first three. It was an easy expansion for the remaining types. There were only two changes. Results are uploading now. -- JLaTondre (talk) 23:47, 17 October 2018 (UTC)
Well there's gonna be a few more now that I know this is supported. E.g. eLS (encyclopedia), but that's likely going to be in the next dump. Which hopefully will have the remaining CRAPWATCH things sorted out. Headbomb {t · c · p · b} 00:06, 18 October 2018 (UTC)

Category links

In WP:JCW/C43, you have something like

{{JCW-row|display=''[[Comput Sci Eng]]''|d-type=i|target=[[Category:Scientific & Academic Publishing academic journals]]|t-type=?|citations=2|articles=1|search=Comput%20Sci%20Eng}}

this should be

{JCW-row|display=''[[Comput Sci Eng]]''|d-type=i|target=[[:Category:Scientific & Academic Publishing academic journals]]|t-type=?|citations=2|articles=1|search=Comput%20Sci%20Eng}} With a : in front of the category. Headbomb {t · c · p · b} 22:01, 25 October 2018 (UTC)

Done. I also did the same with File: links (probably not likely to ever happen, but just in case). -- JLaTondre (talk) 13:32, 27 October 2018 (UTC)

Skip the (journal) bypass if it's tagged with {{R from unnecessary disambiguation}} / in Category:Redirects from unnecessary disambiguation

E.g. Evol Dev in WP:JCW/E31 uses ''[[Evol Dev (journal)|Evol Dev]]'', and the bot fetches the information from a pointless page (Evol Dev (journal)), rather than the good one (Evol Dev). Since Evol Dev (journal) is tagged with {{R from unnecessary disambiguation}} / categorized in Category:Redirects from unnecessary disambiguation, the bot should just make use of Evol Dev as if Evol Dev (journal) did not exist. Headbomb {t · c · p · b} 13:35, 8 November 2018 (UTC)

LOL You like to complicate things, don't you? ;-) Shouldn't be too much trouble. -- JLaTondre (talk) 18:40, 10 November 2018 (UTC)
I don't, but people refuse to delete those redirects. Headbomb {t · c · p · b} 22:16, 10 November 2018 (UTC)
Implemented. Couple of comments though:
  • Evol Dev (journal) is currently not tagged with {{R from unnecessary disambiguation}}. I manually edited my parsed data set so that when the bot processed it, it would recognize it as one and set the output correctly. You will need to update the actual page to add the template before the next dump if you wish it to keep doing that.
  • In doing this, I realized that for this request, I had only half implemented the change. There are two places in the code that are impacted (how things are counted & how they are outputted). I had only done the counting. I fixed the output as well which means there are a lot more changes than the original two.
Everything looks good to me, but there are lot of deltas so I could have missed something. Results are uploading. Let me know if you see anything wrong. -- JLaTondre (talk) 01:18, 12 November 2018 (UTC)

More things that don't count

You know how 'Series' or 'Part' don't count for matching purposes? I though 'Supplement' was also covered, but turns out it's not.

So here's a few more things that should be ignored for matching purposes

  • Supplementum
  • Supplement
  • Suppl.
  • Suppl
  • Nouvelle Série
  • New Series
  • N.S.
  • NS
  • Neue Folge
  • N.F.
  • NF

i.e, if you find Foobar Supplement, Foobar, New Series, or Foobar (N.F.), they should be grouped with Foobar in WP:JCW/TAR (and WP:JCW/CRAP too). Headbomb {t · c · p · b} 16:40, 5 November 2018 (UTC)

Done. Let me know if you see anything odd. -- JLaTondre (talk) 01:11, 7 November 2018 (UTC)
I'll take a look! So far, it looks like it's doing a few nice pickups. Headbomb {t · c · p · b} 01:39, 7 November 2018 (UTC)
You could add also "Monographs/Monograph/Monogr./Monogr". I'm also debating if adding "Letters/Letter/Lett./Lett." would be more helpful than not. A trial with the "Letters" stuff in would yield a lot of insight. If it's not helpful, it could be yanked. Headbomb {t · c · p · b} 14:19, 7 November 2018 (UTC)
Version with both uploaded. -- JLaTondre (talk) 22:10, 7 November 2018 (UTC)
There's some issues. Things like Letters to Nature stop being picked up for Nature (journal) (first diff line in [4]). And things like The Powys Review Letters started being picked up for Physical Review. Headbomb {t · c · p · b} 22:51, 7 November 2018 (UTC)
I don't understand the Letters to Nature case. Since it's a redirect to Nature, it shouldn't be impacted. I'll investigate that. "The Powys Review" was actually added to "Physical Review Letters" which makes sense. "Physical Review Letters" contains various forms of "Phys Rev Lett". If "lett" is stripped from those, you are left with only a two character difference ("rev" & "review" are treated the same, "the" is also stripped). -- JLaTondre (talk) 23:38, 7 November 2018 (UTC)
Yeah the second one makes sense. The Letters to Nature however doesn't. There's a few cases like that, and it always seem to be with leading stuff, rather than trailing stuff. Could be wrong about that thought. Headbomb {t · c · p · b} 23:54, 7 November 2018 (UTC)
Letters to Nature (and similar cases) should be fixed. -- JLaTondre (talk) 18:25, 10 November 2018 (UTC)

Seems to work fine. As a side note, I've cleaned up everything up to WP:JCW/Target15. Could increase it to /20 or even /25. Headbomb {t · c · p · b} 21:58, 13 November 2018 (UTC)

Weirdness

There's weird stuff going on there. Headbomb {t · c · p · b} 17:06, 12 November 2018 (UTC)

In what way? They all match the "Journal > Magazine > Newspaper > Website > Database > Encyclopedia > Book > Publisher" logic. Those forms are the first one in that chain ([7], [8], & [9]) and the two redirects resolve to the correct locations. -- JLaTondre (talk) 21:45, 12 November 2018 (UTC)
Well, for instance, the old version linked to Stylus, a dab page that lists both Stylus Magazine and The Stylus as possible entries. Maybe that simply means that Stylus (magazine) needs to be created and redirected to the dab page though. I'll give it a though now that I know what is happening. Headbomb {t · c · p · b} 19:26, 13 November 2018 (UTC)
[10] this certainly is an issue though. American Conservative redirects to The American Conservative, but the bot thinks its meant to refer to American Conservative (book). Likewise for [11] (a magazine, now overruled by a website).Headbomb {t · c · p · b} 19:36, 13 November 2018 (UTC)
I think for this bit of logic, it's best to stop at "Journal > Magazine" and forget "Newspaper > Website > Database > Encyclopedia > Book > Publisher". Headbomb {t · c · p · b} 19:45, 13 November 2018 (UTC)
Reverted the changes & also removed the newspaper case. Re-running now so updated results will be in a bit. -- JLaTondre (talk) 23:44, 14 November 2018 (UTC)
Nothing was uploaded btw. And if you could increase targets to /20 or even /25. that would be great. Headbomb {t · c · p · b} 18:40, 15 November 2018 (UTC)
A day is still a bit, isn't it? ;-) Sorry, was interrupted. Uploading now along with the /25. -- JLaTondre (talk) 23:13, 15 November 2018 (UTC)
Also another weird thing WP:JCW/Target1 lists Science twice. Once with the dab, the other without. Headbomb {t · c · p · b} 02:29, 16 November 2018 (UTC)
Fixed. -- JLaTondre (talk) 15:43, 18 November 2018 (UTC)

@JLaTondre: any word on the new dump being processed? Headbomb {t · c · p · b} 20:16, 26 November 2018 (UTC)

Holiday delayed it. Should be up now. -- JLaTondre (talk) 12:24, 27 November 2018 (UTC)

Dab issue?

In WP:JCW/Target3, entry #293, RNA links to RNA instead of RNA (journal). Headbomb {t · c · p · b} 01:15, 28 November 2018 (UTC)

The problem is caused because "RNA (journal)" and "Rna" both normalize to "rna". However, Rna is a redirect to RNA and there was no Rna (journal) equivalent (though you have since created). When the common normalization is run, it's picking the "Rna" target over the "RNA (journal)" target. I can easily put a workaround for that case in the common output, but I'd rather solve the selection logic instead. However, since you created the redirect, shouldn't be an issue with next dump run. I'll still look at it for future cases. -- JLaTondre (talk) 22:17, 29 November 2018 (UTC)

Is JL-Bot supposed to be updating "Women writers articles by quality and importance"?

Hi. I'm not sure if this is the place to ask the question, so apologies for cross-posting. Regarding WP:WikiProject Women writers, I've noticed that the table on our mainpage, "Women writers articles by quality and importance" isn't updating. JL-Bot is updating others areas of the Articles section, so is it supposed to be updating the table, too? Thank you. --Rosiestep (talk) 22:16, 2 January 2019 (UTC)
@Rosiestep: Do you mean this table? If so that's updated by WP 1.0 bot (talk · contribs), which currently has some issues. There is some information at Wikipedia talk:Version 1.0 Editorial Team/Index#Bot blocked and Wikipedia:Bots/Noticeboard#Reporting article assessment "WP 1.0 bot" misbehaving, but long story short is that people are looking into it (the new operator is Kelson (talk · contribs) if you want to contact them about about where things are). Headbomb {t · c · p · b} 22:51, 2 January 2019 (UTC)
Yes, Headbomb, that is the table which needs updating. I'll contact Kelson as per your suggestion. --Rosiestep (talk) 22:57, 2 January 2019 (UTC)

Don't strip final comma?

In this article, we have |journal=The Transactions of the Linnean Society of London, Series 2,. However in WP:JCW/Target10, this is reported as |journal=The Transactions of the Linnean Society of London, Series 2. Not really sure why the final comma is stripped, but it should be kept. Headbomb {t · c · p · b} 00:29, 7 January 2019 (UTC)

There was a specific step to strip trailing commas. I feel like that was requested, but not seeing anything in the archives. Before there was the TAR processing, I believe some cleanup was requested to remove some minor differences and make more entries match (another one is to remove '' at the end). I have removed the comma one and re-ran. Please look at the results and see what you think. I will also create a documentation page that describes all the manipulations (template processing, clean-up, normalization, etc.). It would be good to have a listing for future reference. It might take a couple of days to get to that, though. -- JLaTondre (talk) 02:12, 8 January 2019 (UTC)
I think that was mostly for WP:JCW/POP purposes back when it was our main way of prioritizing work. WP:JCW/TAR could get closer to raw entries to allow for cleanup and standardization. Commas and other garbage should be stripped in the comparison step, but ultimately reported. Whitespace can still be normalized, since the reader wouldn't see that. Headbomb {t · c · p · b} 02:18, 8 January 2019 (UTC)
Anyway, I cleaned all instances of final commas with User:JCW-CleanerBot. Headbomb {t · c · p · b} 02:33, 8 January 2019 (UTC)

It seems to have reprocessed everything but WP:JCW/TAR btw. Headbomb {t · c · p · b} 14:09, 8 January 2019 (UTC)

I only finished the 'regular' output yesterday. The remaining is running now. -- JLaTondre (talk) 23:55, 8 January 2019 (UTC)
Have you been doing some tweaks to WP:JCW/CRAP logic too? It's been a few times when you rerun the bot that the output changes on that page, beyond the 'fresh dump'. Headbomb {t · c · p · b} 04:34, 9 January 2019 (UTC)
No, there have been no logic changes for the questionable processing. Any changes would be the result of upstream changes in the data or your configuration settings. -- JLaTondre (talk) 00:18, 10 January 2019 (UTC)
What causes changes like [12] (see e.g. the bunch of Open journals in Bentham Science Publishers) then? Because I can't see anything in the config settings that would cause that.Headbomb {t · c · p · b} 00:42, 10 January 2019 (UTC)
Not sure. The Open journals that have been removed are the duplicate ones (per the section above). The redirect versions are still there. However, I haven't deployed any changes that should cause that nor am I'm not seeing any changes in the pages themselves. I'm currently replacing the existing code to improving speed and remove the remaining duplicates so not going to spend time isolating it on the old version. -- JLaTondre (talk) 19:15, 13 January 2019 (UTC)

/r/science

While [[/r/science]] is fine in mainspace, in wikipedia space that causes issues. [[:/r/science]] would be needed.

See the line that reads

in WP:JCW/Questionable2. Could apply to other places too. Headbomb {t · c · p · b} 00:56, 11 February 2019 (UTC)

Fixed. -- JLaTondre (talk) 23:31, 11 February 2019 (UTC)
It did seem to cause some weird collateral changes too. Two weird edits: [13], [14]. Everything else was fine. Headbomb {t · c · p · b} 07:44, 12 February 2019 (UTC)
Also when the bot creates the Questionable8 talk redirects, there are some issues [15]. Headbomb {t · c · p · b} 07:46, 12 February 2019 (UTC)
Redirects fixed. -- JLaTondre (talk) 20:45, 15 February 2019 (UTC)

Bot seems to be ignoring Unreliable fields and Mirrors and forks in WP:CRAPWATCH/SETUP

For instance, there is a

and a

selection tree in there. But I see no 'Alternative medicine' or 'Wikipedia:Mirrors and forks' sections in WP:CRAPWATCH anywhere, despite many of those journals and publications being cited. Headbomb {t · c · p · b} 16:18, 8 February 2019 (UTC)

Because those sections don't follow the proper syntax. The {{JCW-selected}} needs to be at the start of the line as in User:JL-Bot/Questionable.cfg#Journals. Anything else is ignored (to avoid picking the documentation earlier in the page, etc.). In those two sections, the entries are * {{JCW-selected}} which will not be picked up. The asterisk is redundant as the template adds one. I will remove them and re-run the questionable processing. -- JLaTondre (talk) 16:33, 9 February 2019 (UTC)
Wow, the brain fart on that one. Thanks for finding it! Headbomb {t · c · p · b} 17:49, 9 February 2019 (UTC)
I setup a bunch of new exclusions to deal with the new influx. If you could rerun when you have time, that would be great. Headbomb {t · c · p · b} 21:36, 9 February 2019 (UTC)
Done. -- JLaTondre (talk) 20:19, 10 February 2019 (UTC)
Could you give it another go to when you've got the chance. Nothing that's pressing though, so if you've got a few code updates planned in the next few days it can wait until then, but it's good to have a refreshed baseline after big updates to WP:CRAPWATCH/SETUP + User:JL-Bot/Citations.cfg. Headbomb {t · c · p · b} 09:13, 11 February 2019 (UTC)
Running. Will have the /r/science fix. -- JLaTondre (talk) 23:32, 11 February 2019 (UTC)
And another run? This will likely be the last one needed before a code update / next dump. Also feel free to increase targets to /30. Headbomb {t · c · p · b} 06:52, 15 February 2019 (UTC)
In process. -- JLaTondre (talk) 20:46, 15 February 2019 (UTC)

Minor tweak for the next run

Instead of

{{JournalsMain}}
{{JournalsLetter|letter=Questionable}}

just do

{{JCW-Main|letter=Questionable}}

Same for the magazines. I've updated the {{JCW-Main}} to call {{JCW-Letter}} when needed. Headbomb {t · c · p · b} 04:46, 16 February 2019 (UTC)

Also, there seems to be little point in having separate {{JCW-exclude}}/{{MCW-exclude}} templates, so I'd suggest excluding things from both list using either templates (i.e. if {{JCW-exclude}} is used, exclude things from both JCW/TAR and MWC/TAR lists, and if {{MCW-exclude}} is used, exclude things from both JCW/TAR and MCW/TAR lists.). There may be corner cases, but I haven't found them yet. Headbomb {t · c · p · b} 07:12, 16 February 2019 (UTC)
Okay. -- JLaTondre (talk) 21:35, 17 February 2019 (UTC)
Updated to 'JCW-Main' after a page move. Headbomb {t · c · p · b} 08:44, 23 February 2019 (UTC)
Likewise, {{JournalsPrevNext}} is now {{JCW-PrevNext}}. The MCW structure has been updated with the same conventions too, with MCW instead of JCW. See Category:Journals Cited by Wikipedia templates. Headbomb {t · c · p · b} 20:20, 23 February 2019 (UTC)
Both of these changes (ignores & templates) were implemented in the last run. -- JLaTondre (talk) 11:33, 26 February 2019 (UTC)

WP:CRAPWATCH tweak

I've given a major, major expansion to WP:CRAPWATCH/SETUP and the list now draws from multiple sources. Could you take the |source= / |note= parameters of {{JCW-selected}} and add it to the target in the list? E.g. something like

Rank Target/Group
(Source)
Entries (Citations, Articles) Total Citations Distinct Articles
24 Pharmacognosy Reviews
[—]
31 31

Headbomb {t · c · p · b} 04:44, 5 February 2019 (UTC)

Yes, that is easy enough to do. I will incorporate with the changes for the prior requests. However, it will still be a bit as I need to find some free time to work on them all. -- JLaTondre (talk) 12:40, 7 February 2019 (UTC)
No worries. I was hoping this would get bumped in the priorities just a bit since it should be a pretty quick thing to do, and crapwatch/setup got a massive expansion , but I'm also thinking it might be time to convert the crapwatch to a template-based solution like we do with {{JCW-row}}, so it might save you time to implement that alongside the new format instead of doing it twice. Headbomb {t · c · p · b} 12:51, 7 February 2019 (UTC)
Note, this is request is now redundant with #New format for TAR and CRAP pages below. Headbomb {t · c · p · b} 01:22, 3 March 2019 (UTC)

Duplicate listing in WP:JCW/CRAP

In the MDPI entry, you have

and then later

Only the second one is needed.

Likewise, you have

Which should be best listed as

Headbomb {t · c · p · b} 01:06, 7 October 2018 (UTC)

Just to be clear, this is not the 'easy' thing I had in mind above. Headbomb {t · c · p · b} 01:07, 7 October 2018 (UTC)

I suppose you could summarize the desired structure as

  • Level 1 (non-redirects)
    • Level 2 (2a → redirects to Level 1; 2b → typos and variants of level 1)
      • Level 3 (typos and variants of Level 2a)

Headbomb {t · c · p · b} 01:12, 7 October 2018 (UTC)

For instance, #13 (e-Century Publishing Corporation) should display as

Rank Target/Group Entries (Citations, Articles) Total Citations Distinct Articles
13 e-Century Publishing Corporation 147 136

Roughly speaking. Headbomb {t · c · p · b} 15:08, 9 October 2018 (UTC)

The "Journal of Cardiovascular Development and Disease" case is happening because the configuration is requesting both List of MDPI academic journals (to which it is a redirect) and Category:MDPI academic journals (to which it is a member). I'm still thinking how to do the revised group as described above. This will be case that will be handled as part of that. -- JLaTondre (talk) 23:23, 10 October 2018 (UTC)
Fetch entries, remove duplicates, process? Headbomb {t · c · p · b} 01:15, 11 October 2018 (UTC)

I believe the duplicate detection is one of the few remaining things that needs to be implemented. Headbomb {t · c · p · b} 18:10, 3 November 2018 (UTC)

Done. -- JLaTondre (talk) 21:29, 11 March 2019 (UTC)

Bulletin of the Natural History Museum

In WP:JCW/Target8, Bulletin of the Natural History Museum is missing quite a bit of entries.

For example, Bull. Br. Mus. (nat. Hist.) (Ent.) in WP:JCW/B31 isn't picked up, even though it's very close to Bull. Br. Mus. (Nat. Hist.) Ent.. Only differing by punctuation (brackets) and capitalization. Headbomb {t · c · p · b} 06:27, 24 January 2019 (UTC)

That is correct. Bull. Br. Mus. (Nat. Hist.) Ent. is not cited by Wikipedia so it won't match. TAR looks for common targets in the citations. It doesn't pull in non-citations for comparison. -- JLaTondre (talk) 23:44, 24 January 2019 (UTC)
How is it not cited? 'Bull. Br. Mus. (nat. Hist.) (Ent.)' is listed in WP:JCW/B31 as cited, and it was. This is a typo'd variant of Bull. Br. Mus. (Nat. Hist.) Ent. which redirects to Bulletin of the Natural History Museum, so should be regrouped with under targets of Bulletin of the Natural History Museum. If that's not how things currently work, then that's how they should work.Headbomb {t · c · p · b} 01:14, 25 January 2019 (UTC)
Bull. Br. Mus. (Nat. Hist.) Ent. is not cited. The fact that it's a redirect is irrelevant to the current processing. The request for TAR was to group common targets among the citations in the WP:JCW/ALPHA pages. If it's not in those pages, it doesn't get included in the TAR processing. If you want to start including redirects, that can be done, but it is currently correct based on the original specs. -- JLaTondre (talk) 02:21, 25 January 2019 (UTC)

Well that's not how I remember making the request in back then, but it could just be an unclarity/ambiguity in the original wording. No matter, the general desired logic is

  1. Find 'targets' (e.g. non-redirects, and landing pages)
  2. For each target, regroup both direct 'matches' (the target entries, plus anything that redirects to the target) and indirect matches (variant and typos of direct matches).
  3. Excluding things from the exclusion list, and things with zero hits.

Headbomb {t · c · p · b} 02:36, 25 January 2019 (UTC)

Done. -- JLaTondre (talk) 21:30, 11 March 2019 (UTC)

New format for TAR and CRAP pages

If you could implement these new formats for the TAR (see [16]) and CRAP (see [17]) pages, that would be great. Headbomb {t · c · p · b} 22:42, 27 February 2019 (UTC)

The new date format (20120220 → 2012-02-20) should apply across the board though. Headbomb {t · c · p · b} 23:01, 27 February 2019 (UTC)
No promises, but I should have time this weekend to finish off the updated version. If so, I will include the new format also. -- JLaTondre (talk) 02:36, 28 February 2019 (UTC)
Looking forward to it. The new format should make it easier to tweak appearance without your involvement, but also make it much easier to review diffs. Headbomb {t · c · p · b} 02:39, 28 February 2019 (UTC)
Actually, you can put the date in {{JCW-date}} directly instead. It's no longuer required in {{JCW-bottom}}. Headbomb {t · c · p · b} 02:37, 5 March 2019 (UTC)
Do you mean edit the {{JCW-date}} page? Why add an additional page edit? -- JLaTondre (talk) 16:31, 10 March 2019 (UTC)
The idea is that this can be used on other pages and is more easily readable by other bots like User:RonBot. Technically JCW-bottom could still be used as is (in case the bot stops mid-run or something), but JCW-date is more useful since it can be transcluded on other pages, like the main WP:JCW page. Headbomb {t · c · p · b} 20:50, 10 March 2019 (UTC)
Done. -- JLaTondre (talk) 21:37, 11 March 2019 (UTC)

African Journal of Traditional, Complementary and Alternative Medicines not picked up?

In WP:CRAPWATCH/SETUP, there is


In Category:Alternative and traditional medicine journals, there is African Journal of Traditional, Complementary and Alternative Medicines.

Afr J Tradit Complement Altern Med redirects to African Journal of Traditional, Complementary and Alternative Medicines, but isn't reported in the 'Alternative medicine' entry in WP:JCW/Questionable1 (#6).

Headbomb {t · c · p · b} 01:30, 11 February 2019 (UTC)

That is because African Journal of Traditional, Complementary and Alternative Medicines is listed under the journal configuration section. It and its redirect show up as 52 on WP:JCW/Questionable2. Since the questionable processing is based on the common target processing, it doesn't have the concept of something resolving to two targets. Is that needed? Or can you live with the unreliable fields configuration being a catch all and the journal configuration section effectively being an override for when you want a specific listing? -- JLaTondre (talk) 23:39, 11 February 2019 (UTC)
Well it's not super critical, as long as it gets picked up, that's the number one priority. But it would be useful to have it listed in both places. Kinda like if an OMICS journal of quack medicine gets a standalone entry, it doesn't stop being an OMICS journal, a quack medicine journal, or something that was individually called to be crap. So people interested in doing OMICS cleanup will find it under the OMICS entry, people interested in quack medicine cleanup will find it under the quack medicine entry, and as a standalone entity, it'll be traceable to whatever organization called it crap. Headbomb {t · c · p · b} 23:48, 11 February 2019 (UTC)
Okay, I need to bite the bullet and just re-write the target processing. That way I can solve all these issues/requests. -- JLaTondre (talk) 20:45, 15 February 2019 (UTC)
Was this (partly) re-written already? It's finding... new things like Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/Target10 (Oncology, entry #975). Headbomb {t · c · p · b} 03:47, 16 February 2019 (UTC)
A couple experimental tweaks, nothing definitive. The full re-write is still progress. -- JLaTondre (talk) 21:34, 17 February 2019 (UTC)

Just making sure I wasn't crazy or did something weird somewhere. They're good pick ups, so whatever you did, it's a good tweak. Feel free to rerun whenever you tweak, I always check things. Plus nearly all remaining exclusions for CRAPWATCH were set up (not everything everything, but everything obvious). The tail end of false positives should be gone now.Headbomb {t · c · p · b} 22:50, 17 February 2019 (UTC)

Need to check this one (page resolving to multiple questionable targets) once the latest results are uploaded. Based on the updates, I think it should work as desired now, but I didn't have this on my written list when doing the changes so didn't specifically do anything for it. -- JLaTondre (talk) 21:36, 11 March 2019 (UTC)
Done. -- JLaTondre (talk) 23:48, 11 March 2019 (UTC)

Signpost draft: User:Headbomb/Crapwatch

Currently creating some signpost draft article about the crapwatch. Feedback welcome. I don't plan on sending it for publication until at least User talk:JL-Bot#WP:CRAPWATCH tweak is taken care of and you give me the thumbs up, but I could hold off for longer too depending on when the other issues get tackled. Headbomb {t · c · p · b} 10:07, 26 February 2019 (UTC)

Only recommendation I have is to change the sentence that states "Most false positives can however be bypassed manually, and the compilation will be updated accordingly on the next bot run." Maybe to "False positives can be manually identified and added to the configuration so that they will be removed in future bot runs"? Just to clarify that the configuration is manual, but the actual bypassing is not. The original wording could be ambiguous to someone not familiar with it. Other than that, thanks for the shout out. -- JLaTondre (talk) 23:18, 11 March 2019 (UTC)
Changed to "However, false positives can be manually identified, and the compilation will be updated accordingly in future bot runs." Headbomb {t · c · p · b} 02:48, 12 March 2019 (UTC)

March 2019

  Hello. This is a message to let you know that one or more of your recent contributions, such as the edit you made to Portal:Architecture, did not appear constructive and has been reverted. Please take some time to familiarise yourself with our policies and guidelines. You can find information about these at our welcome page which also provides further information about contributing constructively to this encyclopedia. If you only meant to make test edits, please use the sandbox for that. If you think I made a mistake, or if you have any questions, you may leave a message on my talk page. Please do not spam vast indiscriminate lists of links to a portal page BrownHairedGirl (talk) • (contribs) 02:39, 16 March 2019 (UTC)

@BrownHairedGirl: you're aware JL-Bot is a bot that was specifically asked to do this, right? And will keep re-adding that content every week, because, again, it was asked to include it. See WP:RECOG. Headbomb {t · c · p · b} 03:19, 16 March 2019 (UTC)
@Headbomb: yes thanks, I was aware. This is where Twinkle brought me, so I thought I'd leave a note here as a first step. Some project or somewhere needs to be notified to change the bot's instructions, but I'm not sure where. --BrownHairedGirl (talk) • (contribs) 03:25, 16 March 2019 (UTC)
The Transhumanist added it to that page. You will either need to revert that edit and remove the config or discuss it with him. -- JLaTondre (talk) 12:48, 16 March 2019 (UTC)

Weird Image/File issue

Entry #2923 in WP:JCW/Target30 (Animage) has the following entries

There's an interwiki issue there, but the actual entries are Image: The Journal of Nursing Scholarship and Image: Journal of Photography of the George Eastman House. Headbomb {t · c · p · b} 06:51, 15 March 2019 (UTC)

Take a look at the page source. The actual entry listed is "Image: Journal of Photography of the George Eastman House" as per the template. When Image: pages are linked, the wiki software auto-converts them to display File: (legacy from when the Image: namespace was renamed File:). Luckily, they were not actual images are they would have been displayed. I can prefix them with ":" so it will both not display an image (if actually one) and prevent the auto-conversion (ex. Image: The Journal of Nursing Scholarship). -- JLaTondre (talk) 21:08, 15 March 2019 (UTC)
Prefixing seems the way to go.Headbomb {t · c · p · b} 03:17, 16 March 2019 (UTC)
Done. Will reprocess & post. -- JLaTondre (talk) 16:59, 16 March 2019 (UTC)

Citation New Version Live

The updates to the citation processing are complete. They include:

  • Popular Targets (TAR) processing now also looks for redirects to the target even if the redirect is not a citation
  • TAR & Questionable Targets (CRAP) output now uses templates
  • TAR & CRAP normalization matching has been improved to catch some cases that could have been missed
  • TAR & CRAP pages now list the timestamps of the configuration page(s) used for the run (might want to move this to a template)
  • CRAP duplicates in the output are now removed
  • CRAP output now includes the Source and Notes fields
  • Improved identification of redirects to disambiguation pages
  • Code has been refactored to (hopefully) make future updates easier with less unexpected interactions

The output from the new version is saving now. It was generated yesterday (I wanted to be around when it uploading the results so I could monitor it) so will be missing the latest configuration file changes. However, the first item above will result in more false positives that need to be suppressed (ex. Nat. redirects to Nature which matches At, MAT, NA, etc.). Once the false positives have been updates based on this run, let me know here and I'll re-run. Also, let me know if you see anything unexpected. -- JLaTondre (talk) 21:27, 11 March 2019 (UTC)

I'll take a look. The first thing that stands out is in diffs like these [18], those extra '(journal)' (and associated searches) seem ... unnecessary and sometimes detrimental. Makes the target column for Molecular Phylogenetics and Evolution look like a redirect, when it shouldn'd be, for example. The old behaviour there was better. Headbomb {t · c · p · b} 00:36, 12 March 2019 (UTC)
I changed it to remove the " (journal|magazine)" from the searches. For the target, it ignored redirects tagged as unnecessary. I've updated it to also ignore "TITLE (journal|magazine)" redirects that point back to "TITLE". I think that was the original behavior. -- JLaTondre (talk) 02:23, 12 March 2019 (UTC)
Oh boy, TAR/CRAP is a mess. It seems that it's missing the 'if it exist and doesn't point back to the target, it's not a match' step. For example, Nat. redirects to Nature (journal), which matches At is fine, if At doesn't exist. But At existing and not pointing to Nature (journal) should exclude At from the matches. Likewise in the TAR entry for New Scientist (#6) it matches things that clearly don't point to New Scientist, like NEWSru, Science (journal), Sun Journal and News24. Those should all be filtered out, and relatively early on, so you're not compounding the issue by looking for variants of NEWSru, Sun Journal, etc... for New Scientist. Headbomb {t · c · p · b} 00:40, 12 March 2019 (UTC)
Ugh, I overlooked that check in the prior version. I put it back in and am running it. Hopefully the next version looks better. Sorry about that! -- JLaTondre (talk) 02:23, 12 March 2019 (UTC)
It can't not look better haha. I'll have more comments, but at the moment, it's hard to even read the compilation and find out what's good/bad behaviour. The new templated versions seems fine, as far as structure goes, although some CRAP matching probably needs a bit of refinement. I'll know more once the new upload is up. Headbomb {t · c · p · b} 02:36, 12 March 2019 (UTC)
In future runs, use ISO format for dates. (https://en.wikipedia.org/w/index.php?title=Template:JCW-date&curid=60140457&diff=887330264&oldid=886911175). Headbomb {t · c · p · b} 02:42, 12 March 2019 (UTC)
It's still appending extra stuff for searches [19]. Not a critical fix that needs another run though. Headbomb {t · c · p · b} 08:23, 12 March 2019 (UTC)
Should really be fixed now, but will wait on another run to upload. -- JLaTondre (talk) 01:25, 13 March 2019 (UTC)
I uploaded a few of Journal/A pages to verify fixed, but not the whole set. Will do that when everything else looks good. -- JLaTondre (talk) 01:47, 14 March 2019 (UTC)
Done. -- JLaTondre (talk) 17:32, 14 March 2019 (UTC)

New version, part 2

Alright, now that the uploaded version makes more sense, we can read things more sanely.

The first things is that it's obvious that the matching algorithm is too aggressive when 'small' names are concerned, so it's got to be de-aggresivized. For example, The Astrophysical Journal matches

Presumably through one of its small redirects. I'm guessing the algorithm goes something like

  • A) Search for The Astrophysical Journal, fetch redirects such as APJ/APJL/APJS (which form set A)
  • B) Search for normalized variants and expanded variants of set A (which form set B, having an APJ+Journal/APJL+Journal/APJS+Journal in there somewhere)
  • C) Search for typos of set B (which form set C), finding CAN Journal and most others as a typo of one or more of APJ+Journal/APJL+Journal/APJS+Journal
  • D) Exclude articles from set C which don't point back to The Astrophysical Journal, but keep the redlinks (e.g. most of them)

Here by 'normalized variant', I mean normalizing JournalJ, ignoring Series/Supplement/Letters/Trailing garbage etc... By 'expanded variant' I mean artificially adding 'Journal/Magazine' to see if you can get a match.

So for small names, or perhaps in general, what it should do is instead

  • A) Search for The Astrophysical Journal, fetch redirects such as APJ/APJL/APJS (which form set A)
  • B) Search for normalized variants and expanded variants of set A (which form set B)
  • C) Search for typos of set B (which form set C)
  • D Throw away things that don't point back to The Astrophysical Journal, keep the redlinks (which form set D)
  • E) Search for expanded variants of D (which form set E)
  • F) Throw away things that don't point back to The Astrophysical Journal, keep the redlinks (which form set F)

Headbomb {t · c · p · b} 09:15, 12 March 2019 (UTC)

The processing actually works as follows:
For each target to be processed:
  1. Find all citations that resolve to that target
  2. Find all redirects to the target
  3. For each of the above, use their normalization to find other citations with same pattern. Matches are defined as:
    1. For strings of <3 characters, require an exact match
    2. For strings of 3-5 characters, allow 1 character delta
    3. For strings of 6-20 characters, allow 1-2 character deltas
    4. For strings of 21+ characters, allow 1-3 character deltas
  4. Toss out any false positives
  5. Toss out any that resolve to articles
For CRAP, the logic is the same except that for step 2 it uses the additional parameters in the configuration line.
In the "The Astrophysical Journal" case, it has a redirect of "Ap J" which normalizes to "apjournal". That is what causes the above hits. "J" is expanded to "journal" in order that cases like "Nature J." and "Nature Journal" match. This is the logic that has been used for a long time. The problem is that redirects like these are now being pulled in. I can see three solutions:
  1. Switch the normalization from "j" -> "journal" to "journal" -> "j". This would reduce the string size which would reduce the tolerance. If this was done, only "CAP Journal" in the above list would be a hit. However, I don't think this is a good idea as: a) it would stop catching typos of "journal"; and b) it would stop catching cases where there is not a space between the term and "journal".
  2. Continue to normalize as is, but if both normalizations being compared end in "journal", strip the "journal" from both and compare the remainders. Use the existing delta rules. In this case, that would result in "ap" being compared and all of the above would be tossed. It should still catch the a & b case from the above option.
  3. Mark these as false positives and be done with it. This is the first run with the redirect addition so should stabilize.
  4. Revert the inclusion of redirects.
Let me think on it and see if I come up with an other options. If not, I'll give 2 a shot. -- JLaTondre (talk) 01:22, 13 March 2019 (UTC)
Of those, #2 seems the best (likewise for 'magazine'), or at least worth giving a try to see how it goes. Not sure how well my suggested algorithm above would perform / how easy to implement it would be. It might be better, it might be worse.
But manually marking them as false positives is... they're just so many of them. Headbomb {t · c · p · b} 01:48, 13 March 2019 (UTC)
I implemented #2 and re-ran for TAR. Take a look at it and see what you think. If good, I will update CRAP as well. I did not do Part 5 (below) yet as wanted to see the changes separately (in case something went weird). -- JLaTondre (talk) 01:46, 14 March 2019 (UTC)
It's brought things down to a much, much, more manageable level. Run it all (although feel free to keep #5 separate). Headbomb {t · c · p · b} 02:30, 14 March 2019 (UTC)
CRAP done. -- JLaTondre (talk) 17:33, 14 March 2019 (UTC)

New version, part 3

The bot stopped at WP:JCW/Questionable10, leaving WP:JCW/Questionable11/WP:JCW/Questionable12/WP:JCW/Questionable13/WP:JCW/Questionable14 useless. They should get CSD'd when no longer needed. Headbomb {t · c · p · b} 09:21, 12 March 2019 (UTC)

Already handled. The bot lets me know which pages are no longer needed & I delete them after the run. Normally, I would be around when it completes. -- JLaTondre (talk) 20:39, 12 March 2019 (UTC)

New version, part 4

In WP:JCW/CRAP, several entries are missing their |source=. E.g. entry #951 in WP:JCW/Questionable10 (Pattern Recognition in Physics) is missing |source=BLJ from WP:CRAPWATCH/SETUP. Headbomb {t · c · p · b} 09:24, 12 March 2019 (UTC)

Fixed, but will wait on another run to upload. -- JLaTondre (talk) 01:25, 13 March 2019 (UTC)
Still have some cases not working. Will look into it more. -- JLaTondre (talk) 17:34, 14 March 2019 (UTC)

This is possibly related to entries like

which choke because of the pipe in the note. I fixed that, although I don't know if that's going to fix these issues. Headbomb {t · c · p · b} 07:44, 15 March 2019 (UTC)

Most cases were due to a typo that I didn't see for staring at. I've fixed that, but haven't updated as the content task has been running for the past day (I need to work on making that more efficient also). You are correct that pipes would also create an issue. If they can be avoided, I'd rather not have to deal with them. -- JLaTondre (talk) 21:10, 15 March 2019 (UTC)
Should be avoidable in general, although it'd be useful if they were. Not a priority in the least tough. Headbomb {t · c · p · b} 03:16, 16 March 2019 (UTC)
Implemented. If you see any cases it doesn't handle, please let me know. -- JLaTondre (talk) 19:13, 16 March 2019 (UTC)
There's nothing that needs it at the moment, but I'll update it later to have shorter notes in certain cases. Headbomb {t · c · p · b} 19:27, 16 March 2019 (UTC)
@JLaTondre:, still many cases it doesn't handle. See WP:JCW/Questionable10, entries 902–905 and 909. There are more (e.g. 31 entries in WP:JCW/Questionable9). Headbomb {t · c · p · b} 06:14, 17 March 2019 (UTC)
Ugh, got wrapped up in pipe case & forget to validate change with other cases. Should be fixed. -- JLaTondre (talk) 16:26, 17 March 2019 (UTC)

New version, part 5

In WP:JCW/CRAP (and in TAR, but mostly in CRAP), there's a lot of ABCD types of acronyms, which match other unrelated ABCD type of acronyms.

So if you've got something which is one single, all caps word, don't look for typos. This way something like IJSSMS doesn't match IJSSMM but only IJSSMS + capitalized variants + IJSSMS Journal.

Alternatively, perhaps simpler, there could be a final step that removes allcaps acronyms that don't match the initial input save in capitalization.

This way if you have APJ in WP:CRAPWATCH/SETUP (or at a redirect to something that would get picked up by WP:CRAPWATCH/SETUP), you'd keep APj if it's found, but would throw away APT.

Headbomb {t · c · p · b} 10:07, 12 March 2019 (UTC)

Doable. I will roll in with changes to Part 2 above. -- JLaTondre (talk) 01:31, 13 March 2019 (UTC)
@JLaTondre: was this part implemented? Because entry #912 in WP:JCW/Questionable10 still matches NELS to JELS, for example.Headbomb {t · c · p · b} 06:09, 17 March 2019 (UTC)
No, I decided to do it separately so that I could verify each change. My next step was to ask for an example to test against. You beat me to that. ;-) -- JLaTondre (talk) 16:27, 17 March 2019 (UTC)
Looking forward to this being implemented. I believe it's the last 'major' thing that needs to be in for false positives to fall back to a manageable level. Could be wrong about that, but very much looking forward to it. Headbomb {t · c · p · b} 16:30, 17 March 2019 (UTC)
Initial version implemented. Generating TAR & CRAP output (local). Probably won't have time to look at it and validate until tomorrow. -- JLaTondre (talk) 18:51, 17 March 2019 (UTC)
Cool. I'll be doing a bunch of additions to the exclusions in the meantime, so if your tests show that nothing's blown up tomorrow, do a fresh run then before uploading. Headbomb {t · c · p · b} 19:00, 17 March 2019 (UTC)
It looks good to me. I'm re-running & will upload. By the way, in those bunch of additions you made, you entered a number that you didn't have to because they would be excluded by this test. -- JLaTondre (talk) 23:03, 18 March 2019 (UTC)
Yup. Hence #Unnecessary exclusions report below. I tried avoiding most that wouldn't get picked up under the new rules, but I didn't go out of my way to triple check I only included the ones that would get picked up, especially since I haven't seen the new rules in action yet. The main concern was to get rid of as many false positives as possible before the new run, at least so that WP:JCW/CRAP wouldn't be so massive. I left a couple of long entries as they were, since the lack of a typo hierachy made it hard to gauge if they were false positives, or legit pickups. Headbomb {t · c · p · b} 23:17, 18 March 2019 (UTC)

Doesn't seem to work. Taking WP:JCW/Questionable9, you've got entries 802, 818, 823, 829, 832, 833... and many many others, all matching inexact all caps acronyms. Headbomb {t · c · p · b} 07:19, 19 March 2019 (UTC)

It's only ignoring all uppercase words of the same length. For "International Journal on Research Methodologies in Physics and Chemistry", the configuration has "IJRMPC" (6 letters), but the reported result is "IJRAP" (5 letters). I interpreted the request that way based on your examples all being the same length, but on re-reading, I see that wasn't stated. I can change it that if the search term is all uppercase, it throws out any results that are all uppercase. -- JLaTondre (talk) 03:07, 20 March 2019 (UTC)
Ah I see, my initial request was ambiguous there, yes. Headbomb {t · c · p · b} 08:47, 20 March 2019 (UTC)
Changed to ignore uppercase words of any length (when target is also an uppercase word). Running now. -- JLaTondre (talk) 00:47, 21 March 2019 (UTC)

New version, part 6

To cut down on false positives, some words shouldn't count as far as the "length" of the string is concerned.

  • Bulletins/Bulletin/Bull./Bull
  • Journals/Journal
  • News
  • Newsletter/Newsl.
  • Magazine/Mag.
  • Proceedings/Proceeding/Proc./Proc
  • Reviews/Review/Rev./Rev
  • Online
  • Transactions/Transaction/Trans./Trans

So if you have something like say CA News, then for purpose of comparison, the string length should be 2, rather than 6. Headbomb {t · c · p · b} 17:44, 14 March 2019 (UTC)

Done. Will reprocess & post. -- JLaTondre (talk) 16:02, 16 March 2019 (UTC)

New version, part 7

These can be shoved in {{JCW-bottom|e-id=887990773|q-id=887859138|r-time=2019-03-17}} to give

Date
Compilation last updated on 17 March 2019. The Wikipedia CiteWatch results are based on the database dump of 8 May, using this configuration, with these exclusions.

Making use of |e-id= and |q-id= on TAR/CRAP pages as relevant. Headbomb {t · c · p · b} 06:54, 17 March 2019 (UTC)

Done. -- JLaTondre (talk) 18:49, 17 March 2019 (UTC)

New version, part 9

In [20], the following exclusion was setup

however, it is not respected in WP:JCW/Questionable9 (entry #804). Headbomb {t · c · p · b} 12:24, 19 March 2019 (UTC)

Likewise, it had

but those were still included in WP:JCW/Questionable1 (entry#3) Headbomb {t · c · p · b} 13:13, 19 March 2019 (UTC)

Fixed. -- JLaTondre (talk) 02:48, 20 March 2019 (UTC)

Seems to works, although some of the associations were lost, for example,

used to suppress 'Biologue' from WP:JCW/Questionable1, because Biologue was a match for Biology (journal), which is under MDPI. It's not really the end of the world, since they can be re-declared, but it would be useful to have those exclusions back (especially if the 3-level hierarchy is implemented) since they were already done/working before. Headbomb {t · c · p · b} 09:03, 20 March 2019 (UTC)

To be clear, it's not that Biologue couldn't get picked up. If there was a different MDPI journal (named 'Biologia'), then it would be a match for that an should be reported as is. Just that if matching Biology (journal) is the only reason it's included under MDPI, then it should be excluded. Headbomb {t · c · p · b} 09:10, 20 March 2019 (UTC)
Changed to ignore at both levels -- the main questionable entry (the behavior just implemented) or the additional targets (the original behavior). Running now. -- JLaTondre (talk) 00:50, 21 March 2019 (UTC)


New JCW-exclude format

I've added a crapton of exclusions (so a rerun would do wonder, even if you don't have the new dump yet). However, we're hitting the template expansion limit, so in addition to the 'normal' format

{{JCW-exclude|The Wire (magazine)|The Wave Magazine}}
{{JCW-exclude|The Wire (magazine)|The WILD Magazine}}
{{JCW-exclude|The Wire (magazine)|The Wild Magazine}}
{{JCW-exclude|The Wire (magazine)|WHERE Magazine}}
{{JCW-exclude|The Wire (magazine)|WHERE magazine}}
{{JCW-exclude|The Wire (magazine)|WILD Magazine}}
{{JCW-exclude|The Wire (magazine)|Wild Magazine}}

could you support

{{JCW-exclude|The Wire (magazine)|The Wave Magazine|The WILD Magazine|The Wild Magazine|WHERE Magazine|WHERE magazine|WILD Magazine|Wild Magazine}}

? Headbomb {t · c · p · b} 13:01, 21 March 2019 (UTC)

The new format isn't implemented yet, but once it's supported, we could likely get User:RonBot to merge and sort entries. Headbomb {t · c · p · b} 13:13, 21 March 2019 (UTC)
Actually put this on hold. The 'new' format blows up post-template expansion, so it doesn't make pages easier to load/edit. Maybe later, but for now this isn't needed. Headbomb {t · c · p · b} 00:57, 23 March 2019 (UTC)

New run

Thanks for the refresh. I just setup a bunch of exclusions for the expanded crapwatch, so another run would be rather productive at the moment. Headbomb {t · c · p · b} 09:58, 26 February 2019 (UTC)

Will re-run later today. There were a couple of new templates in new citations that I will update the code to handle. -- JLaTondre (talk) 11:34, 26 February 2019 (UTC)
There's some weird stuff going on with WP:MCW/TAR, first Spin (magazine) ≠ Scan Magazine doesn't seem to work (entry #18). There are other examples where the most recent exclusions didn't kick in on the WP:MCW/TAR pages. Many seem related to (magazine) entries, but Pacific RailNews ≠ RiaNews also didn't seem to work on WP:MCW/Target2. Second, the counts are a bit different for similar entries. E.g. Billboard (4365 in 1844) going up to Billboard (4445 in 1880). Headbomb {t · c · p · b} 06:54, 27 February 2019 (UTC)
For the exclusions, I believe the issue was the processing was from before those entries were added to the configuration page. While the save timestamp is after they were added, the run had actually occurred before that and was uploaded later (normally doesn't happen, but sometimes the processing gets broken up). I re-ran the target processing and it is excluding them. I will update both the target and questionable output to include the timestamps of the configuration page for future reference. For the Billboard case, the diff is comparing the results of the 02/01 dump with those from the 02/20 dump so they should differ. -- JLaTondre (talk) 02:34, 28 February 2019 (UTC)

Been a while since the new dump is out. Even you don't have time to implement the latest tweaks, a new run would be useful. Headbomb {t · c · p · b} 09:52, 8 March 2019 (UTC)

The old version is running. I hope to have the new version wrapped up this weekend. -- JLaTondre (talk) 03:30, 9 March 2019 (UTC)
Bot seems to have chocked on the Crapwatch. Did everything else fine though. Headbomb {t · c · p · b} 12:33, 9 March 2019 (UTC)
Questionable pages saved. -- JLaTondre (talk) 13:43, 9 March 2019 (UTC)
See [21], although this will be superceded by the new format below eventually. I fixed the pages, so no need to re-run. Headbomb {t · c · p · b} 14:00, 9 March 2019 (UTC)
I did set up a bunch of new exclusions, but the changes would be relatively minimal (mostly affecting TAR25 to TAR30). We're approaching an asymptotically stable set here (at least with the current algorithms). A new run would be nice if you want to run this overnight while you sleep, but its certainly not critical (and could wait until after the new format is implemented). Headbomb {t · c · p · b} 16:16, 11 March 2019 (UTC)
Current version uploading pre-dates the new exclusions, but you will need more (see User talk:JL-Bot#Citation New Version Live). -- JLaTondre (talk) 21:39, 11 March 2019 (UTC)

When could we expect a new run? I was hoping to review the latest logic with a new dump, and have a final exclusion pass before publishing that signpost piece before the end of the month. Headbomb {t · c · p · b} 09:12, 24 March 2019 (UTC)

Saving now. -- JLaTondre (talk) 00:54, 26 March 2019 (UTC)
@JLaTondre: I've added a bunch of crapwatch exclusions. If you rerun now, I can submit the Signpost piece for publication. Headbomb {t · c · p · b} 15:13, 26 March 2019 (UTC)
Running. It will take a couple of hours. -- JLaTondre (talk) 21:16, 26 March 2019 (UTC)
Hmmm, seems that when the number of citations are the same, the order is random. I will add an additional sort level (number of articles) to avoid this type of flip flop. -- JLaTondre (talk) 01:09, 27 March 2019 (UTC)
Done. Should be consistent from here on out. -- JLaTondre (talk) 01:26, 27 March 2019 (UTC)
Cool beans. Things look good, so I've submitted my article to the Signpost. I've updated some notes and I'll be polishing some more stuff, but nothing major that would required reruns. Feel free to do one on the 30th since the Signpost will be published on the 31st, hopefully with my article in it. Headbomb {t · c · p · b} 01:33, 27 March 2019 (UTC)
Actually, now that List of Dove Medical Press academic journals was expanded, there is an opportunity for another run prior to the 'final' one on the 30th. Headbomb {t · c · p · b} 22:51, 27 March 2019 (UTC)
Also with a bunch of additions from updates to Beall's lists, it'd be worth a run. Headbomb {t · c · p · b} 15:26, 29 March 2019 (UTC)

I just finalized the latest exclusions and notes to deal with the Dove journals and latest expansion of the list. One final run before the signpost publication would take care of every little nitpicky thing. By the time the new dump gets around, the traffic should have died down on the page, and there won't be as much freaking out if we have false positives. Headbomb {t · c · p · b} 04:37, 31 March 2019 (UTC)

Running. Results will be up in a couple hours. -- JLaTondre (talk) 17:45, 31 March 2019 (UTC)

Signpost piece

I'd appreciate some support here (concerning the publication of User:Headbomb/Crapwatch) if you think this is a good initiative. Headbomb {t · c · p · b} 12:10, 28 March 2019 (UTC)

Looks like it is moving forward. -- JLaTondre (talk) 17:48, 31 March 2019 (UTC)

Need help

I tried to set-up JL-Bot, but I'm not sure if it's correct. There are two pages:

I created both pages on 22 March 2019‎, but nothing has appeared. I thought the bot ran once a week. Please correct me if I'm incorrect. Mitchumch (talk) 04:50, 1 April 2019 (UTC)

Nominally once a week. It did not run this weekend due to a conflict. I have run it against those two pages. I'll now have it do a normal run. -- JLaTondre (talk) 22:38, 1 April 2019 (UTC)
It looks good. Thank you. Mitchumch (talk) 23:03, 1 April 2019 (UTC)

Dump finally up

It took a while, but there's finally a useable April dump. Headbomb {t · c · p · b} 09:06, 10 April 2019 (UTC)