User talk:The Transhumanist/OutlineDedupeHolding.js

Latest comment: 6 years ago by The Transhumanist in topic Leveraging TrueMatch data mining
This is the workshop support page for the user script OutlineDedupeHolding.js. Comments and requests concerning the program are most welcome. Please post discussion threads below the section titled Discussions. Thank you. By the way, the various scripts I have written are listed at the bottom of the page.[1]
This script is under development, and is not yet functional

When completed, this script will remove duplicate list items from section (e.g.: See also, holding bin, general concepts, list section, etc.). That is, it will remove from the current section all topics that exist anywhere else in the body of the page (not in templates).

Script's workshop edit

This is the work area for developing the script and its documentation. The talk page portion of this page starts at #Discussions, below.

Description / instruction manual for OutlineDedupeHolding.js edit

This script is under development, and is not yet functional

When completed, this script will remove duplicate list items from section (e.g.: See also, holding bin, general concepts, list section, etc.). That is, it will remove from the current section all topics that exist anywhere else in the body of the page (not in templates).

This is useful for culling lists of links gathered to a holding area (in an outline) awaiting placement into the outline. It can also help cleaning up the General concepts sections, List sections, and See also sections of outlines.

How to install this script edit

Important: this script was developed for use with the Vector skin (it's Wikipedia's default skin), and might not work with other skins. See the top of your Preferences appearance page, to be sure Vector is the chosen skin for your account.

To install this script, add this line to your vector.js page:

importScript("User:The Transhumanist/OutlineDedupeHolding.js");

Save the page and bypass your cache to make sure the changes take effect. By the way, only logged-in users can install scripts.

Explanatory notes (source code walk-through) edit

This section explains the source code, in detail. It is for JavaScript programmers, and for those who want to learn how to program in JavaScript. Hopefully, this will enable you to adapt existing source code into new user scripts with greater ease, and perhaps even compose user scripts from scratch.

You can only use so many comments in the source code before you start to choke or bury the programming itself. So, I've put short summaries in the source code, and have provided in-depth explanations here.

My intention is Threefold:

  1. to thoroughly document the script so that even relatively new JavaScript programmers can understand what it does and how it works, including the underlying programming conventions. This is so that the components and approaches can be modified, or used again and again elsewhere, with confidence. (I often build scripts by copying and pasting code that I don't fully understand, which often leads to getting stuck). To prevent getting stuck, the notes below include extensive interpretations, explanations, instructions, examples, and links to relevant documentation and tutorials, etc. Hopefully, this will help both you and I grok the source code and the language it is written in (JavaScript).
  2. to refresh my memory of exactly how the script works, in case I don't look at the source code for weeks or months.
  3. to document my understanding, so that it can be corrected. If you see that I have a misconception about something, please let me know!

In addition to plain vanilla JavaScript code, this script relies heavily on the jQuery library.

If you have any comments or questions, feel free to post them at the bottom of this page under Discussions. Be sure to {{ping}} me when you do.

General approach edit

(general approach goes here)

More specifically, starting at the beginning...

Aliases edit

An alias is one string defined to mean another. Another term for "alias" is "shortcut". In the script, the following aliases are used:

$ is the alias for jQuery (the jQuery library)

mw is the alias for mediawiki (the mediawiki library)

These two aliases are set up like this:

( function ( mw, $ ) {}( mediaWiki, jQuery ) );

That also happens to be a "bodyguard function", which is explained in the section below...

Bodyguard function edit

The bodyguard function assigns an alias for a name within the function, and reserves that alias for that purpose only. For example, if you want "t" to be interpreted only as "transhumanist".

Since the script uses jQuery, we want to defend jQuery's alias, the "$". The bodyguard function makes it so that "$" means only "jQuery" inside the function, even if it means something else outside the function. That is, it prevents other javascript libraries from overwriting the $() shortcut for jQuery within the function. It does this via scoping.

The bodyguard function is used like a wrapper, with the alias-containing source code inside it, typically, wrapping the whole rest of the script. Here's what a jQuery bodyguard function looks like:

1 ( function($) {
2     // you put the body of the script here
3 } ) ( jQuery );

See also: bodyguard function solution.

To extend that to lock in "mw" to mean "mediawiki", use the following (this is what the script uses):

1 ( function(mw, $) {
2     // you put the body of the script here
3 } ) (mediawiki, jQuery);

For the best explanation of the bodyguard function I've found so far, see: Solving "$(document).ready is not a function" and other problems   (Long live Spartacus!)

The ready() event listener/handler edit

The ready() event listener/handler makes the rest of the script wait until the page (and its DOM) is loaded and ready to be worked on. If the script tries to do its thing before the page is loaded, there won't be anything there for the script to work on (such as with scripts that will have nowhere to place the menu item mw.util.addPortletLink), and the script will fail.

In jQuery, it looks like this: $( document ).ready(function() {});

You can do that in jQuery shorthand, like this:

$().ready( function() {} );

Or even like this:

$(function() {});

The part of the script that is being made to wait goes inside the curly brackets. But you would generally start that on the next line, and put the ending curly bracket, closing parenthesis, and semicolon following that on a line of their own), like this:

1 $(function() {
2     // Body of function (or even the rest of the script) goes here, such as a click handler.
3 });

This is all explained further at the jQuery page for .ready()

For the plain vanilla version see: http://docs.jquery.com/Tutorials:Introducing_$(document).ready()

Only activate for vector skin edit

Initially each script I write is made to work only on the vector skin, the skin under which I developed it, and by default the only skin for which it is initially tested with. To limit the script to working for vector only, I use the following if control structure:

if ( mw.config.get( 'skin' ) === 'vector' ) {
}

To test it with another skin, remove or comment out the above code from the script.

Change log for OutlineDedupeHolding.js edit

Task list edit

Bug reports edit

Desired/completed features edit

Completed features are marked with   Done
  • Check for existence of holding sections
  • Check each holding section itself for duplicates, and remove them
  • Remove duplicates in holding sections that are in rest of outline
  • Remove empty holding sections

Development notes for OutlineDedupeHolding.js edit

Holding sections edit

I was discussing using "Place these" as the name of a holding section, but that would violate SRTA. A subheading under See also, called "Other" would be less obtrusive. But might already exist elsewhere in the outline; I'll deal with that when I encounter it.

This script will run on several holding sections:

  • Other (under See also)
  • See also
  • General concepts
    • General concepts
    • General "subject name" concepts
  • Lists
    • Lists
    • "Subject name" lists

It should remove a holding section if it is empty.

Relevant scripts edit

This script should process the see also section, the general concepts section, the list section, and the "Other" section.

See User:Ucucha/duplinks (highlights duplicate links, which means it must find them).

See User:Evad37/duplinks-alt (highlights duplicate links, which means it must find them).

See User talk:The Transhumanist/RedlinksRemover.js (edits an article to delete something - adapt it to delete duplicate list entries that don't have an annotation).

Based on discussion below, RedlinksRemover.js probably has all the technology in it that this script needs: regex applied to removing list items, in a nested loop.

Rough rough talk-through edit

This conducts semi-automated editing, and therefore needs to be on a menu item. (Should not run by default).

Script dependencies edit

Discussions edit

This is where the actual talk page starts for OutlineDedupeHolding.js. Please post your discussion threads below...

Loading dependencies edit

By the way, do I need to load any dependencies for the following code?

 		// ============== activation filters ==============
	        // Only activate on Vector skin
	        if ( mw.config.get( 'skin' ) === 'vector' ) {

Do all "mw." lines have dependencies? The Transhumanist 13:40, 1 January 2018 (UTC)Reply

mw.config is always available, and contains lots of other useful stuff – see mw:mw.config. Other mw modules, detailed at mw:ResourceLoader/Core modules, aren't necessarily available unless you load them – see mw:ResourceLoader/Migration_guide_(users)#mw.loader for further details. - Evad37 [talk] 14:10, 1 January 2018 (UTC)Reply
Thank you. Reading them now. Also, I've added these links to User:The Transhumanist/Outline of scripts, for future reference. The Transhumanist 22:12, 5 January 2018 (UTC)Reply

Leveraging TrueMatch data mining edit

[To User:Evad37]

[Referring to TrueMatch] This is very exciting. That's one more step in one of the city outline building approaches that gets sped up...

Step 1: Create city outline using template Template:Outline city   Done
Step 2: Find more links using TrueMatch   Done and/or various ViewAsOutline scripts   Done
Step 3: Transfer links to outline (its holding section) via copy/paste   Done or Send (planned set of scripts)
Step 4: Dedupe the links in outline's holding section, using OutlineDedupeHolding.js (planned script)
Step 5: Use TopicPlacerFromBin.js (planned script) to move the links from the holding section to their final resting places in the outline.
Step 6: Process the outline with RedlinksRemover   Done


Steps 3, 4, and 5 are currently done manually, but 3 (copy/paste) isn't as tedious, so 4 and 5 have priority.

Since deduping is more complicated after links are placed, developing this first, makes the most sense.

Example of using the tools so far... edit

The Outline of Chicago is currently being drafted, using the above steps...

Step 1: See all the redlinks? Those are from the Template:Outline city. Many of the links in the template do not apply to Chicago, and so they turn red, but you never know what all is going to turn red when you first start a city outline, and they are time consuming to remove by hand. So, what we do is populate the outline with all the topics we can find, and then strip out the redlinks in step 6. The RedlinksRemover doesn't remove red entries that have children, it just delinks them. But, when the outline first starts out, most of the redlinks don't have children. If we strip them out too soon, we'll wind up having to type many of them back in when we find children topics for them.

Step 2: Here's where StripSearchSorted with TrueMatch comes in. You do some intitle searches, such as "in Chicago". Increase the limit in the url to 5,000 to get the maximum results you can at once. That produces the results here: https://en.wikipedia.org/w/index.php?title=Special:Search&limit=5000&offset=0&profile=default&search=intitle%3A%22of+Chicago%22&searchToken=bvc5dp6q7ldd4ayxph2a6ixg

Step 3: We copy and paste them to the "Place these" section in the outline (under See also). We repeat step 2 with further intitle searches (such as "of Chicago") and other gathering methods and send them all to "Place these" until we have all the topics we can find.

Step 4: The problem we have now is that many of the links in the "Place these" section are already in the body of the outline, like Culture of Chicago, Demographics of Chicago, and so on. And links may be duplicated in the "Place these" section itself. Therefore, we need to dedupe (remove the duplicates from) this section. That for each link in "Place these" that is found in the body of the outline (not navigation templates), or elsewhere in "Place these", gets removed.

Step 5: In this step, you take each of the topics one-by-one from the "Place these" section, after the duplicates have been removed, and put them into the body of the outline. Currently done by hand.

Step 6: Clean it all up with the RedlinksRemover. This tool is quick and painless – just click on the menu item. Without this tool, it is mind-numbingly tedious.

Design considerations for dedupe edit

Which brings us to the design of OutlineDedupeHolding.js.

Eventually, this will dedupe more than one section, but its initial version will just process entries in the "Place these" section.

For each item, it needs to check the rest of the outline, excluding templates, and including the rest of "Place these", for a matching entry. If a match is found, that item is deleted from "Place these". If no match, go on to the next item.

My question for you edit

I think this one may be within my ability level to write. I just need a little guidance...

How would you go about programming it? The Transhumanist 07:37, 22 January 2018 (UTC)Reply

Detecting duplicate links is a problem that has already been solved (for prose): User:Evad37/duplinks-alt. So I would suggest starting from there, and see if you can follow the approach that script takes – but you'll need to adapt it to look at wikitext rather than html, and to actually remove duplicated links. - Evad37 [talk] 08:28, 22 January 2018 (UTC)Reply
Come to think of it, it's not duplicate links that I need to remove, but list items (<li>) with a duplicate link in them.
Hmmm. Wikitext. That's it! Transcluded templates' contents don't show up in an outline's wikitext. And RedlinksRemover.js already strips out entire list items from the wikitext of outlines via regex, and it uses nested loops to do it. This one requires a nested loop solution too, I think. Looping through all the list items in "Place me", applying each as a search string in a nested loop processing all the list items in the rest of the outline, ought to handle the bulk of it. Then use a similar process to remove the duplicates within "Place me" itself (or do this step first). Thank you for the clue I needed. I'll let you know how it turns out. The Transhumanist 12:42, 22 January 2018 (UTC)Reply
  1. ^