User:Trappist the monk/IL2LDR – In-line to list-defined referencing converter

IL2LDR is an AWB script that attempts to convert an article using in-line references to an article using list-defined referencing (LDR). In-line references are scraped out of article text and placed inside the article's {{reflist}} template. The purpose is to make it easier for editors to work with article text without the clutter of the necessary referencing getting in the way.

While it is possible to use this tool to bulk-convert many articles, it is not intended for that purpose. Rather, it is intended to do the bulk of the work necessary when the decision has been taken, following proper consideration, to change an article from in-line to list-defined referencing.

What it does edit

The first thing the tool does is to look for a {{reflist}} template or a <references /> tag. The tool does not support {{reflist}} templates with the |group= parameter. At the end of the process, the tool will replace the existing {{reflist}} template with {{reflist|refs=...}} where the |refs= parameter lists all of the in-line references scraped from the article text. If an acceptable {{reflist}} template or a <references /> tag can't be found, the tool abandons the edit with a status message.

Next follow several steps that make later processing easier and improve style consistency throughout the article:

  1. the tool standardizes the format of the article's <ref>...</ref> tags. It does this by removing extraneous spaces, ordering attributes, and quoting attribute values. Here, the tool does support the group attribute but only so far as to make <ref>...</ref> tags that contain it consistent with those that do not.
  2. HTML and HTML-like tags are hidden by replacing the opening < and closing > characters with the specific text strings __0P3N__ and __CL0S3__
  3. vertical format references are flattened to horizontal format

List-defined referencing makes use of reference definitions and reference intances. A reference definition has the form:

<ref name="name">citation templates or other source identification</ref>

where name is unique to that reference. Reference instances have the form:

<ref name="name" />

For unnamed reference definitions, the tool looks at the reference content and if the content is a cs1|2 template, attempts to extract something that can be used as a unique name. Unique names can be taken from several of the common identifier parameters supported by Citation Style 1. In its current configuration, the tool will extract the parameter value from one of |pmc=, |pmid=, |doi=, or |isbn= in that order. The tool then adds a name attribute to the <ref> tag using the identifier value: if the tool used the value from |pmc=12345 then reference tag becomes: <ref name="PMC_12345">, etc.

If the tool is unable to extract an identifier from the reference content, it creates a name that takes the form __IL2LDR__ddd where ddd is a three-digit number beginning at 001 and increasing by one for every unnamed reference that uses a created name: <ref name="__IL2LDR__001">, etc. After the tool has run successfully, editors should consider renaming definitions and instances that use these automated names because they are contextually meaningless.

Once named, reference definitions are moved into a list and sorted by name, leaving behind a reference instance to mark the definition's original location: <ref name="__IL2LDR__001" />. Because reference definitions must be unique, the tool can now check for duplicate definitions. If duplicate definitions are found, the tool abandons the edit with a status message. If definitions couldn't be moved into the list, the tool abandons the edit with a status message.

With a complete list of reference definitions, the tool replaces the {{reflist}} with {{reflist|refs= ...}} where ... is the list of reference definitions, each on its own line, in name="name" ascending order.

The last step is to restore the hidden HTML and HTML-like tags.

Status messages edit

The tool emits several status messages that are available in AWB's Logs tab. There is one success message:

Converted in-line references to list-defined references – the tool thinks that it was able to move all references from article text into the {{reflist}} template. Editors should, of course, review the results before clicking the Save button.

When things don't go quite right, the tool reports that and abandons the edit. In some cases the article is skipped, in others the incomplete edit is shown so that editors can find and fix the problem in the source. Incomplete edits should not be saved.

article has no {{reflist}} – the tool requires a {{reflist}} template, with or without column parameters, or <references /> tag; the tool does not support {{reflist|group=...}} nor does it support articles that already have have a list-defined referencing structure ({{reflist|refs=...}})
duplicate ref name: <name> – all reference names must be unique; this message identifies one that is not

There are messages in the code that because of improvements should not display. If any of these messages are encountered, please report them.

unable to name all unnamed referencesthe tool was unable to extract any of the cs1|2 identifiers |pmc=, |pmid=, |isbn=, and |doi= as unique names from the reference content for use as a name; this message also occurs when the reference content contains html markup because of automatic name creation and because HTML is hidden, we should never see this message
unable to move reference: <name>the tool was unable to move one or more reference definitions; because vertical format citations are flattened to horizontal format, we should never see this message
no references to moveall named references are in vertical format because vertical format citations are flattened to horizontal format, we should never see this message

How to make it work edit

  1. Start AWB
  2. log in, load your default settings file or manually set AWB's settings – because the changes that the tool makes are substantial, when first using this tool editors should limit AWB functionality to just this tool
  3. From the AWB Tools menu select Make module
  4. In Module make sure that the Enabled checkbox is checked
  5. In Module make sure that C# 2.0 is selected in the dropdown box
  6. Come back to this page; copy everything from the shaded box in §Script to your clipboard
  7. In Module, replace the content of the large text box with the content of your clipboard
  8. In Module, click the Make module button. If the module makes, you should see Module compiled and loaded in green and the current time under the dropdown box. If the module didn't make, make sure that the whole of §Script replaced the whole of the default content of large box.
  9. Close Module
  10. Add the page or pages to be edited to the page list
  11. Click start
  12. If the tool reports problems fix them outside of AWB and try again

Script edit

// Attempt to convert in-line references to list defined reference (LDR) format.
//
// 1. if page doesn't have a recognized {{reflist}} or <references/> then quit.  We need to replace that with the
// {{reflist|refs=...}}
//
// 2. standardize ref tag form for appearance and consistency
//
// 3. hide html tags
//
// 4. flatten vertical format citations
//
// 5. if there are nameless references (<ref>...</ref>) that can be named from identifiers present in a cs1|2 citation,
//		 |pmc=, |pmid=, |doi=, etc. then make a named ref tag using the identifier value: <ref name="DOI_...">...</ref>
//
// 6. add named references to a sorted dictionary where the name from <ref name="name"> is the key and everything between
//		the ref tags is the value.  At the same time replace the named ref definition with <ref name="name" />.  Also check
//		for duplicate definitions (same key).  If we try to add a duplicate to the dictionary, set the global G_mailbox to
//		an error message.
//
// 7. if we set the global G_mailbox then add it to the summary message, set Skip to true and abandon this edit
//
// 8. check to see if there is anything in our dictionary.  If nothing there, add a message to the summary message, set Skip
//		to true and abandon this edit
//
// 9. replace the reflist template with {{reflist |refs= ... }} where ... is the list of key/value pairs from the
//		dictionary reformated as <ref name="key">value</ref>, each on its own line, in key alpha ascending order
//
// 10. unhide html
//

// known weaknesses:
// 1. for doi identifiers, any double quotes should be replaced with perhaps an underscore; 
// 2. needs support more identifiers?
// 3. does not support reference groups

// status messages and what they mean:
//
// Status message are available in AWB's Logs tab.  There is one success message:
//	Converted in-line references to list defined references – the tool thinks that it was able to move all references from article text into the {{reflist}} template
//
// If any of these messages are emitted, the edit has been abandoned; if a name is provided, that name may or may not exist in the
// article text.  Names that the tool creates from cs1|2 identifiers prefix the identifier's assigned value
// with and uppercase version of the identifier name: DOI_, PMC_, etc:
//	1. article has no {{reflist}} – the tool requires {{reflist}} with or without column parameters or <references />; the tool does not support {{reflist |group=...}}
//	2. unable to name all unnamed references – the tool was unable to extract any of the cs1|2 identifiers |pmc=, |pmid=, |isbn=, and |doi=
//		as unique names from the reference content for use as a name; this message also occurs when the reference content contains html markup
//	3. duplicate ref name: <name> – all reference names must be unique; this message identifies one that is not
//	4. unable to move reference: <name> – with flattening, should we ever see this message? – the tool cannot move references
//		that are intentionally placed on multiple lines; distinguish from line-wrap at the edge of the editing window
//	5. no references to move – with flattening, should we ever see this message? – all named references are in vertical format

//---------------------------< G L O B A L S >----------------------------------------------------------------

string G_mailbox;		// used only if we can't add a reference to the dictionary
int G_counter = 0;		// used for names the tool makes up

//--------------------------< P R O C E S S _ A R T I C L E >-------------------------------------------------

public string ProcessArticle(string ArticleText, string ArticleTitle, int wikiNamespace, out string Summary, out bool Skip)
	{
	Skip = false;
	Summary = "Converted in-line references to list-defined references";
	
//---------------------------< D E F I N E S >----------------------------------------------------------------

	string IS_REFLIST = @"\{\{\s*[Rr]eflist\s*\}\}";		// {{reflist}}
	string IS_REFLIST_COL = @"\{\{\s*[Rr]eflist\s*\|\s*(?:colwidth\s*=\s*)?(\d+[A-Za-z]*)\}\}";		// {{reflist |30em}} $1 is column width
	string IS_REFERENCES = @"\<references\s*/\s*\>";		// <references />
	string IS_NAMELESS_REF = @"\<ref\>";					// <ref>
	string IS_NAMED_REF_DEF = @"\<ref\sname=""([^>/]+)""\>(.*?)\</ref\>";			// $1 is ref name; $2 is the reference content
	
// IDENTIFIERS:
// find identifier parameters in cs1|2 templates; these have two captures: $1 is ref tag content left of the last identifier value character; $2 is the identifier value
	string IS_DOI = IS_NAMELESS_REF + @"([^\}]*\|\s*(?:doi|DOI)\s*=\s*([^\s\|\}]+)[^\<]*\</ref>)";
	string IS_ISBN = IS_NAMELESS_REF + @"([^\}]*\|\s*(?:isbn|ISBN)\s*=\s*([\d-]+X?)[^\|\}]*)";
	string IS_PMC = IS_NAMELESS_REF + @"([^\}]*\|\s*(?:pmc|PMC)\s*=\s*(\d+)[^\|\}]*)";
	string IS_PMID = IS_NAMELESS_REF + @"([^\}]*\|\s*(?:pmid|PMID)\s*=\s*(\d+)[^\|\}]*)";

//---------------------------< D I C T I O N A R Y >----------------------------------------------------------

	SortedDictionary<string, string> reference_list = new SortedDictionary<string, string>();
	Match regex_match;

//---------------------------< R E F L I S T >----------------------------------------------------------------

//REFLIST: be sure that there is a {{reflist}} or <references /> template to replace;
	if (!Regex.Match (ArticleText, IS_REFLIST).Success && !Regex.Match (ArticleText, IS_REFLIST_COL).Success && !Regex.Match (ArticleText, IS_REFERENCES).Success)
		{
		Skip = true;											// no reflist so skip this page
		Summary = "article has no {{reflist}}";				// but say why we skipped
		return ArticleText;
		}

// convert <references /> tag to {{reflist}} before we hide html (and html-like) tags
	ArticleText = Regex.Replace(ArticleText, IS_REFERENCES, "{{reflist}}");

//---------------------------< S T A N D A R D I Z E   R E F   T A G S >--------------------------------------
// standardize the form of ref tags both for appearance and because it simplifies later code
//DEFINITIONS:	
// simple ref tag
	ArticleText = Regex.Replace(ArticleText, @"\<\s*ref\s*\>", "<ref>");
// closing ref tag
	ArticleText = Regex.Replace(ArticleText, @"\<\s*/\s*ref\s*\>", "</ref>");

//quoted group and quoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*""([^>/]+)""\s*name\s*=\s*""([^>/]+)""\s*\>", "<ref group=\"$1\" name=\"$2\">");
//quoted name and quoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*""([^>/]+)""\s*group\s*=\s*""([^>/]+)""\s*\>", "<ref group=\"$2\" name=\"$1\">");
//quoted group and unquoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*""([^>/]+)""\s*name\s*=\s*([^>""/]+\b)\s*\>", "<ref group=\"$1\" name=\"$2\">");
//quoted name and unquoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*""([^>/]+)""\s*group\s*=\s*([^>""/]+\b)\s*\>", "<ref group=\"$2\" name=\"$1\">");
// unquoted group and quoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*([^>""/]+)\s+name\s*=\s*""([^>/]+)""\s*\>", "<ref group=\"$1\" name=\"$2\">");
// unquoted name and quoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*([^>""/]+)\s+group\s*=\s*""([^>/]+)""\s*\>", "<ref group=\"$2\" name=\"$1\">");
// unquoted group and name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*([^>/]+)\s+name\s*=\s*([^>/]+\b)\s*\>", "<ref group=\"$1\" name=\"$2\">");
// unquoted name and group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*([^>/]+)\s+group\s*=\s*([^>/]+\b)\s*\>", "<ref group=\"$2\" name=\"$1\">");
// quoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*""([^>/]+)""\s*\>", "<ref group=\"$1\">");
// quoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*""([^>/]+)""\s*\>", "<ref name=\"$1\">");
// unquoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*([^>""/]+)\s*\>", "<ref group=\"$1\">");
// unquoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*([^>""/]+)\s*\>", "<ref name=\"$1\">");


// INSTANCES:
//quoted group and quoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*""([^>/]+)""\s*name\s*=\s*""([^>/]+)""\s*/\s*\>", "<ref group=\"$1\" name=\"$2\" />");
//quoted name and quoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*""([^>/]+)""\s*group\s*=\s*""([^>/]+)""\s*/\s*\>", "<ref group=\"$2\" name=\"$1\" />");
//quoted group and unquoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*""([^>/]+)""\s*name\s*=\s*([^>""/]+\b)\s*/\s*\>", "<ref group=\"$1\" name=\"$2\" />");
//quoted name and unquoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*""([^>/]+)""\s*group\s*=\s*([^>""/]+\b)\s*/\s*\>", "<ref group=\"$2\" name=\"$1\" />");
// unquoted group and quoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*([^>""/]+)\s+name\s*=\s*""([^>/]+)""\s*/\s*\>", "<ref group=\"$1\" name=\"$2\" />");
// unquoted name and quoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*([^>""/]+)\s+group\s*=\s*""([^>/]+)""\s*/\s*\>", "<ref group=\"$2\" name=\"$1\" />");
// unquoted group and name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*([^>/]+)\s+name\s*=\s*([^>/]+\b)\s*/\s*\>", "<ref group=\"$1\" name=\"$2\" />");
// unquoted name and group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*([^>/]+)\s+group\s*=\s*([^>/]+\b)\s*/\s*\>", "<ref group=\"$2\" name=\"$1\" />");
// quoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*""([^>/]+)""\s*/\s*\>", "<ref group=\"$1\" />");
// quoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*""([^>/]+)""\s*/\s*\>", "<ref name=\"$1\" />");
// unquoted group
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+group\s*=\s*([^>""/]+)\s*/\s*\>", "<ref group=\"$1\" />");
// unquoted name
	ArticleText = Regex.Replace(ArticleText, @"<\s*ref\s+name\s*=\s*([^>""/]+)\s*/\s*\>", "<ref name=\"$1\" />");


//---------------------------< H I D E >----------------------------------------------------------------------
// HIDE HTML: find html opening tags that are not <ref>; replace the opening < with __0P3N__ and the closing > with __CL0S3__
	while (Regex.Match (ArticleText, @"\<([^/Rr][^\>]*)\>").Success)
		{
		ArticleText = Regex.Replace(ArticleText, @"\<([^/Rr][^\>]*)\>", "__0P3N__$1__CL0S3__");
		}

// HIDE HTML: find html closing tags that are not </ref>; replace the opening < with __0P3N__ and the closing > with __CL0S3__
	while (Regex.Match (ArticleText, @"\<(/[^Rr][^\>]*)\>").Success)
		{
		ArticleText = Regex.Replace(ArticleText, @"\<(/[^Rr][^\>]*)\>", "__0P3N__$1__CL0S3__");
		}
	
//---------------------------< F L A T T E N   R E F E R E N C E S >------------------------------------------
// So that the article wikitext has a consistent look, all vertical format citations are flattened to horizontal format.
	ArticleText = Regex.Replace(ArticleText, @"\<ref[^\<]+\</ref\>",
		delegate(Match match)
			{
			string reference = match.Groups[0].Value.Replace("\n", "").Replace("\r", "");
			return reference;
			});


//---------------------------< N A M E L E S S _ R E F S >----------------------------------------------------
// give names to nameless references using available identifiers; use simpler more readable first.  If that isn't
// possible, assign what is hoped to be a unique name __IL2LDR__xxx where xxx is an incrementing three-digit
// number that begins at 001.

	ArticleText = Regex.Replace(ArticleText, @"\<ref\>([^\<]+)\</ref\>",
		delegate(Match match)
			{
			string content = match.Groups[1].Value;							// 1 - <ref> content </ref>
			string name;
			string base_name = @"__IL2LDR__";
			
//PMC: if nameless reference has |pmc=#### name the reference <ref name=PMC_####>
			Match identifier = Regex.Match (content, @"\|\s*(?:pmc|PMC)\s*=\s*([^\s\|\}]+)");
			if (identifier.Success)
				{
				name = identifier.Groups[1].Value;
				return @"<ref name=" + @"""PMC_" + name.Trim() + @""">" + content +@"</ref>";
				}
//PMID: if nameless reference has \pmid=#### name the reference <ref name=PMID_####>
			identifier = Regex.Match (content, @"\|\s*(?:pmid|PMID)\s*=\s*([^\s\|\}]+)");
			if (identifier.Success)
				{
				name = identifier.Groups[1].Value;
				return @"<ref name=" + @"""PMID_" + name.Trim() + @""">" + content +@"</ref>";
				}
//DOI: if nameless reference has |doi=#### name the reference <ref name=DOI_####> after replacing the forward slash(es) with underscores
			identifier = Regex.Match (content, @"\|\s*(?:doi|DOI)\s*=\s*([^\s\|\}]+)");
			if (identifier.Success)
				{
				name = identifier.Groups[1].Value.Replace("/", "_");		// replace forward slash in doi identifier with underscore
				return @"<ref name=" + @"""DOI_" + name.Trim() + @""">" + content +@"</ref>";
				}
//ISBN: if nameless reference has |isbn=#### name the reference <ref name=ISBN_####>
			identifier = Regex.Match (content, @"\|\s*(?:isbn|ISBN)\s*=\s*([^\s\|\}]+)");
			if (identifier.Success)
				{
				name = identifier.Groups[1].Value;
				return @"<ref name=" + @"""ISBN_" + name.Trim() + @""">" + content +@"</ref>";
				}
			
			G_counter ++;
			return @"<ref name=""" + base_name + G_counter.ToString("D3") + @""">" + content +@"</ref>";
			});

	G_counter = 0;															// reset because it is remembered page to page
	
	regex_match = Regex.Match (ArticleText, @"<ref\>");
	if (regex_match.Success)												// did we name all unnamed refs?
		{
		Summary = "unable to name all unnamed references";					// no, emit error message
		Skip = true;														// and abandon the edit
		return ArticleText;
		}
			

//---------------------------< A D D   T O   D I C T I O N A R Y >--------------------------------------------
// add named reference definitions (every thing between <ref name="..."> and </ref>) to a dictionary; replace
// reference definition with closed ref tag (<ref name="..." />

	ArticleText = Regex.Replace(ArticleText, IS_NAMED_REF_DEF,
		delegate(Match match)
			{
			string raw_match = match.Groups[0].Value;
			string name = match.Groups[1].Value;							// 1 - ref definition name
			string definition = match.Groups[2].Value;						// 2 - ref definition
			
			if (!reference_list.ContainsKey(name))
				{
				reference_list.Add(name, definition);						// key not found so add to reference list dictionary
				return @"<ref name=" + @"""" + name + @""" />";
				}
			else
				G_mailbox = "duplicate ref name: " + name;					// add message to "global" mailbox that there is a duplicate reference with this name
			return raw_match;
			});
	
	if ("" != G_mailbox)
		{
		Summary = G_mailbox;												// change summary message
		G_mailbox = "";														// set to empty string for the next article
		Skip = true;														// and abandon this edit
		return ArticleText;
		}

	regex_match = Regex.Match (ArticleText, @"<ref\sname=""([^>/]+)""\>");
	if (regex_match.Success)												// did we move all named refs to the dictionary?
		{
		Summary = "unable to move reference: " + regex_match.Groups[1].Value;	// no, emit error message
//		Skip = true;						// don't skip so we can find references that the tool named		// and abandon the edit
		return ArticleText;
		}

//---------------------------< W R I T E   L D R   R E F L I S T >--------------------------------------------
// make sure that there is something in the dictionary to move; this is possible when all named references occupy
// multiple text lines (vertical format)

	if (1 > reference_list.Count)											// if nothing in the dictionary then we're done
		{
		Summary = "no references to move";									// add message to summary
		Skip = true;														// and abandon this edit
		return ArticleText;
		}

// replace {{reflist}} with {{reflist|refs=...}} where ... are the reference definitions contained in the reference_list dictionary
	ArticleText = Regex.Replace(ArticleText, IS_REFLIST,
		delegate(Match match)
			{
			string list = "";

			foreach (KeyValuePair<string, string> pair in reference_list)
				{
				list = list + '\n' + @"<ref name=" + @"""" + pair.Key + @""">" + pair.Value + @"</ref>" + '\n';
				}
		
			return @"{{reflist |refs=" + list +@"}}" + '\n';
			});

// replace {{reflist|30em}} with {{reflist|30em|refs=...}} where ... are the reference definitions contained in the reference_list dictionary
	ArticleText = Regex.Replace(ArticleText, IS_REFLIST_COL,
		delegate(Match match)
			{
			string list = "";
			
			foreach (KeyValuePair<string, string> pair in reference_list)
				{
				list = list + '\n' + @"<ref name=" + @"""" + pair.Key + @""">" + pair.Value + @"</ref>" + '\n';
				}
			
			return @"{{reflist |" + match.Groups[1].Value + @" |refs=" + list +@"}}" + '\n';
			});

// replace <references /> with {{reflist|refs=...}} where ... are the reference definitions contained in the reference_list dictionary
	ArticleText = Regex.Replace(ArticleText, IS_REFERENCES,
		delegate(Match match)
			{
			string list = "";

			foreach (KeyValuePair<string, string> pair in reference_list)
				{
				list = list + '\n' + @"<ref name=" + @"""" + pair.Key + @""">" + pair.Value + @"</ref>" + '\n';
				}
		
			return @"{{reflist |refs=" + list +@"}}" + '\n';
			});

//---------------------------< U N H I D E >------------------------------------------------------------------

// UNHIDE: replace __0P3N__ with <
	ArticleText = Regex.Replace(ArticleText, @"__0P3N__", "<");

// UNHIDE: replace __CL0S3__ with >
	ArticleText = Regex.Replace(ArticleText, @"__CL0S3__", ">");

	return ArticleText;
	}