Wikipedia talk:Wikipedia Signpost/2014-10-15/Technology report

Discuss this story

First of all, it's great to see progress on making it easier to edit our content using a variety of tools.

That said, I think that it's worth looking more closely at Parsoid. It also provides a well-defined tree structure, but covers basically every aspect of wikitext. It even marks up multi-template content in a way that makes it easy to replace the entire block of templated content.

The DOM structure it provides can be edited by bots, gadgets or external services like content translation (see a list of current users). There is no limitation to manual editing; any method of manipulating HTML will work. A combination of several algorithms (video) is used to avoid dirty diffs (unintended changes in the wikitext).

We are very interested in improving Parsoid further for bots and other uses. Let us know about your needs. You can find us on IRC in #mediawiki-parsoid. -- GWicke (talk) 14:36, 17 October 2014 (UTC)Reply

  • Great to see mwpfh getting some attention, it's been tremendously useful to the code behind SuggestBot. Keep up the great work! Cheers, Nettrom (talk) 14:57, 17 October 2014 (UTC)Reply
  • To add to what GWicke said, Parsoid's express goal is to be a bidirectional converter (wikitext -> html; html -> wikitext), be a clean wikitext roundtripper (wikitext -> html -> wikitext without introducing dirty diffs in unedited portions of wikitext), and also be a semantically identical HTML roundtripper (html -> wikitext -> html -- we don't handle arbitrary HTML yet). But, yes Parsoid and mwpfh provide different representations -- Parsoid's representation is HTML5 with some RDF annotations, whereas mwpfh's representation is probably more along the lines of an Abstract Syntax Tree? However, both are well-structured tree representations which can be manipulated fairly easily depending on what representation is found suitable / useful for the application at hand. SSastry (WMF) (talk) 15:32, 17 October 2014 (UTC)Reply
  • @GWicke: Thanks for your comments. For what it's worth, Parsoid wasn't as usable or stable when work on mwpfh started, but I digress. I'm still a bit unclear on how Parsoid handles multi-level template nesting. I tried to parse "{{foo|{{bar|{{baz|abc=123}}}}}}" and got <span about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"dsr":[0,31,null,null],"pi":[[{"k":"1","spc":["","","",""]}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"foo","href":"./Template:Foo"},"params":{"1":{"wt":"{{bar|{{baz|abc=123}}}}"}},"i":0}}]}'></span>, and I'm not sure how I could, say, use this to read the value of the "abc" parameter in {{baz}}. Would I need to use Parsoid again on the value of that "wt" key or am I missing something? Part of mwpfh's usefulness for bots is that the trees it generates have methods for common wikicode manipulation – there are simple functions for adding template parameters and the like, modifying and traversing the tree, etc. As far as I know, Parsoid is focused solely on the parsing aspect and doesn't support this kind of stuff directly, but it raises question of whether it could be useful as an alternate backend for mwpfh. Would be annoying to have to deal with outsourcing queries from Python to a node.js subprocess, but it could be an interesting experiment. — Earwig talk 17:41, 17 October 2014 (UTC)Reply
  • @The Earwig: Thank you for your response as well! The Parsoid and mwpfh projects did indeed start at around the same time. Back then there were no good parsing options for editing, and it wasn't even clear whether full editing of typical wiki content would be technically feasible. Both projects have independently done pioneering work.
    You bring up a good use case where the usability of the Parsoid DOM is not optimal for bots interested in nested parameters. Parsoid actually supports exposing the templated parameters as HTML (using the 'html' key instead of 'wt'), but this is not currently enabled in production. We should be ready to switch this on in a month or two.
    Generally the idea with Parsoid is to do all the parsing on the server, so that bots don't have to deal with it. The workflow is basically retrieve HTML from the API, edit it, and send the modified HTML back to the API for saving. Convenient APIs for this workflow (especially the saving part) are being worked on right now. I agree that having a more specialized client-side interface / library for specific tasks like template editing is very useful. Your idea of using Parsoid as a backend for mwpfh sounds very promising to me, and could even expand beyond templates into other content. -- GWicke (talk) 21:35, 17 October 2014 (UTC)Reply
  • I tried mwpfh once, and I must say I was disappointed. It does not distinguish parser extension tags (like <ref> or <source>) from tags like <b> and <table> which are (mostly) passed through to HTML. If I recall correctly, there is no support for noinclude/includeonly/onlyinclude. Sufficiently tricky markup can get mwpfh really confused. It seems the authors developed it by trial and error instead of actually looking up how MediaWiki parses markup (which is not that hard, really: just read the source). By the way, I just tried the <ref>foo{{close ref}} thing… it does not work as described here (as I expected, because I know how the parser works). You will be better off using mw:API:Expandtemplates with the generatexml option instead. (I would avoid Parsoid too, it has similar warts.) Keφr 21:33, 20 October 2014 (UTC)Reply