User talk:DataflowBot/output/Popular low quality articles (id-2)

Latest comment: 3 years ago by LuckyMiner01 in topic Duplicate Person
See also the BOTREQ bot request and Jimbo Wales' talk page discussion.
WikiProject iconWikipedia NA‑class
WikiProject iconThis page is within the scope of WikiProject Wikipedia, a collaborative effort to improve Wikipedia's encyclopedic coverage of itself. If you would like to participate, please visit the project page. Please remember to avoid self-references and maintain a neutral point of view, even on topics relating to Wikipedia.
NAThis page does not require a rating on Wikipedia's content assessment scale.

New filtered pageviews data for all articles edit

@Bamyers99: there are raw pageview data that include hourly 220+ MB (uncompressed) data sets with human-only filtered pageviews for all articles. Is it feasible to uncompress and sort those /^en .*/ view counts for use? Would using the top 20,000 show the right amount of Stub and Start class predicted articles? I guess we should try to list around 1,000 of those. EllenCT (talk) 12:27, 28 May 2016 (UTC)Reply

@EllenCT: I don't think that processing the raw page view data would be a good use of computing resources. Especially since it would duplicate the work of the Analytics team API and WP:5000. I have coded DataflowBot to use the WP:5000 page as input. I have used the same exclusions as Wikipedia:Top 25 Report#Exclusions for mobile data view percentages. The percentages are configurable. I have excluded C-class articles. --Bamyers99 (talk) 19:47, 28 May 2016 (UTC)Reply
@Bamyers99: how do you feel about combining the Stub and Start percentage predictions with the numeric popularity similar to how we combine edits and editors in WP:MOSTEDITED? I would gladly do a formula for that if you can please tell me whether or not there is a way to use zlib from a 220+ MB HTTPS stream without excessive memory overhead? Can zlib.decompressobj() do that? EllenCT (talk) 15:44, 30 May 2016 (UTC)Reply
@EllenCT: Any formula you want to provide is fine with me. I only had 2 statistics courses in college, and that was a while ago, so.... Regarding zlib, I am not a Python programmer, DataflowBot is coded in PHP (source GitHub). If the stream is a .gz file, then gzip would need to be used because even though gzip uses zlib, gzip has header and checksum information wrapped around the zlib compressed data. It looks like the 4th parameter (fileobj) might work if you can get an "object which simulates a file" for the HTTPS stream. --Bamyers99 (talk) 20:11, 30 May 2016 (UTC)Reply
I am looking at http://php.net/manual/en/function.gzinflate.php and can't tell if that or gzuncompress will work, and no idea if either or both will work with 220+MB files. Do you know what the PHP memory_limit parameter is where you are running? In the mean time I've grabbed [1] and will try to plot the confidence percentage for both stub and start predictions for the top ... I don't know how many, and see what those look like to try to make a formula. I'll ping you when that's ready. EllenCT (talk) 23:01, 1 June 2016 (UTC)Reply
I generally don't try to read .gz files directly. If a .gz file is too big to gunzip to disk then I gunzip to a pipe and have the program read from stdin a line at a time. gunzip -c file.gz | program --Bamyers99 (talk) 23:25, 1 June 2016 (UTC)Reply
We'll figure out the lowest memory utilization solution after we figure out how many we need to get a decent list and whatever is eliminating the temporary popularity spikes. I haven't looked at how you did the spike elimination yet. In the mean time I am still working on the formula and have a data set at /snapshot-20160531-230000 others might want to use too. EllenCT (talk) 03:25, 2 June 2016 (UTC)Reply

Try this: top 100 stub predictions, sorted by their start class confidence edit

@Bamyers99: after a day of looking at this, it's clear that we can't be using the same kind of formula which worked for MOSTEDITED. Start class-predicted articles are almost always pretty good in that it's hard to find obvious and easy ways to improve them, so we should be focusing only on the stub predictions and sorting them by their start class probability. Sorting stub-predicted articles by their stub class confidence doesn't do what we want at all, but using their start class confidence sorts them in the way people will expect. So, here is what I think will work best: once per day unzip the https stream of a full day's snapshots[2][3][4] to tally the top 20,000 raw /^en / articles by pageviews across all 24 hourly files, ignoring redirects, disambiguation pages, list articles, non-article namespace pages, temporary popularity spikes, and whatever else you already ignore. Get the ORES predictions for those to find the top 100 most popular stub predictions, but save their start class probability confidences, and use those values to sort the list of 100 from smallest to largest. Is that doable? EllenCT (talk) 23:40, 2 June 2016 (UTC)Reply

@Bamyers99: we can use the same kind of formula with that! Please see /Top 594 stub predictions from 20160531-230000. Sorry I couldn't filter redirects and disambiguation pages. EllenCT (talk) 05:32, 4 June 2016 (UTC)Reply
Also asked for help with PHP at [5]. EllenCT (talk) 22:24, 5 June 2016 (UTC)Reply

How's it going? edit

@Bamyers99: how are things going with you? I noticed there hasn't been an update to this list in a couple weeks, even though WP:POPULARLOWQUALITY has had several times more pageviews than WP:MOSTEDITED. Could you please have it run every week? Please see the discussion here. Would you please tell User:EpochFail and User:JAllemandou (WMF) how to filter out redirects and disambiguation pages? EllenCT (talk) 15:16, 30 June 2016 (UTC)Reply

@EllenCT: I had only been running it when the Signpost was published. I have just set it up to run early Tuesday mornings. The Top 5000 views sometimes is re-run on Monday for some reason. I'm sure that they know about the redirect table in the database. For disambiguation pages, there is a 'disambiguation' pp_propname record in the page_props table. --Bamyers99 (talk) 15:46, 30 June 2016 (UTC)Reply
Thanks again. If you want to post a link in to where your code accesses that table and property in the User talk:EpochFail thread, that would likely help, too. Do you think PHP can unzip a full day of pageview data for a top N list in sufficiently small memory? EllenCT (talk) 16:48, 30 June 2016 (UTC)Reply

Unlikely pageview count edit

How did the bot count over eight million views for Hyphen-minus on its 31 January run? This shows a more modest 1,213: Noyster (talk), 10:38, 19 February 2017 (UTC)Reply

@Noyster: The 5000 popular pages that was used for this report has - which is a redirect to Hyphen-minus at 8,449,402. Here is the Pageviews Analysis for -. --Bamyers99 (talk) 15:19, 19 February 2017 (UTC)Reply
@Bamyers99: FYI, I came across this note in the PageView API, and there's also a mention of how they removed it from the top list in the changelog. Might be worth considering doing something similar here. Cheers, Nettrom (talk) 18:54, 9 March 2017 (UTC)Reply
@Nettrom: Thanks for the explanation. I have excluded the dash page from the results. --Bamyers99 (talk) 19:32, 9 March 2017 (UTC)Reply

Duplicate Listings edit

The list which was generated on 2017-04-04 03:27 (UTC) contains the entry Noodle (Gorillaz) twice with different values for Rank and Views:

Rank Article ORES prediction Views
143 Noodle (Gorillaz) Start 26,622
144 Jason Orange Start 26,617
145 Noodle (Gorillaz) Start 26,358

Thatsquareguy (talk) 13:00, 18 June 2017 (UTC)Reply

@Thatsquareguy: #145 is for the redirect Noodle (character) found in this 5000 popular pages report. Redirect totals are not consolidated with their targets. Given the fact that the ORES prediction retrieval has failed every week since April 4, I am not inclined to work on consolidation. --Bamyers99 (talk) 13:52, 18 June 2017 (UTC)Reply
Hey Bamyers99, I had a look at DataFlowBot's source code for fetching ORES scores (because I recently worked on SuggestBot and had some ORES issues). The URL (line 144) points to the Labs ORES instance, which AFAIK is more experimental at this point. Try switching to the production URL ("https://ores.wikimedia.org/v2/scores" in this case)? Cheers, Nettrom (talk) 16:56, 18 June 2017 (UTC)Reply
@Nettrom: Thanks for the tip. I switched URLs and it is working again. --Bamyers99 (talk) 17:34, 18 June 2017 (UTC)Reply

Main page edit

Mainpage probably shouldn't appear in this list for all sorts of reasons. Best Regards, Barbara 12:49, 12 May 2019 (UTC)Reply

It's still showing up. --Nessie (talk) 19:07, 30 October 2019 (UTC)Reply
Still there. --awkwafaba (📥) 02:58, 1 April 2020 (UTC)Reply

COVID-19? edit

How is it that there are no articles on COVID-19 on this list? There are plenty of stubs popping up all over for that. Seems like the bot is kinda stuck. The same few articles seem to be just bouncing around the list. Typically the Netflix original of the week gets replaced every week or so, but the same ones have been on there for at least a month. Are the hamsters low on food? --awkwafaba (📥) 01:33, 25 March 2020 (UTC)Reply

@Bamyers99 and DataflowBot: It's still stuck. How is the page currently listed with the most views, UFC 246, about an event from January? The table says it has 704,144 views for the week, and it only has 50,383 for the month. 2020 coronavirus pandemic in Uttar Pradesh, a stub, has 47,765 pageviews for the week supposedly covered by the current list, which would put it about halfway. Start-class Favipiravir has 645,891 pageviews. Check out others here and here. It's clearly not properly updating. --awkwafaba (📥) 02:50, 1 April 2020 (UTC)Reply
@Awkwafaba: Thanks for the ping. The previous source for popular pages User:West.andrew.g/Popular pages has been retired. I have switched it to use User:HostBot/Top 1000 report. --Bamyers99 (talk) 21:35, 1 April 2020 (UTC)Reply

Move to project space? edit

This is an extremely useful list, and it's linked directly from the Community portal. I see from the above that it's been having some issues, but still, is it time to graduate this to a more proper title in projectspace (WP space) rather than userspace? {{u|Sdkb}}talk 08:17, 22 April 2020 (UTC)Reply

I was just about to suggest this when I saw your post. I support this entirely, because there's really no official one of these in the WP: space. It would make it easier to find for others, and better have a place for a relevant list in wikispace. Heyoostorm (talk) 21:19, 16 July 2020 (UTC)Reply

Duplicate Person edit

I found a duplicate listing for Joy Philbin, should one of these be removed?

Rank Article ORES prediction Views
14 Joy Philbin Start 162257
15 Joy Philbin Start 156972

LuckyMiner01 | I'm new here, so if i make a mistake, please tell me, here, so I can learn from it. 22:52, 29 July 2020 (UTC)Reply