User:Enterprisey/AIV analysis/Appendix

This page contains random details about the AIV analysis so that you can more thoroughly check my work.

Trimming the overlap

edit

The September 2023 analysis generated two files, https://apersonbot.toolforge.org/aiv-analysis/2022-09-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json and https://apersonbot.toolforge.org/aiv-analysis/2023-02-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json. These had an overlap of about a month or so because I started the second job at February 1 to catch the change made to {{IPvandal}}. I removed the overlapping cases and uploaded the resulting file to TODO TODO. Here's the Python session where I did the filtering:

aiv-analysis $ python
Python 3.10.10 (main, Mar  5 2023, 22:26:53) [GCC 12.2.1 20230201] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> a=json.load(open('2022-09-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json'))
>>> len(a)
17826
>>> b=json.load(open('2023-02-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json'))
>>> len(b)
22559
>>> next(case['report']['aiv_removal_revid'] for case in a)
1107803846
>>> next(case['report']['aiv_removal_revid'] for case in b)
1136759073
>>> a[-1]['report']['aiv_removal_revid']
1142517023
>>> b_revids = set(case['report']['aiv_removal_revid'] for case in b)
>>> a2=[case for case in a if case['report']['aiv_removal_revid'] not in b_revids]
>>> len(a2)
14728
>>> a2[-1]['report']['aiv_removal_revid']
1136750374
>>> json.dump(a2, open('2022-09-01T00:00:00Z--2023-02-01T00:00:00Z--cases.0.json', 'w'))

As you can see, the task was straightforward: I generated a list of AIV removal revids for b, and filtered out the cases with those revids in a to make a2, which I wrote into the new file.

Note that the resulting two files have no gaps in between them. This can be verified by starting at the last diff that I printed for a2, which is Special:Diff/1136750374, and stepping forward to the next instance of removed text, which is Special:Diff/1136759073, which is, as expected, the first revid that I printed for b.

The resulting file is https://apersonbot.toolforge.org/aiv-analysis/2022-09-01T00:00:00Z--2023-02-01T00:00:00Z--cases.0.json.