Wikipedia talk:WikiProject External links/Webcitebot2/Archive 1

Latest comment: 13 years ago by CharlieEchoTango in topic Proposal

Judging importance

This Task Force is in the beginning phase. So before it goes any further we should evaluate its importance. Please indicate below if you feel this task is Important or Not important. Thank you. - Hydroxonium (H3O+) 14:26, 3 February 2011 (UTC)

  • Important - Obviously I feel the task is important. - Hydroxonium (H3O+) 14:27, 3 February 2011 (UTC)
  • Important – obviously WP:LINKROT highlights all the problems caused by not archiving web-pages. The bot in a single run through the {{cite web}}'ed pages tagged over 80k dead links in a span of a month. My favourite statistic being – for every {{Dead link}} removed, 14.5 new ones were added. —  HELLKNOWZ  ▎TALK 17:26, 3 February 2011 (UTC)
  • Important - WP:LINKROT is a major problem for Wikipedia that directly threatens its reliability and content diversity. And a bot helping with this problem frees workhours that can be spent on improving article content instead of combatting linkrot. Thus I consider this an important bot. Toshio Yamaguchi (talk) 18:25, 3 February 2011 (UTC)
  • Important - But not sufficient. However good the bot is, it still needs the associated archive(s) to be up to the job. We've got over 810 thousand transclusions of {{cite web}} to support somehow, and of course this number is only growing. LeadSongDog come howl! 05:48, 4 February 2011 (UTC)
  • Very important. I suggest that both a bot and a semi-automatic tool should be created to Webcite pages. Linkrot is a huge problem and there are few better solutions.   Will Beback  talk  09:23, 7 February 2011 (UTC)
  • Important I'm always seeing links removed by editors because they are dead, rather than replaced with archived links. It takes a significant amount of time to do ourselves and is repetitive and boring - therefore making this an ideal job for a bot. SmartSE (talk) 12:18, 7 February 2011 (UTC)
  • Extremely Important. Not only in the context of wikipedia but in scholarship in general - see http://www.webcitation.org/5wLW5GaNU - I'd be very eager to work with the Wikipedia community and/or foundation to use WebCite (which is an open source project), also to help addressing the broader societal goals addressed in this paper. Disclosure: I am the initiator of the WebCite project. I'll comment below to clarify some issues/questions raised, without necessarily repeating this disclosure in my comments below. WebCite is an academic, non-profit project, currently hosted at the University of Toronto --Eysen (talk) 14:57, 8 February 2011 (UTC)
  • Important Required to maintain verifiability. Doc James (talk · contribs · email) 13:41, 9 February 2011 (UTC)
  • Important. I agree. Dead links are so frustrating. Axl ¤ [Talk] 10:38, 10 February 2011 (UTC)
  • Extremely Important: a dead link used as a source is not a source anymore 81.64.104.59 (talk) 10:56, 12 February 2011 (UTC)
  • All of the above. -- œ 11:58, 12 February 2011 (UTC)
  • Very important - as others have said, verifiability is more important and a source that cannot be verified is no source. FT2 (Talk | email) 02:46, 17 February 2011 (UTC)

Input from software engineers

Creating a replacement WebCiteBOT is an extremely difficult task. Monitoring the IRC feed, submitting links to WebCitation.org and updating templates sounds simple. But there are a lot of other things to take in to account in order to do it properly. Several of our best software engineers have looked in to this issue and some have started working on bots. I will ask them to stop by to give their input. - Hydroxonium (H3O+) 14:28, 3 February 2011 (UTC)

General comments and next steps

First, I would like to encourage others to edit the task force page as this project is just getting started and we are still defining things. I would also like to get comments from people interested in this issue. These would include, what we want to accomplish and how do we proceed. Thanks. - Hydroxonium (H3O+) 14:30, 3 February 2011 (UTC)

Is webcitation.org still functional?

  • So far as I am able to discern, the webcitation.org site is itself moribund. I see nothing that's changed there for years now. It appears that Gunther's efforts have moved on. Are there actions we can take to revive it?
WebCite is running for almost 10 years now. It is true that it is not the most flashy website around as funding is tight, but there have been some backend developments (see slide 45 at http://www.slideshare.net/eysen/preserving-the-scholarly-record-with-webcite-wwwwebcitationorg-an-archiving-system-for-longterm-digital-preservation-of-cited-webpages ). It is used by over 200 scholarly journals, including my own, see for example reference list at http://www.jmir.org/2011/1/e14/ --Eysen (talk) 15:25, 8 February 2011 (UTC)
  • Can we simply verify that it has a scalable future before we pour too much effort into further using it?
  • Is the initial limit on input rate of new content to webcitation.org still relevant? After all this time, I'd expect it to have increased significantly.
We introduced the rate limit to prevent spamming and other forms of abuse. The Wikipedia WebCiteBot can obviously be whitelisted. --Eysen (talk) 15:25, 8 February 2011 (UTC)
  • If it is not viable, can we explore alternatives with WMF, at least for the freely-licensed cases? LeadSongDog come howl! 23:06, 3 February 2011 (UTC)
Unlikely.
I don't know how to verify it has a future. Perhaps ask whether there are plans to hand over the reins to Internet Archive?
The site is still operational in the sense that it both continues to supply previously archived pages and also continues to accept archival requests. For example, I have receipt emails for today from 13 minutes ago and 5:58 PM.
The rate-limiting is still in effect. My own experimenting with my tools indicates that the rate-limiting in practice has to be around 20 or 30 seconds per submission. Obviously this is utterly unfeasible for Wikipedia-scale archiving.
I previously emailed the Internet Archive people, since they have a for-fee on-demand archiving service. I got a pretty cold shoulder. I'm sure the WMF could afford the service; but if I may wax cynical, the WMF has demonstrated that it is more interested in hiring more staff and discussing gender issues and African children than in improving technical aspects of Wikipedia. --Gwern (contribs) 23:36, 3 February 2011 (UTC)
I spoke by telephone to the technical contact listed for webcitation.org today. He confirmed that they are still running but are not funded at present to support further development. There is room there for WMF or others to fund or volunteer technical resources to support their efforts. In a sense, we've had a free ride off them for a while now. We really should try and help out their effort somehow, if we want to continue using them. LeadSongDog come howl! 05:04, 4 February 2011 (UTC)
It sounds like creating something independent is the best way to go. The thing is it shouldn't cost that much money to do (relatively speaking), all in all it should be under 100 million links which would probably take under 30 TB of HD space. It's not like we are archiving the entire web, just what is referenced in wikipedia. The thing is it is a bit expensive for any one person to be able to afford. All in all you could probably build something like webcite able to scale up to wikipedia for maybe $10-15k for equipment plus whatever it costs to rent the relevant space in a data center (I'd guess anywhere from $100-500 a month). If the WMF could do something that would be awesome. The one down side to doing something like that is you have to deal with the legal aspects of DMCA takedowns. --nn123645 (talk) 05:58, 4 February 2011 (UTC)
User:EdoDodo just ran a query for me on the toolserver and came up with the total number of external links in pages as 17,462,350 as of (about) the time of this post including duplicates. Assuming about 350 KB per page (this page had some stats on the average size of a web page) you'd be looking at about 6 TB storage (including duplicates), which with a backup you'd be looking at 12 TB. Excluding duplicates the total amount of space needed might be closer to 10 TB, which isn't nearly as bad as I originally thought. --nn123645 (talk) 07:37, 4 February 2011 (UTC)
How about a merge? WMF could provide disk space and connectivity in exchange for heavy usage of the service. Archiving just the text would save space and be a good first step. - Hydroxonium (H3O+) 08:31, 4 February 2011 (UTC)
It's not the money that's such an issue, it's the software and maintenance. $10k is way overestimating how much the hardware costs; 2 terabytes costs $60, so hard drives will run you (let's be generous and ask for 20TB of space or 10 drives) $600, and then let's say the server is another $1000, for a total of roughly a tenth of your low-end estimate.
And as I said before, you guys are assuming the Foundation really cares. If it did, it would have gotten access to the Internet Archive's ArchiveIt service and done this as a MediaWiki plugin. Unless you have a firm public commitment, don't expect any help from the WMF. --Gwern (contribs) 15:35, 4 February 2011 (UTC)
Yeah I was going to revise my estimate. That was when I was thinking the links were over 100 million, not the 17.5 million that they actually are. So I vastly overestimated the number of links that exist, which I guess does make sense, since most of the pages in the mainspace are stubs with very few or no links. If every page on the mainspace was an FA it'd be a different story entirely, but obviously that is not the case. Also it's been awhile since I looked at the prices for storage space. It's pretty amazing that you can get a 2 TB for $60, last time I was looking at them they were around $250. --nn123645 (talk) 15:47, 4 February 2011 (UTC)

┌─────────────────────────────┘
So answering the question, webcitation.org is still functional and can be used for small scalle archiving, but is not able to handle full-scalle archiving of all Wikipedia's reference links. Where do we go from here? - Hydroxonium (H3O+) 17:45, 4 February 2011 (UTC)

Perhaps you or someone else could write up a proposal for WMF, expressing the need for such a service, and suggest that there's one ready made over that-a-way? Huntster (t @ c) 05:48, 5 February 2011 (UTC)
I am not sure why the conclusion is that WebCite "is not able to handle full-scalle archiving of all Wikipedia's reference links". Is this because of the rate limit? Again, the rate limit has been introduced because of abuse for spamming purposes. The rate limit can be lifted for Wikipedia, i.e. we can whitelist any legitimate use/bot. WebCite has about 50 million snapshots in the database. How many references does Wikipedia have? One guesstimate would be 3.5 Million articles times on avg 10 references = 35 million? --Eysen (talk) 15:25, 8 February 2011 (UTC)
Again, WebCite is open source, non-for-profit, and very interested in working with the Wikipedia community on this. If this means formally donating the code, handing over control over the domain and the trademark "WebCite" to WMF, then I would even be prepared to do this, if WMF guarantees sustainability. But I doubt WMF wants to deal with the liability and operating issues. There is some value in keeping this as a separate entity, not only for liability reasons, but also because we want to encourage a universal standard for citing webreferences and want to solve solve broader problems, see e.g. http://www.webcitation.org/5wLW5GaNU. I would argue that supporting WebCite is interesting for Wikipedia/WMF not only from the perspective as "citing" author/entity, but also from the perspective of a "cited" entity. Many journals discourage citing webreferences including Wikipedia articles due to fear that they disappear. WebCite was created to instill trust and to create a mechanism for journal editors/publishers so that webreferences are seen as legitimate reference. --Eysen (talk) 15:25, 8 February 2011 (UTC)::In any case, WebCite supports a proposal to WMF is being written, pointing out the need for such a service, but also pointing out that WebCite is an existing open source service that is eager to work with the Foundation either in a formal or in a informal way to address this problem. To me, a memorandum of understanding between Webcite and WMF is the way forward. The Internet Archive could be involved as a "guarantor" for mirroring WebCite content (which it already agreed to do). WMF and WebCite could also look for funding from third parties (granting agencies etc) together --Eysen (talk) 15:28, 8 February 2011 (UTC)
Somewhere on this page is an estimate of 17 million external links that need to be archived. Which would be a significant increase on top of 50 million. But I'd like to work with WebCite too - I was just unsure whether you guys would work with us.
So, I think the thing to discuss is how to whitelist some bots to do archive requests, and how to get the initial 17 million URLs archived.
One suggestion is that someone experienced with Pywikipedia or something could download a database dump, extract the 17m URLs, and send you a multi-megabyte text file of the URLs; I remember from the FAQ you were supposed to develop a feature for bulk archiving where one would simply FTP up a big text file, so I assume you could manually do a bulk archive given a text file. --Gwern (contribs) 15:45, 8 February 2011 (UTC)
The ~17.5 million (17,462,350 was the exact number) was not an estimate, but the actual number as of about 5 days ago (edit: as mentioned above note that this includes duplicates, so the actual number of unique URLs might be significantly less). Extraction isn't an issue, as a dump already exists with a list of every URL here (844 MB download). --nn123645 (talk) 09:15, 9 February 2011 (UTC)

"Print mode" archiving

Just as a consideration for the bot: it should be aware of sites that employ an URL-based parameter that enables "print mode" that helps to collapse multipage articles into a single page, such that this print version is the one that is archived. We would need to have forehand knowledge of which sites do this and the functionality, but it should be easy to work from that list once its compiled. --MASEM (t) 18:28, 4 February 2011 (UTC)

Your talking about literally hundreds of thousands, possibly millions of unique domains. Manually going through each one and finding if it supports a printable view is impractical and would likely only have a negligible impact on space. You could do this by looking for synonyms of the word "print", but it would induce unneeded complexity. --nn123645 (talk) 09:36, 9 February 2011 (UTC)

Using Archive-It

It seems to me that one of the preliminaries to any attempt to use IA's Archive-It is figuring out how much they would charge. The obvious criticism of a proposal that the WMF subscribe is that 'it would cost too much and we'd rather hire some more staffers', which is a hollow defense if the IA would only charge a few thousands. (It would be nice, given the shared mission and ideals and shared advisors, if the IA would provide Archive-It for free; but that's unrealistic.)

They probably charge per URL; more importantly might be by new URL. We already have a figure for raw URLs - 17 million. But how many of those have already been archived? That might be more important. (Even if Wikipedians were to set up their own archive service, we might still want to know this so one might be able to economize and omit those URLs.) A good estimate could probably be gotten by getting a list of URLs from 100 or 1000 random articles, rewriting them into IA searches (easily done by prepending the appropriate string), and then see what fraction is unavailable. I don't have the facility with Pywikipedia to get such a list and do the checking myself. Any volunteers? --Gwern (contribs) 00:37, 5 February 2011 (UTC)

Internet Archive

Regarding Internet Archive, I thought the major limiting factor in using it as an archive-on-demand service was that there was a minimum of six months between submission and publication, often more. I couldn't find anything to indicate that Archive-It worked differently, though I'd hope it would. (I admit as well, I can't stand IA's setup, and vastly prefer WebCitation). Huntster (t @ c) 05:36, 5 February 2011 (UTC)
I don't think the six month limit really matters. If your timescale is that short, you might as well say, "But most links don't bitrot in just 6 months so why bother archiving at all?". And you may not like the IA's setup, but that we're having this discussion at all is evidence that they have at least got the longevity/stability thing right; a good solution that exists is better than a perfect solution that does not exist. --Gwern (contribs) 14:38, 5 February 2011 (UTC)
A major problem with the internet archive is that they retroactively apply robots so that when a web dies and taken over buy a porn cite all the old archived content becomes unavailable. We would need to negotiate some change to archive so an archive within a a period was marked as okay plus have a mechanism for web owners to still ask for retraction. Overall I'd like if we could use archive-it but this combined with the problem of checking the right version is archived for marking makes it look rather problematic to me. Dmcq (talk) 22:04, 6 February 2011 (UTC)
How large a problem would you estimate this is? Does it affect 90% of dead URLs? 10%? 1%? 0.001%? --Gwern (contribs) 22:22, 6 February 2011 (UTC)
Sorry I see now the title is archive-it rather than the internet archive. Archive it is for people archiving their own stuff I believe, what we want is something different, I was thinking of the internet archive. I've put in a section heading above where the topic changed. My experience of http://www.archive.org/ is that it seems to affect somewhere around half the sites I check it for. It depends on how popular the site was before it died. It sometimes even affects sites that haven't been taken over that way that a change of design means old pages are now redacted but I don't suppose there's too much can be done about that. I wonder if archiving specifically nominated pages should really be counted as robot activity if done at the time of nomination. I can see of course that repeating the arching unsupervised or looking at pages that aren't specifically nominated would count as a robot trawling, but a specific request from a user isn't trawling. Is there a robots way of say I don't want this archived either nver mind just stopping robots putting pressure on my site or indexing it. Dmcq (talk) 09:38, 8 February 2011 (UTC)

fr.wiki archiving method

I've noticed that fr.wiki uses http://www.wikiwix.com/ to archive references (example). Judging by the number of times it is used, I assume that it is automated. Does anyone know of anyone to ask, or who could ask, how they do it? I found this proposal where it was decided to implement it, but I'm not sure about the exact mechanics of it. If their system works, it would make sense to me for us to try and use it as well, rather than duplicate efforts. SmartSE (talk) 12:59, 7 February 2011 (UTC)

I'm on it. - Hydroxonium (H3O+) 14:07, 7 February 2011 (UTC)
I have contacted fr:Utilisateur:Dodoïste who was involved with the wikiwix project and asked them for help. - Hydroxonium (H3O+) 14:44, 7 February 2011 (UTC)
Cool, I found this blog post by them in January which sounds like they are interested in helping other projects (and speak English!). SmartSE (talk) 14:53, 7 February 2011 (UTC)

┌────────────────────────┘
Looks like Wikiwix is run by fr:Utilisateur:Pmartin, who is the owner of the linterweb.fr search engine. I think he and fr:Utilisateur:Dodoïste wrote a javascript tool for monobook (see fr:Utilisateur:Pmartin/Cache and the google translation). It seems that users can add fr:Utilisateur:Pmartin/cache.js to thier monobook and it will add a green "cache" link to external links. I can't read javascript, so I don't know what it does. Any of our software engineers want to take a look at it? Thanks. - Hydroxonium (H3O+) 16:41, 7 February 2011 (UTC)

I'm not a JS coder, but I have to read it on occasion. It looks pretty straightforward; if it's configured to run, then on any article in mainspace, it will examine every off-site URL, and if it doesn't point to Wikipedia or Internet Archive or Wikiwix already, will create a new link next to it (consisting of "http://wikiwix.com/cache/?url="+old-link).
On the topic of Wikiwix, it's unclear to me what kind of project they are. Commercial? It would seem so. --Gwern (contribs) 16:58, 7 February 2011 (UTC)
Most of what has been said is quite accurate. Except that Pmartin wrote the script, I am not a JavaScript coder either (or at least, not yet). Let's start from the beginning.
In summer 2008 I was researching Web archives with another Wikipedian. We quickly saw several annoying limitations in them. Internet Archive is slow, crawls the web on an unpredictable fashion and lacks continuity. IA is also missing an impressive amount of pages from its archive. We did not have a bot to use WebCite as you did, and we also saw that you had troubles using WebCite (the severs went down frequently, WebCite could not archive as quickly as you need it to).
Around that time Pmartin jumped in and offered to provide an archived dedicated to Wikipedia. For free. LinterWeb is indeed an enterprise that must make money. But when they can afford it, they promote open source software, and are glad to give a hand for free. LinterWeb has a history of collaboration with the Wikimedia movement and the WMF. Although Pmartin sometimes has trouble to interact with Wikimedians because he can't speak English.
Here is how the archive works. An instant after an external link is added to Wikipedia for the first time, it is archived by Wikiwix. Because Wikiwix is synchronized with Wikipedia's recent changes. The currently existing external links are archived too, and if they are already dead Wikiwix tries to copy it from another archive service (Internet Archive and Google cache for example).
The archive of the page does not change hereon, no update. At least, that is what the community chose in 2008. There have been several debates to improve this system among the French community, but we just couldn't agree on something else. So Pmartin did not made any significant change to the archive system. However, if the English Wikipedia wants a different archive system, the community simply has to agree on something and Pmartin will do his best. For example, I guess you could have several archives for several dates, like the Internet Archive. Or get a way to update the archive if needed. Etc.
Pmartin has been wishing to extend Wikiwix to other Wikmedia wikis for a long time. We tried to reach the English Wikipedia once or twice. Last time I discussed adding en.wikipedia.org with him, everything was ready for it. The servers and the software were ready, he could have started the synchronization with the English Wikipedia's recent changes almost instantly. The links already inside the articles are added at a rate of 60 links per second (which makes 2 500 000 links per day). Basically, the archive would have been completely ready after two weeks. I can't know for sure if it is still the case, but I bet it is. The archiving system could begin in a few days (depending on Pmartin's free time), and two weeks later we would have a working beta version.
Wikiwix is hosted on the Renater network. Thus their ability to store very large amounts of data, and their robust servers.
I guess I've said the most important things. If you have any other questions, feel free to ask. I will do my best to answer them. Yours, Dodoïste (talk) 19:35, 7 February 2011 (UTC)
Thanks very much, Dodoïste. I have a couple questions
  1. Are most of fr.wikipedia's external links already archived on Wikiwix?
  2. Have most of fr.wikipedia's external link problems been solved?
  3. What would en.wikipedia need to do to get this started?
  4. Would you be availble to help coordinate this with Pmartin?
Thanks very much. I really appreciate your help. - Hydroxonium (H3O+) 20:57, 7 February 2011 (UTC)
You're welcome, I enjoy doing this. :-)
  1. Yes. The archive itself was completed ages ago. Each external link added to fr.wikipedia after august 2008 is automatically and reliably archived in Wikiwix. In short, the new external links are not a problem. The old ones, however, are quite tricky to retrieve.
  2. Yes, most have been solved. I don't have any stats right now but Pmartin could surely provide them. However, it's quite tricky and sometimes impossible to recover an archive of external links that rot long ago. For example, several dead websites were not in IA, and thus couldn't be retrieved. In the case of the English Wikipedia, the result might be much better since WebCite has be used extensively. Wikiwix will basically be a combination of IA, WebCite, other archives if needed, and Wikiwix's own archive. Some links will always be impossible to retrieve. But one can hardly do any better that Wikiwix.
  3. To get the archive started, Pmartin just needs some time to set it up. I've just sent him an email to ask him if he is still ready to pull the switch anytime soon. The archive itself and the display of the "[archive]" links on Wikipedia are two different things entirely. Once the archive is ready, users simply need to add fr:Utilisateur:Pmartin/cache.js to their vector.js to beta-test it. If you wish to deploy it to the whole Website, add this JS to MediaWiki:Common.js.
  4. Sure, I could spare a few hours a day to do the translations.
Cheers, Dodoïste (talk) 21:47, 7 February 2011 (UTC)
Thanks very much. I think this is the way to go since fr.wikipedia has been using it for close to 2 years. - Hydroxonium (H3O+) 22:09, 7 February 2011 (UTC)
Since this system is ready-made for our purposes, and is for all intents ready to go, this is probably the best option. I would suggest three upgrades, though. First would be a textbox at Wikiwix to allow for easy look-up archived sites, as with IA and WebCite. Second would definitely be to allow for multiple versions of an archived site, since pages can and do change but still remain relevant (or become newly relevant). Third, and if it exists I'm sorry, but please allow for manual archiving of a site (yes, I know it is supposed to be automatic, but I don't trust bots to always and flawlessly work).
Now, are you saying that when the en.wiki database is being built, existing archives from IA and WebCite would be pulled so everything is held in one place? If so, this would suggest an even stronger need for allowing versions at Wikiwix, since we wouldn't want to pull and apply a version from the wrong date. Huntster (t @ c) 00:34, 8 February 2011 (UTC)
This option seems almost too good to be true. The best part is it can be implemented without support from the foundation, which turns this from a discussion about a proposal into a discussion about the actual implementation. Should we reboot this as an RFC listed at {{cent}}? (will start a separate discussion below about this).Yoenit (talk) 10:03, 8 February 2011 (UTC)
Personally I already did it a few months ago: I was satisfied of Wikiwix because if gives all the sites versions, and have installed it on the French Wiktionary, Wikibooks, and Wikiversity without ANY problem. JackPotte (talk) 19:12, 8 February 2011 (UTC)
@Huntster:
  1. This already exists. See, Wikiwix is originally a search engine dedicated to Wikimedia content. I don't remember how to look-up archived sites, but I tried it about a year ago. I could become more visible if some users are interested in it.
  2. I've just discussed to allow for multiple versions of a archived webpage with Pmartin. He agreed that it is a good idea and we should implement it. The community should define their needs clearly on this topic. How should new archives be created? Manually, by the author of an article using the said webpage in Wikipedia? Or automatically, every x months? We believe that automatic updates may not be relevant most of the time, and manual updates would make more sense. But it's up to the community to decide.
  3. I've never heard of such cases happening at fr.wikipedia. I'm not sure if it's needed really. But I guess it could be implemented anyway, just to be sure.
As Wikiwix currently works, IA and WebCite are only used if the external link is already dead as Wikiwix tries to archive it. So there is no risk of an archive being overwritten by an older archive. Cheers, Dodoïste (talk) 20:26, 8 February 2011 (UTC)
I was going to come and suggest this; I am glad to see you have already been in touch with Dodoïste! I hope to catch Pascal on IRC soon; we've been playing tag. Please note that the WikiScholar proposal does not specify where original cited pages are stored -- only that there should be a single page to discuss all cites of that source (and to offer a "what cites here" list), which should include a link to an archive of the source. SJ+ 19:44, 9 February 2011 (UTC)

Proposal

Per Huntster's suggestion above, I have started a proposal at strategy:Proposal:Have WMF involved in archiving online citations. There is a basic framework, but nothing under the "proposal" section yet. Everybody is encouraged to update/change it as they see fit. Please note that you don't automatically get logged in there — even if you have a unified login on the other projects. So your IP address will be shown if you don't specifically log in. Thanks. - Hydroxonium (H3O+) 18:22, 5 February 2011 (UTC)

Below are the proposals so far. Please discuss and add to these. Thanks. - Hydroxonium (H3O+) 18:53, 5 February 2011 (UTC)

WMF start its own archival project

  • Support - If Wikimedia Foundation starts its own archiving project, this will eliminate the need to rely on external services and will be beneficial for Wikipedias long-term reliability, as WP:LINKROT will always be a threat to Wikipedia. It should also be checked, if Wikimedia Foundation could take over WebCite, as this would enable Wikimedia Foundation to use an already developed and efficiently working archiving technology, but I don't know if such a move would be possible. Toshio Yamaguchi (talk) 20:06, 5 February 2011 (UTC) switched vote to WebCitation below. Toshio Yamaguchi (talk) 23:59, 9 February 2011 (UTC)
Comment: I am strongly opposed to Wikipedia creating its own archival project without trying to build on what's already there (WebCite). WebCite is actually not opposed to any form of takeover or absorption by WMF - WebCite is a non-profit open source project - as long as long-term sustainability is guaranteed by WMF, and the broader goals of the WebCite project are being pursued, which is to create a universal standard on how webreferences should be cited and archived - not just within Wikipedia, but also in scholarly journals, books, etc. Pursuing these broader goals (apart from solving the short-term problem of linkrot within wikipedia) - thus supporting WebCite - should also be in the interest of Wikipedia, as Wikipedia not only "cites", but is also (and should be more) "cited" (but as long as editors don't have trust in the stability of webreferences, it is not cited enough, as webreferences are generally discouraged). Thus, there is some value in keeping WebCite a stand-alone project, supporting WebCite (intellectually and with volunteers), and to aim for a close cooperation between WebCite and WMF, rather than reinventing the wheel or absorbing WebCite into WMF - not least for legal reasons (decreasing WMF's liability) but also for operational reasons (who is going to deal with DMCA takedown requests?) --Eysen (talk) 16:17, 8 February 2011 (UTC)
With "takeover of WebCite by WMF" I mean considering letting some of WMF's money flow into WebCite to keep it alive. And if WMF can afford that, why not for example provide an option in the toolbox on the left side to directly archive references when working on an article? I don't think that would harm WebCite in any way (besides, it oviously seems to be more harmful to WebCite to run out of money and cease operation). I don't intend WebCite to become a WP-only tool. Instead, I think it should keep operating as it has done in the past, simply with the additional option for WP editors to use it directly from the toolbox. Toshio Yamaguchi (talk) 21:43, 8 February 2011 (UTC)
  • Support. Wonderful idea. It's time to put some of that donation money to good use. With the mass reliance on online reference links it's time we take responsibility for making sure our articles stay verifiable in the long term. I'd also like to go on record as supporting the other two proposals below, as any solution would be better than the status quo, but this option is the most ideal IMO. -- œ 00:02, 6 February 2011 (UTC)
  • Support - This is the most reliable solution. Also agree with OlEnglish, I would support any improvement. What we have now is failing and getting worse by the day. - Hydroxonium (H3O+) 00:22, 6 February 2011 (UTC) switched !vote to Wikiwix below. Hydroxonium 22:20, 7 February 2011 (UTC)
  • Support - Most ideal good solution, if it's cost-effective. Mr.Z-man 01:35, 6 February 2011 (UTC) modified Mr.Z-man 21:48, 9 February 2011 (UTC)
  • Support This is what should happen, I don't know if it will though. --nn123645 (talk) 02:44, 6 February 2011 (UTC)
  • Support Incidentally, this would dovetail nicely with the Wikidata idea; the sources for such data need citing, and the bibliographic data for the citations is just another type of data to keep track of. And 2 different articles citing the same source could point to the same Wikicite/Wikidata entry, making updating easier, etc. --Cybercobra (talk) 05:06, 6 February 2011 (UTC)
  • Support. Excellent idea. SlimVirgin TALK|CONTRIBS 15:07, 6 February 2011 (UTC)
  • Comment I agree that this is an excellent idea, but as the whole wikimedia project is based on free content, I'm not sure that they would be able to archive copyrighted webpages. AFAIK no one has bought internet archives to court, but in the same way we couldn't create our own archives at say, Article/Source archive the WMF probably can't host he archive itself. SmartSE (talk) 12:23, 7 February 2011 (UTC)
There have been a few (see Internet Archive#Controversies and legal disputes). They usually remove pages if the owner asks. This will certainly be a tricky area if we do it ourselves. - Hydroxonium (H3O+) 12:54, 7 February 2011 (UTC)
Under law (I'm paraphrasing and simplifying a lot here), you can by default archive/cache anything that is not explicitly marked as noarchive/nocache or otherwise requested not to be archived. In short, legally an archiving service acts as a library, thus not in breach of copyrights. —  HELLKNOWZ  ▎TALK 14:23, 7 February 2011 (UTC)
Thanks for the info, it certainly shows it is a bit of a legal grey area. My main point though is that the WMF's mission is to provide free information, rather than hosting archives. SmartSE (talk) 14:51, 7 February 2011 (UTC)
Consider that the Foundation is aware that non-free content is needed to build an encyclopedia per Licensing policy, I can easily see the approach being applied to the equivalent of a caching process. --MASEM (t) 15:07, 7 February 2011 (UTC)
  • Support, although I'm a bit bit skeptical that WMF will go this route in the near future. This does seem like the best long-term option though. —  HELLKNOWZ  ▎TALK 14:23, 7 February 2011 (UTC)
  • Support (conditionally, as I would like to hear if there are legal reasons why it might be better to use an independent service) --SPhilbrickT 18:15, 7 February 2011 (UTC)
  • Comment. It is sure a good idea, but the WMF already has already way too many things to do. I'm not sure it would be their priority. Cheers, Dodoïste (talk) 19:38, 7 February 2011 (UTC)
    • The WMF is funded by the community. And their main goal is to operate the sites. This is something that the community wants and is necessary to maintain the integrity of the site in the future. If this wouldn't be a priority, then their priorities are hugely misaligned. Mr.Z-man 21:05, 7 February 2011 (UTC)
      • There are tons of things that the community wants, and the WMF can't provides. I shouldn't have spoken in terms of "priority". Of course supporting the community is in their priorities. I only meant that they have an incredibly small amount of staff and money for a website in the top ten most visited. Facebook have 1,400 employees, and Wikimedia doesn't even have 50 employees I guess. Wikimedia barely have what it takes to keep its website up and running. That's why I wouldn't count on them to set up an archive. Cheers, Dodoïste (talk) 21:58, 7 February 2011 (UTC)
  • Oppose if we try to reinvent the wheel. A formal collaboration with, or a takeover of an already existing service will solve the problem more quickly and efficiently. Archiving is not the WMF's core bussiness, this needs to be outsourced. Steven Fruitsmaak (Reply) 19:58, 8 February 2011 (UTC)
  • Oppose The one concern I see is WRT legality. Keeping it separate would provide some protection. A collaboration with WebCite seems like a good idea.Doc James (talk · contribs · email) 13:46, 9 February 2011 (UTC)
  • Unsure, best keep separate - as media move to "paid view" we can expect an eventual confrontation over links kept for verification. The "search engine" defense or s.230 may be valid. Really we need WMF legal opinion how best to tackle the issue that the verifiable material we rely on is owned by others who will often wish to monetarize it behind pay walls, or may have link-rotted behind those pay walls when needed in 50 years time. And what about sources that were not publicly reachable without subscription in the first place? FT2 (Talk | email) 02:46, 17 February 2011 (UTC)

WMF partner with WebCitation.org

  • Support The lead developer has expressed intrest in this. It's alredy widely used and already built. Why rebuild something that already works? Tim1357 talk 21:09, 5 February 2011 (UTC)
  • Weak oppose - I'm not a huge fan of the WMF investing in another project, especially one that has already failed once. If we go this route, publishing the source code for the service should be a requirement. Mr.Z-man 22:43, 5 February 2011 (UTC)
  • Weak support. Quite like the principle of using an established solution which is pretty good, but it does depend on how it would be done. If in the end the option (whatever its details turn out to be) appeals to WMF, that's good enough for me. Rd232 talk 03:50, 6 February 2011 (UTC)
  • Support since it would mean less duplication of effort, and would also preserve the already giant database at WebCitation. Of course, "Any of the Above" is is ultimately my support...something must be done, this is just the most efficient. Huntster (t @ c) 20:23, 6 February 2011 (UTC)
  • Support it would seem relatively simple to get a bot to archive new links that are added and then add |archive url and |archive date sections to templates, or add it them as hidden comments if the references are not suitably formatted. It may be sensible to exclude sites though that we know are either archived at archive.org, or remain live for years (e.g. BBC, Guardian, DOIs etc.) so as not to waste resources. SmartSE (talk) 12:27, 7 February 2011 (UTC)
    • The bot isn't the problem, we actually had a working bot. The problem was that the WebCitation infrastructure couldn't cope with the load. Mr.Z-man 21:07, 7 February 2011 (UTC)
WebCite had some hosting/provider issues when the first WebCitebot was tested, and since moved to a new server --Eysen (talk) 15:50, 8 February 2011 (UTC)
  • Strong Support: A memorandum of understanding between WebCite and WMF is the way to go. This doesn't necessarily have to cost WMF a penny - WebCite and WMF could jointly go for third-party funding (granting agencies etc, which have funded the work of WebCite so far), if needed. WebCite is open source and very interested in working with the Wikipedia community on this, to the point that WebCite offers office space in Toronto for any volunteers willing to take this on. WebCite has archived 50 million snapshots, and feeds content into IA and other long-term preservation partners. --Eysen (talk) 15:50, 8 February 2011 (UTC)
  • Support If it does the trick and is under a similar license why not. Seems like a fast solution to our problems. And they are Canadian so are definitely good :-) I wonder if someone from Wiki Canada would be interested in working with WebCite? I know we have a number of member in the area. Doc James (talk · contribs · email) 13:47, 9 February 2011 (UTC)
  • Strong Support WebCite already has a well functioning archiving technology, thus WMF doesn't need to develop or buy new, untested technology. A close collaboration would be beneficial for both projects. Wikipedia gains access to WebCite archiving service and WebCite gets a strong partner. I would like to see the ability to integrate the WebCite archiving interface directly into Wikipedia (perhaps into the toolbox). Toshio Yamaguchi (talk) 23:59, 9 February 2011 (UTC)
  • Support. This looks like the best solution. Axl ¤ [Talk] 10:46, 10 February 2011 (UTC)
  • Strong support due to clear synergies - Webcitation benefits from WMF support, WMF benefits from an external citation system. Could we ask Webcitation to use a live backup server in a different location, if we paid for it? FT2 (Talk | email) 02:46, 17 February 2011 (UTC)

WMF use Internet Archive's Archive-It service

  • Conditional support - Iff this is cheaper than running our own service or the WMF isn't willing to host it. If we're going to pay someone else to do it, we should at least go with a proven solution. Mr.Z-man 01:35, 6 February 2011 (UTC)
  • As above. SlimVirgin TALK|CONTRIBS 15:07, 6 February 2011 (UTC)
  • What Z-man said --Cybercobra (talk) 05:18, 7 February 2011 (UTC)
  • Wayback Machine crawls the web themselves and only miss obscure or weird urls. I don't think they actually do on-demand instant archiving, like Webcitation. There is a weird url selection method based on Alexa's site list. Archive-it seems to be aimed at specific collections and random sites would probably not qualify. I really doubt that WMF will get involved with this directly. Wayback is also very unlikely to change their 6+ month listing period due to technical reasons. —  HELLKNOWZ  ▎TALK 14:33, 7 February 2011 (UTC)
  • Wouldn't this be an argument against using Archive-It, rather than a support? Huntster (t @ c) 19:42, 7 February 2011 (UTC)
  • Well, I never actually explicitly said I either support or oppose this; just some personal comments. —  HELLKNOWZ  ▎TALK 10:48, 9 February 2011 (UTC)
Before you say things like 'random sites would probably not qualify', shouldn't you actually find out? I didn't receive any encouragement when I was emailing with Kristine Hanna, but neither did she say that Wikipedia's external links were obviously inappropriate for Archive-It. In fact, she said "Yes, based on the initial requirements you included in this email, Archive-It could be a fit." (But then went on to say this should be discussed at 'a high, strategic level', at which point I basically gave up.) --Gwern (contribs) 19:49, 7 February 2011 (UTC)
That's why I said "probably", based on their site's info -- "Subscribers develop their own collections" -- I am unsure how this is to be interpreted and whether all reference external links on WP would qualify as a collection. I was under the impression they archived collections, such as, "state archives, university libraries, federal institutions, state libraries, etc..." which to me seemed like WP external links are too different in nature to all immediately qualify. That's why I said "probably", and it seems this may not be the case after all. 10:48, 9 February 2011 (UTC)
  • Comment - Update from the Kristine. She says Archive-It would be happy to work with WMF in some capacity. The archived websites are available 24 hours after they crawl the site. The 6 month lag is when archived sites are added from Archive-It to the Wayback Machine. Pricing depends on total amount of data stored per year and they run test crawls to get an esitmate, then price it for 5+ years. She says they run best when crawling 500 sites at a time and Wikipedia might be stretching their service, but they have other options we could explore. They are looking for somebody to be a contact between Archive-It and Wikipedia.

@Gwern, you have a HUGE amount of experience here and know the inner-workings and such, and seem to be the obvious choice. Would you mind being the contact, please? Or could you suggest somebody else if you'd rather not be the conact? I don't have the experience needed to take this on. Thanks. - Hydroxonium (H3O+) 06:44, 8 February 2011 (UTC)

Use Wikiwix like fr.wikipedia does

  • Support - fr.wikipedia has been using Wikiwix for almost 2 years and has most of their WP:LINKROT problems solved (see #fr.wiki archiving method above). This seems to be a proven solution, so I am going with this. - Hydroxonium (H3O+) 22:15, 7 February 2011 (UTC)
  • Support - I (fairly obviously) agree that this is the best option, since it has a proven track record and should require relatively little work to get running. SmartSE (talk) 22:30, 7 February 2011 (UTC)
  • Support as a seemingly ready-made option. Huntster (t @ c) 09:11, 8 February 2011 (UTC)
  • Support - This now seems like the best solution, especially since it's essentially available now. Mr.Z-man 21:35, 9 February 2011 (UTC)
  • Support as it's proven and in place today. FT2 (Talk | email) 02:46, 17 February 2011 (UTC)
  • Support - A simple (and proven) solution. [[CharlieEchoTango]] 05:46, 17 February 2011 (UTC)

Any of the above

  • By all means take into account pros and cons of options as discussed above, but please do something, WMF. Anything is better than nothing (the status quo). Rd232 talk 03:50, 6 February 2011 (UTC)
  • Strong support I am not informed enough to decide which of the proposals is the best, but I strongly agree WMF should take action. Yoenit (talk) 18:13, 6 February 2011 (UTC)
  • Anything that works would be OK; whatever is most efficient. However -- do we know the copyright implications of archiving other people's websites? Short Brigade Harvester Boris (talk) 22:51, 6 February 2011 (UTC)
    • Google has been sued for page caching and won (Field v. Google), so it appears to be allowed. I would suggest we leave the copyright concerns to the foundation lawyers, who get the final say about this anyway. Yoenit (talk) 23:06, 6 February 2011 (UTC)
WebCite is getting takedown requests on a daily basis and acts on all of them, immediately. Takedown means they are no longer displayed publicly, but the snapshots are still in a dark archive, for scholars or for legal purposes, if they need access to the material. DMCA provides a safe harbor for providers such as WebCite, but having one full time person dealing with DMCA takedown requests is a requirement (another reason for why WebCite should be used, instead of reinventing the wheel). --Eysen (talk) 16:05, 8 February 2011 (UTC)
Wikiwix also have an easy system to request the takedown of a Website. Dodoïste 16:45, 8 February 2011 (UTC)

Nearly Finished

Hi there. I'm nearly done rewriting a bot to do WebCite bot's old job. Just about everything is written, and I was going to finish up today, but they took down the database server (that was housing the links to archive) and it won't be back up until tomorrow. I'll be sure to post a link to some source code when I'm done fooling around with it. Here are the settings I've been using (they are not necessarily the settings I'll use when the bot goes live):

  • After the link is inserted, wait 48 hours to archive it (to make sure spam links have time to be removed)
  • Check to make sure the link is used in a reference. If not, do not proceed.
  • If there exists an archive that is less than 200 days old, use that instead of making a new archive.
  • Else, request a new archive.
  • Wait one hour to let WebCitation Archive the link.
  • Verify that the link has been successfully archived.
  • Insert the archive into the article in question.

-- Tim1357 talk 01:56, 8 February 2011 (UTC)

From what I recall, the problem was not getting a working bot, but getting a working archive service. Mr.Z-man 02:44, 8 February 2011 (UTC)
Was it that the archive service was down to much, or could not handle the load? Tim1357 talk 03:54, 8 February 2011 (UTC)
I have the same recollection - it was slow or down enough of the time to be unreliable. I don't know how well it would scale. If we're going to use an outside service, wikiwix seems more reliably guaranteed to be remirrorable and available for improvements through lots of connections in the Wikipedia community. SJ+ 19:50, 9 February 2011 (UTC)
Step 2 sounds dangerous and useless. All external links are of value, not just links which wikignomes have wrapped in the subset of templates you have whitelisted. It introduces complexity. Further, it kills motivation to use it by introducing uncertainty and value judgment ('Were my URLs archived? I have no idea, so why bother checking? It's not like I really want to spend time being a good doobie and checking anyway.') All to save a few bucks of disk space? --Gwern (contribs) 03:27, 8 February 2011 (UTC)
The bot is compatible with all citation templates and also bracketed links. Therefore a link just needs to be between two <ref> tags. Its easy to change that if we decide to include external links that aren't in references. Tim1357 talk 03:54, 8 February 2011 (UTC)
OK, I think it should be changed now before inertia locks in the choice. No previous bot I am aware of has done such a thing; there has been no clamor to discriminate against non-<ref> links; and it comes with clear costs and unclear benefits. — Preceding unsigned comment added by Gwern (talkcontribs)
The original webcite bot worked in the same fashion if I recall correctly, from the original BRFA " After 24 hours have passed it will go back and check the article to make sure the new link is still in place and that the link is used as a reference (i.e. not as an external link)." --nn123645 (talk) 09:33, 9 February 2011 (UTC)

Reboot this as an RFC and list it at {{cent}}

This discussion started with the goal to draft a proposal for the wikimedia foundation to set up or fund an external link archiving project. The method used for the French wikipedia does not require the foundation to become (financially) involved and I honestly can't imagine the foundation would be willing to invest time and money into something when a ready made and free alternative is available. I therefore propose we reboot this discussion as an RFC listed on {{cent}} whether we should start the implementation of the French archivation system and how. Yoenit (talk) 10:13, 8 February 2011 (UTC)

I guess that's a good idea as implementing the system would be a pretty major change and so it is better to get a strong consensus to implement it first. Shall we wait until we have confirmation from Pmartin though, that they will actually be able to do it? SmartSE (talk) 12:42, 8 February 2011 (UTC)
I have received confirmation from Pmartin that he can set up the English archive soon. He can technically begin the crawl anytime, and about two or three weeks after that the archive should be up and running. However, Pmartin request a strong consensus from the community regarding the implementation of Wikiwix before he sets up the archive. Which is understandable, considering the amount of maintenance job it requires to maintain on the long term. Dodoïste (talk) 19:22, 8 February 2011 (UTC)
This is a good idea. SJ+ 19:47, 9 February 2011 (UTC)

Judging from the willingness of wikiwix, webcitation and IA to collaborate with Wikipedia, I think you can fairly assume that anyone would be interested to become Wikipedia's preferred archiving service. Wikiwix and IA have commercial interests and webcitation might benefit in improved sustainability and credibility. In any case, I think the WMF needs to be involved and the wider community needs to give input because this would be an enormous change with an impact on every single article of Wikipedia, giving whichever service we finally chose a dramatic increase in visibility. Steven Fruitsmaak (Reply) 19:55, 8 February 2011 (UTC)

My english is bad but someone could translate me.
Voici les statistiques de consultation du cache wikipedia francophone. Cela ne représente pas énormément de visiteurs ([1]) donc la visibilité de wikiwix n'est pas notre critère. En fait nos serveurs sont hébergés dans un datacenter universitaire ( http://crihan.fr ) car nous faisons partie des sociétés innovante de ma région. Mes aspirations ne sont pas d'ordre financière sinon en aucun cas nous aurions monter le projet pour la partie francophone et encore moins ( http://okawix.com cf : http://www.elearning-africa.com/newsportal/english/news217.php ) , Linterweb rentrant de l'argent par le biais de la prestation. En fait lorsque j'ai du monde qui ne travaille pas pour des clients je l'ai fait travailler sur les outils pour la communauté wikipedienne. Cela permet de collaborer à wikipedia tout en permettant à mes salariés de voire des aspects comme python, javascript, en attendant de trouver un client. Cordialement Pmartin (talk) 21:41, 8 February 2011 (UTC)
(free) translation:
Please find thereafter the statistics concerning the Wikiwix cache system for the external links of the French speaking Wikipedia. As you may see, the figures (numbers of visitors, of clicks...) are not that important. Promoting Wikiwix is not our goal, in fact. Our servers are hosted in an university data center (http://crihan.fr), as in our region we are considered to be an innovative company. We're not motivated by an aspiration of a financial nature, else we could not have run the project for the French speaking Wikipedia archive system, and even less the Okawix project (http://okawix.com, cf.: http://www.elearning-africa.com/newsportal/english/news217.php). Actually, Linterweb makes money by placing its employees (usually programmers) with businesses on a temporary basis. But, when some of my programmers are not working for any customer, I have them work on tools of the Wikipedia community. It's a win-win activity: it's good for Wikipedia, it's good for my employees (and therefore for the firm), who can thus train and develop their skills on technologies like Python, JavaScript... while they are waiting to be assigned a new task by a new customer. Kind regards.
:-) 78.250.226.62 (talk) 15:58, 9 February 2011 (UTC)
  Done - I've created the RfC at Wikipedia:Requests for comment/Archived citations and listed it at {{cent}} - Hydroxonium (H3O+) 14:58, 11 February 2011 (UTC)