Someone asked for my thoughts on this early draft of “Citation needed? Wikipedia and the COVID-19 pandemic” at the preprint repository bioRxiv. I look at lots of such papers and often have the same thoughts about them, so I thought I would share my views here to list how I think about medical research on Wikipedia nowadays.
This is an up-to-date May 2021 scholarly review of how Wikipedia presents information COVID-19. Such information matters because Wikipedia is an extremely popular source of medical information, and because throughout this pandemic Wikipedia has been the most requested, published, accessed, and consulted source of information on COVID-19 in particular, and because collectively we public health educators need to coordinate our response on how we are going to use Wikipedia to achieve communication goals. I like this paper and it is as good as it gets for these kinds of medical reviews from health authorities who are not insiders to the Wikipedia editorial community.
The Wikipedia community does not itself publish good status reports which give basic facts which researchers would need to put their evaluations into context. The lack of that factual context is in my view the biggest barrier to making sense of this COVID paper, and of course is no fault of the authors of the COVID paper because that data does not exist in publication. While for the entirety of Wikipedia it is possible to get reports of massive numbers that give facts like the entirety of Wikipedia having 6 million articles and 200,000 active editors, more commonly researchers would like to run a report for some subset of Wikipedia. In this case the researchers would have benefited from a medical status report explaining how medical articles Wikipedia has, how many citations they use, how many people are editing, and how much traffic those articles get. No such report exists. The authors do the best they can with available information. Some problematic statements in this preprint include that “Wikipedia has over 130,000 different articles relating to health and medicine”, which is accurate, but the breakdown would surprise most people because half that is non-English, much of it are local social topics like biographies or organizational profiles, and quite a lot of it is alternative names or definitions. For this study I expect the researchers would prefer to separate medical science from medicine in society. Another statement is that “medical professionals are active consumers of Wikipedia and make up roughly half of those involved in editing these articles”, which cites casual research from 2011 and 2013. While I believe this anecdotally and while I recognize the validity of past small surveys, I am uncomfortable with how frequently and seriously researchers and policymakers emphasize these reports, and want another more robust survey. In Figure S4 the authors present data visualizations about Wikipedia’s audience traffic and editorial change rates, which I know is very important, but Wikipedia does not publish baseline norms that establish what kind of traffic and editing patterns are normal in various situations. The effect of this lack of basic context is that understanding the researchers’ very good measurements and insights is difficult for anyone without experience looking at Wikipedia metrics.
This paper describes a plan to describe the quality of Wikipedia’s articles on COVID by considering how many citations to scientific sources they have. This is a workable idea, but I think the authors anticipated that Wikipedia’s perspective on COVID by assuming be medical, rather than social. The authors found 1500 articles which Wikipedia categorized as COVID related and remarked that only 149 of them cited scientific sources. The information that I am missing here to put this in context is some assessment of how many of the 1500 articles ought to be citing a science paper; perhaps this 10% is the appropriate amount. My expectation is that the majority of Wikipedia’s COVID content would be non-medical, non-scientific, and instead sociological. I recognize that there is a traditional societal expectation that disease is the field of health experts, but nowadays matters of citizenship like public health are more participatory. At the start of this pandemic I wrote an article in The Signpost describing that public health in COVID meant publishing information about cruise ships which were sites of outbreak. The cruise ship article I mentioned cites nautical reports, newspapers in Chinese, English, and Japanese language, vacation and leisure magazines, and heavy machinery trade journals related to the class of engine in the boat. This is a solid article which is fundamental to the COVID story, and it is an example of a Wikipedia article where the best available information sources to cite are not scientific.
I like the way the authors talk about certain Wikipedia articles being key to the development of other Wikipedia articles. This makes sense to me, but Wikipedia editors have no tools for surfacing this other than intuition from experienced editors. When the authors identified the articles for the pandemic, the disease, and the virus as being centers of development I expected those; but when they named COVID-19 drug repurposing research, COVID-19 drug development, and Severe acute respiratory syndrome coronavirus I was surprised. The drug development articles are getting the least traffic of all of those, which I expect because they are not of general interest. I looked at this content a lot when I set up the article for one of the vaccines, and now I realize that I was like lots of other editors who were also setting up COVID drug sub-articles by starting with what we found there. I remember how difficult it was for me to make sense of that content. I originally named the article for the vaccine “BNT162b2” because at the time that was the name, and right now its name is “Pfizer–BioNTech COVID-19 vaccine”. Sorting through the kind of research that gets published before drugs and concepts even have names is so tedious. Since the research is showing the migration of citations from these articles to elsewhere on Wikipedia, then my interpretation is that lots of wiki editors also sorted this mess independently and came to our own private conclusions that we did not know how to make it more orderly. Wikipedia is a great place for collaboration when everyone is editing and proposing ideas; I am seeing again that Wikipedia does not have good collaboration support when lots of people are quietly baffled alone in the same spaces. If those drug development articles really were hubs, then other people must have looked at the talk pages and been speechless in confusion just as I was, even while I was publishing other content. And for the last hub – article about a sort of coronavirus which was not COVID-19 – I cannot immediately think of a reason why that article would be a source of other citations, unless it simply cited the same sources as SARS-CoV-2. Elsewhere the author talked about how new Wikipedia articles rely on the existence of older well-established articles which frequently are citing older well respected academic papers. I believe that also, but that kind of insight would need to come from analysis, because it is much too difficult for any human to recognize gradual reuse of content over hundreds of articles through several years.
The authors talk about the difficulty of organizing all the publications which Wikipedia articles on COVID were citing, saying “no better solution could be found” for issues like recognizing links between preprints and later publishing; mapping the citation graph among papers, categorizing sorts of publications, and making comparisons between academic literature and either respected traditional media or non-academic expert health sources like the World Health Organization. Wikipedia is trying to present the Wikicite project to address these issues, and projects like dear Scholia should be tools to help users navigate sources. There is a major research deficiency in access to open and machine readable source metadata. When humanity sorts that then the problem will be solved forever and a new generation of research can begin, but we are still in a state of transition from paper to digital publication were fundamental problems like this regularly consume hours of research labor from all researchers at a global scale.
The authors’ explanation of why Wikipedia editors lock articles is what people say, but in my view, it is a misunderstanding and not how Wikipedia editors themselves think about the issue. The description the authors give is the one that non-Wiki editors expect, which is that Wikipedia editors lock certain articles to protect quality. The reality – which non-Wikipedia editors never recognize – is that Wikipedia has much better systems for protecting article quality thank locking out editors, and that the locks are an attempt to prevent one of a few much more boring problems. Wikipedia editors themselves would give a few reasons for applying locks, such as new users edit warring, or a crowd of new users suddenly surging to edit a Wikipedia article, or sometimes a trickle of new users coming steadily over time to add content, but typically the unacceptable behavior that all these cases have in common is the users failing to cite a source. If I had my wish there would be much less article locking, and instead we could just have software development to apply to certain articles which rejects the addition of claims without citations. The lock is not supposed to discourage editing exactly, but instead is supposed to make people aware that Wikipedia has a quality control process and if they, for example, fail to cite sources, then they fundamentally are failing to recognize the nature of Wikipedia as a quality controlled publication. Bad sources are not the issue; because Wikipedia reviewers have good processes in place for checking sources and training editors to identify good sources. Wikipedia’s locking system has always been very easy to bypass. For most editing privileges in Wikipedia, the distinction is between new editors, and the trusted, experienced, mature editors who have accounts which are 3 days old and have spent about 1 hour making 10 edits. The point to recognize is that when seniority to bypass locks comes after 1 hour of experience, the locks are not there to improve the quality of citations. Also, even for mass vandalism, we have other systems which eliminate those edits, so nowadays if there were no locks on Wikipedia we could probably configure other systems to provide comparable protection against totally misguided edits. I will mention that we do have some rare higher degrees of locks which many editors never encounter. These are in place on Wikipedia articles known to cause insanity, like those related to Israel, girls+video games, classifications of certain Hindu castes, and certain unpopular rap music albums with inexplicably devoted followings.
This paper has my imprimatur and position that nihil obstat.