Wikimania 2019 Wednesday 14 August

Arrival at university

Talked with Abhinav, Indrajit, and ?? from Bhopal, Kolkata, and Delhi. Lane committed to join the South Asian Meetup.

Talked with Mike Peel and Joseph Seddon re GLAM and Wikibase. Both of them want federated Wikibase instances to completely catalog the collections of all museums. Lane asked about the “limits of wikidata” issue – both said that there were enough projects to do for now without committing to any particular technical solution.

Talked with Doug Taylor re infoboxes. Doug said we will talk later. Doug is recently elected secretary of Wikimedia Medicine in the context of Shani’s abdication to join the Wikimedia Foundation board.

Talked with Ad re Google into India and Wikimedia Foundation funding. Ad said that previously Google gave stock to the Wikimedia Foundation, which they sold. Lane asked ad to post any references about this to the English Wikipedia article titled, “Wikipedia and Google”. We reviewed planning for ASBS process and decided that we were ready with the posted agenda.

Mediawiki performance enthusiast re: limits of Wikidata

I talked with someone who has contributed to the technical development of Mediawiki’s backend performance. This person insisted that I should not quote him on anything. There was nothing strange, unusual, or controversial in the conversation. The issue was that even in Wikipedia I agree that it makes sense for specialized professional project teams, with Wikimedia Foundation funding or otherwise, to have control of their project and development. If conversations are endless and meandering then they can distract from achieving goals. Please no one take this conversation for technical insight, but rather consider this conversation for my own state of understanding. I asked questions knowing nothing about this, so these notes here are more reflective of what someone like me might ask, and not reflective of the developer teams.

How does someone identify the limits of Wikidata? “There is not a single number which anyone can get on this because MediaWiki is not so simple.”

We can look at current performance and count the number of items. The number of statements perr item varies. Also we keep information on revisions, so just adding and removing things will fill up Wikidata eventually. There are many external factors. Various parts can break independently.

The Query service has its own capacity planning which is independent. It is not like MediaWiki which has historical index so it is just a snapshot and it should be able to grow easily. It has many factors which determine its performance.

One is capacity without use – we should be able to store more just by adding more hardware. Another issue is querying.

Q: if we double the hardware, then do we double the capacity for people to use Wikidata?

Suppose we have one unit of content. We can determine the space for this. Now we have to have a server for a certain number of users, maybe 1000. The server determines the number of users, and this is irrespective of the storage capacity. Every server will have to have a mirror of the storage. It is like books in a library – if people in various places all want a copy of a book, then they each need a library and the libraries each need a copy of the book. As the query service becomes more popular we will need more space.

Ad H.: How many servers does the WMF have now?

Enthusiast – Between 500-1000, but 20 of those servers handle 90% of the traffic. This is through Varnish, which has a copy from the last few minutes. The rest of the servers are all getting up to date edits which then get transferred to those 20 servers. This system works because only a few articles trend globally, and if all the articles were equally popular then this system would not scale or work in this way.

There is an incident – the “Michael Jackson incident”, which prepared us for various challenges. When he died, it was a big event where many people learned in real time. Before twitter and facebook people would find out the next day through the newspaper. Now people immediatley want to look for information in Wikipedia. Typically with news there is a 24 hour news cycle where time zones would get the news in their own usual time. With Michael Jackson’s death everyone wanted news in Wikipedia at one time. Perhaps the the time there was 1 million hits a second. Now in 2019 we get 7-8 million hits a second at peak times every day. There had been traffic surges before but not like this. Also, it was not just a surge to look at the article, but also a surge in people editing, with many saves submitted before any one was resolved and making many conflicts. There is an edit conflict protocol where we can detect this and treat it in a certain way. However, if many people make save requests at the same time for the same article we could not resolve it. We were still using PHP5 which took 10 seconds for a save and since then we have switched to Lua. Before Michael Jackson we would process every edit in full, the search index, the category tree, and everything else. Now we have a “pool counter” where everyone who saves at the same time gets logged in the history but only the latest edit gets fully process for the search index and the wikitext.

We were processing too many edits simultaneously and we could not handle the traffic. When this happened we would delete the cached entry and try to rewrite it. At this time, many people were requesting that same article because suddenly the cache was exposed directly to the world with no mitigation. Nowadays we keep the old version around until a new version is in place. When someone saves an edit it is technically in the history immediately but we only show it to readers after the cascade of changes is fully resolved. The first measure we took as cascade prevention The second measure we took was disabling editing entirely in this surge of edits.

Tim Starling did this at the time. Enthusiast said that he read about it after. He had to take an initial measure and everything was resolved in 6 hours.

With Wikipedia a very small number of articles account for almost all traffic. With Wikidata the situation is that a larger diversity of items gets called. Hard drive storage space is the easiest thing to add. The reason why the query service goes down is because the number of queries and the complexity of queries. The storage space should not matter so much.

The complexity of what can be in the system is a factor in the efficiency of the system. The more complex things which have to exist the more burden on the system. However, the number of items in storage and the complexity of them, such as the number of statements per item, should not be a significant factor in the performance of the system.

With MySQL a person has a database on disc somewhere. The database will try to keep in memory what is commonly used and will try to back fill what is updated.

When will Wikidata have capacity for 500 million items? Enthusiast’s thought – do not worry about scaling WDQS, it will scale easily. Instead worry about MediaWiki infrastructure.

The answer is how MediaWiki tracks dependencies between Wikidata items and Wikipedia articles. German engineers who work on Wikibase will probably know more about that. Someone at the Wikimania Foundation would be able to comment on expected growth rate. People who know more might be Alaa Sarhan and Adam Shoreland.

Jouni Tuomisto re Wikibase for health data

Jouni Tuomisto participates in Open Knowledge Finland. he has environmental health data from NIH, and a Mediawiki installation, and wants to format his content into a Wikibase installation. He is seeking to join discussions on the structure of Wikibase.

I first met Jouni visited in 2016 for the 15th-year Wikipedia anniversary when he visited New York at an art museum with his son. His son was not at Wikimania, and was about 15 at the time and has started university now.

Joe Sutherland re: WMF Trust and Safety

Joe is Scottish but has been living in San Francisco for some time.

He is on the Wikimedia Foundation Trust and Safety team. We talked about the goal of “personal care”, where somehow the Wikimedia Foundation can develop online resources which users can find to assist themselves without additional human labor intervention.

Joe is Wikimedia community point of contact for OTRS agents, of which I am one. I had addressed an OTRS ticket of a particular individual who was in the media for a scandal and who had many objections to how Wikipedia presented them. This general circumstance is common, as many public figures in scandals object to Wikipedia’s summary of the scandal and citing sources much more than they object to the original media on the scandal itself. Joe had looked at this case, which is an unusual move. He did not comment on my activity, and anyway it would be inappropriate for him to do so. If he talks to me at all then he has to feel safe that I am not crazy, so that is the baseline for context here.

I mentioned the the universal code of conduct which is a T&S ambition, but that is not his team. I mentioned the suicide project with Sherwin, and again that is not his team.

Lane asked that someone from the Trust and Safety team join Wiki LGBT+ meetups at this Wikimania and in general. Lane remarked that safety issues are a continual topic of discussion for LGBT+ editors and that this community wants representation in policy making regarding safety and support. I am sure of his awareness and understanding.

chat with Tim Moody regarding IIAB

Tim Moody, the electrical engineer who manages Internet in a Box, told me about the first electrical generation station which was in Niagara Falls. Originally electricity was corporate rather than public infrastructure, DC, and everyone had their own standards. At Niagara Falls someone set up a generation station to provide power generally.

Tim and James presented IIAB in Geneva. They had a booth at a conference. The goal was evangelism over the concept, James gave a talk.

The thing which happened there that is most relevant to Wikimania 2019 is that they were on a panel with WHO and Doctors without Borders – in the end none of the organizations would share content over copyright. Tim advocates to put more pressure on organizations to give stuff. Hesparion, for example, seemed like a good partner, but they insist that no money can be generated because of their donor structure. Selling at cost or even below cost for devices has to be prohibited as a prerequisite of partnership. All of these organizations will say that anyone can have their content, but none of them will say that we can distribute the content, and distributing it in the open is an additional taboo.

“I was more exposed to this in Geneva that I had been before, and I am hoping to discuss this more at Wikimania 2019”.

The news for Internet in the Box is there will be a new version, 7.0, released soon. Previous versions are documented in GitHub. The previous product was called SE, which is from OLPC. When the rename happened the version 0.5 of that system became version 5.0. If anyone went to the GitHub repository for Internet in the Box one would see that these are formal releases with documentation.

For version 7 new services will come in a couple of different areas. It will include a phone system. A member of the network who works in mesh networking wants to add a closed-system, mesh network service with phone and email. The team is working with Internet Archive. The contact is Mitra Arnab. They are seeking to provide an offline collection based on their private paid archive service. Two people are working on this project. The mesh people are not part of an organization and are long-time IIAB contributors going back to the OLPC days. Anish is in Himachal Pradesh in a mountainy region north of Dharamshala where these devices are useful. He did graduate studies in social entrepreneurship in Michigan.

Tim said that he feels that content is the most important part of a project. Anish has the philosophy of getting the devices and products and content to people as part of social empowerment.

On the content side there is that Internet Archive offline service.

There is a service similar to Asterisk called PBX where the team is developing a sort of local YouTube.

Tim Moody is working on the admin console for improved menus. The service for providing OpenStreetMap is more easily accessing. Now the menu has a user interface to permit the user to choose a region to get a map rather than having the user get the content directly and install it on the device. The admin console is primary for the agent who is doing deployment. They operate the admin console to select the content they want in a package. This menu guides a person to choose all the kinds of content that they want. Previously the philosophy was having IIAB packages one size fits all without much personalization. With the admin console the idea to select content appropriate to where the box is going. Tim says that this has been his main focus. The first function of the admin console was to turn services on and off. After the install process, which would offer various services, all services would be offered to everyone. There is an “Ansible playbook” which configures these services including on/off. In about 2017 Tim moved to develop content. Now the admin console can install Zim falls, get modules from Rachel (also known as OER to Go), and also it is easier to edit the menu system itself for example to translate it into different languages.

Tim said that lots of people from Kiwix are here and had a pre-hackathon. Emmanuel is here.

chat with Emmanuel re: Kiwix

Emmanuel says there are 10 people from Kiwix at this conference. Emmanuel says Kiwix is a software suite with so many things happening .

The IIAB does wifi hotspots. Now Kiwix is developing services to assist with this. The idea is that a person puts an SD card in their computer, browse to a website and choose content offerings, then put the SD card into the device to deploy. This means that a person without technical knowledge can set up their software package. This solution is entirely online and a person can do it from their laptop or smartphone. They go to the website, set up an image in the cloud, and either download the image onto their own SD card or reserve the image specifications for a person to order an SD card through a fulfillment service and receive it on post.

Many people here at Wikimania are working on Kiwix Android. Some people are working on the Zim Farm. This is an automatic system to create regular updates of zim files so that there can be up-to-date downloads of various language of Wikipedia.

There is a Wikipedia WikiProject, WP 1.0 bot, which helps evaluate the quality and importance of various Wikipedia Wikipedia articles. Right now there is a 10-year old bot which publishes all these evaluations for English Wikipedia. The team is rewriting this tool.

This tool is fundamental to us for content package selection.

There is a Kiwix Android version. “Kiwix serve” runs this content. With Kiwix Android, there is an option to “start server” which turns a phone into a hot spot like a Raspberry Pi. This will distribute the Kiwix installation on the phone. Then people with their own devices can connect to the phone wifi just in the same way that people connect to the Raspberry Pi. They do not need to have Kiwix or any download. They just go to a website, choose an option, and permits people to access.

The client must have Kiwix Android and a content package. From here the users do not need a download (this makes sense)

SJ and data science and localizing Scholia templates

Regarding the tickets for Scholia and the hackathon goal – the answer to localizing templates is ???

Ask C Scott’s team ??? Ask Amir?

I invited SJ to Charlottesville to see Wikipedia contributors at the School of Data Science at the University of Virginia. SJ and Peter Meyer talked about the upcoming 2019 Boston WikiConference and journalism focus. SJ is in the Knowledge Futures Group at MIT. They do public knowledge graphs and pub pub (an open publishing framework). I asked about MIT’s interest in supporting other schools in their educational and research setup for software and infrastructure. SJ said that this is exactly what they do.

Amir / Ladsgroup joined Finn to assist with the localization of a template from English Wikipedia to danish Wikipedia. Finn said there is a problem with the view function.

Amir … “it goes through all of the arguments…” “it does some formatting of the value…” “Lua is niche shaped and returns it cleaner…” “What is the problem with moving the function over? If you move it then it should work. You could also move it to another module.” “I personally think that you should copypaste everything.” “If you have write access then you can preview pages which use this.” “Go to the template page, try to edit it. Ah, you do not have the right to edit it. If you had the right, then you could have a preview and see how it works.” “You can put it in the sandbox.”

Finn: “Ah, this does not work!” “You are not an admin?” “No.” “Me neither, someone might ask me to do something.” “Yes, we are the lazy guys.”

Max Klein and research ethics

Max is at CivilServant, a nonprofit organization which supports “citizen behavioral science” and has a special interest in assisting Wikipedia community members organize their own research. He told me that he was developing a version of user:HostBot on English Wikipedia by cloning it and varying it to do other services which are practical and models as research case studies.

Using ORES-powered machine learning, the bot invites cohorts of 300 people to the teahouse. The Teahouse wants the best people to be invited. Hostbot originally did this with heuristics, like number of edits. With machine learning Max has Host Bot – AI. The bot does A/B testing, the results for which Max will present later at the CivilServant research event right after the main conference.

For the CivilServant meetup the community is setting the agenda. It is not pre-defined and based on what people want to know.

Heikki Kastemaa and mapping Stockholm art in Finnish language

Wikimedia Finland is mapping public art in Wikiprojekti:Suomen julkiset. For example the municipality of Luettelo there is a list of the art. The current target audience for this is the many tourists who come from Finland to Stockholm. This project is listing all Finnish artworks and memorials into Wikipedia and later into Wikidata.

About 60-70% of the art in their collection has been translated of about 7000 works. They get photographes of works which are more than 70 years old, public art, which is in the public domain. They are seeking public art and memorials.

It has been a lot of work to fit Wikidata into our workflows. The problem is changing the present project activities to incorporate posting the same information into Wikidata. If the information were in Wikidata then it would be easier to maintain and translate, but many people who contribute to this as a Wikipedia project do not know how to edit Wikidata. The plan for the near future is to have special Wikidata workshops for project participants.

Cyberpower678 and InternetArchiveBot

Maintaining user:InternetArchiveBot is a massive undertaking. The bot has an automated process which seeks out dead links in Wikipedia and replaces each with a link to the Internet Archive’s copy of the website. The bot currently runs on 20 languages of Wikipedia and “Miraheze“, a wiki farm driven by MediaWiki. The newest project undertaking is linking books to reference. Now one can click on a link in a book reference and get to a given book. Even one can link to individual pages. This works even for books in copyright by showcasing only the page cited. A person can also check out the book through the Open Library.

After 14 days the book self destructs. To check out books one needs an Archive.org account. Books come in various formats, like an ePub. The books are completely searchable. Mark Graham speaks for the program.

The books that Archive.org has are in Open Library. Not all the books in the Open Library are indexed in Archive.org. To check out a book the Internet Archive has to physically have a copy.

The English Wikipedia cites about 1.2 million unique ISBNs or 3.2 non-unique ISBNs as of August 2019. The InternetArchiveBot links to Archive.org by adding new links to the citations.

Currently the InternetArchiveBot is 2.0beta and and the end of the Hackathon, 15 August 2019 it will be 2.0. InternetArchiveBot 2.1alpha will have a book linking feature to share a few pages of the books.

Dan W. Skalman, assists as a Wiktionary contributor. Dan is doing front end development. Users can invoke the tool to call its attention to any page where they want to run the bot. Previously there were many errors in the operator interface. The interface displayed an error banner, but that banner itself functioned in error. Dan fixed the error banner so that it correctly displays the many errors.

Maximilian Doerr has a list of these books mapped to the Internet Archive’s physical collection for the Open Library. He has a list of all the ISBNs which English Wikipedia cites, that 1.3 million. The Internet Archive physically owns about 120,000 of these books. Because they physically own them, they are able to provide previews of the cited pages for these books to English Wikipedia.

Max offered to share this list of ISBNs mapped to the unique identifiers which Archive.org applies to them, perhaps to upload this dataset to Wikidata as a pilot set of books which anyone can check out online from the Open Library.

Joe Seddon and the value of the fundraising banner

Joe Seddon from Wikimedia Foundation fundraising commented on the value of the Wiki Loves Monuments banner campaign.

The two things to calculate the maximum worth on English Wikipedia are the “cost per impression”, typically counted by the thousand or million, and the total impression count. This assumes that there is one advertisement, when some campaigns have multiple or repeating advertisements. Medicine, for example, gets 300,000,000 pageviews on English Wikipedia every year.

Let’s play a hypothetical exercise. There are 200,000 photos for Wiki Loves Monuments. If we were running this outside of Wikipedia projects, how much would it cost to advertise all these images to audiences? We could consider the number of people to whom we advertise this campaign, and how much of their attention we direct to the pictures, and our success rate in getting them to interact, and consider all this engagement which we are able to measure against the going market rates at which other web platforms sell this kind of interaction. We know the traffic and engagement numbers, but translating the value of this into cash money is debatable. However, at any reasonable price, the Wiki Loves Monument campaign attracts what anywhere else would be a media campaign of extraordinarily high cost at any vendor.

One estimate for the value of Wikipedia editor time could be to look at market rates for paid translation services. Perhaps in the United States English to almost any other language would be $0.15/word. If the labor of Wikipedia engagement were comparable in financial value to that, then we could estimate the value of Wikipedia engagement campaigns.

I have thought about being a mathematician for Wikipedia value guesses. For example, what would be the cost of producing Wikipedia outside of Wikipedia, and what would be the cost of finishing Wikipedia. Maybe it would cost several trillion dollars to complete it? What are our goals, and how do we explain to sponsors and funders the value of what we are able to attract and offer to our audience?

Lodewijk and Wiki Loves Monuments

I have notes in a separate post.

Slashme and the Parliament Diagram Tool

I have notes in a separate post.

Nahid Sultan and Wikimedia Bangladesh

I asked Nahid to tell me about Projects of Wikimedia Bangladesh. He told me that they were collaborating with the biggest online news site to have an annual Wikipedia editing event.

This outreach was previously a part of Mother Tongue Day but now is running by demand whenever partners want it and the chapter has the ability to support. Also, because that holiday has so many other events around it, they run this project later in the year to not conflict with other events. In their program they run a contest which also gives certificates and prizes to participants.

1000 people edit every year. A few people become regular editors and a few people only edit during this event.

Someone gave the chapter a referral to the Bangladesh Military and asked them for historic military photos. They made a request. There was no reply or follow up for six months. One day an email came saying that they had escalated the request higher and higher, and at the top they got permission to share 250 photos. Most of the photos are of the air force. Some of them show the founders of the country with the military.

A radio station in the capitol routinely does interviews with celebrities. It is a radio station so they do not emphasize visual media, but they do photograph the people who come to speak on radio. We asked them for photographs from their internal collection, many of which they had never published and only kept in their internal archives. They gave us photos and licenses to put this content on Commons.

The Director General of Museums in Bangladesh was a Wikipedia editor but did not make his position known. He was editing articles and we came to see his activities. We reached out to him and eventually he said that he was at the museum and he invited us to visit. We arrived at the museum and there was a long line. When we arrived someone was there to receive us and they escorted us to the front of the line and into the museum. We told them we were here to meet a wiki colleague and we asked them what he did at the museum. They laughed and said that he was the director of all the government museums in Bangladesh, including being the top person at this museum. It takes time to build a relationship and it was nice to have this conversation start.

At the hotel party at Clarion

There was a hotel party at the Clarion. I talked with Mina about Wikidata, Greek language, and neuroscience. I talked with Avi about Abba since we were in Sweden and since he always likes discussing local music culture. I told him that I lived in Charlottesville and he recognized the name from the news and asked me about American racism. I talked with Finn to plan our presentations related to Scholia. I talked with James and Shani, both formerly of Wikimedia Medicine and now on the Wikimedia Foundation Board of Trustees. They asked me to contribute to Wikimedia Medicine as secretary and I said that I would do more documentation there but wished to not be an officer of the organization.