I tried explaining something in video by webcam which is new for me, but I want to try more often. I sorted notes in my head, rehearsed out loud once, then recorded. I watched it and thought it was fine, so I published it.
I wanted to start a conversation on demographic profiling in the Wikipedia platform.
I explain that Wikipedia has biographical encyclopedic articles. Complementary to Wikipedia, in the Wikipedia platform, we have Wikidata, which is a database of information about the subjects of all Wikipedia articles and many other topics related to building a general reference resource. Wikidata imports lots of datasets, and the project in Wikidata which has been largest in terms of content and contributor community is WikiCite. WikiCite is an effort to collect scholarly source metadata, or the citations to academic and research literature. Beyond being a mere copy, we also enrich the data in the wiki platform. Now we get to Scholia which is my project and the place where demographic profiling can happen.
WikiCite imported citation data from lots of papers, and this data includes the names of lots of researchers. When we have researcher names we can associate them with their institutions, such as by reading their institutional affiliations in the author byline of their paper. When we know a name and some additional information, often we can connect to other data sources and get much more information, like a full publication history from other profiling services, or research history from any project database, or funding history from government or foundation databases, or event participation history from conference programs. By and by we have more complete data-based biographies of many people.
Eventually we may get demographic data. Maybe they have published papers in multiple languages, so that might be supporting evidence of being a member of a language community. Maybe they have a few decades of publication history with nothing before or after, so we can place them into a generational cohort. In writings about a person we can observe the use of gender pronouns and mark those. In other ways sometimes we find race, ethnicity, religion, or LGBT+ status. In the end, if we have a list of people for a university, institution, research project, community organization, conference, grant recipient cohort, or any other group of scientists, then potentially we could demographically profile the entire group and gain insights. The benefits include identifying and showcasing the accomplishments of underrepresented demographics, addressing bias with data-based evidence of its existence, improving the quality of research by enabling administrators to more easily recruit diverse collaborators, and better assessment of impact and benefit for user communities. The risks of demographic profiling are numerous and serious, and include increasing the visibility and vulnerability of demographics at risk for harassment, inappropriately violating the privacy of individuals, creating a scalable chaotic new global paradigm in human interaction without the infrastructure and planning to mitigate harm, compelling individuals and institutions to participate in labeling or otherwise be left out, applying temporal and cultural labels at scale where they are certain to be misinterpreted across time and culture, solidifying strong labels of personal identification when labeling also causes harm, and creating a massive new dataset which can be co-opted by corporate bad actors to enact perpetual shenanigans. I do not know where this can go or who if anyone has delineated the discourse on demographic labeling.
After publishing the video I thought about everything else I could have said and did not. I have to start somewhere. I felt relieved to talk.