Issues with data access and why they hinder transparency, accountability, and policy-oriented research
Kalina Bontcheva
University of Sheffield
Abstract:
Researchers have shown that online disinformation is cross-platform by nature. Thus this raises concerns that discussions around the EU Digital Services Act (DSA) data access provision for researchers are very much focused on actions by individual VLOPs (Very Large Online Platforms) and VLOSEs (Very Large Search Engines). Additional measures need to be taken to encourage cross-platform data access as well as improved data access tools and datasets for all independent eligible researchers.
Key words: Data access, APIs, DSA, VLOPs, VLOSEs
1. Data Access: Why do UK researchers need it?
Researchers have shown that online disinformation is cross-platform by nature. Thus this raises concerns that discussions around the EU Digital Services Act (DSA) data access provision for researchers are very much focused on actions by individual VLOPs (Very Large Online Platforms) and VLOSEs (Very Large Search Engines). Therefore, in my opinion additional measures need to be taken to encourage cross-platform data access and also through creation of detailed, shared research datasets on elections which are created through aggregated contributions by companies and provided under DSA provisions to eligible researchers.
Moreover, I advocate that the effectiveness of the companies’ reported measures to combat online disinformation need to be studied and verified independently by independent researchers and thus, the data that these companies provide to researchers should be sufficiently detailed to allow for that.
In the interest of preserving the reputation and trust in science and research, I advocate that online platforms and search engines should be discouraged from providing data and/or funding to only a small group of researchers, selected by the companies themselves in a non-transparent manner. Instead, all eligible researchers in the UK, in the EU and worldwide should be included, in order to also ensure comprehensive, language- and country-specific research, transparency, and accountability.
I also believe that it is particularly important for researchers to have access to comprehensive datasets of moderated content deemed to be disinformation by the platform moderation teams and their AI algorithms. In the case of election-related content, these datasets should also include ads, posts, and groups that belong to political parties, candidates, and other key public organisations in each election. This is important for several reasons. These are needed firstly to serve as an archive and enable historical research into disinformation, election debates, campaigning, and advertising, as has been done already on datasets released previously by Twitter related to the 2016 EU Brexit referendum and the 2016 US presidential elections. Secondly, it is necessary to create comprehensive election-related datasets that span multiple platforms so that independent researchers can study the effectiveness, coverage, comprehensiveness, and fairness of disinformation moderation measures put in place by VLOPs and VLOSEs. The data needs to be multilingual and multi-platform to enable independent investigations and comparisons.
Moreover, I believe that when an online platform/search engine declines a request for access to an academic researcher or their team to study online disinformation, they should also report this refusal and the reasons behind it to the respective national media regulator and Digital Services Coordinator. There are already examples of researchers denied data access even though they are funded under the Horizon 2020 research programme to study disinformation. Therefore, I believe that there is an urgent need for the EC to meet and agree with online platforms and search engines to not deny data access to independent researchers studying disinformation, especially those funded by the EC.
2. Data Access Issues with Meta (Facebook and Instagram)
In July 2023 I made a comparison between data access provisions for researchers by Instagram and TikTok, since they are competing VLOPs of a similar kind. It is therefore unclear why TikTok can offer a comprehensive research API for data access, which allows researchers to analyse disinformation in their own computational environments. In contrast, Instagram Meta has opted for a highly restrictive, clean room approach (called Researcher Platform). I have found this to be highly restrictive with respect to repeatability, open science, and also the kinds of research projects and questions that EU-funded research projects can investigate, e.g. in relation to election integrity and effectiveness of platform measures against disinformation during elections. Moreover, Meta’s clean room approach does not allow researchers to download data arising from our disinformation research (e.g. URLs shared) or to upload datasets for cross-platform research on disinformation (e.g. ability to study the spread on Instagram of a given set of TikTok or Twitter URLs that have been debunked by fact-checkers).
Therefore I believe that all image and video sharing VLOPs need to offer equivalent and fully comprehensive, flexible API-based data access under the DSA.
The more general point is that VLOPs and VLOSEs of a similar kind need to be providing similar API data access, in terms of coverage, comprehensiveness, real-time updates, etc.
3. Data Access Issues with X/Twitter
A vast array of research into online disinformation (including my own and that of my team) was made possible thanks to Twitter’s generous provision of elevated researcher API access. This unfortunately was revoked when Twitter’s ownership changed and it was rebranded to X.
Under the DSA, EU researchers are eligible for free research API access on X. However, when EU-funded UK researchers apply, they are rejected as the company regards them as not covered by the DSA.
However, as many political parties and elected politicians are using X/Twitter for their election campaigns and to communicate with citizens, there is clearly a very strong societal and governmental need for in-depth research into online disinformation. It is severely disappointing because we are now in a situation where we have the proven technology to monitor disinformation; policy makers and media organisations want to know about the effectiveness of X/Twitter’s moderation measures, but none of it is possible due to the X API not being affordable to researchers (e.g. costing $42,000 a month).
4. Data Access for Google Search
Recently I carried out a comparative analysis of the July 2023 reports on Code of Practice implementation from two VLOSEs: Microsoft Bing and Google Search, where I found a major disparity in terms of the depth, quality, and range of data they currently offer to researchers. Given that Bing and Google Search are both VLOSEs and are facing the same privacy, user, and data sharing challenges, I strongly believe that the research community and the EC should reasonably expect that they offer similar data provision arrangements. However, this is unfortunately far from being the case. In their July 2023 reports, Google Search detailed access only to Google Trends data and the Fact-Check API. When compared to Microsoft Bing’s data access provisions reported in their respective July 2023 report, it becomes clear that Google Search data access is highly limited and in urgent need of expansion. More specifically, I argue that Google needs to adopt Microsoft’s practice in provision of thematic datasets to researchers (c.f Bing’s COVIDand click analysis datasets). In addition, I believe that Google Search needs to provide researchers with comprehensive APIs for news, web, image, video, and other searches, to match those provided by Bing.