Skip to content

Leibniz-HBI/ukraine_twitter_data

Repository files navigation

After closure of the Academic API on 24.6.2023 this collection has stopped. If you cannot rehydrate tweets because of the API closures, feel free to contact us in the issues or via email.

ukraine_twitter_data

Flag of Ukraine

Twitter (and maybe later other social media) data around the Ukraine Invasion in February 2022

Data reaches back until February 1st and will be updated daily.

The updates are automatised and should be available every afternoon for the preceding day. Let us know if there are any problems!

Please cite as:

Münch, F. V., & Kessling, P. (2022, March 1). ukraine_twitter_data. https://doi.org/10.17605/OSF.IO/RTQXN

Other citation styles can be found on OSF:

https://doi.org/10.17605/OSF.IO/RTQXN

FAQ

What data is available?

Right now, we provide data on all tweets that contain the hashtag or word 'ukraine' (see query below) in different languages since February 1st here according to the Twitter Academic API.

Furthermore, data on English tweets containing the term 'bucha' are now available.

Collections:

language query data nearly complete from
English #ukraine AND lang='en' 1. February 2022
German ukraine AND lang='de' 1. February 2022
Russian Украина AND lang:ru 1. February 2022
Ukrainian Україна AND lang:uk 1. February 2022
English bucha AND lang='en' 1. March 2022
German (bucha OR butscha) AND lang='de' 1. March 2022
Russian (Бу́ча OR bucha) AND lang:ru 1. March 2022
Ukrainian (Бу́ча OR bucha) AND lang:uk 1. March 2022

To comply with Twitter TOS and protect people who have decided to delete their tweets, we share tweet IDs, creation date, and metadata about our collection methods and dates only.

If you are elegible for Academic API access with Twitter and want to add further languages, let us know and we are happy to support you.

How was this data collected?

With the focalevents tool by @ryanjgallagher using our Academic Twitter API access.

We query tweets that contain the keywords stated above and filter for languages detected by Twitter.

We started collecting tweets on 24. February, backfilling tweets since 1. February.

You find information on whether an ID was collected via the search ('backfilled') or streamed in the data itself. Backfilled data will not contain tweets that have been deleted before the collection time.

How is the data structured?

We share the data in language specific folders.

The filenames indicate the date of the tweets.

Furthermore, every file is available in two CSV versions:

  • one with the IDs only, for easy hydration with tools mentioned below.
  • one with metadata on how the data was collected in every line, for you to filter it to your needs

How many tweets are in each collection?

This many:

(#)Ukraine

counts_en_hashtag (1) counts_de (2) counts_uk (1) counts_ru (2) counts_all (2)

Bucha/Butscha/Бу́ча

counts_bucha_en counts_bucha_de counts_bucha_uk counts_bucha_ru

These figures will be updated periodically.

How can I get the content of the tweets?

Via the Twitter API, e.g. with twarc or, if you prefer a graphical user interface, with the Hydrator by @DocNow.

We provide files that do contain the tweet IDs only for this purpose.

If you need any data that is not available this way, we might be able to help you, pending an ethical evaluation of your research goals.

How do you ensure the quality of the data?

Due to connection and other problems there can and always will be gaps in such a large-scale data collection. We are in the process of meticulously backfilling any gaps that we discover in our data collection.

Here we compare our data with the estimated counts returned by the API (number of collected tweets per hour divided by Twitter API count estimates).

We aim for 95% of the hourly estimated counts by Twitter. As you can see, this is not always possible, most likely due to tweet deletions, account bans, account protections, or wrong estimates by Twitter.

In the English and Bucha datasets our count is for one hour 16-18 times higher than the Twitter estimate we got. We will have a closer look at that asap, but more is usually better than less. Maybe its a glitch caused by daylight saving time (even though we should see that also in other languages) or the 'spikyness' of the event 🤷. Most counts are >= 95%, less than 10 hours have only more than 90% of the estimated count.

(#)Ukraine

target_counts_ratio_ukraine-en-hashtag target_counts_ratio_ukraine target_counts_ratio_ukraine-uk target_counts_ratio_ukraine-ru

Bucha/Butscha/Бу́ча

target_counts_ratio_bucha-en target_counts_ratio_bucha-de target_counts_ratio_bucha-uk target_counts_ratio_bucha-ru

Is this ethical/allowed?

Because we publish tweet IDs only, we comply with the Twitter TOS.

Given the public interest in this data and that this data will be indispensable for presenting research findings on events in contemporary history we also comply with the GDPR, especially its German implentation, the DSGVO (Art 6 (1) f) GDPR in connection with Art 85 GDPR, § 27 BDSG).

From an ethical standpoint, we do not share any data the conflict parties would not have collected or be able to collect anyway.

As we share only Tweet IDs, accounts can protect themselves at any time.

We think, sharing this collection contributes to the cause of open science.

Furthermore, while much of the information contained in the tweets will be dis- and misinformation, this dataset at least provides transparency by enabling researchers and OSINT experts to analyse it independently, which is in the public interest of democratic states.

However, we still ask you to assess your respective use of this data with your ethical review board, and/or with our ethical and legal guidance questionaire SOCRATES

I don't want all of the data

You can use Git's sparse checkout feature: https://dev.to/kiwicopple/quick-tip-clone-a-single-folder-from-github-44h6

If you're just interested in single days, the easiest way is to just download single files manually in the Github interface or automated with their URL via curl/wget.

Limitations

This data is mainly limited by the fidelity of the Twitter API and data degradation over time:

  • Tweet IDs of Tweets that have been deleted, suspended, hidden or protected before the collection time will not be in the dataset
  • Tweets that have been deleted or otherwise depublished after collection time will not be returned during hydration
  • The collection depends on Twitter's language detection, which is known to be far from perfect, but good enough for large scale assessments:
    • Tweets that have not been detected as being in one of the collected languages will not be in the collection.
    • Also, there will be mislabeled tweets (e.g. Dutch as German, or maybe even Ukrainian as Russian) in the collection.
    • Tweets that do not contain any text (e.g. links or pictures only) might be missing in the collection.

Furthermore, while we backfilled any gaps occuring in the data so far, there might be gaps in the future due to systems failures or errors in our code or used software. We plan to publish the count estimates by Twitter alongside the data automatically in the near future so that researchers can double check themselves. In the meantime, researchers with access to the Twitter Academic API have access to the count endpoints themselves and are able to compare the counts. Please let us know in the Issues if you see any major deviations.

We do not guarantee any ongoing collection, mainly because Twitter limits the amount of Tweets we can collect per month. So please do not plan with anything beyond what's here already, e.g. for project planning or grant proposals and such. (Or approach us and we will help you to apply for Academic access to the Twitter API yourself and set up your own collection.)