Skip to content

the-deep/humset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

HumSet is a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. HumSet is curated by humanitarian analysts and covers various disasters around the globe that occurred from 2018 to 2021 in 46 humanitarian response projects. The dataset consists of approximately 17K annotated documents in three languages of English, French, and Spanish, originally taken from publicly-available resources. For each document, analysts have identified informative snippets (entries) in respect to common humanitarian frameworks, and assigned one or many classes to each entry. See the our pre-print short paper for details.

Paper: Humset - Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

@inproceedings{fekih-etal-2022-humset,
    title = "{H}um{S}et: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crises Response",
    author = "Fekih, Selim  and
      Tamagnone, Nicolo{'}  and
      Minixhofer, Benjamin  and
      Shrestha, Ranjan  and
      Contla, Ximena  and
      Oglethorpe, Ewan  and
      Rekabsaz, Navid",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.321",
    pages = "4379--4389",
}



Dataset

Main dataset is shared in CSV format (humset_data.csv), where each row is considered as an entry with the following features:

entry_idlead_idproject_idsectorspillars_1dpillars_2dsubpillars_1dsubpillars_2dlangn_tokensproject_titlecreated_atdocumentexcerpt
  • entry_id: tpyeunique identification number for a given entry. (int64)
  • lead_id: unique identification number for the document to which the corrisponding entry belongs. (int64)
  • sectors, pillars_1d, pillars_2d, subpillars_1d, subpillars_2d: labels assigned to the corresponding entry. Since this is a multi-label dataset (each entry may have several annotations belonging to the same category), they are reported as arrays of strings. For a detailed description of these categories, see the paper. (list)
  • lang: language. (str)
  • n_tokens: number of tokens (tokenized using NLTK v3.7 library). (int64)
  • project_title: the name of the project where the corresponding annotation was created. (str)
  • created_at: date and time of creation of the annotation in stardard ISO 8601 format. (str)
  • document: document URL source of the excerpt. (str)
  • excerpt: excerpt text. (str)

Note:

  • subpillars_1d and subpillars_2d respective tags are reported, as strings, with the format {PILLAR}->{SUBPILLARS}, in order to underline the hierarchical structure of 1D and 2D categories.

Addittional data

In addition to the main dataset, documents (leads) full texts are also reported (documents.tar.gz). Each text source is represented JSON-formatted file ({lead_id}.json) with the following structure:

[
  [
    paragraph 1 - page 1,
    paragraph 2 - page 1,
    ...
    paragraph N - page 1
  ],
  [
    paragraph 1 - page 2,
    paragraph 2 - page 2,
    ...
    paragraph N - page 2
  ],
  [
    ...
  ],
  ...
]

Each document is a list of lists of strings, where each element is the text of a page, divided into the corresponding paragraphs. This format was used since, as indicated in the paper, over 70% of the sources are in PDF format, thus choosing to keep the original textual subdivision. In the case of HTML web pages, the text is reported as if it belongs to a single page document.

Additionally, train/validation/test splitted dataset is shared. The repository contains the code with which it is possible to process the total dataset, but the latter contains some random components which would therefore result in a slightly different result.

Request access

To gain access to HumSet, please contact us at nlp@thedeep.io

Contact

For any technical question please contact Selim Fekih, Nicolò Tamagnone.

Terms and conditions

For a detailed description about terms and conditions, refer to DEEP Terms of Use and Privacy Notice



About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published