Skip to content
/ conha19 Public

Corpus de novelas hispanoamericanas del siglo XIX (conha19)

License

Notifications You must be signed in to change notification settings

cligs/conha19

Repository files navigation

Corpus de novelas hispanoamericanas del siglo XIX (conha19)

DOI

The corpus conha19 consists of 256 novels written by Argentine, Cuban, and Mexican authors or published in the respective countries between 1830 and 1910. Of these novels, 234 are published in this repository, as they are in the public domain.

Conha19 was prepared for the dissertation "Genre Analysis and Corpus Design: 19th Century Spanish American Novels (1830-1910)", written by Ulrike Henny-Krahmer. The dissertation project was realized as part of the junior research group "Computational Literary Genres Stylistics" (CLiGS), a project funded by the German Federal Ministry of Education and Research (BMBF) and hosted at the University of Würzburg between 2015 and 2020.

The corpus has been prepared primarily to allow for the analysis of subgenres, especially thematic subgenres (historical novel, sentimental novel, etc.) and literary currents (such as romantic, realist, and naturalistic novels). Some background information about the contents and preparation of the corpus is given in this README file. For further information see the list of related publications.

Overview of the novels in the corpus

The overviews given here apply to all 256 novels (including the ones which still are under copyright). In total, the texts amount to 18.3 million tokens. There are 108 Mexican, 99 Argentine, and 49 Cuban novels. The following figures show the distribution of novels per decade, first by country, then by thematic subgenre, and thirdly by literary current.

Novels by decade and country

Novels by decade and thematic subgenre

Novels by decade and literary current

The novels were written by 121 different authors. Authors who are represented with 5 or more works are listed below:

author name country number of novels in the corpus
de Cuéllar, José Tomás Mexico 9
Gutiérrez, Eduardo Argentina 9
Gamboa, Federico Mexico 8
Ocantos, Carlos María Argentina 8
Gómez de Avellaneda, Gertrudis Cuba 7
Calcagno, Francisco Cuba 6
Paz, Ireneo Mexico 6
Altamirano, Ignacio Manuel Mexico 5
Ancona, Eligio Mexico 5
Holmberg, Eduardo Ladislao Argentina 5
Sicardi, Francisco Argentina 5
Villaverde, Cirilo Cuba 5

Structure and contents of the repository

In the following, the kind of data which is contained in this repository is listed. Three main formats of the novels are included: TEI, plain text, and linguistically annotated files:

  • tei: the TEI master files of the novels
  • txt: plain text files, extracted from the TEI master files
  • annotated: linguistically annotated files (in TEI)

There is additional material accompanying the novels' files:

  • metadata_free.csv: basic metadata about the 234 novels which are in the public domain and which are published here, in tabular format, including for example the CLiGS identifiers, shortcuts for authors and titles, publication years, and information about the subgenres of the texts
  • metadata_all.csv: basic metadata for all the 256 novels, including the ones which are not in the public domain yet
  • schema: a folder containing an external TEI keywords file and a schematron file, which serve to control the metadata keywords used in the text classification section of the TEI header. The TEI schemas for the basic and the linguistically annotated TEI files in turn are not given here because correspond to the general CLiGS schemas, which are available in the CLiGS reference repository
  • bib/biblibography.xml: bibliography file (in TEI), holding full bibliographic references of literary historical works cited in the corpus files
  • spellcheck: lists with exception words and results of the spell check in CSV format, for the whole corpus and per novel
  • travelogues: three TEI files with travelogues which were not considered as novels for the corpus, but compared to them in the selection process
  • scripts: scripts used to check, clean, or summarize corpus data
  • plots: plots with summaries of corpus metadata

Besides, there are further formats that were derived from the three main formats for specific analyses:

  • tei_ns: "tei no speech", subset of 92 files without direct speech mark-up (in TEI)
  • tei_ds: "tei direct speech", subset of 92 files with direct speech annotation based on a regular expression approach
  • tei_tokenized_ds: subset of 92 files as tokenized text with two stand-off direct speech annotations (DS_gold: semi-automatically created gold standard, DS_reg: automatically created RegExp-based annotation), in TEI
  • annotated_corr: linguistically annotated files (in TEI) with corrected POS annotation for verb forms with enclitic pronouns
  • txt_annotated: plain text files, extracted from the corrected linguistically annotated TEI files (annotated_corr); named entities are replaced with the token ENTITY
  • txt_annotated_corr: plain text files derived from txt_annotated; converted to lower case; blank spaces that precede punctuation marks (comma, full stop, etc.) are removed
  • txt_annotated_nouns: plain text files derived from the corrected linguistically annotated TEI files (annotated_corr); only nouns are kept
  • txt_annotated_stop: plain text files derived from txt_annotated_corr; stop words are removed

Related resources

Repositories

This repository is related to three other GitHub repositories:

Bib-ACMé is a digital bibliography containing information about the novels published in Argentina, Cuba, and Mexico between 1830 and 1910. This bibliography constitutes the sampling frame for the corpus Conha19, so it aims to represent the whole population of 19th-century novels published in the three countries.

scripts-nh contains XSLT- and Python scripts which were used for the creation, annotation, and documentation of the corpus and for the analysis of the novels in the corpus.

data-nh holds research data that resulted from applying the scripts of scripts-nh to the corpus files.

Datasets

Here, links to the corpus in other formats (published elsewhere and not as part of this repository) are given.

Publications

This corpus or parts of it have been described and/or used for analyses in the following publications:

Reference publication:

  • Henny-Krahmer, Ulrike (2023). Genre Analysis and Corpus Design: Nineteenth Century Spanish-American Novels (1830–1910). Dissertation, Universität Würzburg. https://doi.org/10.25972/OPUS-31999.

Other publications:

  • Calvo Tello, José, Ulrike Henny-Krahmer, and Christof Schöch (2018): "Textbox: análisis del léxico mediante corpus literarios". In Historia del léxico español y Humanidades digitales. Edited by Dolores Corbella, Alejandro Fajardo, and Jutta Langenbacher. Berlin: Peter Lang, 225-253. https://dialnet.unirioja.es/servlet/articulo?codigo=7081640.
  • Calvo Tello, José, Daniel Schlör, Ulrike Henny-Krahmer, and Christof Schöch (2017): "Neutralising the Authorial Signal in Delta by Penalization: Stylometric Clustering of Genre in Spanish Novels". In Digital Humanities 2017. Conference Abstracts. Montréal: McGill University & Université de Montréal, 181-184. https://dh2017.adho.org/abstracts/037/037.pdf.
  • Henny-Krahmer, Ulrike (forthcoming): "Family Resemblance in Genre Stylistics: A Case Study with 19th Century Spanish American Novels." In Digital Stylistics in Romance Studies and Beyond. Edited by Robert Hesselbach, José Calvo Tello, Ulrike Henny-Krahmer, Daniel Schlör, and Christof Schöch. Heidelberg: heiUP.
  • ___ (2022): "Novelas originales y americanas. A Digital Analysis of References to Identity in Subtitles of Spanish American 19th Century Novels." apropos [Perspektiven auf die Romania] 9: 14-36. https://doi.org/10.15460/apropos.9.1893.
  • ___ (2021): "Time for Genre. Temporal Expressions as Features for the Classification of Literary Subgenres." EADH2021. https://eadh2021.culintec.de/HENNY_KRAHMER_Ulrike_Time_for_Genre__Temporal_Expressions_as.html.
  • ___ (2018): "Exploration of Sentiments and Genre in Spanish American Novels." In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Ciudad de México: Red de Humanidades Digitales, 399-403. https://dh2018.adho.org/exploration-of-sentiments-and-genre-in-spanish-american-novels/.
  • Henny-Krahmer, Ulrike, Katrin Betz, Daniel Schlör, and Andreas Hotho (2018): "Alternative Gattungstheorien. Das Prototypenmodell am Beispiel hispanoamerikanischer Romane." In DHd 2018. Kritik der digitalen Vernunft. Konferenzabstracts. Köln: Universität zu Köln, 105-112. http://doi.org/10.5281/zenodo.4622413.
  • Schöch, Christof, José Calvo Tello, Ulrike Henny-Krahmer, and Stefanie Popp (2019): "The CLiGS textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in XML-TEI." Journal of the Text Encoding Initiative. https://journals.openedition.org/jtei/2085.
  • Schöch, Christof, Ulrike Henny, José Calvo Tello, Daniel Schlör, and Stefanie Popp (2016): "Topic, Genre, Text. Topics im Textverlauf von Untergattungen des spanischen und hispanoamerikanischen Romans (1880-1930)." In DHd 2016. Modellierung, Vernetzung, Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts. Leipzig: Universität Leipzig, 235-239. http://doi.org/10.5281/zenodo.4645381.
  • Zehe, Albin, Daniel Schlör, Ulrike Henny-Krahmer, Martin Becker, and Andreas Hotho (2018): "A White-Box Model for Detecting Author Nationality by Linguistic Differences in Spanish Novels." In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Ciudad de México: Red de Humanidades Digitales, 519-522. https://dh2018.adho.org/a-white-box-model-for-detecting-author-nationality-by-linguistic-differences-in-spanish-novels/.

Rights and citation suggestions

The works contained in this public corpus are in the public domain. They are provided here with the Public Domain Mark Declaration and can be re-used without restrictions. The XML-TEI markup is also considered to be free of any copyright and is provided with the same declaration. If you use texts from this collection for your research or teaching, we kindly ask you to reference this repository using the citation suggestion below and/or cite the reference publication indicated below.

Handling of works protected by copyright

According to the German copyright law, some of the works that are part of the full corpus accompanying the dissertation are still under general copyright because the authors died less than 70 years ago. Furthermore, some of the source editions used are protected by the ancillary copyright because they were published less than 25 years ago and copyright was claimed for them by the editors. This applies to 19 texts.

The corpus files for these works will be added to the public corpus as soon as the copyright expires. A table summarizing information that is relevant for the copyright status of all the files in the corpus, including the ones that are not published in this repository yet, can be viewed here. The entire corpus has been archived on Zenodo (see http://doi.org/10.5281/zenodo.4447468) with restricted access.

Citation suggestions

If you use this corpus, I kindly ask you to cite it either directly or by indicating the reference publication, as suggested below.

Citation suggestion for the corpus:

Henny-Krahmer, Ulrike (ed.) (2021). Corpus de novelas hispanoamericanas del siglo XIX (conha19). Version 1.0.1. Github.com. URL: https://github.com/cligs/conha19. DOI: https://doi.org/10.5281/zenodo.4766987.

Citation suggestion for the reference publication:

Henny-Krahmer, Ulrike (2023). Genre Analysis and Corpus Design: Nineteenth Century Spanish-American Novels (1830–1910). Dissertation, Universität Würzburg. https://doi.org/10.25972/OPUS-31999.

Contact

If you have any comments or suggestions on the corpus or would like to contribute to it, please leave an issue or contact:

Ulrike Henny-Krahmer, ulrike.henny@web.de