Corpus de novelas hispanoamericanas del siglo XIX (conha19)

The corpus conha19 consists of 256 novels written by Argentine, Cuban, and Mexican authors or published in the respective countries between 1830 and 1910. Of these novels, 234 are published in this repository, as they are in the public domain.

Conha19 was prepared for the dissertation "Genre Analysis and Corpus Design: 19th Century Spanish American Novels (1830-1910)", written by Ulrike Henny-Krahmer. The dissertation project was realized as part of the junior research group "Computational Literary Genres Stylistics" (CLiGS), a project funded by the German Federal Ministry of Education and Research (BMBF) and hosted at the University of Würzburg between 2015 and 2020.

The corpus has been prepared primarily to allow for the analysis of subgenres, especially thematic subgenres (historical novel, sentimental novel, etc.) and literary currents (such as romantic, realist, and naturalistic novels). Some background information about the contents and preparation of the corpus is given in this README file. For further information see the list of related publications.

Overview of the novels in the corpus
Structure and contents of the repository
Related resources
Rights and citation suggestions
- Handling of works protected by copyright
- Citation suggestions
Contact

Overview of the novels in the corpus

The overviews given here apply to all 256 novels (including the ones which still are under copyright). In total, the texts amount to 18.3 million tokens. There are 108 Mexican, 99 Argentine, and 49 Cuban novels. The following figures show the distribution of novels per decade, first by country, then by thematic subgenre, and thirdly by literary current.

The novels were written by 121 different authors. Authors who are represented with 5 or more works are listed below:

author name	country	number of novels in the corpus
de Cuéllar, José Tomás	Mexico	9
Gutiérrez, Eduardo	Argentina	9
Gamboa, Federico	Mexico	8
Ocantos, Carlos María	Argentina	8
Gómez de Avellaneda, Gertrudis	Cuba	7
Calcagno, Francisco	Cuba	6
Paz, Ireneo	Mexico	6
Altamirano, Ignacio Manuel	Mexico	5
Ancona, Eligio	Mexico	5
Holmberg, Eduardo Ladislao	Argentina	5
Sicardi, Francisco	Argentina	5
Villaverde, Cirilo	Cuba	5

Structure and contents of the repository

In the following, the kind of data which is contained in this repository is listed. Three main formats of the novels are included: TEI, plain text, and linguistically annotated files:

tei: the TEI master files of the novels
txt: plain text files, extracted from the TEI master files
annotated: linguistically annotated files (in TEI)

There is additional material accompanying the novels' files:

metadata_free.csv: basic metadata about the 234 novels which are in the public domain and which are published here, in tabular format, including for example the CLiGS identifiers, shortcuts for authors and titles, publication years, and information about the subgenres of the texts
metadata_all.csv: basic metadata for all the 256 novels, including the ones which are not in the public domain yet
schema: a folder containing an external TEI keywords file and a schematron file, which serve to control the metadata keywords used in the text classification section of the TEI header. The TEI schemas for the basic and the linguistically annotated TEI files in turn are not given here because correspond to the general CLiGS schemas, which are available in the CLiGS reference repository
bib/biblibography.xml: bibliography file (in TEI), holding full bibliographic references of literary historical works cited in the corpus files
spellcheck: lists with exception words and results of the spell check in CSV format, for the whole corpus and per novel
travelogues: three TEI files with travelogues which were not considered as novels for the corpus, but compared to them in the selection process
scripts: scripts used to check, clean, or summarize corpus data
plots: plots with summaries of corpus metadata

Besides, there are further formats that were derived from the three main formats for specific analyses:

tei_ns: "tei no speech", subset of 92 files without direct speech mark-up (in TEI)
tei_ds: "tei direct speech", subset of 92 files with direct speech annotation based on a regular expression approach
tei_tokenized_ds: subset of 92 files as tokenized text with two stand-off direct speech annotations (DS_gold: semi-automatically created gold standard, DS_reg: automatically created RegExp-based annotation), in TEI
annotated_corr: linguistically annotated files (in TEI) with corrected POS annotation for verb forms with enclitic pronouns
txt_annotated: plain text files, extracted from the corrected linguistically annotated TEI files (annotated_corr); named entities are replaced with the token ENTITY
txt_annotated_corr: plain text files derived from txt_annotated; converted to lower case; blank spaces that precede punctuation marks (comma, full stop, etc.) are removed
txt_annotated_nouns: plain text files derived from the corrected linguistically annotated TEI files (annotated_corr); only nouns are kept
txt_annotated_stop: plain text files derived from txt_annotated_corr; stop words are removed

Related resources

Repositories

This repository is related to three other GitHub repositories:

Bib-ACMé is a digital bibliography containing information about the novels published in Argentina, Cuba, and Mexico between 1830 and 1910. This bibliography constitutes the sampling frame for the corpus Conha19, so it aims to represent the whole population of 19th-century novels published in the three countries.

scripts-nh contains XSLT- and Python scripts which were used for the creation, annotation, and documentation of the corpus and for the analysis of the novels in the corpus.

data-nh holds research data that resulted from applying the scripts of scripts-nh to the corpus files.

Datasets

Here, links to the corpus in other formats (published elsewhere and not as part of this repository) are given.

TXM corpus: Conha19 in a binary format suitable for the text analysis tool TXM (see http://textometrie.ens-lyon.fr/).

Publications

This corpus or parts of it have been described and/or used for analyses in the following publications:

Reference publication:

Henny-Krahmer, Ulrike (2023). Genre Analysis and Corpus Design: Nineteenth Century Spanish-American Novels (1830–1910). Dissertation, Universität Würzburg. https://doi.org/10.25972/OPUS-31999.

Other publications:

Calvo Tello, José, Ulrike Henny-Krahmer, and Christof Schöch (2018): "Textbox: análisis del léxico mediante corpus literarios". In Historia del léxico español y Humanidades digitales. Edited by Dolores Corbella, Alejandro Fajardo, and Jutta Langenbacher. Berlin: Peter Lang, 225-253. https://dialnet.unirioja.es/servlet/articulo?codigo=7081640.
Calvo Tello, José, Daniel Schlör, Ulrike Henny-Krahmer, and Christof Schöch (2017): "Neutralising the Authorial Signal in Delta by Penalization: Stylometric Clustering of Genre in Spanish Novels". In Digital Humanities 2017. Conference Abstracts. Montréal: McGill University & Université de Montréal, 181-184. https://dh2017.adho.org/abstracts/037/037.pdf.
Henny-Krahmer, Ulrike (forthcoming): "Family Resemblance in Genre Stylistics: A Case Study with 19th Century Spanish American Novels." In Digital Stylistics in Romance Studies and Beyond. Edited by Robert Hesselbach, José Calvo Tello, Ulrike Henny-Krahmer, Daniel Schlör, and Christof Schöch. Heidelberg: heiUP.
___ (2022): "Novelas originales y americanas. A Digital Analysis of References to Identity in Subtitles of Spanish American 19th Century Novels." apropos [Perspektiven auf die Romania] 9: 14-36. https://doi.org/10.15460/apropos.9.1893.
___ (2021): "Time for Genre. Temporal Expressions as Features for the Classification of Literary Subgenres." EADH2021. https://eadh2021.culintec.de/HENNY_KRAHMER_Ulrike_Time_for_Genre__Temporal_Expressions_as.html.
___ (2018): "Exploration of Sentiments and Genre in Spanish American Novels." In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Ciudad de México: Red de Humanidades Digitales, 399-403. https://dh2018.adho.org/exploration-of-sentiments-and-genre-in-spanish-american-novels/.
Henny-Krahmer, Ulrike, Katrin Betz, Daniel Schlör, and Andreas Hotho (2018): "Alternative Gattungstheorien. Das Prototypenmodell am Beispiel hispanoamerikanischer Romane." In DHd 2018. Kritik der digitalen Vernunft. Konferenzabstracts. Köln: Universität zu Köln, 105-112. http://doi.org/10.5281/zenodo.4622413.
Schöch, Christof, José Calvo Tello, Ulrike Henny-Krahmer, and Stefanie Popp (2019): "The CLiGS textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in XML-TEI." Journal of the Text Encoding Initiative. https://journals.openedition.org/jtei/2085.
Schöch, Christof, Ulrike Henny, José Calvo Tello, Daniel Schlör, and Stefanie Popp (2016): "Topic, Genre, Text. Topics im Textverlauf von Untergattungen des spanischen und hispanoamerikanischen Romans (1880-1930)." In DHd 2016. Modellierung, Vernetzung, Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts. Leipzig: Universität Leipzig, 235-239. http://doi.org/10.5281/zenodo.4645381.
Zehe, Albin, Daniel Schlör, Ulrike Henny-Krahmer, Martin Becker, and Andreas Hotho (2018): "A White-Box Model for Detecting Author Nationality by Linguistic Differences in Spanish Novels." In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Ciudad de México: Red de Humanidades Digitales, 519-522. https://dh2018.adho.org/a-white-box-model-for-detecting-author-nationality-by-linguistic-differences-in-spanish-novels/.

Rights and citation suggestions

The works contained in this public corpus are in the public domain. They are provided here with the Public Domain Mark Declaration and can be re-used without restrictions. The XML-TEI markup is also considered to be free of any copyright and is provided with the same declaration. If you use texts from this collection for your research or teaching, we kindly ask you to reference this repository using the citation suggestion below and/or cite the reference publication indicated below.

Handling of works protected by copyright

According to the German copyright law, some of the works that are part of the full corpus accompanying the dissertation are still under general copyright because the authors died less than 70 years ago. Furthermore, some of the source editions used are protected by the ancillary copyright because they were published less than 25 years ago and copyright was claimed for them by the editors. This applies to 19 texts.

The corpus files for these works will be added to the public corpus as soon as the copyright expires. A table summarizing information that is relevant for the copyright status of all the files in the corpus, including the ones that are not published in this repository yet, can be viewed here. The entire corpus has been archived on Zenodo (see http://doi.org/10.5281/zenodo.4447468) with restricted access.

Citation suggestions

If you use this corpus, I kindly ask you to cite it either directly or by indicating the reference publication, as suggested below.

Citation suggestion for the corpus:

Henny-Krahmer, Ulrike (ed.) (2021). Corpus de novelas hispanoamericanas del siglo XIX (conha19). Version 1.0.1. Github.com. URL: https://github.com/cligs/conha19. DOI: https://doi.org/10.5281/zenodo.4766987.

Citation suggestion for the reference publication:

Henny-Krahmer, Ulrike (2023). Genre Analysis and Corpus Design: Nineteenth Century Spanish-American Novels (1830–1910). Dissertation, Universität Würzburg. https://doi.org/10.25972/OPUS-31999.

Contact

If you have any comments or suggestions on the corpus or would like to contribute to it, please leave an issue or contact:

Ulrike Henny-Krahmer, ulrike.henny@web.de

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
annotated		annotated
annotated_corr		annotated_corr
bib		bib
plots		plots
schema		schema
scripts		scripts
spellcheck		spellcheck
tei		tei
tei_ds		tei_ds
tei_ns		tei_ns
tei_tokenized_ds		tei_tokenized_ds
travelogues		travelogues
txt		txt
txt_annotated		txt_annotated
txt_annotated_corr		txt_annotated_corr
txt_annotated_nouns		txt_annotated_nouns
txt_annotated_stop		txt_annotated_stop
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
metadata_all.csv		metadata_all.csv
metadata_free.csv		metadata_free.csv

License

cligs/conha19

Folders and files

Latest commit

History

Repository files navigation

Corpus de novelas hispanoamericanas del siglo XIX (conha19)

Overview of the novels in the corpus

Structure and contents of the repository

Related resources

Repositories

Datasets

Publications

Rights and citation suggestions

Handling of works protected by copyright

Citation suggestions

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages