Added TAB-separated csv version of DICTLINE.GEN #92

gfranzini · 2019-01-30T15:24:44Z

Hi! First of all, thank you so much for making this data available on GitHub!

We've been working with the DICTLINE.GEN file in your repository and for our research purposes needed to transform it into a TAB-separated csv spreadsheet. In converting the space-separated fields in your original file, my colleague @Stormur noticed that the spacing is not consistent; this issue may or may not be relevant to you but we thought we'd flag it up just in case. We also found some problematic characters (e.g., in lines 6883 and 37500), which prevented us from successfully loading the csv file into a SQL table. Anyway, everything is working nicely now and we thought we'd share the csv file with you. :)

Thanks again!

mk270 · 2019-02-05T05:14:15Z

Ok,

the original DICTLINE.GEN is not in UTF-8; IIRC it's in Latin-1. You can convert to UTF-8 by doing something like

```iconv -f LATIN1 -t UTF-8 DICTLINE.GEN```

alternatively, many SQL systems afford on-the-fly encoding conversion; I doubt the encoding prevents loading into SQL if you tell the loader what encoding it's supposed to be in.

@Stormur - I'd be very interested to know how the spacing is inconsistent; as the file contains fixed-width fields, and IIRC the WORDS programme depends on this, inconsistency may be introducing errors.

ids1024 · 2019-02-05T05:20:47Z

It's good to know about any spacing issues.

I don't think a csv file duplicating DICTLINE.gen is a good thing to add to the repository here. That's problematic if DICTLINE.gen is updated, etc. A script that automates such a conversion might be more useful.

mk270 · 2019-02-06T19:58:14Z

i think we'll go with @ids1024 's suggestion unless there's a good reason to the contrary

Stormur · 2019-02-07T10:03:30Z

Hi!

I refer of course to the file that I was able to download from here. There might be fixed-width fields originally, but since only spaces are found in this version, conversion to a tsv/csv format was laborious and not obvious, as there was no way

to tell a priori what the width of these fields is;
to know if some elements should appear in the same field or not, as it looks like for the morphological or semantical notation. The former does not always present the same number of elements and in any case we have only a space as separator;
and in many lines some fields are empty, so it is not clear how to separate them.

Using tabulations for fields and spaces only inside fields (as e.g. in definitions) would instantly solve this representation problem and be more user friendly. Do you think there would be some technical difficulties in readapting Whitaker's Words to this tab/space paradigm?

The inconsistencies that I found were lots of double spaces instead of single ones, mostly inside definitions. They broke the regular expressions that I used for tsv conversion. There was e.g. also an obviously wrong "in satiable".
Other inconsistencies were some misplaced or badly formatted lines, e.g. one line for Jud is meshed with exquisit.
Unfortunately, I did not take a log of all such inconsistencies, but the new tsv file has them corrected.

Regarding special characters, my suggestion, coming from personal experience, is to pass directly to UTF8 (or maybe even "downgrade" to ASCII, if those special characters are not a big issue, as it would still be compatible with UTF8), which is more user-friendly and wouldn't rely on external, possibly wrong conversions (totalling two conversions, taking into account the one to tsv).

Thank you!

mk270 · 2019-02-07T12:14:09Z

@Stormur thanks very much for this clear explanation of what you're up to.

There are a couple of pieces of information that you need to have:

the file is consumed by a morphological analyser programme (that is the whole point)
the programme internally knows what the field widths are
it is very expensive to adapt the programme's behaviour
it is also (I believe) expensive to make the programme consume UTF-8

As @ids1024 says, there is also a latent objection to having two copies of the same data in the repository; if the original file is ever updated, we need to remember to update the CSV version as well.

I'm not sure what you mean by "external, possibly wrong conversions" - please expand on this.

I think having a TSV/CSV form of the data is genuinely useful, for the broader uses that the digital humanities community might put it to. There is no way I'm going to compromise basic software engineering good practice for that. Instead, we'll make a conversion tool that knows the exact field widths, and outputs the data in the exact form you need.

Once we have got this tool working, we can compare its output with your TSV file. That should enable us to identify the inconsistencies you have identified. This is really useful, thank you!

Do you have a preference for the language used for such a tool, e.g., Python?

ids1024 · 2019-02-07T17:30:27Z

Here's an initial conversion script, using the same format.

It produces the same output as the csv until the 186th entry where differences begin. But the diffs can be compared.

import csv
import sys

titles = ("id", "word", "other-1", "other-2", "other-3", "pos", "morphology", "number", "code", "definitions")
field_idx = (0, 19, 38, 57, 76, 83, 97, 100, 110)

writer = csv.writer(sys.stdout, delimiter='\t', lineterminator="\n")
writer.writerow(titles)
with open('DICTLINE.GEN', encoding="latin-1") as dictline:
    id_ = 1
    for l in dictline:
        fields = (l[start:end].rstrip() for start, end in zip(field_idx, field_idx[1:] + (None,)))
        writer.writerow([id_] + list(fields))
        id_ += 1

Another reason to have an automated solution: otherwise it's essentially impossible to verify the correctness of this csv file. If there's a mistake in the conversion somewhere in the middle, how would we know?

Edit: The diff is very long when the id column is included, but using cut to remove it makes it shorter. Here's a diff (from the generated one, to the csv in this PR): https://gist.github.com/ids1024/f9e49bb6bf0d564bfe198f872de40d70

Edit: Updated to use python's csv module.

ids1024 · 2019-02-07T17:59:17Z

Other inconsistencies were some misplaced or badly formatted lines, e.g. one line for Jud is meshed with exquisit.

#93 should fix this.

That makes the id numbers generated by my script the same as the csv here, so comparisons can be made more easily.

mk270 · 2019-02-10T17:55:52Z

Ok, we have fixed the Jud/exquisit issue (I think) - neither term shows up in Book 4 of the Aeneid tho ... ;)

thanks for reporting the error in the file

Fix from csv in PR mk270#92

asarhaddon · 2023-02-20T14:20:00Z

Hello.
DICTLINE.GEN only contains two non-ASCII characters, which are errors (french equivalents where english words is expected).
#126 fixes this part of the issue.

gfranzini added 2 commits January 30, 2019 16:08

Added TAB-separated csv version of DICTLINE.GEN

568c920

Renamed file

c840726

mk270 closed this Feb 6, 2019

mk270 reopened this Feb 7, 2019

ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 7, 2019

DICTLINE.GEN: Add missing newline (mk270#92)

4b8308e

ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 7, 2019

DICTLINE.GEN: Add missing newline (mk270#92)

0ff4942

ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 10, 2019

DICTLINE.GEN: remove extra characters at end of assecutio line

1dbf394

Fix from csv in PR mk270#92

ids1024 mentioned this pull request Feb 10, 2019

DICTLINE.GEN: remove extra characters at end of assecutio line #95

Merged

ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 10, 2019

Add dictline_csv.py script, based on PR mk270#92

ae648b7

ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 10, 2019

Add dictline_csv.py script, based on PR mk270#92

6bba79f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added TAB-separated csv version of DICTLINE.GEN #92

Added TAB-separated csv version of DICTLINE.GEN #92

gfranzini commented Jan 30, 2019

mk270 commented Feb 5, 2019

ids1024 commented Feb 5, 2019

mk270 commented Feb 6, 2019

Stormur commented Feb 7, 2019 •

edited

mk270 commented Feb 7, 2019

ids1024 commented Feb 7, 2019 •

edited

ids1024 commented Feb 7, 2019 •

edited

mk270 commented Feb 10, 2019

asarhaddon commented Feb 20, 2023

Added TAB-separated csv version of DICTLINE.GEN #92

Are you sure you want to change the base?

Added TAB-separated csv version of DICTLINE.GEN #92

Conversation

gfranzini commented Jan 30, 2019

mk270 commented Feb 5, 2019

ids1024 commented Feb 5, 2019

mk270 commented Feb 6, 2019

Stormur commented Feb 7, 2019 • edited

mk270 commented Feb 7, 2019

ids1024 commented Feb 7, 2019 • edited

ids1024 commented Feb 7, 2019 • edited

mk270 commented Feb 10, 2019

asarhaddon commented Feb 20, 2023

Stormur commented Feb 7, 2019 •

edited

ids1024 commented Feb 7, 2019 •

edited

ids1024 commented Feb 7, 2019 •

edited