Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added TAB-separated csv version of DICTLINE.GEN #92

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gfranzini
Copy link

Hi! First of all, thank you so much for making this data available on GitHub!

We've been working with the DICTLINE.GEN file in your repository and for our research purposes needed to transform it into a TAB-separated csv spreadsheet. In converting the space-separated fields in your original file, my colleague @Stormur noticed that the spacing is not consistent; this issue may or may not be relevant to you but we thought we'd flag it up just in case. We also found some problematic characters (e.g., in lines 6883 and 37500), which prevented us from successfully loading the csv file into a SQL table. Anyway, everything is working nicely now and we thought we'd share the csv file with you. :)

Thanks again!

@mk270
Copy link
Owner

mk270 commented Feb 5, 2019

Ok,

the original DICTLINE.GEN is not in UTF-8; IIRC it's in Latin-1. You can convert to UTF-8 by doing something like

```iconv -f LATIN1 -t UTF-8 DICTLINE.GEN```

alternatively, many SQL systems afford on-the-fly encoding conversion; I doubt the encoding prevents loading into SQL if you tell the loader what encoding it's supposed to be in.

@Stormur - I'd be very interested to know how the spacing is inconsistent; as the file contains fixed-width fields, and IIRC the WORDS programme depends on this, inconsistency may be introducing errors.

@ids1024
Copy link
Contributor

ids1024 commented Feb 5, 2019

It's good to know about any spacing issues.

I don't think a csv file duplicating DICTLINE.gen is a good thing to add to the repository here. That's problematic if DICTLINE.gen is updated, etc. A script that automates such a conversion might be more useful.

@mk270
Copy link
Owner

mk270 commented Feb 6, 2019

i think we'll go with @ids1024 's suggestion unless there's a good reason to the contrary

@mk270 mk270 closed this Feb 6, 2019
@Stormur
Copy link

Stormur commented Feb 7, 2019

Hi!

I refer of course to the file that I was able to download from here. There might be fixed-width fields originally, but since only spaces are found in this version, conversion to a tsv/csv format was laborious and not obvious, as there was no way

  • to tell a priori what the width of these fields is;
  • to know if some elements should appear in the same field or not, as it looks like for the morphological or semantical notation. The former does not always present the same number of elements and in any case we have only a space as separator;
  • and in many lines some fields are empty, so it is not clear how to separate them.

Using tabulations for fields and spaces only inside fields (as e.g. in definitions) would instantly solve this representation problem and be more user friendly. Do you think there would be some technical difficulties in readapting Whitaker's Words to this tab/space paradigm?

The inconsistencies that I found were lots of double spaces instead of single ones, mostly inside definitions. They broke the regular expressions that I used for tsv conversion. There was e.g. also an obviously wrong "in satiable".
Other inconsistencies were some misplaced or badly formatted lines, e.g. one line for Jud is meshed with exquisit.
Unfortunately, I did not take a log of all such inconsistencies, but the new tsv file has them corrected.

Regarding special characters, my suggestion, coming from personal experience, is to pass directly to UTF8 (or maybe even "downgrade" to ASCII, if those special characters are not a big issue, as it would still be compatible with UTF8), which is more user-friendly and wouldn't rely on external, possibly wrong conversions (totalling two conversions, taking into account the one to tsv).

Thank you!

@mk270
Copy link
Owner

mk270 commented Feb 7, 2019

@Stormur thanks very much for this clear explanation of what you're up to.

There are a couple of pieces of information that you need to have:

  • the file is consumed by a morphological analyser programme (that is the whole point)
  • the programme internally knows what the field widths are
  • it is very expensive to adapt the programme's behaviour
  • it is also (I believe) expensive to make the programme consume UTF-8

As @ids1024 says, there is also a latent objection to having two copies of the same data in the repository; if the original file is ever updated, we need to remember to update the CSV version as well.

I'm not sure what you mean by "external, possibly wrong conversions" - please expand on this.

I think having a TSV/CSV form of the data is genuinely useful, for the broader uses that the digital humanities community might put it to. There is no way I'm going to compromise basic software engineering good practice for that. Instead, we'll make a conversion tool that knows the exact field widths, and outputs the data in the exact form you need.

Once we have got this tool working, we can compare its output with your TSV file. That should enable us to identify the inconsistencies you have identified. This is really useful, thank you!

Do you have a preference for the language used for such a tool, e.g., Python?

@mk270 mk270 reopened this Feb 7, 2019
@ids1024
Copy link
Contributor

ids1024 commented Feb 7, 2019

Here's an initial conversion script, using the same format.

It produces the same output as the csv until the 186th entry where differences begin. But the diffs can be compared.

import csv
import sys

titles = ("id", "word", "other-1", "other-2", "other-3", "pos", "morphology", "number", "code", "definitions")
field_idx = (0, 19, 38, 57, 76, 83, 97, 100, 110)

writer = csv.writer(sys.stdout, delimiter='\t', lineterminator="\n")
writer.writerow(titles)
with open('DICTLINE.GEN', encoding="latin-1") as dictline:
    id_ = 1
    for l in dictline:
        fields = (l[start:end].rstrip() for start, end in zip(field_idx, field_idx[1:] + (None,)))
        writer.writerow([id_] + list(fields))
        id_ += 1

Another reason to have an automated solution: otherwise it's essentially impossible to verify the correctness of this csv file. If there's a mistake in the conversion somewhere in the middle, how would we know?

Edit: The diff is very long when the id column is included, but using cut to remove it makes it shorter. Here's a diff (from the generated one, to the csv in this PR): https://gist.github.com/ids1024/f9e49bb6bf0d564bfe198f872de40d70

Edit: Updated to use python's csv module.

ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 7, 2019
@ids1024
Copy link
Contributor

ids1024 commented Feb 7, 2019

Other inconsistencies were some misplaced or badly formatted lines, e.g. one line for Jud is meshed with exquisit.

#93 should fix this.

That makes the id numbers generated by my script the same as the csv here, so comparisons can be made more easily.

ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 7, 2019
@mk270
Copy link
Owner

mk270 commented Feb 10, 2019

Ok, we have fixed the Jud/exquisit issue (I think) - neither term shows up in Book 4 of the Aeneid tho ... ;)

thanks for reporting the error in the file

ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 10, 2019
ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 10, 2019
ids1024 added a commit to ids1024/whitakers-words that referenced this pull request Feb 10, 2019
@asarhaddon
Copy link
Contributor

Hello.
DICTLINE.GEN only contains two non-ASCII characters, which are errors (french equivalents where english words is expected).
#126 fixes this part of the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants