New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added TAB-separated csv version of DICTLINE.GEN #92
base: master
Are you sure you want to change the base?
Conversation
Ok, the original DICTLINE.GEN is not in UTF-8; IIRC it's in Latin-1. You can convert to UTF-8 by doing something like
alternatively, many SQL systems afford on-the-fly encoding conversion; I doubt the encoding prevents loading into SQL if you tell the loader what encoding it's supposed to be in. @Stormur - I'd be very interested to know how the spacing is inconsistent; as the file contains fixed-width fields, and IIRC the WORDS programme depends on this, inconsistency may be introducing errors. |
It's good to know about any spacing issues. I don't think a csv file duplicating |
i think we'll go with @ids1024 's suggestion unless there's a good reason to the contrary |
Hi! I refer of course to the file that I was able to download from here. There might be fixed-width fields originally, but since only spaces are found in this version, conversion to a tsv/csv format was laborious and not obvious, as there was no way
Using tabulations for fields and spaces only inside fields (as e.g. in definitions) would instantly solve this representation problem and be more user friendly. Do you think there would be some technical difficulties in readapting Whitaker's Words to this tab/space paradigm? The inconsistencies that I found were lots of double spaces instead of single ones, mostly inside definitions. They broke the regular expressions that I used for tsv conversion. There was e.g. also an obviously wrong "in satiable". Regarding special characters, my suggestion, coming from personal experience, is to pass directly to UTF8 (or maybe even "downgrade" to ASCII, if those special characters are not a big issue, as it would still be compatible with UTF8), which is more user-friendly and wouldn't rely on external, possibly wrong conversions (totalling two conversions, taking into account the one to tsv). Thank you! |
@Stormur thanks very much for this clear explanation of what you're up to. There are a couple of pieces of information that you need to have:
As @ids1024 says, there is also a latent objection to having two copies of the same data in the repository; if the original file is ever updated, we need to remember to update the CSV version as well. I'm not sure what you mean by "external, possibly wrong conversions" - please expand on this. I think having a TSV/CSV form of the data is genuinely useful, for the broader uses that the digital humanities community might put it to. There is no way I'm going to compromise basic software engineering good practice for that. Instead, we'll make a conversion tool that knows the exact field widths, and outputs the data in the exact form you need. Once we have got this tool working, we can compare its output with your TSV file. That should enable us to identify the inconsistencies you have identified. This is really useful, thank you! Do you have a preference for the language used for such a tool, e.g., Python? |
Here's an initial conversion script, using the same format. It produces the same output as the import csv
import sys
titles = ("id", "word", "other-1", "other-2", "other-3", "pos", "morphology", "number", "code", "definitions")
field_idx = (0, 19, 38, 57, 76, 83, 97, 100, 110)
writer = csv.writer(sys.stdout, delimiter='\t', lineterminator="\n")
writer.writerow(titles)
with open('DICTLINE.GEN', encoding="latin-1") as dictline:
id_ = 1
for l in dictline:
fields = (l[start:end].rstrip() for start, end in zip(field_idx, field_idx[1:] + (None,)))
writer.writerow([id_] + list(fields))
id_ += 1
Another reason to have an automated solution: otherwise it's essentially impossible to verify the correctness of this csv file. If there's a mistake in the conversion somewhere in the middle, how would we know?
Edit: Updated to use python's csv module. |
#93 should fix this. That makes the id numbers generated by my script the same as the csv here, so comparisons can be made more easily. |
Ok, we have fixed the Jud/exquisit issue (I think) - neither term shows up in Book 4 of the Aeneid tho ... ;) thanks for reporting the error in the file |
Fix from csv in PR mk270#92
Hello. |
Hi! First of all, thank you so much for making this data available on GitHub!
We've been working with the
DICTLINE.GEN
file in your repository and for our research purposes needed to transform it into a TAB-separated csv spreadsheet. In converting the space-separated fields in your original file, my colleague @Stormur noticed that the spacing is not consistent; this issue may or may not be relevant to you but we thought we'd flag it up just in case. We also found some problematic characters (e.g., in lines 6883 and 37500), which prevented us from successfully loading the csv file into a SQL table. Anyway, everything is working nicely now and we thought we'd share the csv file with you. :)Thanks again!