Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictline_csv.py script, based on PR #92 #96

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ids1024
Copy link
Contributor

@ids1024 ids1024 commented Feb 10, 2019

There are still a couple differences from the csv file in the other PR.

One is fixed in #95, so ignoring that...

The csv in the PR replaces some double spaces in definitions with single spaces. Which is probably at least mostly good. Some of the extra spaces look clearly unintentional, while others may be a matter of style. This is worth looking into.

Ignoring whitespace changes with diff -b, the difference is small enough I'll just paste it here:

249c249
< 248	absit				INTERJ			E E X C E	"god forbid, ""let it be far from the hearts of the faithful"";"
---
> 248	absit				INTERJ			E E X C E	"""god forbid"", ""let it be far from the hearts of the faithful"";"
973c973
< 972	adgnit	adgnit			N	4 1 M T		X D X F O	recognition (drama);
---
> 972	adgnit	adgnit			N	4 1 M T		X D X F O	"""recognition"" (drama);"
2337c2337
< 2336	agnit	agnit			N	4 1 M T		X D X F O	recognition (drama);
---
> 2336	agnit	agnit			N	4 1 M T		X D X F O	"""recognition"" (drama);"
4271c4271
< 4270	apsit				INTERJ			E E X C E	"god forbid, ""let it be far from the hearts of the faithful"";"
---
> 4270	apsit				INTERJ			E E X C E	"""god forbid"", ""let it be far from the hearts of the faithful"";"
6884c6884
< 6883	bovid	bovid			N	1 1 F T		G X X E K	bovid;
---
> 6883	bovid	bovid			N	1 1 F T		G X X E K	bovidé;
7032c7032
< 7031	bu	bu			N	1 1 F T		X X X F S	bubbub; (natural sound made by infants asking for drink);
---
> 7031	bu	bu			N	1 1 F T		X X X F S	"""bubbub""; (natural sound made by infants asking for drink);"
11559c11559
< 11558	commosis	commos			N	3 3 F T		X A X N O	gumming; (said to be first layer in construction of honeycombs);
---
> 11558	commosis	commos			N	3 3 F T		X A X N O	"""gumming""; (said to be first layer in construction of honeycombs);"
17651c17651
< 17650	Didym	Didym			N	2 1 M P		E E H E E	twin, apostle Thomas;
---
> 17650	Didym	Didym			N	2 1 M P		E E H E E	"""twin"", apostle Thomas;"
20310c20310
< 20309	FALSO				ADV	POS		F X X E E	falsely; deceptively; spuriously;
---
> 20309	falso				ADV	POS		F X X E E	falsely; deceptively; spuriously;
22922c22922
< 22921	implacat	implacat			ADJ	1 1 POS		X X X D X	not appeased, insatiable;
---
> 22921	implacat	implacat			ADJ	1 1 POS		X X X D X	not appeased, in  satiable;
30538c30538
< 30537	pil	pil			N	2 1 M P		X W X E O	chief; [primipilus/primi pili centurio => first/primary centurion of legion];
---
> 30537	pil	pil			N	2 1 M P		X W X E O	"""chief""; [primipilus/primi pili centurio => first/primary centurion of legion];"
37501c37501
< 37500	trabuc	trabuc			N	2 1 M T		G W X E K	trebuchet (machine of war);
---
> 37500	trabuc	trabuc			N	2 1 M T		G W X E K	trébuchet (machine of war);
38698c38698
< 38697	VERO				ADV	POS		X X X A X	yes; in truth; certainly; truly, to be sure; however;
---
> 38697	vero				ADV	POS		X X X A X	yes; in truth; certainly; truly, to be sure; however;

The csv in the PR strips the non-ascii characters, while my script re-encodes them correctly. This seems clearly better.

The csv file in the PR seems to be stripping some " characters in definitions, while the Python csv library wraps the definition on those lines in quotes, then escapes the quotes inside as "". This seems clearly better, assuming most csv libraries support it.

For some reason DICTLINE.GEN.csv fro the PR has FALSO and VERO capitalized, although they aren't in DICTLINE.GEN. I don't know if this is a mistake or done for some reason.

I don't know if it would be better to use commas instead of tabs, or change/remove the column titles, if we really need an id column, etc. I've kept that the same for now to make diffing easy.

@gfranzini @Stormur any comments on this?

@mk270
Copy link
Owner

mk270 commented Feb 10, 2019

I presume FALSO and VERO are Italian for false and true?

@mk270
Copy link
Owner

mk270 commented Feb 10, 2019

Ok, so it looks like the changes fall into the following categories

  • spurious spaces in the original ("in satiable")
  • text in Latin1 encoding
  • text involving double quotes
  • Latin stems that coincide with modern Italian words for "true" and "false", which doubtless confused some naive spreadsheet or csv library

I think we want to fix the first of these, and ignore all the rest

@ids1024
Copy link
Contributor Author

ids1024 commented Feb 10, 2019

Latin stems that coincide with modern Italian words for "true" and "false", which doubtless confused some naive spreadsheet or csv library

Ah, yeah, that's probably it. I hadn't thought of that. Wrapping those in quotes might fix it; but that's not necessarily something we want to do.

spurious spaces in the original ("in satiable")

And note that there are far more of these, but the others are omitted from the diff above by using the -b argument.

I think we want to fix the first of these, and ignore all the rest

Yep. Or more precisely, the script already fixes the encoding and quotes properly. (If the csv loader can handle quoting and UTF-8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants