Add dictline_csv.py script, based on PR #92 #96

ids1024 · 2019-02-10T19:33:34Z

There are still a couple differences from the csv file in the other PR.

One is fixed in #95, so ignoring that...

The csv in the PR replaces some double spaces in definitions with single spaces. Which is probably at least mostly good. Some of the extra spaces look clearly unintentional, while others may be a matter of style. This is worth looking into.

Ignoring whitespace changes with diff -b, the difference is small enough I'll just paste it here:

249c249
< 248	absit				INTERJ			E E X C E	"god forbid, ""let it be far from the hearts of the faithful"";"
---
> 248	absit				INTERJ			E E X C E	"""god forbid"", ""let it be far from the hearts of the faithful"";"
973c973
< 972	adgnit	adgnit			N	4 1 M T		X D X F O	recognition (drama);
---
> 972	adgnit	adgnit			N	4 1 M T		X D X F O	"""recognition"" (drama);"
2337c2337
< 2336	agnit	agnit			N	4 1 M T		X D X F O	recognition (drama);
---
> 2336	agnit	agnit			N	4 1 M T		X D X F O	"""recognition"" (drama);"
4271c4271
< 4270	apsit				INTERJ			E E X C E	"god forbid, ""let it be far from the hearts of the faithful"";"
---
> 4270	apsit				INTERJ			E E X C E	"""god forbid"", ""let it be far from the hearts of the faithful"";"
6884c6884
< 6883	bovid	bovid			N	1 1 F T		G X X E K	bovid;
---
> 6883	bovid	bovid			N	1 1 F T		G X X E K	bovidé;
7032c7032
< 7031	bu	bu			N	1 1 F T		X X X F S	bubbub; (natural sound made by infants asking for drink);
---
> 7031	bu	bu			N	1 1 F T		X X X F S	"""bubbub""; (natural sound made by infants asking for drink);"
11559c11559
< 11558	commosis	commos			N	3 3 F T		X A X N O	gumming; (said to be first layer in construction of honeycombs);
---
> 11558	commosis	commos			N	3 3 F T		X A X N O	"""gumming""; (said to be first layer in construction of honeycombs);"
17651c17651
< 17650	Didym	Didym			N	2 1 M P		E E H E E	twin, apostle Thomas;
---
> 17650	Didym	Didym			N	2 1 M P		E E H E E	"""twin"", apostle Thomas;"
20310c20310
< 20309	FALSO				ADV	POS		F X X E E	falsely; deceptively; spuriously;
---
> 20309	falso				ADV	POS		F X X E E	falsely; deceptively; spuriously;
22922c22922
< 22921	implacat	implacat			ADJ	1 1 POS		X X X D X	not appeased, insatiable;
---
> 22921	implacat	implacat			ADJ	1 1 POS		X X X D X	not appeased, in  satiable;
30538c30538
< 30537	pil	pil			N	2 1 M P		X W X E O	chief; [primipilus/primi pili centurio => first/primary centurion of legion];
---
> 30537	pil	pil			N	2 1 M P		X W X E O	"""chief""; [primipilus/primi pili centurio => first/primary centurion of legion];"
37501c37501
< 37500	trabuc	trabuc			N	2 1 M T		G W X E K	trebuchet (machine of war);
---
> 37500	trabuc	trabuc			N	2 1 M T		G W X E K	trébuchet (machine of war);
38698c38698
< 38697	VERO				ADV	POS		X X X A X	yes; in truth; certainly; truly, to be sure; however;
---
> 38697	vero				ADV	POS		X X X A X	yes; in truth; certainly; truly, to be sure; however;

The csv in the PR strips the non-ascii characters, while my script re-encodes them correctly. This seems clearly better.

The csv file in the PR seems to be stripping some " characters in definitions, while the Python csv library wraps the definition on those lines in quotes, then escapes the quotes inside as "". This seems clearly better, assuming most csv libraries support it.

For some reason DICTLINE.GEN.csv fro the PR has FALSO and VERO capitalized, although they aren't in DICTLINE.GEN. I don't know if this is a mistake or done for some reason.

I don't know if it would be better to use commas instead of tabs, or change/remove the column titles, if we really need an id column, etc. I've kept that the same for now to make diffing easy.

@gfranzini @Stormur any comments on this?

mk270 · 2019-02-10T21:45:46Z

I presume FALSO and VERO are Italian for false and true?

mk270 · 2019-02-10T21:54:14Z

Ok, so it looks like the changes fall into the following categories

spurious spaces in the original ("in satiable")
text in Latin1 encoding
text involving double quotes
Latin stems that coincide with modern Italian words for "true" and "false", which doubtless confused some naive spreadsheet or csv library

I think we want to fix the first of these, and ignore all the rest

ids1024 · 2019-02-10T21:59:29Z

Latin stems that coincide with modern Italian words for "true" and "false", which doubtless confused some naive spreadsheet or csv library

Ah, yeah, that's probably it. I hadn't thought of that. Wrapping those in quotes might fix it; but that's not necessarily something we want to do.

spurious spaces in the original ("in satiable")

And note that there are far more of these, but the others are omitted from the diff above by using the -b argument.

I think we want to fix the first of these, and ignore all the rest

Yep. Or more precisely, the script already fixes the encoding and quotes properly. (If the csv loader can handle quoting and UTF-8).

Add dictline_csv.py script, based on PR mk270#92

6bba79f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dictline_csv.py script, based on PR #92 #96

Add dictline_csv.py script, based on PR #92 #96

ids1024 commented Feb 10, 2019 •

edited

mk270 commented Feb 10, 2019

mk270 commented Feb 10, 2019

ids1024 commented Feb 10, 2019 •

edited

Add dictline_csv.py script, based on PR #92 #96

Are you sure you want to change the base?

Add dictline_csv.py script, based on PR #92 #96

Conversation

ids1024 commented Feb 10, 2019 • edited

mk270 commented Feb 10, 2019

mk270 commented Feb 10, 2019

ids1024 commented Feb 10, 2019 • edited

ids1024 commented Feb 10, 2019 •

edited

ids1024 commented Feb 10, 2019 •

edited