Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ticks --> apostrophes #101

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

artie-inc
Copy link

There are a lot of tick marks that occur in English Common Voice that should be apostrophes

This is part of a larger problem which involves quotation marks / double quotes

Artie added 4 commits October 4, 2019 15:52
There are a lot of tick marks that occur in English Common Voice that should be apostrophes

This is part of a larger problem which involves quotation marks / double quotes
This is also an issue WRT hyphens... should hyphens and dashes be collapsed?
there are 160 utterances with C++ spoken in common voice english, and this user is in the `test.csv` file after `test.tsv` is passed through `import_cv2.py`, and I verified that it is spoken this way
Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things that maybe need clarification, see the comments.

For pure Deep Speech use everything you do makes sense. But Common Voice is used outside of Deep Speech.

sentence = sentence.replace("C++", "C plus plus")
## collapse all apostrophe-like marks
## e.g. common_voice_en_18441344.mp3 ‘I’m not a serpent!’ --> 'I'm not a serpent!'
sentence = sentence.replace("’","'") # right-ticks --> apostrophes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always the right thing to do? For instance see here.

## collapse all apostrophe-like marks
## e.g. common_voice_en_18441344.mp3 ‘I’m not a serpent!’ --> 'I'm not a serpent!'
sentence = sentence.replace("’","'") # right-ticks --> apostrophes
sentence = sentence.replace("‘","'") # left-ticks --> apostrophes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess I'd pose a similar question here.

sentence = sentence.replace("‘","'") # left-ticks --> apostrophes
## Change em-dash to dash
## e.g. common_voice_en_18607891.mp3 Nelly, come here — is it morning? --> Nelly, come here – is it morning?
sentence = sentence.replace("—","–")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here; em-dash is used in cases where dash is not used and vice versa. For instance see here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants