This package contains all components related to phonology. It contains features, feature extractors, and transformers.
These features can either be passed as a dict
mapping symbols to feature arrays, or as tuples of strings, or tuples of tuples of strings.
As features we expect a tuple of dict
s, the first containing vowel features, and the second containing consonant features.
In a lot of cases the feature extraction can be postponed by passing a feature extractor
.
This is usually the easiest method, as it guarantees a parsimonious representation.
The CVTransformer
is based on patpho
, and uses a Consonant Vowel grid to align consonant and vowel features. This ensures that words with a different number of vowels and consonants in segments are still aligned.
The words spat
and pat
, for example, would not be aligned using a purely linear encoding, but are aligned when using a Consonant Vowel grid.
When using the CVTransformer, please consider citing:
@article{li2002patpho,
title={PatPho: A phonological pattern generator for neural networks},
author={Li, Ping and MacWhinney, Brian},
journal={Behavior Research Methods, Instruments, \& Computers},
volume={34},
number={3},
pages={408--415},
year={2002},
publisher={Springer}
}
As it is a phonological transformer, the CVTransformer expects a tuple of features, one for vowels and another for consonants.
from wordkit.features import CVTransformer, OneHotPhonemeExtractor
p_words = [('p', 'æ', 't'), ('s', 'p', 'æ', 't')]
c = CVTransformer(OneHotPhonemeExtractor, field=None)
X = c.fit_transform(p_words)
c.features
The ONCTransformer
is similar to the CVTransformer
above, but uses syllable information to group phonemes into Onsets, Nuclei and Codas.
This ensures is more data-intensive, because it requires syllable information whereas the CVTransformer
only requires phonological information, but is also more accurate.
In contrast to the CVTransformer
, whose grid must be manually specified, the grid of the ONCTransformer
is completely data-driven, and determined during fitting.
from wordkit.features import ONCTransformer, OneHotPhonemeExtractor
# Syllables are represented as tuples of tuples.
p_words = [(('p', 'æ', 't'),), (('s', 'p', 'æ', 't'),)]
o = ONCTransformer(OneHotPhonemeExtractor, field=None)
X = o.fit_transform(p_words)
o.features
The OpenNGramTransformer
featurizes words using open ngrams, which is the set of ordered combinations of ngrams in a word.
Taking bigrams as an example, the OpenNGramTransformer
turns the word "salt"
into {"sa", "sl", "st", "al", "at", "lt"}
. The extracted features are similar to what is known as "character skipgrams".
The main motivation for using the open ngram features is transposition resilience.
If you use the OpenNGramTransformer, please consider citing the following sources:
@article{schoonbaert2004letter,
title={Letter position coding in printed word perception: Effects of repeated and transposed letters},
author={Schoonbaert, Sofie and Grainger, Jonathan},
journal={Language and Cognitive Processes},
volume={19},
number={3},
pages={333--367},
year={2004},
publisher={Taylor \& Francis}
}
@article{whitney2001brain,
title={How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review},
author={Whitney, Carol},
journal={Psychonomic Bulletin \& Review},
volume={8},
number={2},
pages={221--243},
year={2001},
publisher={Springer}
}
The example below shows how "salt"
and "slat"
lead to similar encodings.
from wordkit.features import OpenNGramTransformer
words = ["salt", "slat"]
o = OpenNGramTransformer(n=2, field=None)
X = o.fit_transform(words)
print(o.features)
# Normalized hamming distance
dist = (X.shape[1] - (X[0].dot(X[1]))) / X.shape[1]
The ConstrainedOpenNGramTransformer
is similar to the OpenNGramTransformer
, above, with the added constraint that the ngrams can only skip up to a specific number of letters.
If you use this transformer, please cite the sources listed under the OpenNGramTransformer
heading, above.
from wordkit.features import ConstrainedOpenNGramTransformer
words = ["photography", "graphically"]
c = ConstrainedOpenNGramTransformer(n=2, window=2)
c.fit_transform(words)
c.features
The WeightedOpenBigramTransformer
can only transform bigrams, and assigns each of the bigrams weights depending on the distance between the letters.
If you use this transformer, please cite the sources listed under the OpenNGramTransformer
heading, above.
from wordkit.features import WeightedOpenBigramTransformer
words = ["photography", "graphically"]
# Bigrams with no intervening letters get weight 1,
# bigrams with a single intervening letter get weight .8, and so on.
w = WeightedOpenBigramTransformer(weights=(1., .8, .2))
X = w.fit_transform(words)
The NGramTransformer
turns words into character ngrams.
Every word is padded with n - 1
dummy characters ("#" by default).
Padding can be turned off by setting use_padding
to False, but this removes the option of featurizing words which are shorter than n
characters.
from wordkit.features import NGramTransformer
words = ["dog", "fog", "hotdog", "colddog"]
w = NGramTransformer(n=3)
X = w.fit_transform(words)
w_2 = NGramTransformer(n=3, use_padding=False)
X_2 = w_2.fit_transform(words)
This subpackage contains a number of pre-specified features, and is mainly aimed at replicating already existing papers.
from wordkit.features import (PredefinedFeatureExtractor,
dislex_features,
plunkett_phonemes)
# Use a feature extractor with predefined features.
phon = CVTransformer(PredefinedFeatureExtractor(dislex_features))
# Use a predefined phoneme set.
phon = CVTransformer(plunkett_phonemes)
This subpackage contains all the functions and objects involved in feature extraction. In general, feature extraction denotes the process of extracting features from a set of objects. Keep in mind that feature extraction is distinct from transformation. Feature extraction merely determines the set of features, which are then passed to a transformer for further use.
import numpy as np
from wordkit.features import PhonemeFeatureExtractor, CVTransformer
p_words = [('k', 'æ', 't'), ('d', 'ɔ', 'ɡ')]
p = PhonemeFeatureExtractor()
v_feats, c_feats = p.extract(p_words)
# These can then be added to a transformer
c = CVTransformer((v_feats, c_feats))
transformed = c.fit_transform(p_words)
# Feature extractors can be directly added to transformers.
c = CVTransformer(PhonemeFeatureExtractor)
transformed_2 = c.fit_transform(p_words)
# Both methods are equivalent
assert np.all(transformed == transformed_2)
# using a dictionary
words = [{"orthography": "cat", "phonology": ('k', 'æ', 't')},
{"orthography": "dog", "phonology": ('d', 'ɔ', 'ɡ')}]
# field must be set because we use a dictionary
c = CVTransformer(PhonemeFeatureExtractor, field="phonology")
transformed_3 = c.fit_transform(words)
All phonological feature extractors return tuples, where the first item of each tuple is a dictionary of vowel-to-feature mappings, and the second item of each tuple is a dictionary of consonant-to-feature mappings.
Turns a set of phoneme strings into one hot encoded phonemes. Each phoneme is assigned an orthogonal vector, and this extractor thus assumes that all phonemes are maximally dissimilar.
This extractor extracts IPA distinctive features from the input, and assigns each unique value of an IPA feature a one-hot encoded representation. Note that only features within a feature group are orthogonal. For example, if there are only three values for a given feature group, the features within that feature group will have a three-bit one-hot encoded vector.
This extractor first extracts IPA distinctive features, and assigns each of the values in this features a vector, which is predefined.
Predefined phonological features can be found in wordkit.features
.
There is no limit on the type of features which can be passed into this extractor, although all values from the same feature group need to have the same dimensionality.
from wordkit.features import (CVTransformer,
PredefinedFeatureExtractor,
dislex_features)
words = [{"orthography": "dog", "phonology": ('d', 'ɔ', 'ɡ')}]
p = PredefinedFeatureExtractor(dislex_features,
field="phonology")
v_feats, c_feats = p.extract(words)
This feature set was defined by ourselves, and is an expansion of the patpho
feature set. It can encode a wide variety of phonemes, but not all of the phonemes in the International Phonetic Alphabet.
It assigns each of the distinctive features an overlapping binary code, and therefore leads to a more parsimonious representation as the one extracted by the PhonemeFeatureExtractor
.
This feature set assigns each single feature a number between 0
and 1
. It can not distinguish between all phonemes in the International Phonetic Alphabet.
Like the orthographic features with the same name, it was defined and used by miikkulainen in the context of the DISLEX model.
Phoneme sets are sets which directly map phoneme characters to features. We currently include only two of these because of their limited utility; they usually only apply to a subset of available phonemes, and rarely work for the featurization of an entire corpus.
The phoneme sets are all tuples of dictionaries. The first item of each tuple is a mapping from vowels to features, while the second item is a mapping from consonants to features.
These are taken from Plunkett and Marchman (1992):
@article{plunkett1993rote,
title={From rote learning to system building: Acquiring verb morphology in children and connectionist nets},
author={Plunkett, Kim and Marchman, Virginia},
journal={Cognition},
volume={48},
number={1},
pages={21--69},
year={1993},
publisher={Elsevier}
}
Like patpho_real, below, this has been taken from the patpho paper.
@article{li2002patpho,
title={PatPho: A phonological pattern generator for neural networks},
author={Li, Ping and MacWhinney, Brian},
journal={Behavior Research Methods, Instruments, \& Computers},
volume={34},
number={3},
pages={408--415},
year={2002},
publisher={Springer}
}
Like patpho_binary above, this feature set has been taken from the patpho paper.