Given the name of a property or attribute like 'BrandName' or 'AmountReceived', try to predict a data type like String, Boolean, Integer...
python3 -m pip install --upgrade parameterized==0.7.5 levenshtein==0.20.8 Flask==2.2.2
python3 ./src/predict-type-from-name.py <name> [--help --fuzzy]
python3 ./src/predict-type-from-name.py Actve --fuzzy
Output:
Actve=Boolean
python3 ./src/predict-type-from-name-repl.py
Output:
Enter a property name like 'Color' or 'BrandName' or 'CreatedOn'
(just press ENTER to exit) ->ExportedOn
ExportedOn=Date
(just press ENTER to exit) ->ItemWidth
ItemWidth=Integer
./go.api.sh
Open a URL with a property-name at the end:
http://127.0.0.1:5000/predict_type/Branded
Output:
property_name: Branded -> predicted type=Boolean
python3 ./src/evaluate.py <path or glob to JSON data file(s)> [--help --fuzzy]
python3 ./src/evaluate.py ./data/names-and-types.small.1.json
Output:
# Accuracy:
45% correctly predicted
5% incorrectly predicted
50% not predicted
Data set size: 66 words
A small element of Machine Learning is used to optimize the parameters used to predict, for a given data set.
The Accuracy measure is used (TP/(TP+FP)). The Cost function is defined simply to maximise the accuracy.
python3 ./src/train.py ./data/ip-xxx-big.json
Training...
[done]
Optimal config:
is_fuzzy=False, max_distance=0, min_length=2, cost=29, accuracy=71
Unfortunately, Machine Learning indicates that the optimal configuration can be acheived WITHOUT fuzzy matching! However, for UX reasons, fuzzy matching still seems useful, given the accuracy against data is the same.
- The property name (the word) is stemmed into smaller tokens, assuming camelCase or PascalCase
- Heuristics are run to try and recognise the first or last token. Example:
is
orcan
indicatesBoolean
. If match is found, exit. - [If fuzzy matching is enabled] Levenshtein distance is then allowed on the longer tokens, to try to get a fuzzy match.
Approach | Accuracy | Correctly predicted | Incorrectly predicated | Not predicted | Data set | Comment |
---|---|---|---|---|---|---|
Heuristics, no fuzzy match | - | 45% | 5% | 50% | 66 words | 'Safe' predications |
Heuristics, with fuzzy match (min length 3, max distance 5) | - | 47% | 48% | 5% | 66 words | 'Unsafe' fuzzy predications: small gain in true positives with cost of much more false positives. |
Heuristics, with fuzzy match (min length 2, max distance 2) | - | 50% | 14% | 36% | 66 words | 'Safer' fuzzy predications. |
Heuristics, with fuzzy match (min length 5, max distance 2) | 91% | 47% | 5% | 48% | 66 words | 'Safer' fuzzy predications. |
ML Optimized Heuristics, with NO fuzzy match (min length 2) | 91% | 45% | 5% | 50% | Machine Learning optimized the 5600 item data set -> Fuzzy is OFF. |
Approach | Accuracy | Correctly predicted | Incorrectly predicated | Not predicted | Data set | Comment |
---|---|---|---|---|---|---|
Heuristics, no fuzzy match | - | 16% | 7% | 77% | 5640 words | 'Safe' predications. |
Heuristics, with fuzzy match (min length 2, max distance 2) | - | 24% | 30% | 46% | 5640 words | Fuzzy predications. |
Heuristics, with fuzzy match (min length 2, max distance 2) | - | 24% | 30% | 46% | 5640 words | Fuzzy predications. |
Heuristics, with fuzzy match (min length 5, max distance 2) | 68% | 17% | 8% | 75% | 5640 words | Fuzzy predications. |
ML Optimized Heuristics, with forced fuzzy match (min length 6, max distance 1) | 71% | 16% | 7% | 77% | 5640 | Machine Learning optimized THIS data set Fuzzy is forced ON, learned optimal token length. |
ML Optimized Heuristics, with NO fuzzy match | 71% | 16% | 7% | 77% | 5640 | Machine Learning optimized THIS data set -> Fuzzy is OFF. |