Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing lower level model evaulation data #35

Open
phillbaker opened this issue Feb 7, 2017 · 2 comments
Open

Exposing lower level model evaulation data #35

phillbaker opened this issue Feb 7, 2017 · 2 comments

Comments

@phillbaker
Copy link

Thanks for all the hard work on this! Parserator has definitely made it easy to create a model with crfsuite. As I dig into fine tuning my model, I'd like to have access to the metrics provided by crfsuite (accuracy, precision, recall).

It looks the python wrapper does provide access to this data (scrapinghub/python-crfsuite#42 (comment)), what do you think of a PR that exposes this as a return value of trainModel?

@fgregg
Copy link
Member

fgregg commented Feb 7, 2017 via email

@phillbaker
Copy link
Author

phillbaker commented Nov 10, 2017

Just following up on this. We ended up using modified versions of the parse and tag functions:

def parse(raw_string, verbose=False):
    if not TAGGER:
        raise IOError(
            '\nMISSING MODEL FILE: %s\nYou must train the model before you can '
            'use the parse and tag methods\nTo train the model annd create the '
            'model file, run:\nparserator train [traindata] [modulename]' % MODEL_FILE)

    tokens = tokenize(raw_string)
    if not tokens:
        return []

    features = tokens2features(tokens)

    tags = TAGGER.tag(features)

    if verbose:
        probabilities = []
        for index, tag in enumerate(tags):
            probabilities.append(TAGGER.marginal(tag, index))
        return list(zip(tokens, tags, probabilities))

    return list(zip(tokens, tags))


def tag(raw_string, probability_cutoff=None):
    tagged = OrderedDict()
    if probability_cutoff:
        tagged_probability = OrderedDict()
        for token, label, probability in parse(raw_string, verbose=True):
            tagged_probability.setdefault(label, {'tokens': []})
            if tagged_probability[label].get('probability'):
                tagged_probability[label]['probability'] = tagged_probability[label]['probability'] * probability
            else:
                tagged_probability[label]['probability'] = probability

            tagged_probability[label]['tokens'].append(token)

        for label, token_probabilities in tagged_probability.items():
            if token_probabilities['probability'] > probability_cutoff:
                tagged[label] = token_probabilities['tokens']
    else:
        for token, label in parse(raw_string):
            tagged.setdefault(label, []).append(token)

    for token in tagged:
        component = ' '.join(tagged[token])
        component = component.strip(' ,;')
        tagged[token] = component

    return tagged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants