Exposing lower level model evaulation data #35

phillbaker · 2017-02-07T04:21:20Z

Thanks for all the hard work on this! Parserator has definitely made it easy to create a model with crfsuite. As I dig into fine tuning my model, I'd like to have access to the metrics provided by crfsuite (accuracy, precision, recall).

It looks the python wrapper does provide access to this data (scrapinghub/python-crfsuite#42 (comment)), what do you think of a PR that exposes this as a return value of trainModel?

The text was updated successfully, but these errors were encountered:

fgregg · 2017-02-07T04:26:31Z

that sounds interesting. would like to see a PR, yes.

…

On Mon, Feb 6, 2017 at 10:21 PM, Phillip Baker ***@***.***> wrote: Thanks for all the hard work on this! Parserator has definitely made it easy to create a model with crfsuite. As I dig into fine tuning my model, I'd like to have access to the metrics provided by crfsuite (accuracy, precision, recall). It looks the python wrapper does provide access to this data ( scrapinghub/python-crfsuite#42 (comment) <scrapinghub/python-crfsuite#42 (comment)>), what do you think of a PR that exposes this as a return value of trainModel <https://github.com/datamade/parserator/blob/master/parserator/training.py#L29> ? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#35>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAgxbTPtsXtby9wcJLO3GTAo8N2OF2q2ks5rZ_FAgaJpZM4L5CbA> .

-- 773.888.2718

phillbaker · 2017-11-10T16:43:41Z

Just following up on this. We ended up using modified versions of the parse and tag functions:

def parse(raw_string, verbose=False):
    if not TAGGER:
        raise IOError(
            '\nMISSING MODEL FILE: %s\nYou must train the model before you can '
            'use the parse and tag methods\nTo train the model annd create the '
            'model file, run:\nparserator train [traindata] [modulename]' % MODEL_FILE)

    tokens = tokenize(raw_string)
    if not tokens:
        return []

    features = tokens2features(tokens)

    tags = TAGGER.tag(features)

    if verbose:
        probabilities = []
        for index, tag in enumerate(tags):
            probabilities.append(TAGGER.marginal(tag, index))
        return list(zip(tokens, tags, probabilities))

    return list(zip(tokens, tags))


def tag(raw_string, probability_cutoff=None):
    tagged = OrderedDict()
    if probability_cutoff:
        tagged_probability = OrderedDict()
        for token, label, probability in parse(raw_string, verbose=True):
            tagged_probability.setdefault(label, {'tokens': []})
            if tagged_probability[label].get('probability'):
                tagged_probability[label]['probability'] = tagged_probability[label]['probability'] * probability
            else:
                tagged_probability[label]['probability'] = probability

            tagged_probability[label]['tokens'].append(token)

        for label, token_probabilities in tagged_probability.items():
            if token_probabilities['probability'] > probability_cutoff:
                tagged[label] = token_probabilities['tokens']
    else:
        for token, label in parse(raw_string):
            tagged.setdefault(label, []).append(token)

    for token in tagged:
        component = ' '.join(tagged[token])
        component = component.strip(' ,;')
        tagged[token] = component

    return tagged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing lower level model evaulation data #35

Exposing lower level model evaulation data #35

phillbaker commented Feb 7, 2017

fgregg commented Feb 7, 2017 via email

phillbaker commented Nov 10, 2017 •

edited

Exposing lower level model evaulation data #35

Exposing lower level model evaulation data #35

Comments

phillbaker commented Feb 7, 2017

fgregg commented Feb 7, 2017 via email

phillbaker commented Nov 10, 2017 • edited

phillbaker commented Nov 10, 2017 •

edited