Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create training data for NER task using snorkel ? #1254

Open
thak123 opened this issue Jul 10, 2019 · 17 comments
Open

How to create training data for NER task using snorkel ? #1254

thak123 opened this issue Jul 10, 2019 · 17 comments
Labels
feature request no-stale Auto-stale bot skips this issue

Comments

@thak123
Copy link

thak123 commented Jul 10, 2019

I want to create a dataset using the snorkel labelling function but I am not able to find any links.
I want to train a NER model using the above data.

Can anyone tell me how to proced

@paroma paroma added the Q&A label Jul 18, 2019
@Mageswaran1989
Copy link

@thak123 Follow the link #838
You will find following notebooks:
1.Crowdsourced_Sentiment_Analysis
2. Categorical_Classes

But I am doubtful on the area of tagging table data from PDFS/Receipts

@hpeiyan
Copy link

hpeiyan commented Aug 20, 2019

  1. Crowdsourced_Sentiment_Analysis
  2. Categorical_Classes

Hi Mageswaran. I found the link you posted is not found.

@ajratner
Copy link
Contributor

ajratner commented Sep 1, 2019

Hi @thak123 while you can hopefully look at some of the existing tutorials to help you in the interim, we're actually planning to release an NER-specific tutorial soon! Marking as "feature request" and will leave open till this is done

@marctorsoc
Copy link

marctorsoc commented Oct 28, 2019

Hi @ajratner , I'm quite interested in this feature, do you have an expected timeline for the release of those tutorials? not a hard deadline, but just to know if some weeks, months, years...

@christopheratfarmjournal

Hi @ajratner, I'm very interested in this feature. Any idea when the tutorial may be released? Here we are 2 months after your previous mention . . . does it still look months away?

@vincentschen vincentschen added the no-stale Auto-stale bot skips this issue label Nov 18, 2019
@maciejbiesek
Copy link

Any update on this issue?

@pfllo
Copy link

pfllo commented Nov 21, 2019

I found 2 papers in the snorkel resources page that tackles the NER task.
The SwellShark paper, handles the overlapping candidate problem in NER using the Maximum Marginal Likelihood Approach.
The MeTaL paper uses the Matrix Completion-Style Approach, but I can't find any details on handling the overlapping candidate problem in NER.
@ajratner Could you give some hints on how to handle the overlapping candidate problem in the matrix completion-style approach, so that we can try out the NER task before the tutorial comes out?

@thak123
Copy link
Author

thak123 commented Feb 19, 2020

any update on this issue ?

@blah-crusader
Copy link

Also interested.. C'mon guys! :D

@jason-fries
Copy link
Contributor

jason-fries commented Feb 27, 2020

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:

import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
   m = {}
   for i in range(len(sentence)):
       for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
           term = ' '.join(sentence[i:j])
           if term in dictionary:
               m.update({idx:1 for idx in range(i,j+1)})
   return m
           
def create_token_L_mat(Xs, Ls, num_lfs):
   """
   Create token-level LF matrix from LFs indexed by sentence
   """
   Yws = []
   for sent_i in range(len(Xs)):
       ys = dok_matrix((len(Xs[sent_i]), num_lfs))
       for lf_i in range(num_lfs):
           for word_i,y in Ls[sent_i][lf_i].items():
               ys[word_i, lf_i] = y
       Yws.append(ys)
   return csr_matrix(vstack(Yws))
  
# labeling functions
def LF_is_location(s):
   locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
   matches = dict_match(s, locations)
   return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}
   
def LF_is_company(s):
   companies = {"Apple", "Apple, Inc."}
   matches = dict_match(s, companies)
   return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
   return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
   "Apple, Inc. is headquartered in Cupertino, California .".split(),
   "Explore the very best of the Big Apple .".split(),
]

lfs = [
   LF_is_location,
   LF_is_company,
   LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model 

Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

@ajratner
Copy link
Contributor

@jason-fries thanks so much! Just to make sure it's clear: this repo has been generously maintained primarily by researchers like @jason-fries, and we are in general very capacity limited in terms of major changes to the repo. As such, we currently don't have a timeline on an NER tutorial. Contributions are very welcome though!

To additionally be clear: our policy for the issues page is that questions and comments are great, but demands such as "cmon guys" are not appropriate usage. Thanks for your understanding!

@ajratner
Copy link
Contributor

And also just to be very clear: we all really want to put more stuff out here... we're working on it, and so grateful to all of you on the issues page for your patience, enthusiasm, and support in trying Snorkel out in the meantime!!! :)

@blah-crusader
Copy link

Thanks a lot for this response @jason-fries ! @ajratner apologies for coming across impatient/rude; I've been really amazed by the current release, and the corresponding research papers and did not mean anything other than: "I'm also super interested in staying up to date on the topic".

Thanks!

@marctorsoc
Copy link

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:

import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
   m = {}
   for i in range(len(sentence)):
       for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
           term = ' '.join(sentence[i:j])
           if term in dictionary:
               m.update({idx:1 for idx in range(i,j+1)})
   return m
           
def create_token_L_mat(Xs, Ls, num_lfs):
   """
   Create token-level LF matrix from LFs indexed by sentence
   """
   Yws = []
   for sent_i in range(len(Xs)):
       ys = dok_matrix((len(Xs[sent_i]), num_lfs))
       for lf_i in range(num_lfs):
           for word_i,y in Ls[sent_i][lf_i].items():
               ys[word_i, lf_i] = y
       Yws.append(ys)
   return csr_matrix(vstack(Yws))
  
# labeling functions
def LF_is_location(s):
   locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
   matches = dict_match(s, locations)
   return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}
   
def LF_is_company(s):
   companies = {"Apple", "Apple, Inc."}
   matches = dict_match(s, companies)
   return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
   return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
   "Apple, Inc. is headquartered in Cupertino, California .".split(),
   "Explore the very best of the Big Apple .".split(),
]

lfs = [
   LF_is_location,
   LF_is_company,
   LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model 

Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

thanks for this. I already experimented with a similar approach in the past, but it's really useful to me to have confirmation that this actually works quite well and there's not much difference (given enough resources) as compared to something specific to sequence data 👍

@raj5287
Copy link

raj5287 commented Jul 20, 2020

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:

import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
   m = {}
   for i in range(len(sentence)):
       for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
           term = ' '.join(sentence[i:j])
           if term in dictionary:
               m.update({idx:1 for idx in range(i,j+1)})
   return m
           
def create_token_L_mat(Xs, Ls, num_lfs):
   """
   Create token-level LF matrix from LFs indexed by sentence
   """
   Yws = []
   for sent_i in range(len(Xs)):
       ys = dok_matrix((len(Xs[sent_i]), num_lfs))
       for lf_i in range(num_lfs):
           for word_i,y in Ls[sent_i][lf_i].items():
               ys[word_i, lf_i] = y
       Yws.append(ys)
   return csr_matrix(vstack(Yws))
  
# labeling functions
def LF_is_location(s):
   locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
   matches = dict_match(s, locations)
   return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}
   
def LF_is_company(s):
   companies = {"Apple", "Apple, Inc."}
   matches = dict_match(s, companies)
   return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
   return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
   "Apple, Inc. is headquartered in Cupertino, California .".split(),
   "Explore the very best of the Big Apple .".split(),
]

lfs = [
   LF_is_location,
   LF_is_company,
   LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model 

Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

@jason-fries thanks for this, but could you please tell how to train the MajorityLabelVoter or LableModel , since I am getting error with both these Methods and even with LFAnalysis(L=L_, lfs=lfs).lf_summary() . I am guessing, may be this is because of sparse matrix since the error is NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported > So could you please help me out here, what to do next?

@alvin-c-shih
Copy link

@raj5287 MajorityLabelVoter requires L be integer type. LFAnalysis requires the matrix be dense. Other operations would prefer np.array instead of np.matrix.

Try this as a tactical fix:

L = np.asarray(L.astype(np.int8).todense())

@rjurney
Copy link

rjurney commented Jul 13, 2022

The thing to do here is to use skweak, not Snorkel. It is a commercial tool now and investments in this area are going into other projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request no-stale Auto-stale bot skips this issue
Projects
None yet
Development

No branches or pull requests