Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sep and fastai_tokenizer #4

Open
jrlinton opened this issue Jun 24, 2020 · 0 comments
Open

Sep and fastai_tokenizer #4

jrlinton opened this issue Jun 24, 2020 · 0 comments

Comments

@jrlinton
Copy link

jrlinton commented Jun 24, 2020

Brilliant work here, Morgan - really looking forward to using this with my students on a project. Deepest apologies if I'm not doing this right - I'm very new to Github and also not a particularly good programmer.

It looks like perhaps the FastAI v2 team made a change in Tokenizer that is making it choke on the sep argument when instantiating your custom tokenizer in the fasthugs_language_model notebook.

class MLMTokenizer(Tokenizer): 
    def __init__(self, tokenizer, rules=None, counter=None, lengths=None, mode=None, **kwargs):  # removed sep=' '
        super().__init__(tokenizer, rules, counter, lengths, mode)  # removed sep

Taking the sep argument out seemed to fix the issue at first, but then the fastai_tokenizer kept the datasets from being created. I checked the vaious other components and isolated the issue to the tokenizer, but wasn't able to parse the error message that resulted.

tfms=[attrgetter("text"), fastai_tokenizer, AddSpecialTokens(tokenizer), MLMTokensLabels(tokenizer)]
dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)

Here are head and tail of resulting ten or so pages of error message (again, apologies if I'm not following protocol here):

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-100-070c60545587> in <module>
     11 
     12 #dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)
---> 13 dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)
     14 
     15 dsets[0][0][:20], dsets[0][1][:20]

<ipython-input-99-0553a9fb405f> in __init__(self, items, tfms, tls, n_inp, dl_type, **kwargs)
      4     "Doesn't create a tuple in __getitem__ as x is already a tuple"
      5     def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
----> 6         super().__init__(items=items, tfms=tfms, tls=tls, n_inp=n_inp, dl_type=dl_type, **kwargs)
      7 
      8     def __getitem__(self, it)

.
.  (Pages later)
.

~\.conda\envs\fastai2\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'parallel_gen.<locals>.f'

Anyway, I hope this is helpful. Please keep up the amazing work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant