Skip to content

Data Out: Moving text from Hub to NLP models #492

Discussion options

You must be logged in to vote

Another approach (as suggested by @AbhinavTuli) is to convert text to tokens during assignment.

This happens if the user passes a tokenizer while instantiating a Dataset:

ds = hub.Dataset(tag, shape=(10,), schema=schema, mode="w", tokenizer=some_tokenizer)  # user specifies tokenizer

schema = {'sentence': Text(shape=(None, ), max_shape=(500, ))}  # we still rely on Text schema

for i, sentence in enumerate(sentences):
    ds['tokens', i] = sentence  # words are converted to tokens during assignment 

In this case, the actual tokenization occurs before data is pushed into a Dataset, not after it is pulled out.

The user-provided tokenizer is then invoked by str_to_int:

def str_to_int(assign…

Replies: 3 comments 2 replies

Comment options

You must be logged in to vote
2 replies
@mynameisvinn
Comment options

@DebadityaPal
Comment options

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by mikayelh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet
3 participants