Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Vocab & Embedding setup code into readers #337

Open
undwie opened this issue Nov 12, 2017 · 8 comments
Open

Move Vocab & Embedding setup code into readers #337

undwie opened this issue Nov 12, 2017 · 8 comments
Assignees
Milestone

Comments

@undwie
Copy link
Collaborator

undwie commented Nov 12, 2017

I wrote some code that uses a jack reader (its model, pre and post-processing), but an own training loop---see below. This should be useful for others as a starting point for, say, using multi-task learning objectives etc. Personally, I think it's a good idea to think of this as the core way of interfacing jack. Everything else (a CLI training script etc) should be light wrapper around this.

    jack_train = ...some jack QASetting, Answer pairs

    emb_file = '../bloomsbury_data/external/embeddings/GloVe/glove.6B.50d.txt'
    embeddings = load_embeddings(emb_file, 'glove')

    config = load_config("jtr/conf/jack.yaml")
    config['name'] = "blurb"
    config['repr_dim_input'] = 50

    vocab = Vocab(emb=embeddings, init_from_embeddings=True)
    shared_resources = SharedResources(vocab=vocab, config=config)
    input_module = XQAInputModule(shared_resources)
    model_module = FastQAModule(shared_resources)
    output_module = XQAOutputModule(shared_resources)
    reader = TFReader(shared_resources, input_module, model_module, output_module)

    tf.reset_default_graph()
    reader.setup_from_data(jack_train, is_training=True)

    loss = reader.model_module.tensors[Ports.loss]
    optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)
    min_op = optimizer.minimize(loss)

    sess = model_module.tf_session
    sess.run(tf.global_variables_initializer())

    for epoch in range(0, 10):
        for batch in reader.input_module.batch_generator(jack_train, 1, False):
            feed_dict = reader.model_module.convert_to_feed_dict(batch)
            loss_value, _ = sess.run((loss, min_op), feed_dict=feed_dict)
        if epoch % 5 == 1:
            print(loss_value)

I think we should work on streamlining this. What I dislike about the above is the handling of vocabs and embeddings. This is reader-specific, and should be handled internally. So I'd rather it looked like:

    config = load_config("jtr/conf/jack.yaml")
    config['name'] = "blurb"
    config['repr_dim_input'] = 50
    config['emb_file'] = '../bloomsbury_data/external/embeddings/GloVe/glove.6B.50d.txt'
 
    shared_resources = SharedResources(config=config)
    input_module = XQAInputModule(shared_resources)
    model_module = FastQAModule(shared_resources)
    output_module = XQAOutputModule(shared_resources)
    reader = TFReader(shared_resources, input_module, model_module, output_module)

and then the input module (or model module) takes care of setting up vocabs and embeddings. Because don't have this separation, right now, vocab and embedding code creeps into learning code, even though it's specific to readers. The above would fix this.

To fix this, the reader-owners need to move embedding/vocab setup code into their modules.

Any thoughts or objections?

@pminervini
Copy link
Collaborator

This is also useful for writing more fine-grained tests that check things during training

I quickly turned your code here into a test: https://github.com/uclmr/jack/blob/master/tests/jack/readers/test_fastqa_loop.py

@dirkweissenborn
Copy link
Collaborator

I think vocabs and emebddings could simply be created and loaded in the shared resources and not the modules. In the modules it would introduce unnecessary overhead for most implementations, because embedding handling is quite reader agnostic.

@riedelcastro
Copy link
Contributor

I'd disagree. I have already seen readers that a) need or not need character embeddings, b) several vocabs for different parts (questions, answers, support), c) only use pre-trained embeddings etc. All of these seem like reasonable decisions that are reader-specific (or input module specific)

This doesn't stop us from factoring out a lot of functionality, such that readers/input modules only use a single call to some create_vocab etc. function in case they really share functionality.

@dirkweissenborn
Copy link
Collaborator

I agree with the vocab part, but the embeddings can be provided separately already through the shared resources.

@riedelcastro
Copy link
Contributor

I'd prefer if the reader was in charge of that too. This makes setting up readers more straightforward for the user. Right now I have to think about what I need to prepare (the embeddings), and what the reader is doing to populate the shared resources. And I need to look into the reader to decide whether it needs embeddings or not (if it doesn't, I shouldn't load embeddings). I would like the API usage to be

  1. Setup configuration dictionary
  2. Load the reader with this configuration

I also feel that even embedding loading can get convoluted (for example, if you decide to preload the embeddings for the training set, or if you want to load other types of pre-trained models etc).

@dirkweissenborn
Copy link
Collaborator

dirkweissenborn commented Nov 13, 2017

I don't have a strong opinion here and understand your point. It makes it also clear to the user where these resources are coming from when they are not auto-magically created somewhere else first. We had exactly this problem before when people did not understand where embeddings and vocabs actually come from and how they are created. Anyway, this is a major change so we should push that rather sooner than later.

@dirkweissenborn
Copy link
Collaborator

@riedelcastro @pminervini , anyone working on this?

@dirkweissenborn
Copy link
Collaborator

vocab and embeddings are now decoupled. The setup code should still move into the readers themselves though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants