Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guided alignments in Sockeye. #1105

Open
tomsbergmanis opened this issue Feb 19, 2024 · 6 comments
Open

Guided alignments in Sockeye. #1105

tomsbergmanis opened this issue Feb 19, 2024 · 6 comments

Comments

@tomsbergmanis
Copy link

Hi!
We finally have time to reopen work on guided alignments for Sockeye 3
To recap: guided alignments are handy for formatted document translation, non-translatable entity and placeholder handling, and variations of automatic post-editing. Guided alignments are described in this paper Jointly Learning to Align and Translate with Transformer Models

Previously we were recommended to start from metadata branch. Would it still be the best point from where to start? If so, would getting it up-to-date be complicated?

Cheers!
Toms

@mjdenkowski
Copy link
Contributor

Hi Toms,

At this point, the metadata branch is somewhat out of sync with main, but it could still be helpful as a reference. One path forward would be to follow how metadata is woven through data preparation and training in the metadata branch and add alignment tracking in similar places in the main branch.

Best,
Michael

@iPRET
Copy link

iPRET commented Mar 14, 2024

Hi Michael,
I am the developer with Tilde implementing guided alignments in Sockeye 3. Things are going well, but I have a question: the sockeye.layers.MultiHeadAttention class uses torch's torch.nn.functional.multi_head_attention_forward, which does dropout post-softmax on the attentions, which breaks the cross-entropy loss' assumption that its inputs are valid probability distributions (this makes training a lot worse ༼ つ ◕_◕ ༽つ).
So, we currently see two options:

  • To reimplement (mostly copy and modify) torch.nn.functional.multi_head_attention_forward
  • To turn off attention dropout for the entire layer used to learn guided alignments

Do you have any preference? Or do you see another way forward?

Thanks,
Ingus Jānis Pretkalniņš

P.S. We were surprised that dropout on attention is implemented post and not pre-softmax. Post-softmax seems to be standard in transformers. Do you know of why that is?

@mjdenkowski
Copy link
Contributor

Hi Ingus,

I'm not familiar with the internals of torch.nn.functional.multi_head_attention_forward. I believe we use it during training because it is faster than our inference implementation (layers.py#L544-L570, layers.py#L655-L678). When we switch between implementations, we need to either interleave or separate the parameters to match what different layers expect (layers.py#L455-L510).

If the inference implementation doesn't have the dropout issue, one option would be also using this implementation during training when the option for guided alignments is active. This may be a shorter version of the reimplementation path you mentioned.

Best,
Michael

@iPRET
Copy link

iPRET commented Apr 15, 2024

Hello Michael,

We're doing some final internal checks on the changes we've made (it's about 1000 lines of changes (づ。◕‿‿◕。)づ), we'll probably do the pull request very soon.
Apart from the developer requirements https://awslabs.github.io/sockeye/development.html,
are there any graphs/checks/experiments that You would like to see, before investing time into doing a code review?

Thanks,
IP

@mjdenkowski
Copy link
Contributor

It sounds like you've made a lot of progress toward your goal. If you're primarily making these changes to enable your own work, you could keep them on a fork of Sockeye without the need to go through a full code review.

If you're interested in merging your changes into Sockeye's main branch, you could run additional experiments to verify the following:

  • The feature works for the scale of model it would be used with (according to your measure of success).
  • The changes do not negatively impact baseline training (and inference if changed). This includes speed, accuracy, and memory usage.

@iPRET
Copy link

iPRET commented May 15, 2024

Hello Michael,

We've prepared a report looking over the ups and downs of adding alignment matrices to Sockeye.
Sockeye_Alignment_Matrix_Report-6.pdf

I will open a pull request promptly. ٩(◕‿◕)۶

Thanks,
IP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants