Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the current model #94

Open
datistiquo opened this issue Mar 30, 2020 · 6 comments
Open

What is the current model #94

datistiquo opened this issue Mar 30, 2020 · 6 comments

Comments

@datistiquo
Copy link

datistiquo commented Mar 30, 2020

Hey,

what is the current model used for https://covid-staging.deepset.ai/answers? :-)
I find it very accurate for my questions and single words (in german). Have you finetuned on the german Corona QAs? Do have any trained deep learning matching algorithm in use? I cannot imagine that the model just uses cosine simalrity with BERT, because it does not perform in my case as well as the model from the bot right now.

I experiment with my own questions with the pretrained deepset model (german) using cosine similarity. I wonder why just queries with words like "hallo" or "die" has a marginal lower similarity than real question just corona specific questions when just using the deepset german model. So those irrelevant words have a high similarityof around 90%...
Do you know any reason why this is the case?

Since QA pairs in german are rare have any idea what else methods you could try to do text matching without training maybe like a Word Mover distance matching with BERT embeddings?

I am very new in using BERT.

@tholor
Copy link
Member

tholor commented Mar 31, 2020

Hey @datistiquo,

Yep, cosine similarity on plain BERT embeddings usually doesn't work very well. That's why we use a sentence-bert model for English questions, which was trained on a NLI task with a siamese network (see https://github.com/UKPLab/sentence-transformers). For all other languages (incl. German) we are currently just using plain BM25 (default in elasticsearch). This works well if you have token matches between your query and the FAQ questions (e.g. "symptome" or "schwanger"), but fails for synonyms / related words. We are currently collecting data via our feedback mechanism and hope to have at some point enough question pairs to also train a sentence-bert for German.
You can find most of our experiments (on English Covid questions) here:
https://public-mlflow.deepset.ai/#/experiments/55

Since QA pairs in german are rare have any idea what else methods you could try to do text matching without training maybe like a Word Mover distance matching with BERT embeddings?

There's a whole bunch of unsupervised techniques which try to do better than simple averaging of word embeddings. We are currently exploring S3E in another context and implement it in FARM. Their paper includes also a good overview of some other techniques - maybe that's interesting for you.

@datistiquo
Copy link
Author

datistiquo commented Mar 31, 2020

Thank you!

Yep, cosine similarity on plain BERT embeddings usually doesn't work very well.

Yes, but I cannot imagine why stopwords or words like "Hallo" have such a high similaritywith the Covid related answers. Maybe it is due to noise in a high-dimensional space of the tokenization, see below?

I think right now on a combination of metric like cosine, euclidean, wmd and tfidf. Have you tried FastText for text matching?

Do you also have the experiments for german public on mlflow? Did you tried like finetuning on the german Covid QAs already?

We are currently collecting data via our feedback mechanism and hope to have at some point enough question pairs to also train a sentence-bert for German.

Are they public besides the QAs I already saw?

Maybe it is also a problem that for the deepset german BERT model where you get very tiny pieces of the Tokenization for example:

atemschutzmasken gives 'at', '##em', '##schutz', '##mas', '##ken'

I feel that gives very much noise when you want to work on word level for similarity like with WMD or BERTScore?

https://arxiv.org/pdf/1904.09675.pdf

I want to try to focus on word level to use only the highest semantic matching of an answer instead of whole sentence embedding. I think this is also the idea of the S3E paper.

Maybe someone one should finetune BERT on all Corona related text? :-)

EDIT: For example I check the word "hallo" just for fun against other words:


your question: hallo
similarity:
> 0.8680158     hausgeburt
> 0.8613737     bezahlen
> 0.86055136    übernimmt
> 0.85608506    arbeitgeber

Since I am new to BERT I hope I did evrything right. I am using BERT as a service right now for testing.
I encode with with the last 2 layers with strategy reducing Max using your pretrained deepset model. I thought BERT should discriminate more a word like "hallo", although if just pretrained. I mean that could possibly imply that actually every word of the vocab has a high simalirity of against every other word...? That is very curious? I sounds like "overfitting".

@tholor
Copy link
Member

tholor commented Apr 3, 2020

Do you also have the experiments for German public on mlflow? Did you tried like finetuning on the german Covid QAs already?

No, we focussed our experiments on English and wanted to take the learnings from there to other anguages. If you did more experiments on German, it would of course be great, if you could share them here (or in mlflow). We did no fine-tuning on questions yet as the number we collected is still very low.

Are they public besides the QAs I already saw?

The questions itself and the eval dataset are public. The data collected via feedback is still very little, but if we get to a decent number here, we will, of course, publish them.

Have you tried FastText for text matching?

Yes. Usually, it's a good benchmark. For English question similarity it didn't work that well though https://public-mlflow.deepset.ai/#/experiments/55/runs/fef67da295f2454dacf7c247369c8d0b

Maybe someone one should finetune BERT on all Corona related text? :-)

We did that for English ;)
See #12

For example I check the word "hallo" just for fun against other words [...].Since I am new to BERT I hope I did everything right. I am using BERT as a service right now for testing.

One thing that you should check: Are you really extracting the token embeddings from BERT-as-a-service or are the padding tokens also included in your reduce_mean operation? If you just feed in one word like "hello" and most of your sequence therefore consists of padding tokens this might bias your pooled embeddings and produce high similarities. Also the [CLS] token is probably something you don't want in there. As I am not using BERT-as-service, I am not sure how this is handled there.
We have it implemented like this in FARM, if that helps you (Minimal Example, Pooling method in the back)

More generally: Plain BERT embeddings are really not the best for semantic similarity. I can really recommend the sentence-bert paper https://arxiv.org/abs/1908.10084 .

I want to try to focus on word level to use only the highest semantic matching of an answer instead of whole sentence embedding. I think this is also the idea of the S3E paper.

Sound interesting. Keep us updated on your progress & results :)
As mentioned, we have a German eval dataset, which might be helpful for you.

@datistiquo
Copy link
Author

We did that for English ;)

I meant something like scraping all corona related text (in german).

If you just feed in one word like "hello" and most of your sequence therefore consists of padding tokens this might bias your pooled embeddings and produce high similarities.

Yes, that sounds like an issue.

As I am not using BERT-as-service, I am not sure how this is handled there.
We have it implemented like this in FARM, if that helps you (Minimal Example, Pooling method in the back)

How do ignore padding and CLS token there? Is your FARM something like doing downstream task with BERT?

@tholor
Copy link
Member

tholor commented Apr 6, 2020

I meant something like scraping all corona related text (in german).

Ah okay. Yes, this could be helpful. However, it would require quite some substantial number of texts and for the goal of semantic similarity, it might not give the biggest boost (as we saw in the English case).

How do ignore padding and CLS token there?

In the minimal example, I linked, both are ignored automatically (default setting in FARM). If you are interested in how it's done behind the scenes, you can check it out here.

Is your FARM something like doing downstream task with BERT?

You can do both: extract embeddings (like bert-as-service) and train on downstream tasks (e.g. QA). It's supporting most of the transformer architectures out there.

@datistiquo
Copy link
Author

Hey,

I want to get the embeddengs of BERT for BoW Model or just calculating the cosine distance.

Are you really extracting the token embeddings from BERT-as-a-service or are the padding tokens also included in your reduce_mean operation?

What do you mean by that? Is it a diffrence if padding is removed before pooling? But anyway the output of BERT is a vector of max_lenght lenght and maybe with padding again.

Are the pooled tokens vectors for each word or for the token from the tokenizer? Would I need to average to get the vector for each word then?

I meant something like scraping all corona related text (in german).

Ah okay. Yes, this could be helpful. However, it would require quite some substantial number of texts

But for Fasttext I think a text file of some megabybtes should be fine? What experience do you have with the right hyperparemters for fasttext on small data?

You said that the number of question pairs is still to less for BERT finetuning. As I saw you have 1000, but that should be enough for finetuning on downstream task like sentence pair classification. I haven't tried it yet, but I saw several example where people fine tuned for classification on few thousands or even 500 examples and get better results than with fasttext. I mean that is the only sense using BERT for doing transfer learning with small data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants