Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Contextual Word Embeddings #151

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

Hazoom
Copy link

@Hazoom Hazoom commented Jul 11, 2019

Add the ability to extract contextual word embeddings from a given list of sentences using XLNet, same like in BERT.
The script extracts a fixed length vector for each token in the sentence.

First, one needs to create an input text file as following:

# Sentence A and Sentence B are separated by the ||| delimiter for sentence pair tasks.
# For single sentence inputs, put one sentence per line and DON'T use the delimiter.
echo 'I love New York. ||| New York is a city' > data/corpus.txt

After that, the script extract_features.py can be used like this, which will create vectors of length 64 for each token in the sentence:

INIT_CKPT_DIR=models/xlnet_cased_L-24_H-1024_A-16
OUTPUT_DIR=data
MODEL_DIR=experiment/extract_features

python extract_features.py \
    --input_file=data/corpus.txt \
    --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \
    --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
    --use_tpu=False \
    --num_core_per_host=1 \
    --output_file=${OUTPUT_DIR}/output.json \
    --model_dir=${MODEL_DIR} \
    --num_hosts=1 \
    --max_seq_length=64 \
    --eval_batch_size=8 \
    --predict_batch_size=8 \
    --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
    --summary_type=mean

Or use the scripts/gpu_extract_features.sh script for running it easily.

This will create a JSON file (one line per line of input) containing the contextual word embeddings from XLNet.

#39

Hi @zihangdai @kimiyoung, can you please take a look?

@Hazoom Hazoom mentioned this pull request Jul 11, 2019
@3NFBAGDU
Copy link

Add the ability to extract contextual word embeddings from a given list of sentences using XLNet, same like in BERT.
The script extracts a fixed length vector for each token in the sentence and one pooled vector from all the word embeddings, with the given pooling strategy parameter.

First, one needs to create an input text file as following:

# Sentence A and Sentence B are separated by the ||| delimiter for sentence pair tasks.
# For single sentence inputs, put one sentence per line and DON'T use the delimiter.
echo 'I love New York. ||| New York is a city' > data/corpus.txt

After that, the script extract_features.py can be used like this, which will create vectors of length 64 and one pooled vector with mean strategy:

INIT_CKPT_DIR=models/xlnet_cased_L-24_H-1024_A-16
OUTPUT_DIR=data
MODEL_DIR=experiment/extract_features

python extract_features.py \
    --input_file=data/corpus.txt \
    --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \
    --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
    --use_tpu=False \
    --num_core_per_host=1 \
    --output_file=${OUTPUT_DIR}/output.json \
    --model_dir=${MODEL_DIR} \
    --num_hosts=1 \
    --max_seq_length=64 \
    --eval_batch_size=8 \
    --predict_batch_size=8 \
    --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
    --summary_type=mean

Or use the scripts/gpu_extract_features.sh script for running it easily.

This will create a JSON file (one line per line of input) containing the contextual word embeddings from XLNet, including one pooled vector.

#39

Hello, i write corpus.txt 1 sentence. When i run python extract_features.py \ --input_file=data/corpus.txt \ --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \ --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \ --use_tpu=False \ --num_core_per_host=1 \ --output_file=${OUTPUT_DIR}/output.json \ --model_dir=${MODEL_DIR} \ --num_hosts=1 \ --max_seq_length=64 \ --eval_batch_size=8 \ --predict_batch_size=8 \ --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \ --summary_type=mean in different times, sentence feature is different. they should be same i think.

@Hazoom
Copy link
Author

Hazoom commented Jul 11, 2019

@3NFBAGDU Thanks for pointing this out.
From what I saw, it happens only in the pooled vector, and I only use the original pooling code of XLNet, I think it's because of the dropout, which is random of course.
I will fix the script to not perform dropout, which is the expected behavior if prediction mode and in addition will remove the output of the pooled vector. I think it's better for the client to perform pooling on client side.

Moshe Hazoom added 2 commits July 11, 2019 14:22
@3NFBAGDU
Copy link

Hi, thank you for answering. as i have tested it, Euclidean distance is better than cosine distance for words.
cosine distance allways > 0.89. I have trained my model in 1.6M sentences.

And do you have any idea how to get sentence_embedding vector from here?

@Hazoom
Copy link
Author

Hazoom commented Jul 11, 2019

Hi, thank you for answering. as i have tested it, Euclidean distance is better than cosine distance for words.
cosine distance allways > 0.89. I have trained my model in 1.6M sentences.

And do you have any idea how to get sentence_embedding vector from here?

Thanks for sharing the results.
In order to get sentence embedding, you can perform pooling by one of the existing pooling strategies, like: max pooling, mean pooling, max-mean pooling, attention pooling, etc.
For example, to perform max pooling, you just need to sum all word vectors into one vector for dimension 1024.

Please note that some word are the padding tokens (actually most of them), so you should ignore them in the pooling strategy and perform the pooling with only real tokens.

@3NFBAGDU
Copy link

Hi, if i give 'Hello, how are you', there should be this output:
{'token' : {"Hello", values} : ...., {token = "how", values:[...]}}.,
but i have output {token: '', values:[]} , i have token allways empty. Is this my sentencepiece model fault ?

@Hazoom
Copy link
Author

Hazoom commented Jul 12, 2019

Hi, if i give 'Hello, how are you', there should be this output:
{'token' : {"Hello", values} : ...., {token = "how", values:[...]}}.,
but i have output {token: '', values:[]} , i have token allways empty. Is this my sentencepiece model fault ?

Apparently, when giving the sentence Hello, how are you? sentence piece model tokenize it such that the first token is empty.
I added code that ignore those empty tokens.
Thanks for noticing.

@3NFBAGDU
Copy link

3NFBAGDU commented Jul 25, 2019

estimator.predict() works to slowly. I want to predict some text in my model in every 2 seconds. Everytime I call the estimator.predict() function, it loads the model all over again. I want to load the model just once and after that use estimator.predict() every 2 seconds on this same model to get the faster prediction. Can you help me ?

@Hazoom
Copy link
Author

Hazoom commented Sep 15, 2019

Hi @zihangdai @kimiyoung , since issue #39 was closed, can you please merge this to master?
Thanks.

@JxuHenry
Copy link

@Hazoom Hi sir, How to modify vector dimensions?

@Hazoom
Copy link
Author

Hazoom commented Oct 25, 2019

@Hazoom Hi sir, How to modify vector dimensions?

@JxuHenry I don't think it's possible, the dimension is set by the network's architecture.

@JxuHenry
Copy link

@Hazoom Hi sir, How to modify vector dimensions?

@JxuHenry I don't think it's possible, the dimension is set by the network's architecture.

OK,Thank you very much

@frank-lin-liu
Copy link

Hi Hazoom,
I follow your instructions and run extract_features.py. Does it need a GPU to run this program?

@Hazoom
Copy link
Author

Hazoom commented Feb 16, 2020

Hi Hazoom,
I follow your instructions and run extract_features.py. Does it need a GPU to run this program?

No, it can be run on CPU as well, just a little bit slower than GPU.

@frank-lin-liu
Copy link

Thank you, Hazoom. I use tensorflow v1.15. Is it the tensorflow version you used?

@Hazoom
Copy link
Author

Hazoom commented Feb 16, 2020

Thank you, Hazoom. I use tensorflow v1.15. Is it the tensorflow version you used?

I used Tensorflow v1.14, but it should be the same, I hope.

@frank-lin-liu
Copy link

It seems that I don't get the expected results. I copied some messages below. Could you please take a look and let me know what the problem is?


2020-02-16 15:03:13.502591: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-16 15:03:13.523791: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2712000000 Hz
2020-02-16 15:03:13.524337: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c394737b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-16 15:03:13.524391: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-02-16 15:03:13.527100: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-16 15:03:13.527146: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-16 15:03:13.527179: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (pl-00193583): /proc/driver/nvidia/version does not exist
2020-02-16 15:03:16.775053: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 131072000 exceeds 10% of system memory.
INFO:tensorflow:Running local_init_op.
I0216 15:03:18.755628 140438614927168 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0216 15:03:18.981006 140438614927168 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Predicting submission for example_cnt: 0
I0216 15:03:26.952376 140438614927168 extract_features.py:427] Predicting submission for example_cnt: 0

@mqhe
Copy link

mqhe commented Jun 19, 2020

Hi Hazoom,
I succeeded in using the scripts to get an output.json for one sentence "Hello World". I observed that the embeddings has 6 tokens, "he", "ll", "o", "world", and the last two tokens are , is this tokenization normal? If we use pooling strategy to calculate the sentence embedding. do we need to remove the 's embeddings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants