Extract Contextual Word Embeddings #151

Hazoom · 2019-07-11T07:04:38Z

Add the ability to extract contextual word embeddings from a given list of sentences using XLNet, same like in BERT.
The script extracts a fixed length vector for each token in the sentence.

First, one needs to create an input text file as following:

# Sentence A and Sentence B are separated by the ||| delimiter for sentence pair tasks.
# For single sentence inputs, put one sentence per line and DON'T use the delimiter.
echo 'I love New York. ||| New York is a city' > data/corpus.txt

After that, the script extract_features.py can be used like this, which will create vectors of length 64 for each token in the sentence:

INIT_CKPT_DIR=models/xlnet_cased_L-24_H-1024_A-16
OUTPUT_DIR=data
MODEL_DIR=experiment/extract_features

python extract_features.py \
    --input_file=data/corpus.txt \
    --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \
    --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
    --use_tpu=False \
    --num_core_per_host=1 \
    --output_file=${OUTPUT_DIR}/output.json \
    --model_dir=${MODEL_DIR} \
    --num_hosts=1 \
    --max_seq_length=64 \
    --eval_batch_size=8 \
    --predict_batch_size=8 \
    --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
    --summary_type=mean

Or use the scripts/gpu_extract_features.sh script for running it easily.

This will create a JSON file (one line per line of input) containing the contextual word embeddings from XLNet.

#39

Hi @zihangdai @kimiyoung, can you please take a look?

Add extract_features.py script

3NFBAGDU · 2019-07-11T10:40:37Z

Add the ability to extract contextual word embeddings from a given list of sentences using XLNet, same like in BERT.
The script extracts a fixed length vector for each token in the sentence and one pooled vector from all the word embeddings, with the given pooling strategy parameter.

First, one needs to create an input text file as following:
# Sentence A and Sentence B are separated by the ||| delimiter for sentence pair tasks.
# For single sentence inputs, put one sentence per line and DON'T use the delimiter.
echo 'I love New York. ||| New York is a city' > data/corpus.txt
After that, the script extract_features.py can be used like this, which will create vectors of length 64 and one pooled vector with mean strategy:
INIT_CKPT_DIR=models/xlnet_cased_L-24_H-1024_A-16
OUTPUT_DIR=data
MODEL_DIR=experiment/extract_features

python extract_features.py \
    --input_file=data/corpus.txt \
    --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \
    --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
    --use_tpu=False \
    --num_core_per_host=1 \
    --output_file=${OUTPUT_DIR}/output.json \
    --model_dir=${MODEL_DIR} \
    --num_hosts=1 \
    --max_seq_length=64 \
    --eval_batch_size=8 \
    --predict_batch_size=8 \
    --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
    --summary_type=mean
Or use the scripts/gpu_extract_features.sh script for running it easily.

This will create a JSON file (one line per line of input) containing the contextual word embeddings from XLNet, including one pooled vector.

#39

Hello, i write corpus.txt 1 sentence. When i run python extract_features.py \ --input_file=data/corpus.txt \ --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \ --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \ --use_tpu=False \ --num_core_per_host=1 \ --output_file=${OUTPUT_DIR}/output.json \ --model_dir=${MODEL_DIR} \ --num_hosts=1 \ --max_seq_length=64 \ --eval_batch_size=8 \ --predict_batch_size=8 \ --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \ --summary_type=mean in different times, sentence feature is different. they should be same i think.

Hazoom · 2019-07-11T11:11:19Z

@3NFBAGDU Thanks for pointing this out.
From what I saw, it happens only in the pooled vector, and I only use the original pooling code of XLNet, I think it's because of the dropout, which is random of course.
I will fix the script to not perform dropout, which is the expected behavior if prediction mode and in addition will remove the output of the pooled vector. I think it's better for the client to perform pooling on client side.

remove pooled vector from the output

3NFBAGDU · 2019-07-11T13:22:21Z

Hi, thank you for answering. as i have tested it, Euclidean distance is better than cosine distance for words.
cosine distance allways > 0.89. I have trained my model in 1.6M sentences.

And do you have any idea how to get sentence_embedding vector from here?

Hazoom · 2019-07-11T13:29:21Z

Hi, thank you for answering. as i have tested it, Euclidean distance is better than cosine distance for words.
cosine distance allways > 0.89. I have trained my model in 1.6M sentences.

And do you have any idea how to get sentence_embedding vector from here?

Thanks for sharing the results.
In order to get sentence embedding, you can perform pooling by one of the existing pooling strategies, like: max pooling, mean pooling, max-mean pooling, attention pooling, etc.
For example, to perform max pooling, you just need to sum all word vectors into one vector for dimension 1024.

Please note that some word are the padding tokens (actually most of them), so you should ignore them in the pooling strategy and perform the pooling with only real tokens.

3NFBAGDU · 2019-07-12T09:54:51Z

Hi, if i give 'Hello, how are you', there should be this output:
{'token' : {"Hello", values} : ...., {token = "how", values:[...]}}.,
but i have output {token: '', values:[]} , i have token allways empty. Is this my sentencepiece model fault ?

Hazoom · 2019-07-12T21:19:15Z

Hi, if i give 'Hello, how are you', there should be this output:
{'token' : {"Hello", values} : ...., {token = "how", values:[...]}}.,
but i have output {token: '', values:[]} , i have token allways empty. Is this my sentencepiece model fault ?

Apparently, when giving the sentence Hello, how are you? sentence piece model tokenize it such that the first token is empty.
I added code that ignore those empty tokens.
Thanks for noticing.

3NFBAGDU · 2019-07-25T12:44:29Z

estimator.predict() works to slowly. I want to predict some text in my model in every 2 seconds. Everytime I call the estimator.predict() function, it loads the model all over again. I want to load the model just once and after that use estimator.predict() every 2 seconds on this same model to get the faster prediction. Can you help me ?

Hazoom · 2019-09-15T14:19:33Z

Hi @zihangdai @kimiyoung , since issue #39 was closed, can you please merge this to master?
Thanks.

JxuHenry · 2019-10-25T06:26:33Z

@Hazoom Hi sir, How to modify vector dimensions?

Hazoom · 2019-10-25T07:38:12Z

@Hazoom Hi sir, How to modify vector dimensions?

@JxuHenry I don't think it's possible, the dimension is set by the network's architecture.

JxuHenry · 2019-10-26T07:04:47Z

@Hazoom Hi sir, How to modify vector dimensions?

@JxuHenry I don't think it's possible, the dimension is set by the network's architecture.

OK，Thank you very much

frank-lin-liu · 2020-02-16T15:29:02Z

Hi Hazoom,
I follow your instructions and run extract_features.py. Does it need a GPU to run this program?

Hazoom · 2020-02-16T15:31:25Z

Hi Hazoom,
I follow your instructions and run extract_features.py. Does it need a GPU to run this program?

No, it can be run on CPU as well, just a little bit slower than GPU.

frank-lin-liu · 2020-02-16T15:55:43Z

Thank you, Hazoom. I use tensorflow v1.15. Is it the tensorflow version you used?

Hazoom · 2020-02-16T16:06:26Z

Thank you, Hazoom. I use tensorflow v1.15. Is it the tensorflow version you used?

I used Tensorflow v1.14, but it should be the same, I hope.

frank-lin-liu · 2020-02-16T16:16:36Z

It seems that I don't get the expected results. I copied some messages below. Could you please take a look and let me know what the problem is?

2020-02-16 15:03:13.502591: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-16 15:03:13.523791: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2712000000 Hz
2020-02-16 15:03:13.524337: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c394737b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-16 15:03:13.524391: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-02-16 15:03:13.527100: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-16 15:03:13.527146: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-16 15:03:13.527179: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (pl-00193583): /proc/driver/nvidia/version does not exist
2020-02-16 15:03:16.775053: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 131072000 exceeds 10% of system memory.
INFO:tensorflow:Running local_init_op.
I0216 15:03:18.755628 140438614927168 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0216 15:03:18.981006 140438614927168 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Predicting submission for example_cnt: 0
I0216 15:03:26.952376 140438614927168 extract_features.py:427] Predicting submission for example_cnt: 0

mqhe · 2020-06-19T02:48:47Z

Hi Hazoom,
I succeeded in using the scripts to get an output.json for one sentence "Hello World". I observed that the embeddings has 6 tokens, "he", "ll", "o", "world", and the last two tokens are , is this tokenization normal? If we use pooling strategy to calculate the sentence embedding. do we need to remove the 's embeddings?

Hazoom added 3 commits July 11, 2019 09:32

Add files via upload

7ada5d5

Add extract_features.py script

Add gpu_extract_features.sh script

d101f12

Fix pooling method

6a77598

Hazoom mentioned this pull request Jul 11, 2019

Word Embeddings #39

Closed

Moshe Hazoom added 2 commits July 11, 2019 14:22

Add files via upload

9cc18ec

remove pooled vector from the output

set dropout to zero in prediction mode

d1d0ff4

fix alignment between real tokens to padding

24485ce

ignore empty tokens in output

8574e9e

maziyarpanahi mentioned this pull request Apr 27, 2020

XLNet generates random word embeddings amansrivastava17/embedding-as-service#45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Contextual Word Embeddings #151

Extract Contextual Word Embeddings #151

Hazoom commented Jul 11, 2019 •

edited

3NFBAGDU commented Jul 11, 2019

Hazoom commented Jul 11, 2019 •

edited

3NFBAGDU commented Jul 11, 2019

Hazoom commented Jul 11, 2019

3NFBAGDU commented Jul 12, 2019

Hazoom commented Jul 12, 2019

3NFBAGDU commented Jul 25, 2019 •

edited

Hazoom commented Sep 15, 2019

JxuHenry commented Oct 25, 2019

Hazoom commented Oct 25, 2019

JxuHenry commented Oct 26, 2019

frank-lin-liu commented Feb 16, 2020

Hazoom commented Feb 16, 2020

frank-lin-liu commented Feb 16, 2020

Hazoom commented Feb 16, 2020

frank-lin-liu commented Feb 16, 2020

mqhe commented Jun 19, 2020

Extract Contextual Word Embeddings #151

Are you sure you want to change the base?

Extract Contextual Word Embeddings #151

Conversation

Hazoom commented Jul 11, 2019 • edited

3NFBAGDU commented Jul 11, 2019

Hazoom commented Jul 11, 2019 • edited

3NFBAGDU commented Jul 11, 2019

Hazoom commented Jul 11, 2019

3NFBAGDU commented Jul 12, 2019

Hazoom commented Jul 12, 2019

3NFBAGDU commented Jul 25, 2019 • edited

Hazoom commented Sep 15, 2019

JxuHenry commented Oct 25, 2019

Hazoom commented Oct 25, 2019

JxuHenry commented Oct 26, 2019

frank-lin-liu commented Feb 16, 2020

Hazoom commented Feb 16, 2020

frank-lin-liu commented Feb 16, 2020

Hazoom commented Feb 16, 2020

frank-lin-liu commented Feb 16, 2020

mqhe commented Jun 19, 2020

Hazoom commented Jul 11, 2019 •

edited

Hazoom commented Jul 11, 2019 •

edited

3NFBAGDU commented Jul 25, 2019 •

edited