LargeEmbedding

Pytorch module for Large-scale word Embedding.

When you want to try embedding of a large number of words (for example, more than 10M+, 128D), it may not be easy since as-is Embedding layer tries to allocate one big tensor in one device. It will also try to secure additional space for the optimization. LargeEmbedding simply partition the LUT into multiple devices.

Currently, we have two partitioning options:

LUT will be evenly partitioned into pages of which size is [page_size], simply distributed into [num_devices] GPUs. Simply, page[i] will be on the device i % num_devices.
LUT will be evenly partitioned into pages of which size is [page_size]. But each page can be distributed into specified device or devices (see examples below). You can map page[i] to device_list[i], which can be an integer (device-id) or a tuple (device-ids). If a tuple is specified for page[i], it will be wrapped by torch.nn.DataParallel, but its efficiency has not been tested yet. In my case, to use 8 GPUs, it seemed partitioning 10M words of LUT into 8 pages each of which size is 1.25M words was faster than mirroring 10M words of LUTs on 8 GPUs and doing data-parallelizing, at least for the training.

Note: I have not tried Pytorch's new sparse operation for this case yet.

How to use

import torch
from torch.autograd import Variable
from largeEmbedding import LargeEmbedding

'''
50000 x 4 LUT will be partitioned into five 10000x4 LUT's. 
In this case, device_list is specified, thus num_devices will be ignored.
The first page will be data-parallelized by device 0 and 1.
The second page will be data-parallelized by device 0 and 3.
The third, fourth, and fifth pages will be on device 4, 5, and 6, respectively.
'''
embedding = LargeEmbedding(n_words=50000, dim_word=4, page_size=10000,
                           num_devices=-1, use_cuda=True,
                           device_list=[(0, 1), (0, 3), 4, 5, 6])

x = Variable(torch.LongTensor([[0, 1, 10000, 30100], [10000, 1, 30100, 0]]).cuda())
print('embedding:', embedding(x))

'''
Sanity check of the above embedding. 
Since the page size is 10000, word[10000] will be at the first element of the second page 
(= embedding.embeddings[1].weight[0]). It should be accessed by Variable(torch.LongTensor([0])).
Also since the page (embedding.embeddings[1]) is mirrored on device 0 and 3, the index variable 
(torch.LongTensor([0])) should also be on device 0. nn.DataParallel requests it to be on the first
device of the device_list, so it should be on 0, not 3.
word[30100] will be at index 100 of the 4th page (= embedding.embeddings[3].weight[100]).
'''
print('test1:', embedding.embeddings[0](Variable(torch.LongTensor([0, 1]).cuda(embedding.page_devices[0]))))
print('test2:', embedding.embeddings[1](Variable(torch.LongTensor([0]).cuda(embedding.page_devices[1]))))
print('test3:', embedding.embeddings[3](Variable(torch.LongTensor([100]).cuda(embedding.page_devices[3]))))

'''
Another sanity check. In this case, LUT is partitioned into one page, mirrored into 4 GPUs.
So the direct access of elements in a specific page is different from the above example.
'''
embedding = LargeEmbedding(n_words=50000, dim_word=4, page_size=50000,
                           num_devices=-1, use_cuda=True,
                           device_list=[(0, 1, 2, 3)])
x = Variable(torch.LongTensor([[0, 1, 10000, 30100], [10000, 1, 30100, 0]]).cuda())
print('embedding:', embedding(x))

print('test1:', embedding.embeddings[0](Variable(torch.LongTensor([0, 1]).cuda(embedding.page_devices[0]))))
print('test2:', embedding.embeddings[0](Variable(torch.LongTensor([10000]).cuda(embedding.page_devices[0]))))
print('test3:', embedding.embeddings[0](Variable(torch.LongTensor([30100]).cuda(embedding.page_devices[0]))))

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
largeEmbedding.py		largeEmbedding.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

largeEmbedding.py

largeEmbedding.py

Repository files navigation

LargeEmbedding

How to use

About

Releases

Packages

Languages

thnkim/LargeEmbedding

Folders and files

Latest commit

History

README.md

README.md

largeEmbedding.py

largeEmbedding.py

Repository files navigation

LargeEmbedding

How to use

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages