Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于6.3.3 随机采样和相邻采样的疑惑 #160

Open
jianli-Alex opened this issue Oct 31, 2020 · 0 comments
Open

关于6.3.3 随机采样和相邻采样的疑惑 #160

jianli-Alex opened this issue Oct 31, 2020 · 0 comments

Comments

@jianli-Alex
Copy link

jianli-Alex commented Oct 31, 2020

bug描述

# 本函数已保存在d2lzh_pytorch包中方便以后使用
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 减1是因为输出的索引x是相应输入的索引y加1
    num_examples = (len(corpus_indices) - 1) // num_steps
    epoch_size = num_examples // batch_size
    example_indices = list(range(num_examples))
    random.shuffle(example_indices)

    # 返回从pos开始的长为num_steps的序列
    def _data(pos):
        return corpus_indices[pos: pos + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    for i in range(epoch_size):
        # 每次读取batch_size个随机样本
        i = i * batch_size
        batch_indices = example_indices[i: i + batch_size]
        X = [_data(j * num_steps) for j in batch_indices]
        Y = [_data(j * num_steps + 1) for j in batch_indices]
        yield torch.tensor(X, dtype=torch.float32, device=device), torch.tensor(Y, dtype=torch.float32, device=device)

以上是随机采样的写法,但是觉得有两个问题。首先,因为for i in range(epoch_size)的关系,所以实际上每一次都是从下标为0的开始采样。对于下面所给的测试,实际上x只能在0-23产生batch,也就是x的batch一直都不包括24,25,26,27,28。

# 测试
my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

# 所给的结果
X:  tensor([[18., 19., 20., 21., 22., 23.],
        [12., 13., 14., 15., 16., 17.]]) 
Y: tensor([[19., 20., 21., 22., 23., 24.],
        [13., 14., 15., 16., 17., 18.]]) 

X:  tensor([[ 0.,  1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10., 11.]]) 
Y: tensor([[ 1.,  2.,  3.,  4.,  5.,  6.],
        [ 7.,  8.,  9., 10., 11., 12.]]) 

Q1:
那其实在实现随机采样的时候,是不是应该保证有一部分epoch包含的batch有24,25,26,27,28(不知道我有没有理解错)。同理,在相邻采样中也有同样的情况。
Q2:
此外,上面的写法生成一定是batch_size=2的数据,当有数据剩余且数据量小于batch_size=2的数据量时就不会生成。但是在全连接和CNN中,我们读取的小批量数据在最后一个batch中数据量往往小于batch_size。因此在这里,假设上面的测试剩余了大于batch_size=1的数据(如设置my_seq = list(range(32)),此时有8个数据未被抽取),是否继续采样一个batch_size=1的数据,望解惑!

# Q2的情况如下:
# 测试
my_seq = list(range(32))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

# 结果(这里包含了0<batch_size<=2的情况)
X:  tensor([[18., 19., 20., 21., 22., 23.],
        [ 0.,  1.,  2.,  3.,  4.,  5.]]) 
Y: tensor([[19., 20., 21., 22., 23., 24.],
        [ 1.,  2.,  3.,  4.,  5.,  6.]]) 

X:  tensor([[12., 13., 14., 15., 16., 17.],
        [24., 25., 26., 27., 28., 29.]]) 
Y: tensor([[13., 14., 15., 16., 17., 18.],
        [25., 26., 27., 28., 29., 30.]]) 

X:  tensor([[ 6.,  7.,  8.,  9., 10., 11.]]) 
Y: tensor([[ 7.,  8.,  9., 10., 11., 12.]]) 

以下是我另外写随机采样的,保证了我上述说的情况

def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 减1是因为输出的索引x是相应输入的索引y加1
    num_examples = (len(corpus_indices) - 1) // num_steps
    # 随机抽样的起始位置
    sample_start = np.random.randint((len(corpus_indices) - 1) % num_steps + 1)
    example_indices = np.arange(sample_start, len(corpus_indices), num_steps)[:num_examples]
    np.random.shuffle(example_indices)
    
    # 转gpu
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # 每个读取batch_size个随机样本
    for idx in np.arange(0, len(example_indices), batch_size):
        batch_example = example_indices[idx:(idx+batch_size)]
        x = [corpus_indices[pos:(pos+num_steps)] for pos in batch_example]
        y = [corpus_indices[(pos+1):(pos+1+num_steps)] for pos in batch_example]
        yield torch.tensor(x, dtype=torch.float32, device=device), torch.tensor(y, dtype=torch.float32, device=device)

测试结果

my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

# 结果:
X:  tensor([[14., 15., 16., 17., 18., 19.],
        [ 8.,  9., 10., 11., 12., 13.]], device='cuda:0') 
Y: tensor([[15., 16., 17., 18., 19., 20.],
        [ 9., 10., 11., 12., 13., 14.]], device='cuda:0') 

X:  tensor([[ 2.,  3.,  4.,  5.,  6.,  7.],
        [20., 21., 22., 23., 24., 25.]], device='cuda:0') 
Y: tensor([[ 3.,  4.,  5.,  6.,  7.,  8.],
        [21., 22., 23., 24., 25., 26.]], device='cuda:0')

版本信息
pytorch: 1.6.0
torchvision: 0.7.0
torchtext:
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant