You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been using it with other students for a project for my master degree. The problem is that as it's usual for students, we don't have super powerfull machines, and we had to use laptops with 2 or 4GB of VRAM.
I know these kinf of machines don't offer the best experience possible and even simple runs will take a lot of times but they should at least work, at least lowering a lot the quality.
The problem is that most of the grids in the library require octrees, and the kaolin implementation for creating the octrees is highly inefficient in terms of space.
Digging a bit deeper I think I think the cause is in ops/conversions/mesh_to_spc/mesh_to_spc.cpp from kaolin
that initializes the buffer tensors at the maximum size regardless of the size of the input and of the available size, making our low memory gpus crash OOM (it triees to initialize a few GB of memory).
In order to make it work on our laptops we made two changes in wisp/accelstructs/octree_as.py, :
modify the make_dense function (in order to be able to create the octree at the beginning), that was calling points_to_octree without any reason since we already know the size of the octree and it's filled with 255.
@classmethod
def make_dense(cls, level) -> OctreeAS:
""" Builds the acceleration structure and initializes full occupancy of all cells.
Args:
level (int): The depth of the octree.
"""
t=0
for i in range(level):
t+=(2**i)**3
octree = torch.ones(t, dtype=torch.uint8, device='cuda')*255
#octree = wisp_spc_ops.create_dense_octree(level)
return OctreeAS(octree)
modify from_quantized_points (in oder to be able to prune the structure), here we just plainly had to rewrite the function in a way that didn't crash the gpu
import numpy as np
def quantized_to_octree(quantized_points, level):
if quantized_points.device!='cpu':
quantized_points=quantized_points.detach().cpu()
pts=np.array(torch.unique(quantized_points,dim=0), dtype=np.uint8)[:,None]
bits = np.unpackbits(pts, 1)[...,[0,1,2]]
bits = bits.reshape(len(bits),-1)
m_idx=bits@np.power(2,np.arange(24)[ : :-1])
oct=np.zeros((2**level)**3, dtype=bool)
oct[m_idx]=1
octree=[]
for _ in range(level):
octL = np.packbits(oct.reshape(-1,8),-1,bitorder='little')
oct = octL>0
octree.insert(0,octL[oct])
return torch.tensor(np.concatenate(octree), dtype=torch.uint8, device='cuda')
...
def from_quantized_points(cls, quantized_points, level) -> OctreeAS:
""" Builds the acceleration structure from quantized (integer) point coordinates.
Args:
quantized_points (torch.LongTensor): 3D coordinates of shape [num_coords, 3] in
integer coordinate space [0, 2**level]
level (int): The depth of the octree.
"""
#octree = spc_ops.unbatched_points_to_octree(quantized_points, level, sorted=False)
octree = quantized_to_octree(quantized_points, level)
return OctreeAS(octree
we had to use numpy because torch doesn't have an equivalent for packbits and unpackbits.
The implementation of make_dense should be pretty close to optimal (unless there is a closed form formula to compute the number of elementr of the octree, but the improvement is marginal).
The implementation of from_quantized_points is definitely not optimal, a custom cuda kernel would do the job way better, but I don't have the time at the moment to write one, and I probably don't have enough experience in cuda to optimize it properly.
I'm not sure if the implementation done in that way was done to achive maximum speed at the expense of memory, but there should be an option to use a low memory
algorithm when there is not enough VRAM.
I don't know if you are also in touch with the kaolin team, and also to them the problem, but would be a nice to have if there was a flag to turn on low memory octree computation.
Hope I helped sombody
Also, I'm not really used to github yet, while we were working of the project we realized there were several small things we would have changed in order to make the library more flexible or more efficient. Should I make a different post for each one of them or can I just dump them in a single issue? or should I use pull requests?
Thanks again for your time
The text was updated successfully, but these errors were encountered:
Hi @samuele-bortolato , thanks for the useful feedback!
Your issue is relevant to kaolin, but kaolin's maintainers overlap with kaolin-wisp so I just forwarded your message.
Thanks again for the amazing library!
I've been using it with other students for a project for my master degree. The problem is that as it's usual for students, we don't have super powerfull machines, and we had to use laptops with 2 or 4GB of VRAM.
I know these kinf of machines don't offer the best experience possible and even simple runs will take a lot of times but they should at least work, at least lowering a lot the quality.
The problem is that most of the grids in the library require octrees, and the kaolin implementation for creating the octrees is highly inefficient in terms of space.
Digging a bit deeper I think I think the cause is in ops/conversions/mesh_to_spc/mesh_to_spc.cpp from kaolin
that initializes the buffer tensors at the maximum size regardless of the size of the input and of the available size, making our low memory gpus crash OOM (it triees to initialize a few GB of memory).
In order to make it work on our laptops we made two changes in wisp/accelstructs/octree_as.py, :
we had to use numpy because torch doesn't have an equivalent for packbits and unpackbits.
The implementation of make_dense should be pretty close to optimal (unless there is a closed form formula to compute the number of elementr of the octree, but the improvement is marginal).
The implementation of from_quantized_points is definitely not optimal, a custom cuda kernel would do the job way better, but I don't have the time at the moment to write one, and I probably don't have enough experience in cuda to optimize it properly.
I'm not sure if the implementation done in that way was done to achive maximum speed at the expense of memory, but there should be an option to use a low memory
algorithm when there is not enough VRAM.
I don't know if you are also in touch with the kaolin team, and also to them the problem, but would be a nice to have if there was a flag to turn on low memory octree computation.
Hope I helped sombody
Also, I'm not really used to github yet, while we were working of the project we realized there were several small things we would have changed in order to make the library more flexible or more efficient. Should I make a different post for each one of them or can I just dump them in a single issue? or should I use pull requests?
Thanks again for your time
The text was updated successfully, but these errors were encountered: