Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate dataset for torchani #622

Open
MichailDanikas opened this issue Jul 17, 2022 · 1 comment
Open

generate dataset for torchani #622

MichailDanikas opened this issue Jul 17, 2022 · 1 comment

Comments

@MichailDanikas
Copy link

Hi,
I have a problem creating my own dataset to use them later for training. I'm a begginer with h5py but I don't understand how the datasets should be formated. I am trying to use the last part of #611 where my species look like this:
array([['O', 'C', 'O'], ['O', 'C', 'O'],... ['O', 'C', 'O']])
for one molecule. The coordinates are in the from:
[array([[[ 0. , 0. , 1.237479], [ 0. , 0. , -0.3 ], [ 0. , 0. , -1.237479]]]),...]
and the energies:
[array(226.56324331), array(208.34163576), array(191.23083335),...]
I've also tried other formats which I saved them using:
torchani.data._pyanitools.datapacker('./path_to_file', mode = 'w')
which after load them with: torchani.data.load('./path_to_file') they were tranformed as dictionaries as the examples in ani_gdb_s01.h5 do. However, in the training part the following error is prompted:
image
If you have any suggestion please let me know.
Thank you in advance.

@jvita
Copy link

jvita commented Aug 30, 2023

Probably a bit late for the original poster, but here's what I do to convert from a list of ASE.Atoms objects. I'm not sure if it's 100% correct, but it seems to work fine.

# `train` is a list of ASE.Atoms objects
with h5py.File('train.hdf5', 'w') as hdf5:
    for i, atoms in enumerate(train):
        natoms = len(atoms)
        
        g = hdf5.create_group(str(i))
        
        g.create_dataset('energies', data=np.atleast_1d(atoms.info['energy']))
        g.create_dataset('cell', data=np.array(atoms.cell).reshape((1, 3, 3)))
        g.create_dataset('coordinates', data=atoms.positions.reshape((1, natoms, 3)))
        g.create_dataset('force', data=atoms.arrays['forces'].reshape((1, natoms, 3)))
        g.create_dataset('species', data=[b'C']*natoms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants