Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in using FC dataset #61

Open
Yangqy-16 opened this issue Mar 30, 2024 · 2 comments
Open

Problems in using FC dataset #61

Yangqy-16 opened this issue Mar 30, 2024 · 2 comments

Comments

@Yangqy-16
Copy link

Hello! Thank you for your great work of Torchdrug, GearNet, and ESM-GearNet!
Sorry to bother you. I'm trying to extract feature embeddings using GearNet (as discussed in several former issues) on EC, GO, and FC dataset (as provided in https://zenodo.org/records/7593591). It is easy to notice that different from EC and GO where proteins are provided in pdb format, proteins in FC are in hdf5 format, so I use your Fold3d class in GearNet (https://github.com/DeepGraphLearning/GearNet/blob/main/gearnet/dataset.py) to preprocess the data.
However, when I pass the Protein class into GearNet network following the instructions in Torchdrug, I met the following errors when running on GPU:

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

and then

RuntimeError: Error building extension 'torch_ext':
...

...           ...site-packages/torchdrug/utils/extension/torch_ext.cpp:1:
/usr/include/features.h:424:12: fatal error: sys/cdefs.h: No such file or directory
  424 | #  include <sys/cdefs.h>
      |            ^~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.

When running on CPU, I met:

NotImplementedError: Could not run 'aten::view' with arguments from the 'SparseCPU' backend

I searched for the cause of these errors on the Internet but found that I couldn't solve them because they are related to the environment. I'm wondering why I don't meet any of the problems when directly use Protein.from_pdb() on EC and GO, but encounter these problems on FC where I use your Fold3D class to get also data.Protein instances.

For reference, my code is as follows:

...
# graph
graph_construction_model = layers.GraphConstruction(node_layers=[geometry.AlphaCarbonNode()], 
                                                    edge_layers=[geometry.SpatialEdge(radius=10.0, min_distance=5),
                                                                 geometry.KNNEdge(k=10, min_distance=5),
                                                                 geometry.SequentialEdge(max_distance=2)],
                                                    edge_feature="gearnet")

# model
gearnet_edge = models.GearNet(input_dim=21, hidden_dims=[512, 512, 512, 512, 512, 512],
                              num_relation=7, edge_input_dim=59, num_angle_bin=8,
                              batch_norm=True, concat_hidden=True, short_cut=True, readout="sum")
pthfile = 'models/mc_gearnet_edge.pth'
net = torch.load(pthfile, map_location=torch.device(device))
#print('torch succesfully load model')
gearnet_edge.load_state_dict(net)
gearnet_edge.eval()
print('successfully load gearnet')


def get_subdataset_rep(pdbs: list, proteins: list, subroot: str):
    for idx in range(0, len(pdbs), bs):  # reformulate to batches
        pdb_batch = pdbs[idx : min(len(pdbs), idx + bs)]
        protein_batch = proteins[idx : min(len(pdbs), idx + bs)]
        # protein
        _protein = data.Protein.pack(protein_batch)
        _protein.view = "residue"
        print(_protein)
        final_protein = graph_construction_model(_protein)
        print(final_protein)

        # output
        with torch.no_grad():
            output = gearnet_edge(final_protein, final_protein.node_feature.float(), all_loss=None, metric=None)
        print(output['graph_feature'].shape, output['node_feature'].shape)

        counter = 0
        for idx in range(len(final_protein.num_residues)):  # idx: protein/graph id in this batch
            this_graph_feature = output['graph_feature'][idx]
            this_node_feature = output['node_feature'][counter : counter + final_protein.num_residues[idx], :]
            print(this_graph_feature.shape, this_node_feature.shape)
            torch.save((this_graph_feature, this_node_feature), f"{subroot}/{os.path.splitext(pdb_batch[idx])[0].split('/')[-1]}.pt")
            counter += final_protein.num_residues[idx]
            
        break


# get representations
if args.task not in ['FC', 'fc']:
    for root in roots:
        pdbs = [os.path.join(root, i) for i in os.listdir(root)]

        proteins = []
        for pdb_file in pdbs:
            try:
                protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
                protein.view = "residue"
                proteins.append(protein)
            except:
                error_fn = os.path.basename(root) + '_' if args.task in ['EC', 'ec', 'GO', 'go'] else ''
                with open(f"{error_path}/{args.task}_{error_fn}error.txt", "a") as f:
                    f.write(os.path.splitext(pdb_file)[0].split('/')[-1] + '\n')
                f.close()
            
            if len(proteins) == bs:  # for debug
                break
        
        subroot = os.path.join(output_dir, root.split('/')[-1]) if args.task in ['EC', 'ec', 'GO', 'go'] else output_dir
        get_subdataset_rep(pdbs, proteins, subroot)

        break
else:
    transform = transforms.Compose([transforms.ProteinView(view='residue')])
    dataset = Fold3D(root, transform=transform)  #, atom_feature=None, bond_feature=None
    
    split_sets = dataset.split()  # train_set, valid_set, test_fold_set
    print('There are', len(split_sets), 'sets in total.')

    for split_set in split_sets:
        print(split_set.indices)
        this_slice = slice(list(split_set.indices)[0], (list(split_set.indices)[-1] + 1))
        this_pdbs, this_datas = dataset.pdb_files[this_slice], dataset.data[this_slice]
        #for fn, protein in zip(this_pdbs, this_datas):
        #    print(fn, protein)
        #    break
        get_subdataset_rep(this_pdbs, this_datas, os.path.join(output_dir, this_pdbs[0].split('/')[0]))

Are there any ways to solve the problem, or is my understanding of torchdrug wrong? Sincerely looking forward to your help. Thank you very much!

@Oxer11
Copy link
Collaborator

Oxer11 commented Apr 2, 2024

Hi, I don't think this should be a dataset-specific problem. It seems that you fail to build the torch_extension in TorchDrug. Could you check this?

@Yangqy-16
Copy link
Author

Hi! Thank you for your reply! I checked my torch_extension based on DeepGraphLearning/torchdrug#8 and DeepGraphLearning/torchdrug#238. I'm sure that my torch_ext.cpp lies correctly under torchdrug/utils/extension, and I tried to delete the folder torch_extensions which lies under /home/your_user_name/.cache but it doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants