Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of gap-filling in refineGEMs #52

Open
15 of 20 tasks
GwennyGit opened this issue Jan 19, 2023 · 27 comments · Fixed by #61, #66 or #73
Open
15 of 20 tasks

Improvement of gap-filling in refineGEMs #52

GwennyGit opened this issue Jan 19, 2023 · 27 comments · Fixed by #61, #66 or #73
Assignees
Labels
enhancement New feature or request

Comments

@GwennyGit
Copy link
Collaborator

GwennyGit commented Jan 19, 2023

In this issue all current gap-filling tools implemented in refineGEMs are summarised and possible enhancements explored.

Current gap-filling modules:

  • genecomp (now: kegg_analysis):
    ⇾ Extracts KEGG gene identifiers from model
    ⇾ Compares KEGG genes in model with the strain-specific ones in KEGG
    ⇾ Extracts RefSeq IDs (GPR) from the .gff file
    ⇾ Maps BiGG to KEGG IDs
    ⇒ Returns a table containing missing reactions with locus tag, EC number, KEGG ID, BiGG ID and RefSeq ID (GPR)
  • curate:
    1. With option gapfill
      ⇾ Adds reactions with the corresponding IDs, stoichiometric coefficients, educts, products, upper & lower bound to the model from a manually obtained table
    2. With option metabs
      ⇾ Adds metabolites with the corresponding IDs, formulae, and name to the model from a manually obtained table
      ⇾ Synchronises the metabolite information over all compartments

Creation of gapfill module for BioCyc (& Adjustment of genecomp to gapfill):

  • Add modules gapfill, entities & analysis_biocyc
  • Rename genecomp module to analysis_kegg
  • Adjust code in analysis_kegg
  • Adjust code in curate
  • Extract all functions that could be used for analysis_kegg & analysis_biocyc into analysis_db
  • Add the following functions to gapfill:
    • Function I: Obtains missing genes, metabolites & reactions (→ Function gapfill_analysis)
      • By comparing genes in model with genes from BioCyc (→ Module analysis_biocyc)
        • Adds charges and chemical formulas from the CHEBI API to the metabolites
        • Retrieves stoichiometric values from the BiGG API
      • By comparing genes in the model with KEGG genes (→ Module analysis_kegg, only obtains reactions & genes/proteins)
      • By comparing genes in the model with KEGG, then with BioCyc and combining the results (Currently, no complete merge!)
      • Filter missing reactions by metabolite compartments & missing metabolites by compartment
    • Function II: Adds the missing genes, metabolites & reactions (→ Module entities, See script: https://github.com/draeger-lab/py4gems/blob/main/Reihaneh/1.%20BioCyc_Comparison.ipynb)

Further improvements:

  • Retrieve missing metabolites via 'bigg_reaction' column from reactions table instead of from the 'Reactants' and 'Products' columns
  • Add functionality to apply BioCyc comparison to models where organism does not occurr in any database:
    • From the GFF & FASTA files with DIAMOND & BioCyc SmartTables (→ lab_strain) obtain missing genes
  • Adjust kegg_analysis to also return tables with missing genes & metabolites
    → Then the result from kegg_analysis can also be added to a model.
  • Do a complete merge on all BioCyc & KEGG tables if db_to_compare: KEGG+BioCyc
  • Add a check similar to verifyGapfilledReactions to gapfill_model
@GwennyGit GwennyGit added the enhancement New feature or request label Jan 19, 2023
@famosab
Copy link
Member

famosab commented Jan 31, 2023

Regarding Function I: we already have a parser for gff files integrated (it is in the function get_locus_gpr from genecomp). Maybe we can expand from there - the only obstacle would be to make sure we have similar IDs that can be compared to each other. At the moment I have the problem that the GPRs in my models cannot be found in my gff file because the naming is completely different.

@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Feb 1, 2023

Regarding function I: For strains that are not in KEGG but in BioCyc I think it will be better to use the BioCyc SmartTables as reference. However, for lab strains this function could be still useful. 🤔 For the BioCyc option I will add a comment to Function I. Maybe for that the script from Reihaneh (@Biomathsys) could be used (or maybe adjusted), see Code here: https://github.com/draeger-lab/py4gems/blob/main/Reihaneh/1.%20BioCyc_Comparison.ipynb.

@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Feb 2, 2023

The current module curate does only add genes and reactions to the model. However, for gapfill the metabolites should be added too. In Reihaneh's script COBRA is used and the implementation to add reactions and metabolites is already properly implemented. Thus, I will use her approach for that. In addition, I will use the function to add the missing genes from curate and extend the tables from KEGG/BioCyc/GFF file with the BiGG identifiers for each reaction and metabolite, respectively.


As the function to add reactions will not be used from curate this module will be kept as such. However, the following three new modules will be generated in addition to the gapfill module: entities, analysis_kegg and analysis_biocyc.

GwennyGit added a commit that referenced this issue Feb 2, 2023
GwennyGit added a commit that referenced this issue Feb 2, 2023
GwennyGit added a commit that referenced this issue Feb 2, 2023
GwennyGit added a commit that referenced this issue Feb 2, 2023
GwennyGit added a commit that referenced this issue Feb 2, 2023
Removed a function that was generalised to work with KEGG and BioCyc and is now in gapfill.
@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Feb 2, 2023

For the comparison between already existing metabolites & reactions I realised that if I add the BiGG identifiers to the table the checks from Reihaneh's script are not necessary. Thus, I will extend the functionality of entities.


In analysis_kegg the function get_locus_gpr (line 167) can be adjusted to get the protein IDs from the Genbank GFF/FASTA (CarveMe input) file. The currently retrieved GPRs are basically the RefSeq identifiers from the RefSeq GFF file. Additionally, the function should be moved to gapfill as it can be used for analysis_kegg and analysis_biocyc.
Note to myself: Are there maybe more functions that can be used for both modules? Potentially the function retrieving the BiGG IDs?

@GwennyGit
Copy link
Collaborator Author

Extracting the functions required for analysis_kegg and analysis_biocyc from analysis_kegg into gapfill created a cycle as these functions were called in each of the other two modules and these modules in return were called within the gapfill module. To still reduce redundancy another module analysis_db is created that now contains the overlapping functions for analysis_kegg and analysis_biocyc.

@famosab
Copy link
Member

famosab commented Feb 3, 2023

Maybe the following publication / code is of interest NICEgame. They mention in their manuscript that they also worked with Python, however in the gh repo I only found Matlab scripts.

@GwennyGit
Copy link
Collaborator Author

From the paper I understand that the authors use media for which it is known that the bacterium should grow on to fill gaps in the model. This approach would be similar to the one from the CarveMe documentation or also the gap filling approach from COBRApy. This would be a nice addition to the gap filling via the genes I think. I already considered adding the call for the CarveMe gap filling after using the gap filling from the genes. However, as far as I understood these programs the user needs to know exactly on which media the bacterium would grow. Thus, I find it rather difficult to use any of the tools as we have strain-specific models. For which I suppose that not every strain of e.g. Staphylococcus haemolyticus grows on the same media, especially, if microbiome media are used like SNM3. 🤔

@famosab
Copy link
Member

famosab commented Feb 7, 2023

We can use requests to access the BiGG database.

Here is an example how to use it with BiGG:

import requests
import refinegems as rg

reac_url = 'http://bigg.ucsd.edu/api/v2/universal/reactions/'
metab_url = 'http://bigg.ucsd.edu/api/v2/universal/metabolites/'

mod = rg.load.load_model_cobra('../../models/Cstr_14.xml')

# requests.get(metab_url+'o2').json()['charges']

for metab in mod.metabolites:
    id = metab.id[:-2]
    print(id, requests.get(metab_url+id).json()['name'])

For metabolites these field can be accessed ['database_links', 'bigg_id', 'formulae', 'old_identifiers', 'charges', 'name', 'compartments_in_models']. Metabolites need to be entered without the compartment information and the beginning M so instead of M_o2_c use o2.

For reactions ['models_containing_reaction', 'reaction_string', 'metabolites', 'database_links', 'bigg_id', 'old_identifiers', 'name', 'pseudoreaction'].

@GwennyGit
Copy link
Collaborator Author

To have all parsing functions combined the module parse was created. However, not all functions that would potentially fit into this module have been added yet.


The function add_charges_chemical_formulae_to_metabs in the module analysis_biocyc currently causes a KeyError which should be solved in the next commit.

@GwennyGit GwennyGit linked a pull request Mar 5, 2023 that will close this issue
@GwennyGit
Copy link
Collaborator Author

Further possible improvements were added to the task list for now.

GwennyGit added a commit that referenced this issue Mar 6, 2023
GwennyGit added a commit that referenced this issue Mar 7, 2023
@famosab famosab changed the title Summary & Enhancement of gap-filling in refineGEMs Omprovement of gap-filling in refineGEMs Mar 7, 2023
@famosab famosab changed the title Omprovement of gap-filling in refineGEMs Improvement of gap-filling in refineGEMs Mar 7, 2023
GwennyGit added a commit that referenced this issue Mar 8, 2023
The filenames were not generated due to including the variable name instead of model_libsbml.getId().
GwennyGit added a commit that referenced this issue Mar 9, 2023
Added `update_annotations_from_others` for the metabolites to enhance the annotations for the newly added metabolites.
@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Mar 13, 2023

After filling the gaps in two of my models (ATCC29970 & JCSC1435) and analysing the result I detected that a lot of orphan, dead-end and disconnected metabolites were added to both models. Neither of my models had any orphan, dead-end or disconnected metabolites before.
To improve the algorithm I decided to get the set of missing metabolites from the 'bigg_reaction' column in the reactions table instead of from the 'Reactants' and 'Products' columns originating from the BioCyc reaction SmartTable. These columns are after the next commit only used to obtain the amount of missing BioCyc metabolites for the statistics table. This modification renders the get_missing_metabolites_wo_BiGG function obsolete along with the resulting table. Thus, this function and all occurrences of the result are removed from the code with the next commit.

GwennyGit added a commit that referenced this issue Mar 14, 2023
GwennyGit added a commit that referenced this issue Mar 15, 2023
Compartments were missing from BiGG IDs which have no BioCyc IDs. This issue is resolved with this commit now.
GwennyGit added a commit that referenced this issue Mar 16, 2023
Updated gapfill for BioCyc #52
@GwennyGit GwennyGit self-assigned this Jun 29, 2023
GwennyGit added a commit that referenced this issue Aug 21, 2023
…#52

The additional handling of SEED identifiers was added as the function get_bigg2other_db could be used in user-written scripts if a mapping from BiGG to for example the SEED namespace is required.
@GwennyGit
Copy link
Collaborator Author

The additional handling of SEED identifiers was added as the function get_bigg2other_db could be used in user-written scripts if a mapping from BiGG to another database is required. However, currently, the mapping would need to be to BioCyc, KEGG or SEED. in the future, more databases could be added.

@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Aug 25, 2023

A check in the gapfill_model function to verify that the added reactions are necessary could be very useful. In the DEMETER part of the CORBA Toolbox a function verifyGapfilledReactions exists. This function could be used as a template. 🤔

@f3rryman
Copy link
Collaborator

f3rryman commented Nov 8, 2023

Enhancing the run-time of analysis_kegg.py and adding a progress bar could be useful for the user.
The bottleneck is probably get_locus_ec() and get_locus_ec_kegg().

Maybe the same solution like analysis_db.py in the multi_get_reaction_compartment() function is feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants