Improvement of gap-filling in refineGEMs #52

GwennyGit · 2023-01-19T09:40:52Z

In this issue all current gap-filling tools implemented in refineGEMs are summarised and possible enhancements explored.

Current gap-filling modules:

genecomp (now: kegg_analysis):
⇾ Extracts KEGG gene identifiers from model
⇾ Compares KEGG genes in model with the strain-specific ones in KEGG
⇾ Extracts RefSeq IDs (GPR) from the .gff file
⇾ Maps BiGG to KEGG IDs
⇒ Returns a table containing missing reactions with locus tag, EC number, KEGG ID, BiGG ID and RefSeq ID (GPR)
curate:
1. With option gapfill
  ⇾ Adds reactions with the corresponding IDs, stoichiometric coefficients, educts, products, upper & lower bound to the model from a manually obtained table
2. With option metabs
  ⇾ Adds metabolites with the corresponding IDs, formulae, and name to the model from a manually obtained table
  ⇾ Synchronises the metabolite information over all compartments

Creation of gapfill module for BioCyc (& Adjustment of genecomp to gapfill):

Further improvements:

Retrieve missing metabolites via 'bigg_reaction' column from reactions table instead of from the 'Reactants' and 'Products' columns
Add functionality to apply BioCyc comparison to models where organism does not occurr in any database:
- From the GFF & FASTA files with DIAMOND & BioCyc SmartTables (→ lab_strain) obtain missing genes
Adjust kegg_analysis to also return tables with missing genes & metabolites
→ Then the result from kegg_analysis can also be added to a model.
Do a complete merge on all BioCyc & KEGG tables if db_to_compare: KEGG+BioCyc
Add a check similar to verifyGapfilledReactions to gapfill_model

The text was updated successfully, but these errors were encountered:

famosab · 2023-01-31T15:21:59Z

Regarding Function I: we already have a parser for gff files integrated (it is in the function get_locus_gpr from genecomp). Maybe we can expand from there - the only obstacle would be to make sure we have similar IDs that can be compared to each other. At the moment I have the problem that the GPRs in my models cannot be found in my gff file because the naming is completely different.

GwennyGit · 2023-02-01T14:49:09Z

Regarding function I: For strains that are not in KEGG but in BioCyc I think it will be better to use the BioCyc SmartTables as reference. However, for lab strains this function could be still useful. 🤔 For the BioCyc option I will add a comment to Function I. Maybe for that the script from Reihaneh (@Biomathsys) could be used (or maybe adjusted), see Code here: https://github.com/draeger-lab/py4gems/blob/main/Reihaneh/1.%20BioCyc_Comparison.ipynb.

GwennyGit · 2023-02-02T09:45:52Z

The current module curate does only add genes and reactions to the model. However, for gapfill the metabolites should be added too. In Reihaneh's script COBRA is used and the implementation to add reactions and metabolites is already properly implemented. Thus, I will use her approach for that. In addition, I will use the function to add the missing genes from curate and extend the tables from KEGG/BioCyc/GFF file with the BiGG identifiers for each reaction and metabolite, respectively.

As the function to add reactions will not be used from curate this module will be kept as such. However, the following three new modules will be generated in addition to the gapfill module: entities, analysis_kegg and analysis_biocyc.

Removed a function that was generalised to work with KEGG and BioCyc and is now in gapfill.

GwennyGit · 2023-02-02T16:38:01Z

For the comparison between already existing metabolites & reactions I realised that if I add the BiGG identifiers to the table the checks from Reihaneh's script are not necessary. Thus, I will extend the functionality of entities.

In analysis_kegg the function get_locus_gpr (line 167) can be adjusted to get the protein IDs from the Genbank GFF/FASTA (CarveMe input) file. The currently retrieved GPRs are basically the RefSeq identifiers from the RefSeq GFF file. Additionally, the function should be moved to gapfill as it can be used for analysis_kegg and analysis_biocyc.
Note to myself: Are there maybe more functions that can be used for both modules? Potentially the function retrieving the BiGG IDs?

GwennyGit · 2023-02-03T08:01:39Z

Extracting the functions required for analysis_kegg and analysis_biocyc from analysis_kegg into gapfill created a cycle as these functions were called in each of the other two modules and these modules in return were called within the gapfill module. To still reduce redundancy another module analysis_db is created that now contains the overlapping functions for analysis_kegg and analysis_biocyc.

famosab · 2023-02-03T13:18:10Z

Maybe the following publication / code is of interest NICEgame. They mention in their manuscript that they also worked with Python, however in the gh repo I only found Matlab scripts.

GwennyGit · 2023-02-03T14:56:12Z

From the paper I understand that the authors use media for which it is known that the bacterium should grow on to fill gaps in the model. This approach would be similar to the one from the CarveMe documentation or also the gap filling approach from COBRApy. This would be a nice addition to the gap filling via the genes I think. I already considered adding the call for the CarveMe gap filling after using the gap filling from the genes. However, as far as I understood these programs the user needs to know exactly on which media the bacterium would grow. Thus, I find it rather difficult to use any of the tools as we have strain-specific models. For which I suppose that not every strain of e.g. Staphylococcus haemolyticus grows on the same media, especially, if microbiome media are used like SNM3. 🤔

famosab · 2023-02-07T11:10:50Z

We can use requests to access the BiGG database.

Here is an example how to use it with BiGG:

import requests
import refinegems as rg

reac_url = 'http://bigg.ucsd.edu/api/v2/universal/reactions/'
metab_url = 'http://bigg.ucsd.edu/api/v2/universal/metabolites/'

mod = rg.load.load_model_cobra('../../models/Cstr_14.xml')

# requests.get(metab_url+'o2').json()['charges']

for metab in mod.metabolites:
    id = metab.id[:-2]
    print(id, requests.get(metab_url+id).json()['name'])

For metabolites these field can be accessed ['database_links', 'bigg_id', 'formulae', 'old_identifiers', 'charges', 'name', 'compartments_in_models']. Metabolites need to be entered without the compartment information and the beginning M so instead of M_o2_c use o2.

For reactions ['models_containing_reaction', 'reaction_string', 'metabolites', 'database_links', 'bigg_id', 'old_identifiers', 'name', 'pseudoreaction'].

GwennyGit · 2023-02-08T20:56:28Z

To have all parsing functions combined the module parse was created. However, not all functions that would potentially fit into this module have been added yet.

The function add_charges_chemical_formulae_to_metabs in the module analysis_biocyc currently causes a KeyError which should be solved in the next commit.

GwennyGit · 2023-03-05T18:19:24Z

Further possible improvements were added to the task list for now.

The filenames were not generated due to including the variable name instead of model_libsbml.getId().

Added `update_annotations_from_others` for the metabolites to enhance the annotations for the newly added metabolites.

GwennyGit · 2023-03-13T17:44:06Z

After filling the gaps in two of my models (ATCC29970 & JCSC1435) and analysing the result I detected that a lot of orphan, dead-end and disconnected metabolites were added to both models. Neither of my models had any orphan, dead-end or disconnected metabolites before.
To improve the algorithm I decided to get the set of missing metabolites from the 'bigg_reaction' column in the reactions table instead of from the 'Reactants' and 'Products' columns originating from the BioCyc reaction SmartTable. These columns are after the next commit only used to obtain the amount of missing BioCyc metabolites for the statistics table. This modification renders the get_missing_metabolites_wo_BiGG function obsolete along with the resulting table. Thus, this function and all occurrences of the result are removed from the code with the next commit.

Compartments were missing from BiGG IDs which have no BioCyc IDs. This issue is resolved with this commit now.

Updated gapfill for BioCyc #52

…#52 The additional handling of SEED identifiers was added as the function get_bigg2other_db could be used in user-written scripts if a mapping from BiGG to for example the SEED namespace is required.

GwennyGit · 2023-08-21T14:33:49Z

The additional handling of SEED identifiers was added as the function get_bigg2other_db could be used in user-written scripts if a mapping from BiGG to another database is required. However, currently, the mapping would need to be to BioCyc, KEGG or SEED. in the future, more databases could be added.

GwennyGit · 2023-08-25T13:59:21Z

A check in the gapfill_model function to verify that the added reactions are necessary could be very useful. In the DEMETER part of the CORBA Toolbox a function verifyGapfilledReactions exists. This function could be used as a template. 🤔

f3rryman · 2023-11-08T10:25:05Z

Enhancing the run-time of analysis_kegg.py and adding a progress bar could be useful for the user.
The bottleneck is probably get_locus_ec() and get_locus_ec_kegg().

Maybe the same solution like analysis_db.py in the multi_get_reaction_compartment() function is feasible.

GwennyGit added the enhancement New feature or request label Jan 19, 2023

famosab added this to the New functions towards a version 1.1 milestone Jan 19, 2023

famosab mentioned this issue Feb 1, 2023

Add functionality to add more identifiers to GeneProducts in polish #53

Closed

5 tasks

GwennyGit added a commit that referenced this issue Feb 2, 2023

Created analysis_biocyc.py #52

7a75107

GwennyGit added a commit that referenced this issue Feb 2, 2023

Created entities.py #52

9c4d9f5

GwennyGit added a commit that referenced this issue Feb 2, 2023

Created gapfill.py #52

875e4eb

GwennyGit added a commit that referenced this issue Feb 2, 2023

Renamed genecomp to analysis_kegg.py #52

96eda8f

GwennyGit added a commit that referenced this issue Feb 2, 2023

Updated __init__.py due to module changes #52

0326be3

GwennyGit added a commit that referenced this issue Feb 2, 2023

Adjusted code in curate #52

0c58e02

GwennyGit added a commit that referenced this issue Feb 2, 2023

Adjusted code in analysis_kegg #52

810f9d9

Removed a function that was generalised to work with KEGG and BioCyc and is now in gapfill.

GwennyGit added a commit that referenced this issue Feb 8, 2023

Added code for function gapfill_analysis #52

1f95746

GwennyGit added a commit that referenced this issue Feb 8, 2023

Updated analysis_biocyc #52

890f2fa

GwennyGit added a commit that referenced this issue Feb 8, 2023

Updated analysis_kegg due to refactoring #52

e94fe4d

GwennyGit added a commit that referenced this issue Feb 8, 2023

Updated entities due to refactoring #52

e947d42

GwennyGit added a commit that referenced this issue Feb 8, 2023

Added InChI-Key to metabol_db_dict in cvterms #52 #59

8eeb1ed

GwennyGit added a commit that referenced this issue Feb 8, 2023

Created analysis_db due to refactoring #52

624131b

GwennyGit added a commit that referenced this issue Feb 8, 2023

Created parse due to refactoring #52

78c4fba

GwennyGit added a commit that referenced this issue Feb 8, 2023

Updated __init__ due to new modules #52

8f65d0d

GwennyGit added a commit that referenced this issue Feb 8, 2023

Updated main due to new function gapfill_analysis #52

fac0df6

GwennyGit added a commit that referenced this issue Feb 8, 2023

Added bigg_models_metabolites.txt for gapfill #52

f7238e7

GwennyGit added a commit that referenced this issue Feb 8, 2023

Updated config due to new function gapfill_analysis #52

0b5c6f4

GwennyGit added a commit that referenced this issue Mar 5, 2023

Changed code syntax highlighting in the documentation #52

0d4312c

GwennyGit linked a pull request Mar 5, 2023 that will close this issue

Gapfill update #73

Merged

GwennyGit added a commit that referenced this issue Mar 6, 2023

Adjusted doc strings for gapfill #52 #74

fe322f3

GwennyGit added a commit that referenced this issue Mar 7, 2023

Resolved merge conflicts for dev #52

4cf61e6

GwennyGit added a commit that referenced this issue Mar 7, 2023

Integrated gapfill into the new main.py #52

f181acf

GwennyGit added a commit that referenced this issue Mar 7, 2023

Integrated gapfill into the new config.yaml #52

55cb972

GwennyGit added a commit that referenced this issue Mar 7, 2023

Integrated gapfill into io.save_user_input #52

3d80ff0

GwennyGit added a commit that referenced this issue Mar 7, 2023

Adjusted gapfill function names to new main & config #52

cfc1d2f

GwennyGit added a commit that referenced this issue Mar 7, 2023

Reduced main doc string in gapfill #52 #74

735c9ee

famosab changed the title ~~Summary & Enhancement of gap-filling in refineGEMs~~ Omprovement of gap-filling in refineGEMs Mar 7, 2023

famosab changed the title ~~Omprovement of gap-filling in refineGEMs~~ Improvement of gap-filling in refineGEMs Mar 7, 2023

GwennyGit added a commit that referenced this issue Mar 8, 2023

Changed filenames for gapfill #52

dc99517

The filenames were not generated due to including the variable name instead of model_libsbml.getId().

GwennyGit added a commit that referenced this issue Mar 8, 2023

Adjusted statistics_df print for gapfill #52

98a92d4

GwennyGit added a commit that referenced this issue Mar 8, 2023

Removed debugging code in analysis_biocyc #52

3c3f9bb

GwennyGit added a commit that referenced this issue Mar 9, 2023

Resolved issues with gapfill_model in save_user_input #52

6e136e0

GwennyGit added a commit that referenced this issue Mar 9, 2023

Fixed ChEBI ID handling in analysis_biocyc.py #52

059ada9

GwennyGit added a commit that referenced this issue Mar 9, 2023

Fixed ChEBI & InChIKEy handling in gapfill.py #52

8acca6a

Added `update_annotations_from_others` for the metabolites to enhance the annotations for the newly added metabolites.

GwennyGit added a commit that referenced this issue Mar 14, 2023

Updated gapfill for BioCyc #52

91a682d

GwennyGit added a commit that referenced this issue Mar 15, 2023

Fixed compartment issue for metabolites #52

54838d7

Compartments were missing from BiGG IDs which have no BioCyc IDs. This issue is resolved with this commit now.

GwennyGit added a commit that referenced this issue Mar 16, 2023

Merge pull request #78 from draeger-lab/dev

4bdf88a

Updated gapfill for BioCyc #52

GwennyGit added a commit that referenced this issue Mar 21, 2023

Added Dr. Reihaneh Mostolizadeh as author #52

003af59

GwennyGit added a commit that referenced this issue Mar 21, 2023

Changed default GPR in gapfill from and to or #52

02b5871

GwennyGit added a commit that referenced this issue Apr 16, 2023

Fixed improper compiled entries in gapfill documentation #52

b6cbf47

GwennyGit self-assigned this Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement of gap-filling in refineGEMs #52

Improvement of gap-filling in refineGEMs #52

GwennyGit commented Jan 19, 2023 •

edited

famosab commented Jan 31, 2023

GwennyGit commented Feb 1, 2023 •

edited

GwennyGit commented Feb 2, 2023 •

edited

GwennyGit commented Feb 2, 2023 •

edited

GwennyGit commented Feb 3, 2023

famosab commented Feb 3, 2023 •

edited

GwennyGit commented Feb 3, 2023

famosab commented Feb 7, 2023 •

edited

GwennyGit commented Feb 8, 2023

GwennyGit commented Mar 5, 2023

GwennyGit commented Mar 13, 2023 •

edited

GwennyGit commented Aug 21, 2023

GwennyGit commented Aug 25, 2023 •

edited

f3rryman commented Nov 8, 2023 •

edited

Improvement of gap-filling in refineGEMs #52

Improvement of gap-filling in refineGEMs #52

Comments

GwennyGit commented Jan 19, 2023 • edited

In this issue all current gap-filling tools implemented in refineGEMs are summarised and possible enhancements explored.

famosab commented Jan 31, 2023

GwennyGit commented Feb 1, 2023 • edited

GwennyGit commented Feb 2, 2023 • edited

GwennyGit commented Feb 2, 2023 • edited

GwennyGit commented Feb 3, 2023

famosab commented Feb 3, 2023 • edited

GwennyGit commented Feb 3, 2023

famosab commented Feb 7, 2023 • edited

GwennyGit commented Feb 8, 2023

GwennyGit commented Mar 5, 2023

GwennyGit commented Mar 13, 2023 • edited

GwennyGit commented Aug 21, 2023

GwennyGit commented Aug 25, 2023 • edited

f3rryman commented Nov 8, 2023 • edited

GwennyGit commented Jan 19, 2023 •

edited

GwennyGit commented Feb 1, 2023 •

edited

GwennyGit commented Feb 2, 2023 •

edited

GwennyGit commented Feb 2, 2023 •

edited

famosab commented Feb 3, 2023 •

edited

famosab commented Feb 7, 2023 •

edited

GwennyGit commented Mar 13, 2023 •

edited

GwennyGit commented Aug 25, 2023 •

edited

f3rryman commented Nov 8, 2023 •

edited