Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Bottlenecks in ProtGraph #46

Open
Luxxii opened this issue Oct 13, 2021 · 0 comments
Open

Potential Bottlenecks in ProtGraph #46

Luxxii opened this issue Oct 13, 2021 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Luxxii
Copy link
Member

Luxxii commented Oct 13, 2021

I ran a line profiler on the generate_graph_consumer method on the human_review dataset (20k proteins)

It gives me the following output:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   114                                           def generate_graph_consumer(entry_queue, graph_queue, common_out_queue, proc_id, **kwargs):
   115                                               """
   116                                               TODO
   117                                               describe kwargs and consumer until a graph is generated and digested etc ...
   118                                               """
   119                                               # Set proc id
   120         1          3.0      3.0      0.0      kwargs["proc_id"] = proc_id
   121                                           
   122                                               # Set feature_table dict boolean table
   123         1          1.0      1.0      0.0      ft_dict = dict()
   124         1          1.0      1.0      0.0      if kwargs["feature_table"] is None or len(kwargs["feature_table"]) == 0 or "ALL" in kwargs["feature_table"]:
   125         1          6.0      6.0      0.0          ft_dict = dict(VARIANT=True, VAR_SEQ=True, SIGNAL=True, INIT_MET=True, MUTAGEN=True, CONFLICT=True)
   126                                               else:
   127                                                   for i in kwargs["feature_table"]:
   128                                                       ft_dict[i] = True
   129                                           
   130                                               # Initialize the exporters for graphs
   131         1         58.0     58.0      0.0      graph_exporters = Exporters(**kwargs)
   132                                           
   133                                               while True:
   134                                                   # Get next entry
   135     20387    7568810.0    371.3      0.7          entry = entry_queue.get()
   136                                           
   137                                                   # Stop if entry is None
   138     20387      18156.0      0.9      0.0          if entry is None:
   139                                                       # --> Stop Condition of Process
   140         1          3.0      3.0      0.0              break
   141                                           
   142                                                   # Beginning of Graph-Generation
   143                                                   # We also collect interesting information here!
   144                                           
   145                                                   # Generate canonical graph (initialization of the graph)
   146     20386    4416910.0    216.7      0.4          graph = _generate_canonical_graph(entry.sequence, entry.accessions[0])
   147                                           
   148                                                   # FT parsing and appending of Nodes and Edges into the graph
   149                                                   # The amount of isoforms, etc.. can be retrieved on the fly
   150     20386      22188.0      1.1      0.0          num_isoforms, num_initm, num_signal, num_variant, num_mutagens, num_conficts =\
   151     20386  312577641.0  15333.0     29.4              _include_ft_information(entry, graph, ft_dict)
   152                                           
   153                                                   # Replace Amino Acids based on user defined rules: E.G.: "X -> A,B,C"
   154     20386      83272.0      4.1      0.0          replace_aa(graph, kwargs["replace_aa"])
   155                                           
   156                                                   # Digest graph with enzyme (unlimited miscleavages)
   157     20386  457306111.0  22432.4     43.0          num_of_cleavages = digest(graph, kwargs["digestion"])
   158                                           
   159                                                   # Merge (summarize) graph if wanted
   160     20386      29893.0      1.5      0.0          if not kwargs["no_merge"]:
   161     20386  268518281.0  13171.7     25.3              merge_aminoacids(graph)
   162                                           
   163                                                   # Collapse parallel edges in a graph
   164     20386      29694.0      1.5      0.0          if not kwargs["no_collapsing_edges"]:
   165     20386   10804029.0    530.0      1.0              collapse_parallel_edges(graph)
   166                                           
   167                                                   # Annotate weights for edges and nodes (maybe even the smallest weight possible to get to the end node)
   168     20386     948172.0     46.5      0.1          annotate_weights(graph, **kwargs)
   169                                           
   170                                                   # Calculate statistics on the graph:
   171     20386      11921.0      0.6      0.0          (
   172     20386      12094.0      0.6      0.0              num_nodes, num_edges, num_paths, num_paths_miscleavages, num_paths_hops,
   173     20386       9768.0      0.5      0.0              num_paths_var, num_path_mut, num_path_con
   174     20386     297176.0     14.6      0.0          ) = get_statistics(graph, **kwargs)
   175                                           
   176                                                   # Verify graphs if wanted:
   177     20386      11624.0      0.6      0.0          if kwargs["verify_graph"]:
   178                                                       verify_graph(graph)
   179                                           
   180                                                   # Persist or export graphs with speicified exporters
   181     20386      38415.0      1.9      0.0          graph_exporters.export_graph(graph, common_out_queue)
   182                                           
   183                                                   # Output statistics we gathered during processing
   184     20386      10500.0      0.5      0.0          if kwargs["no_description"]:
   185                                                       entry_protein_desc = None
   186                                                   else:
   187     20386      37338.0      1.8      0.0              entry_protein_desc = entry.description.split(";", 1)[0]
   188     20386      37142.0      1.8      0.0              entry_protein_desc = entry_protein_desc[entry_protein_desc.index("=") + 1:]
   189                                           
   190     40772     312422.0      7.7      0.0          graph_queue.put(
   191     20386      12337.0      0.6      0.0              (
   192     20386      11818.0      0.6      0.0                  entry.accessions[0],  # Protein Accesion
   193     20386      10432.0      0.5      0.0                  entry.entry_name,  # Protein displayed name
   194     20386       9196.0      0.5      0.0                  num_isoforms,  # Number of Isoforms
   195     20386       9231.0      0.5      0.0                  num_initm,  # Number of Init_M (either 0 or 1)
   196     20386       9244.0      0.5      0.0                  num_signal,  # Number of Signal Peptides used (either 0 or 1)
   197     20386       9232.0      0.5      0.0                  num_variant,  # Number of Variants applied to this protein
   198     20386       9227.0      0.5      0.0                  num_mutagens,  # Number of applied mutagens on the graph
   199     20386       9231.0      0.5      0.0                  num_conficts,  # Number of applied conflicts on the graph
   200     20386       9274.0      0.5      0.0                  num_of_cleavages,  # Number of cleavages (marked edges) this protein has
   201     20386       9240.0      0.5      0.0                  num_nodes,  # Number of nodes for the Protein/Peptide Graph
   202     20386       9269.0      0.5      0.0                  num_edges,  # Number of edges for the Protein/Peptide Graph
   203     20386       9311.0      0.5      0.0                  num_paths,  # Possible (non repeating paths) to the end of a graph. (may conatin repeating peptides)
   204     20386       9318.0      0.5      0.0                  num_paths_miscleavages,  # As num_paths, but binned to the number of miscleavages (by list idx, at 0)
   205     20386       9288.0      0.5      0.0                  num_paths_hops,  # As num_paths, only that we bin by hops (E.G. useful for determine DFS or BFS depths)
   206     20386       9363.0      0.5      0.0                  num_paths_var,  # Num paths of feture variant
   207     20386       9519.0      0.5      0.0                  num_path_mut,  # Num paths of feture mutagen
   208     20386       9508.0      0.5      0.0                  num_path_con,  # Num paths of feture conflict
   209     20386       9476.0      0.5      0.0                  entry_protein_desc,  # Description name of the Protein (can be lenghty)
   210                                                       )
   211                                                   )
   212                                           
   213                                               # Close exporters (maybe opened files, database connections, etc... )
   214         1         13.0     13.0      0.0      graph_exporters.close()

Bottlenecks are:

  • Merge Aminoacids (~25%)
  • Apply Features (~29%)
  • Digestion (~43%)
@Luxxii Luxxii added the enhancement New feature or request label Oct 13, 2021
@Luxxii Luxxii self-assigned this Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant