Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARAX responding with 400 errors with TRAPI validation errors when valid Knowledge Graphs are included in the query #2106

Open
CaseyTa opened this issue Aug 18, 2023 · 40 comments
Assignees

Comments

@CaseyTa
Copy link

CaseyTa commented Aug 18, 2023

Issue: When submitting a query to ARAX that includes a message with a knowledge graph, ARAX responds with a 400 error with the following data:

{
    "detail": "{...} is not valid under any of the given schemas - 'message.knowledge_graph'",
    "status": 400,
    "title": "Bad Request",
    "type": "about:blank"
}

where the contents of {...} in the detail is the content of the knowledge graph.

However, when I run these queries through the Reasoner Validator (version 3.8.5) against TRAPI 1.4.2, the query passes validation.

Here's a sample query that will generate the error. The knowledge graph has been stripped down to a single edge and 2 nodes with minimal content.
fill_e1_reduced-kg.txt

If we strip out the contents of the knowledge graph nodes and edges completely and also remove "auxiliary_graphs": null (auxiliary_graphs is nullable in the spec but ARAX also issues a warning about it being None), ARAX no longer returns TRAPI validation errors (different errors are returned, but these errors are expected).
fill_e1_no_arax_validation_error.txt

This is not a priority issue for September.

Context: The Translator Clinical Data Committee are working on developing some workflows with the Workflow Runner. In our "Path A" workflow, we want to issue a fill operation on the first edge with a specific allowlist for clinical KPs followed by a second fill operation for the remaining edges using any KP.

@edeutsch
Copy link
Collaborator

Does it possibly work any better if you set auxiliary_graphs to {} in your original query? This was one of the TRAPI 1.40 / 1.4.2 problems I think...

@CaseyTa
Copy link
Author

CaseyTa commented Aug 18, 2023

@edeutsch Thanks for the suggestion.

In the first query (fill_e1_reduced-kg.txt), changing it to "auxiliary_graphs": {} made no difference. Similarly, removing it made no difference.

In the second query (fill_e1_no_arax_validation_error.txt), adding "auxiliary_graphs": {} back in does allow it to get past the validation.

@karafecho
Copy link

@edeutsch : Perhaps we can tag this issue as "post 2023 code freeze", just so it gets some visibility? This issue and #2075 are kind of creating blockers for the TCDC's efforts on the CQS.

@edeutsch
Copy link
Collaborator

Hi @karafecho , I think the issue is just that source_record_urls and auxiliary_graphs can't be none in TRAPI 1.4.0. This is the TRAPI 1.4.0 vs. 1.4.2 problem that we discovered after release and decided it wasn't worth a whole new release. If you just change:

                            "source_record_urls": null

to:

                            "source_record_urls": []

and

        "auxiliary_graphs": null

to:

        "auxiliary_graphs": {}

then it will work fine. Can you do that? Fixing the query will be a lot easier than fixing the code for the moment, and should allow you to proceed? Or is is that not feasible?

This does not generate errors:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "MONDO:0009061"
                    ],
                    "categories": null,
                    "is_set": false,
                    "constraints": []
                },
                "n1": {
                    "ids": null,
                    "categories": [
                        "biolink:MolecularEntity",
                        "biolink:EnvironmentalExposure"
                    ],
                    "is_set": true,
                    "constraints": []
                },
                "n2": {
                    "ids": null,
                    "categories": [
                        "biolink:Gene",
                        "biolink:Protein"
                    ],
                    "is_set": true,
                    "constraints": []
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "knowledge_type": null,
                    "predicates": [
                        "biolink:correlated_with",
                        "biolink:associated_with_likelihood_of"
                    ],
                    "attribute_constraints": [],
                    "qualifier_constraints": [],
                    "exclude": false
                },
                "e1": {
                    "subject": "n1",
                    "object": "n2",
                    "knowledge_type": null,
                    "predicates": [
                        "biolink:affects"
                    ],
                    "attribute_constraints": [],
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:qualified_predicate",
                                    "qualifier_value": "biolink:causes"
                                }
                            ]
                        },
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "expression"
                                },
                                {
                                    "qualifier_type_id": "biolink:qualified_predicate",
                                    "qualifier_value": "biolink:causes"
                                }
                            ]
                        },
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "secretion"
                                },
                                {
                                    "qualifier_type_id": "biolink:qualified_predicate",
                                    "qualifier_value": "biolink:causes"
                                }
                            ]
                        },
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "decreased"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "degradation"
                                },
                                {
                                    "qualifier_type_id": "biolink:qualified_predicate",
                                    "qualifier_value": "biolink:causes"
                                }
                            ]
                        }
                    ],
                    "exclude": false
                }
            }
        },
        "knowledge_graph": {
            "nodes": {
                "MONDO:0009061": {
                    "categories": [
                        "biolink:Disease",
                        "biolink:DiseaseOrPhenotypicFeature"
                    ]
                },
                "PUBCHEM.COMPOUND:1068": {
                    "categories": [
                        "biolink:SmallMolecule"
                    ]
                }
            },
            "edges": {
                "090bb1f96d3c": {
                    "subject": "PUBCHEM.COMPOUND:1068",
                    "object": "MONDO:0009061",
                    "predicate": "biolink:biomarker_for",
                    "sources": [
                        {
                            "resource_id": "infores:ctd",
                            "resource_role": "primary_knowledge_source",
                            "upstream_resource_ids": [],
                            "source_record_urls": [
                                "http://ctdbase.org/detail.go?type=chem&acc=C004784&view=disease"
                            ]
                        },
                        {
                            "resource_id": "infores:arax",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:molepro"
                            ],
                            "source_record_urls": []
                        },
                        {
                            "resource_id": "infores:molepro",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:ctd"
                            ],
                            "source_record_urls": []
                        }
                    ]
                }
            }
        },
        "results": null,
        "auxiliary_graphs": {}
    },
    "workflow": [
        {            
            "id": "fill",
            "parameters": {
                "allowlist": [],
                "qedge_keys": [
                    "e1"
                ]
            }
        }
    ]
}

@karafecho
Copy link

karafecho commented Sep 28, 2023

I'm not sure fixing the query will work. When I ran your query, it indeed passed the TRAPI validation test, but it did not return any results. [As an aside, when I sent the query to the ARS, I received a 598 timeout error, which shouldn't be the case, but that appears to be an ARS issue that @MarkDWilliams will need to explore.]

@maximusunc explored the issue a bit. From what he reported, I think there are additional issues beyond the TRAPI validation error that are preventing the query from both passing TRAPI validation in ARAX and returning results.

That said, Max and Abrar are re-engineering the Path A CQS query that drove this issue, and so, we may be able to independently resolve it.

Will keep you posted ...

@karafecho
Copy link

I'll add one other issue that Max identified. Specifically, I think ARAX is considering infores:automat-icees-kg as the primary endpoint for icees-kg; however, infores:automat-icees-kg is the aggregator knowledge source and infores:icees-kg is the primary knowledge source.

Just thought I'd point that out ...

@edeutsch
Copy link
Collaborator

Well, I was just resolving the validation errors. I don't know anything about the query itself or what is expected of it.

I'm not sure I understand about the infores. I think we would go by how it's registered in SmartAPI. So if the SmartAPI name is different than what you specify, then it probably won't work. I think you need to specify the SmartAPI key? I'm not certain.

@maximusunc
Copy link

Here's a stack trace that I get when trying to send a fill allowlist:

  1. From Workflow Runner: Requesting operation '{'id': <fill.fill: 'fill'>, 'parameters': {'allowlist': ['infores:cohd', 'infores:automat-icees-kg'], 'qedge_keys': ['e0']}}' from ARAX Translator Reasoner - TRAPI 1.4.0...
  2. From ARAX: Invalid user-specified KP(s): {'infores:automat-icees-kg'}. Valid options are: {'infores:molepro', 'infores:rtx-kg2', 'infores:text-mining-provider-cooccurrence', 'infores:spoke', 'infores:service-provider-trapi', 'infores:openpredict', 'infores:arax-normalized-google-distance', 'infores:cohd', 'infores:genetics-data-provider', 'infores:knowledge-collaboratory', 'infores:connections-hypothesis', 'infores:arax-drug-treats-disease'}

This list of KPs doesn't include any of the automat services, but they are all valid KPs. Could this be looked into please?

@edeutsch
Copy link
Collaborator

@amykglen would you be able to look into item 2 above?

@maximusunc
Copy link

Sorry, we did just discover a bug where the Automats on dev weren't registered on SmartAPI correctly. Could you just try refreshing your KP list please?

@edeutsch
Copy link
Collaborator

ah yes, I forgot about that. we reported in NCATSTranslator/Feedback#557

Refreshing should be automatic within 10 minutes. Is it resolved now?

https://arax.ci.transltr.io/?smartapi=1

@edeutsch
Copy link
Collaborator

@amykglen would you be able to look into item 2 above?

Nevermind, it was likely a SmartAPI registration issue.

@maximusunc
Copy link

I'm seeing green on the ARAX page for the Automats, but I'm still getting the same error when sending a query to ARAX.

@edeutsch
Copy link
Collaborator

edeutsch commented Sep 29, 2023

okay, thanks for checking and reporting.

@maximusunc can you remind us which endpoint you're testing against?

@amykglen would you look into this?, specifically:
From ARAX: Invalid user-specified KP(s): {'infores:automat-icees-kg'}. Valid options are: {'infores:molepro', 'infores:rtx-kg2', 'infores:text-mining-provider-cooccurrence', 'infores:spoke', 'infores:service-provider-trapi', 'infores:openpredict', 'infores:arax-normalized-google-distance', 'infores:cohd', 'infores:genetics-data-provider', 'infores:knowledge-collaboratory', 'infores:connections-hypothesis', 'infores:arax-drug-treats-disease'}

@maximusunc
Copy link

We're hitting your dev instance: https://arax.ncats.io/beta/api/arax/v1.4/query

@amykglen
Copy link
Member

yep, I'll look into it @edeutsch!

@maximusunc - can you provide an example query that produces the issue you're seeing? that would be super helpful. (I've tried running some of the queries linked earlier in this issue, but none are producing the issue you're reporting)

@maximusunc
Copy link

Sure:

{
    "workflow": [
        {
            "id": "fill",
            "parameters": {
                "allowlist": [
                    "infores:cohd",
                    "infores:automat-icees-kg"
                ],
                "qedge_keys": [
                    "e0"
                ]
            }
        }
    ],
    "message": {
        "query_graph": {
            "edges": {
                "e0": {
                    "exclude": false,
                    "predicates": [
                        "biolink:correlated_with",
                        "biolink:associated_with_likelihood_of"
                    ],
                    "subject": "n0",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {
                    "ids": [
                        "MONDO:0009061"
                    ],
                    "is_set": false
                },
                "n1": {
                    "categories": [
                        "biolink:MolecularEntity",
                        "biolink:EnvironmentalExposure"
                    ],
                    "is_set": true
                }
            }
        }
    },
    "log_level": "DEBUG"
}

@amykglen
Copy link
Member

thanks @maximusunc!

well, that query just completed successfully for me (https://arax.ncats.io/beta/?r=172033), but it appears that it might only have worked this time because it happened to trigger a cache refresh on our end:

2023-09-29T23:48:28.321841 INFO: KP info cache(s) have not been updated for 24+ hours. Refreshing stale cache(s).

I thought the ARAX background tasker was supposed to update the KP info cache on something like an hourly basis, so I'm not sure why it was so out of date (this is on /beta)... suggests there's a bug with either the background tasker or the KP info cacher...

@saramsey
Copy link
Member

Yes, it's possible there is a bug. The Background Tasker and the kp_info_cacher.py were substantially changed in the issue-2114 branch (recently merged to master).

@saramsey
Copy link
Member

One thing to check is that the Background Tasker process is currently running, for /beta?

@edeutsch
Copy link
Collaborator

yeah, apparently not:

grep Tasker /tmp/RTX_OpenAPI_beta.elog
...
2023-09-27T14:34:51.057912: INFO: ARAXBackgroundTasker status: waiting. Current load is (0.16, 0.4, 0.4), n_clients=0, n_ongoing_queries=0
2023-09-27T14:35:51.239649: INFO: ARAXBackgroundTasker status: waiting. Current load is (0.45, 0.46, 0.43), n_clients=0, n_ongoing_queries=0
2023-09-27T14:36:51.462935: INFO: ARAXBackgroundTasker status: waiting. Current load is (0.16, 0.38, 0.4), n_clients=0, n_ongoing_queries=0
2023-09-27T14:37:51.704246: INFO: ARAXBackgroundTasker status: waiting. Current load is (0.3, 0.39, 0.4), n_clients=0, n_ongoing_queries=0
2023-09-27T14:38:51.942820: INFO: ARAXBackgroundTasker status: waiting. Current load is (0.5, 0.41, 0.41), n_clients=0, n_ongoing_queries=0

Died on the 27th

@saramsey
Copy link
Member

Is that about when we did the merge to master?

@saramsey
Copy link
Member

I wonder if we should stop and restart /beta

@saramsey
Copy link
Member

restarting /beta now

@edeutsch
Copy link
Collaborator

I'm looking through the elog file to look for clues. Will you save that first?

@saramsey
Copy link
Member

Shoot, just restarted, sorry.

@saramsey
Copy link
Member

saramsey commented Sep 30, 2023

My bad, Eric; I should have thought to save a copy of the elog file.

@saramsey
Copy link
Member

Eric if you still have it open in less you can probably save it to a new file by hitting the s key

@edeutsch
Copy link
Collaborator

Here's what I see in the elog:

2023-09-27T14:37:51.704246: INFO: ARAXBackgroundTasker status: waiting. Current load is (0.3, 0.39, 0.4), n_clients=0, n_ongoing_queries=0
2023-09-27T14:38:51.942820: INFO: ARAXBackgroundTasker status: waiting. Current load is (0.5, 0.41, 0.41), n_clients=0, n_ongoing_queries=0
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:07] "GET /beta/api/arax/v1.4/response/2b3177ad-64ab-4c90-a16e-a0c0082421bc HTTP/1.1" 200 -
INFO:reasoner_validator.biolink:Resolved Biolink Model Version: '3.5.4'
INFO:reasoner_validator.trapi:TRAPISchemaValidator set to TRAPI Version: 'v1.4.2'
INFO: Launching validator via multiprocessing
INFO:reasoner_validator.biolink:Resolved Biolink Model Version: '3.5.4'
INFO:reasoner_validator.trapi:TRAPISchemaValidator set to TRAPI Version: 'v1.4.2'
INFO: Launching validator via multiprocessing
WARNING: tried to convert the response to JSON and it did not work
It was: {'level': 'INFO', 'message': 'ARS info: &lt;!DOCTYPE html&gt;&lt;html&gt;&lt;head&gt;&lt;meta charSet=&quot;utf-8&quot;/&gt;&lt;meta name=&quot;viewport&quot; content=&quot;width=device-width&quot;/&gt;&lt;title&gt;4
04: This page could not be found&lt;/title&gt;&lt;meta name=&quot;next-head-count&quot; content=&quot;3&quot;/&gt;&lt;noscript data-n-css=&quot;&quot;&gt;&lt;/noscript&gt;&lt;script defer=&quot;&quot; nomodule=&quot;&quot; s
rc=&quot;/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/chunks/webpack-3864d442a525cf32.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next
/static/chunks/framework-8883d1e9be70c3da.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/chunks/main-d1cb916d1fed4c3b.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_nex
t/static/chunks/pages/_app-b555d5e1eab47959.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/chunks/pages/_error-d79168f986538ac0.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&
quot;/_next/static/jUzYuJn7pMobhm-zdAIud/_buildManifest.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/jUzYuJn7pMobhm-zdAIud/_ssgManifest.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;/h
ead&gt;&lt;body&gt;&lt;div id=&quot;__next&quot;&gt;&lt;div style=&quot;font-family:system-ui,&amp;quot;Segoe UI&amp;quot;,Roboto,Helvetica,Arial,sans-serif,&amp;quot;Apple Color Emoji&amp;quot;,&amp;quot;Segoe UI Emoji&amp;
quot;;height:100vh;text-align:center;display:flex;flex-direction:column;align-items:center;justify-content:center&quot;&gt;&lt;div&gt;&lt;style&gt;body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px soli
d rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}&lt;/style&gt;&lt;h1 class=&quot;next-error-h1&quot; style=&quot;display:inline-
block;margin:0 20px 0 0;padding-right:23px;font-size:24px;font-weight:500;vertical-align:top;line-height:49px&quot;&gt;404&lt;/h1&gt;&lt;div style=&quot;display:inline-block;text-align:left&quot;&gt;&lt;h2 style=&quot;font-s
ize:14px;font-weight:400;line-height:49px;margin:0&quot;&gt;This page could not be found&lt;!-- --&gt;.&lt;/h2&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;script id=&quot;__NEXT_DATA__&quot; type=&quot;application
/json&quot;&gt;{&quot;props&quot;:{&quot;pageProps&quot;:{&quot;statusCode&quot;:404}},&quot;page&quot;:&quot;/_error&quot;,&quot;query&quot;:{},&quot;buildId&quot;:&quot;jUzYuJn7pMobhm-zdAIud&quot;,&quot;nextExport&quot;:tr
ue,&quot;isFallback&quot;:false,&quot;gip&quot;:true,&quot;scriptLoader&quot;:[]}&lt;/script&gt;&lt;/body&gt;&lt;/html&gt;'}
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:10] "GET /beta/api/arax/v1.4/response/6cf20dc2-2639-4031-ac36-521775e2afb7 HTTP/1.1" 200 -
INFO:root:Importing linkml:types as /mnt/data/python/Python-3.9.16/lib/python3.9/site-packages/linkml_runtime/linkml_model/model/schema/types from source https://raw.githubusercontent.com/biolink/biolink-model/v3.5.4/biolink
-model.yaml; base_dir=None
INFO:reasoner_validator.biolink:Resolved Biolink Model Version: '3.5.4'
INFO:reasoner_validator.trapi:TRAPISchemaValidator set to TRAPI Version: 'v1.4.2'
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:10] "GET /beta/api/arax/v1.4/response/b10650e7-91ad-402c-a7c3-ee8fd330edfd HTTP/1.1" 200 -
INFO:reasoner_validator.biolink:Resolved Biolink Model Version: '3.5.4'
INFO:reasoner_validator.trapi:TRAPISchemaValidator set to TRAPI Version: 'v1.4.2'
INFO: Launching validator via multiprocessing
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:11] "GET /beta/api/arax/v1.4/response/627fd9eb-732a-4c0f-912b-cfe5e8611f00 HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
INFO:root:Importing linkml:types as /mnt/data/python/Python-3.9.16/lib/python3.9/site-packages/linkml_runtime/linkml_model/model/schema/types from source https://raw.githubusercontent.com/biolink/biolink-model/v3.5.4/biolink
-model.yaml; base_dir=None
INFO:root:Importing linkml:types as /mnt/data/python/Python-3.9.16/lib/python3.9/site-packages/linkml_runtime/linkml_model/model/schema/types from source https://raw.githubusercontent.com/biolink/biolink-model/v3.5.4/biolink
-model.yaml; base_dir=None
INFO:reasoner_validator.biolink:Resolved Biolink Model Version: '3.5.4'
INFO:reasoner_validator.trapi:TRAPISchemaValidator set to TRAPI Version: 'v1.4.2'
INFO: Launching validator via multiprocessing
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:12] "GET /beta/api/arax/v1.4/response/af00ded4-a88e-4813-ac73-ae76276cab64 HTTP/1.1" 200 -
WARNING: tried to convert the response to JSON and it did not work
It was: {'level': 'INFO', 'message': 'ARS info: &lt;html&gt;\n  &lt;head&gt;\n    &lt;title&gt;Internal Server Error&lt;/title&gt;\n  &lt;/head&gt;\n  &lt;body&gt;\n    &lt;h1&gt;&lt;p&gt;Internal Server Error&lt;/p&gt;&lt;/
h1&gt;\n    \n  &lt;/body&gt;\n&lt;/html&gt;\n'}
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:12] "GET /beta/api/arax/v1.4/response/bbe08b74-b079-4916-b429-ac9d79768fff HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:13] "GET /beta/api/arax/v1.4/response/f5ee6748-1a97-432a-98b6-ed173940abb4 HTTP/1.1" 200 -
INFO:root:Importing linkml:types as /mnt/data/python/Python-3.9.16/lib/python3.9/site-packages/linkml_runtime/linkml_model/model/schema/types from source https://raw.githubusercontent.com/biolink/biolink-model/v3.5.4/biolink
-model.yaml; base_dir=None
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:14] "GET /beta/api/arax/v1.4/response/4b874670-a355-4871-a21a-194f5cd5dde5 HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:14] "GET /beta/api/arax/v1.4/response/2a0685a0-7230-4cfa-b9fb-1f83cd494850 HTTP/1.1" 200 -
INFO:reasoner_validator.biolink:Resolved Biolink Model Version: '3.5.4'
INFO:reasoner_validator.trapi:TRAPISchemaValidator set to TRAPI Version: 'v1.4.2'
INFO: Launching validator via multiprocessing
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:16] "GET /beta/api/arax/v1.4/response/738623d0-91fc-49cc-95b3-13fb28697f24 HTTP/1.1" 200 -
INFO:root:Importing linkml:types as /mnt/data/python/Python-3.9.16/lib/python3.9/site-packages/linkml_runtime/linkml_model/model/schema/types from source https://raw.githubusercontent.com/biolink/biolink-model/v3.5.4/biolink
-model.yaml; base_dir=None
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:16] "GET /beta/api/arax/v1.4/response/efa2f249-4e25-4794-8f63-ffb32cf12930 HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:16] "GET /beta/api/arax/v1.4/response/b3a2e253-d1eb-46fc-a2a8-2ed22791270e HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:16] "GET /beta/api/arax/v1.4/response/35c870e2-c7b2-4913-a642-5b4b6b0c8e77 HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:18] "GET /beta/api/arax/v1.4/response/593a2193-a513-43fd-92cf-23c0c6b30f66 HTTP/1.1" 200 -
WARNING:bmt.toolkit:no biolink class found for the given curie: metatype:Datetime, try get_element_by_mapping?
WARNING:bmt.toolkit:no biolink class found for the given curie: bts:sentence, try get_element_by_mapping?
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:39:33] "GET /beta/api/arax/v1.4/response/c5a226ff-a5c8-4396-8686-81d92ca062ce HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:41:20] "HEAD /beta/api/arax/v1.4/entity?q=ibuprofen HTTP/1.1" 200 -
INFO:reasoner_validator.biolink:Resolved Biolink Model Version: '3.5.4'
INFO:reasoner_validator.trapi:TRAPISchemaValidator set to TRAPI Version: 'v1.4.2'
INFO: Launching validator via multiprocessing
INFO:root:Importing linkml:types as /mnt/data/python/Python-3.9.16/lib/python3.9/site-packages/linkml_runtime/linkml_model/model/schema/types from source https://raw.githubusercontent.com/biolink/biolink-model/v3.5.4/biolink
-model.yaml; base_dir=None
WARNING:bmt.toolkit:no biolink class found for the given curie: metatype:Datetime, try get_element_by_mapping?
WARNING:bmt.toolkit:no biolink class found for the given curie: bts:sentence, try get_element_by_mapping?
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:44:23] "GET /beta/api/arax/v1.4/response/c5a226ff-a5c8-4396-8686-81d92ca062ce HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Sep/2023 14:46:06] "GET /beta/api/arax/v1.4/response/17cbecc1-4c30-46db-9499-1cc1c1b50df5 HTTP/1.1" 200 -

The next ping from the BackgroundTasker should have been at 14:39:51, but it was not heard from again.

@saramsey
Copy link
Member

Eric I note that the call to query_tracker.check_ongoing_queries() is not in a try block. Maybe something threw an exception and the child process died?

@edeutsch
Copy link
Collaborator

yes, possible. Although it really should have spewed a stack trace to the elog then?

@edeutsch
Copy link
Collaborator

Eric if you still have it open in less you can probably save it to a new file by hitting the s key

I still have the inode open in more. s does not seem to save.
But I don't think there's anything to see there.
If you have another idea on how to relink the inode, I can try, but I don't think there's anything to see here. Just silently disappeared.

@edeutsch
Copy link
Collaborator

I mused about the possibility of BackgroundTasker dying before, but had not seen it until now.

@saramsey
Copy link
Member

yes, possible. Although it really should have spewed a stack trace to the elog then?

Good point. Yes, I would expect that.

@edeutsch
Copy link
Collaborator

I can confirm later in the log that Expander noticed the problem and refreshed after 24 hr:

2023-09-28T20:23:33.999936 DEBUG: (4148) [] Parsing action: resultify()
2023-09-28T20:23:33.999994 DEBUG: (4148) [] Parsing action: filter_results(action=limit_number_of_results, max_results=500)
2023-09-28T20:23:34.027160 INFO: (4148) [] KP info cache(s) have not been updated for 24+ hours. Refreshing stale cache(s).
The process with process ID 4148 has STARTED refreshing the KP info caches
  - Getting meta info from infores:cohd
  - Getting meta info from infores:molepro
  - Getting meta info from infores:genetics-data-provider
  - Getting meta info from infores:service-provider-trapi
  - Getting meta info from infores:rtx-kg2
  - Getting meta info from infores:spoke
  - Getting meta info from infores:knowledge-collaboratory
  - Getting meta info from infores:text-mining-provider-cooccurrence
  - Getting meta info from infores:openpredict
  - Getting meta info from infores:connections-hypothesis
The process with process ID 4148 has FINISHED refreshing the KP info caches
2023-09-28T20:23:39.672192 DEBUG: (4148) [] Loading cached Smart API info
2023-09-28T20:23:39.672374 DEBUG: (4148) [] Loading cached meta map
2023-09-28T20:23:39.681576: DEBUG: In BiolinkHelper init

@edeutsch
Copy link
Collaborator

edeutsch commented Sep 30, 2023

So maybe query_tracker.check_ongoing_queries() is hung?

Can you tell if the child process is still running or gone?

e.g. do you know the PID?

@saramsey
Copy link
Member

saramsey commented Sep 30, 2023

I did look at the process table, but there were a bunch of child processes (i.e., running queries), so it was hard to determine if any of them was the background tasker child process. I think maybe we should patch the background tasker module to print its PID in status messages.

@saramsey
Copy link
Member

saramsey commented Sep 30, 2023

At this point, if a remnant background tasker process was still running, it would (I assume) no longer show up as a "child process" in the process table, but just as a regular process. I think it is unlikely to still be hanging around (in a stuck state) given the SIGTERM handler code in there, but I guess it is possible.

@saramsey
Copy link
Member

saramsey commented Sep 30, 2023

OK, let's take a look for an orphaned, stuck background tasker process. Here is the current list of python processes:

 4085 ?        Sl   375:06 python3 -u -m openapi_server
27297 ?        S      0:00  \_ python3 -u -m openapi_server
32263 ?        S      0:00  \_ python3 -u -m openapi_server
 4100 ?        S     15:14 python3 -u -m openapi_server
 4405 ?        S      1:12  \_ python3 -u -m openapi_server
 4116 ?        S      0:23 python3 -u -m openapi_server
 4374 ?        S      1:12  \_ python3 -u -m openapi_server
 4133 ?        Sl    37:05 python3 -u -m openapi_server
 4163 ?        S      1:32 python3 -u -m openapi_server
 4453 ?        S      1:12  \_ python3 -u -m openapi_server
 4178 ?        Sl     3:05 python3 -u -m openapi_server
16398 ?        S      0:00  \_ python3 -u -m openapi_server
16438 ?        S      0:00  \_ python3 -u -m openapi_server
 6730 ?        S      0:11 python3 -u -m openapi_server
 6767 ?        S      0:13  \_ python3 -u -m openapi_server
 8743 ?        S      0:17 python3 -u -m openapi_server
 8823 ?        S      0:12  \_ python3 -u -m openapi_server

The only PIDs without a child process is 4133.
And here are the list of parent process IDs of the running services:

tcp        0      0 0.0.0.0:5011            0.0.0.0:*               LISTEN      1025       784660705   4116/python3
tcp        0      0 0.0.0.0:5000            0.0.0.0:*               LISTEN      1025       784651216   4133/python3
tcp        0      0 0.0.0.0:5001            0.0.0.0:*               LISTEN      1025       784656534   4178/python3
tcp        0      0 0.0.0.0:5002            0.0.0.0:*               LISTEN      1025       784659696   4163/python3
tcp        0      0 0.0.0.0:5003            0.0.0.0:*               LISTEN      1025       788463881   8743/python3
tcp        0      0 0.0.0.0:5004            0.0.0.0:*               LISTEN      1025       788429291   6730/python3
tcp        0      0 0.0.0.0:5008            0.0.0.0:*               LISTEN      1025       784654128   4085/python3
tcp        0      0 0.0.0.0:5010            0.0.0.0:*               LISTEN      1025       784659686   4100/python3

So process 4133 is listening on port 5000:

tcp        0      0 0.0.0.0:5000            0.0.0.0:*               LISTEN      1025       784651216   4133/python3

So that is the ARAX production server, which is still running the old thread code and wouldn't be expected to have a background tasker child process.

Conclusion: I didn't find an orphan background tasker process, in the process table.

@amykglen
Copy link
Member

amykglen commented Dec 20, 2023

I think this issue can be closed since the problems with the BackgroundTasker/KP Info cacher (#2170) have been fixed, right @saramsey?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants