Adding New Databases #181

elifcevrim · 2022-02-14T17:16:27Z

Hi Denes,

We are working on adding new databases in pypath. We have completed one of them, Drugcentral. How would you like to proceed about merging new scripts or updates into repo? Can we open a pull request directly or would you prefer discussing about them in issues at first, like we did before?

For Drugbank, it requires user and password. We tried to implement it by checking similar databases in pypath. But we are missing some points I think, could you help us about this issue? Here is the programmatical access of Drugbank: https://go.drugbank.com/releases/help I guess -L option is about an authentication procedure. We are not sure how to implement these options to the current curl script in pypath.

deeenes · 2022-02-14T23:28:52Z

Hi Elif,

Thanks, sounds really great!

I see you have write access to this repo. Feel free to merge directly to master. New modules in pypath.inputs don't break the module, and once merged, next day you can check in the report if the new functions run without error also on the test server: https://status.omnipathdb.org/inputs/latest/

Alternatively, you can open pull requests, in case you want me to review the code first.

About DrugBank: I would suggest to check the legal notes first, the license, we should first know if it's okay to redistribute the data. Otherwise, the -L option of curl is CURLOPT_FOLLOWLOCATION, which means to follow HTTP 30x redirects. This is enabled by default in pypath.share.curl. Downloads which require cookies, custom HTTP headers or password authentication are often tricky to implement. It is often a guess work to find out which headers are important for the server. I show few examples here:

https://github.com/saezlab/pypath/blob/master/pypath/inputs/cell.py
This function downloads supplementary files from journals of the Cell publisher. The logic is the following:

Create a Curl instance which we do not execute, but only obtain the cache path
Check if the cache path exists, if it does, go ahead and use the final Curl instance to access the cache content
Otherwise, create a Curl instance with another URL init_url, where a user-agent header must be present, and this Curl instance must bypass the cache because we must get a valid cookie from the server
Then we process the cookies and include them in the request headers of the final request
Finally we create a Curl object for the URL of the supplementary file that we want to download, using the headers which contain the cookie

https://github.com/saezlab/pypath/blob/master/pypath/inputs/cosmic.py
This is an example of password authentication. The user has many ways to provide their password: by a file in the config directory, or somewhere else, or by the pypath.share.settings module or just passing it to the function.

https://github.com/saezlab/pypath/blob/master/pypath/inputs/innatedb.py
And here, we just have to add a browser user-agent header, otherwise the server doesn't respond properly.

https://github.com/saezlab/pypath/blob/master/pypath/inputs/protmapper.py
Sometimes it's quite difficult to find out why a request fails, for example here, when disabling ALPN was the solution.

A very useful tool is the Inspector of your browser, where on the Network tab you can inspect the request and response headers of each request and by right click copy them as curl command line call.

I hope this helps. If you experience any difficulties, just let me know.

Best,

Denes

elifcevrim added bug Problem in the code help wanted User needs help and removed bug Problem in the code labels Feb 14, 2022

deeenes self-assigned this Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding New Databases #181

Adding New Databases #181

elifcevrim commented Feb 14, 2022

deeenes commented Feb 14, 2022

Adding New Databases #181

Adding New Databases #181

Comments

elifcevrim commented Feb 14, 2022

deeenes commented Feb 14, 2022