Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A translator for GEO, Gene Expression Omnibus #3299

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mangost
Copy link

@mangost mangost commented Apr 12, 2024

No description provided.

Comment on lines +40 to +44
return "dataset";
// if (url.includes("acc.cgi?acc")) {
// return "dataset";
// }
// return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to actually check if the page matches here (and ideally support search pages, if this site has them)

Comment on lines +50 to +63
newItem.title = text(doc, '#ui-ncbiexternallink-1 > table > tbody > tr > td > table:nth-child(6) > tbody > tr:nth-child(3) > td:nth-child(2) > table > tbody > tr > td > table > tbody > tr > td > table:nth-child(6) > tbody > tr > td > table:nth-child(1) > tbody > tr:nth-child(3) > td:nth-child(2)');
newItem.abstractNote = text(doc, '#ui-ncbiexternallink-1 > table > tbody > tr > td > table:nth-child(6) > tbody > tr:nth-child(3) > td:nth-child(2) > table > tbody > tr > td > table > tbody > tr > td > table:nth-child(6) > tbody > tr > td > table:nth-child(1) > tbody > tr:nth-child(6) > td:nth-child(2)');
newItem.url = url;
// url is of format: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE251923
// subset the last part of the url to get the accession number
const acc = url.split("acc=")[1];
newItem.identifier = acc;
// status is of format: Public on Dec 27, 2023
// subset it to get the date
const status_str = text(doc, '#ui-ncbiexternallink-1 > table > tbody > tr > td > table:nth-child(6) > tbody > tr:nth-child(3) > td:nth-child(2) > table > tbody > tr > td > table > tbody > tr > td > table:nth-child(6) > tbody > tr > td > table:nth-child(1) > tbody > tr:nth-child(2) > td:nth-child(2)');
newItem.date = status_str.split("on")[1].trim();
// authors is of format: Chen J, Song Y, Huang J, Wan X, Li Y
// push into newItem.creators
const author_str = text(doc, '#ui-ncbiexternallink-1 > table > tbody > tr > td > table:nth-child(6) > tbody > tr:nth-child(3) > td:nth-child(2) > table > tbody > tr > td > table > tbody > tr > td > table:nth-child(6) > tbody > tr > td > table:nth-child(1) > tbody > tr:nth-child(10) > td:nth-child(2)');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't reasonable - I'm guessing output from Chrome/Firefox devtools? We need selectors that will remain stable between pages. It's possible that we'll have to walk through cells in the table and look at the labels ("Status", "Title") to figure out which field is which.

newItem.url = url;
// url is of format: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE251923
// subset the last part of the url to get the accession number
const acc = url.split("acc=")[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaks if there's more in the query string/an anchor on the URL.

Suggested change
const acc = url.split("acc=")[1];
const acc = new URL(url).searchParams.get("acc");

// status is of format: Public on Dec 27, 2023
// subset it to get the date
const status_str = text(doc, '#ui-ncbiexternallink-1 > table > tbody > tr > td > table:nth-child(6) > tbody > tr:nth-child(3) > td:nth-child(2) > table > tbody > tr > td > table > tbody > tr > td > table:nth-child(6) > tbody > tr > td > table:nth-child(1) > tbody > tr:nth-child(2) > td:nth-child(2)');
newItem.date = status_str.split("on")[1].trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ZU.strToISO(status_str) will extract the date fine, no need to split.

(Also rename to use camelCase, not snake_case.)

@@ -0,0 +1,141 @@
{
"translatorID": "5a325508-cb60-42c3-8b0f-d4e3c6441059",
"label": "GEO",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this NCBI GEO for clarity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your detailed review. I will refine the code when I have time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants