Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PRS C+T preprocess for performance. #493

Merged
merged 5 commits into from May 13, 2024

Conversation

cristinaetrv
Copy link
Collaborator

@cristinaetrv cristinaetrv commented May 8, 2024

Optimizes the C+T preprocess by performing clumping and thresholding on the gwas summary stat scores using only the overlap of loci between the locus column from the dosage matrix rather than including the entire datasets in preprocessing steps.

if column != "locus":
condition = ds.field(column) != -1
conditions.append(condition)
combined_condition = conditions[0]
Copy link
Collaborator

@akotlar akotlar May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This make sense as a first step; could @austinTalbot7241993 and you work together on an imputation strategy (not sure what is right here; SoftImputeCV I believe is intended only for continuous values) for future iterations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to do imputation on the dosage matrix upstream of analyses overall rather than per analysis since we're imputing for ancestry as well? Would it make sense to have a version of the dosage matrix with imputation even as the dosage matrix is generated?

return scores[scores["P"] < p_value_threshold]
def read_feather_in_chunks(file_path, columns=None, chunk_size=1000):
"""Read a Feather file in chunks as pandas Dataframes."""
table = feather.read_table(file_path, columns=columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! This is way better than what I was doing before (which was lower level using dataset api)

# TODO: Add customizable p value threshold and option for multiple thresholds
thresholded_scores = filter_scores_by_p_value(scores, 0.05)
p_value_threshold = 0.05
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading in row and column chunks, very nice

@@ -171,38 +187,6 @@ def test_filter_scores_by_p_value(mock_processed_scores_df: pd.DataFrame):
), "Filtered scores should contain expected SNP(s)."


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning on adding back a new test for this function in future once the dust has settled on optimizing this code, for now all the changes made it really difficult to get the mock files to work for this test so I opted to take it out while changes are still being made. Tested that all this code works on a small dataset though so the only problems are related to mocking it correctly.

@cristinaetrv cristinaetrv changed the title [WIP] Optimize PRS C+T preprocess for performance. Optimize PRS C+T preprocess for performance. May 13, 2024
set_A = set(thresholded_scores.index)
set_B = set(dosage_feather["locus"])
set_B = set(dosage_loci_nomiss["locus"])

overlap_snps = set_A.intersection(set_B)
Copy link
Collaborator

@akotlar akotlar May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this, add an optional argument to _extract_nomiss_dosage_loci: desired_loci. Pass in the set of loci, which are set(threshold_scores.index), and make one of the filter conditions the selection of just those loci. You will further reduce memory usage, to a constant number.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be good, but not important for now.

for col in row.index:
if col != "allele_comparison":
if col != "allele_comparison" and row[col] != -1:
Copy link
Collaborator

@akotlar akotlar May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this a bit more general to work with null missing values as well for backward compatibility: https://github.com/bystrogenomics/bystro/blob/master/python/python/bystro/ancestry/inference.py#L276

return clean_scores_for_analysis(max_effect_per_bin, "ID_effect_as_ref")


def extract_clumped_thresholded_genos(
Copy link
Collaborator

@akotlar akotlar May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work as written for large dataframes because even for the subset of loci, we will be unable to always load all samples into memory; we'll want to perform PRS on groups of samples, and save out just the PRS scores per sample.

I'm ok with waiting until the next PR to update this.

Copy link
Collaborator

@akotlar akotlar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments; primarily we need to adjust _extract_nomiss_dosage_loci

@akotlar akotlar dismissed their stale review May 13, 2024 21:22

missed that we actually do keep only locus column

Copy link
Collaborator

@akotlar akotlar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved; see comment on extract_clumped_thresholded_genos

@akotlar akotlar merged commit c6c0f93 into bystrogenomics:master May 13, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants