New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize PRS C+T preprocess for performance. #493
Optimize PRS C+T preprocess for performance. #493
Conversation
if column != "locus": | ||
condition = ds.field(column) != -1 | ||
conditions.append(condition) | ||
combined_condition = conditions[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This make sense as a first step; could @austinTalbot7241993 and you work together on an imputation strategy (not sure what is right here; SoftImputeCV I believe is intended only for continuous values) for future iterations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to do imputation on the dosage matrix upstream of analyses overall rather than per analysis since we're imputing for ancestry as well? Would it make sense to have a version of the dosage matrix with imputation even as the dosage matrix is generated?
return scores[scores["P"] < p_value_threshold] | ||
def read_feather_in_chunks(file_path, columns=None, chunk_size=1000): | ||
"""Read a Feather file in chunks as pandas Dataframes.""" | ||
table = feather.read_table(file_path, columns=columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! This is way better than what I was doing before (which was lower level using dataset api)
# TODO: Add customizable p value threshold and option for multiple thresholds | ||
thresholded_scores = filter_scores_by_p_value(scores, 0.05) | ||
p_value_threshold = 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading in row and column chunks, very nice
ba0661c
to
33188cc
Compare
@@ -171,38 +187,6 @@ def test_filter_scores_by_p_value(mock_processed_scores_df: pd.DataFrame): | |||
), "Filtered scores should contain expected SNP(s)." | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Planning on adding back a new test for this function in future once the dust has settled on optimizing this code, for now all the changes made it really difficult to get the mock files to work for this test so I opted to take it out while changes are still being made. Tested that all this code works on a small dataset though so the only problems are related to mocking it correctly.
set_A = set(thresholded_scores.index) | ||
set_B = set(dosage_feather["locus"]) | ||
set_B = set(dosage_loci_nomiss["locus"]) | ||
|
||
overlap_snps = set_A.intersection(set_B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this, add an optional argument to _extract_nomiss_dosage_loci
: desired_loci
. Pass in the set of loci, which are set(threshold_scores.index)
, and make one of the filter conditions the selection of just those loci. You will further reduce memory usage, to a constant number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be good, but not important for now.
for col in row.index: | ||
if col != "allele_comparison": | ||
if col != "allele_comparison" and row[col] != -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave this a bit more general to work with null missing values as well for backward compatibility: https://github.com/bystrogenomics/bystro/blob/master/python/python/bystro/ancestry/inference.py#L276
return clean_scores_for_analysis(max_effect_per_bin, "ID_effect_as_ref") | ||
|
||
|
||
def extract_clumped_thresholded_genos( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not work as written for large dataframes because even for the subset of loci, we will be unable to always load all samples into memory; we'll want to perform PRS on groups of samples, and save out just the PRS scores per sample.
I'm ok with waiting until the next PR to update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments; primarily we need to adjust _extract_nomiss_dosage_loci
missed that we actually do keep only locus column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved; see comment on extract_clumped_thresholded_genos
Optimizes the C+T preprocess by performing clumping and thresholding on the gwas summary stat scores using only the overlap of loci between the locus column from the dosage matrix rather than including the entire datasets in preprocessing steps.