Optimize PRS C+T preprocess for performance. #493

cristinaetrv · 2024-05-08T23:30:09Z

Optimizes the C+T preprocess by performing clumping and thresholding on the gwas summary stat scores using only the overlap of loci between the locus column from the dosage matrix rather than including the entire datasets in preprocessing steps.

akotlar · 2024-05-10T14:34:41Z

python/python/bystro/prs/preprocess_for_prs.py

+        if column != "locus":
+            condition = ds.field(column) != -1
+            conditions.append(condition)
+    combined_condition = conditions[0]


This make sense as a first step; could @austinTalbot7241993 and you work together on an imputation strategy (not sure what is right here; SoftImputeCV I believe is intended only for continuous values) for future iterations?

Maybe it makes sense to do imputation on the dosage matrix upstream of analyses overall rather than per analysis since we're imputing for ancestry as well? Would it make sense to have a version of the dosage matrix with imputation even as the dosage matrix is generated?

akotlar · 2024-05-10T14:36:02Z

python/python/bystro/prs/preprocess_for_prs.py

-    return scores[scores["P"] < p_value_threshold]
+def read_feather_in_chunks(file_path, columns=None, chunk_size=1000):
+    """Read a Feather file in chunks as pandas Dataframes."""
+    table = feather.read_table(file_path, columns=columns)


Neat! This is way better than what I was doing before (which was lower level using dataset api)

akotlar · 2024-05-10T14:37:02Z

python/python/bystro/prs/preprocess_for_prs.py

    # TODO: Add customizable p value threshold and option for multiple thresholds
-    thresholded_scores = filter_scores_by_p_value(scores, 0.05)
+    p_value_threshold = 0.05


Reading in row and column chunks, very nice

cristinaetrv · 2024-05-13T20:29:40Z

python/python/bystro/prs/tests/test_preprocess_for_prs.py

@@ -171,38 +187,6 @@ def test_filter_scores_by_p_value(mock_processed_scores_df: pd.DataFrame):
    ), "Filtered scores should contain expected SNP(s)."




Planning on adding back a new test for this function in future once the dust has settled on optimizing this code, for now all the changes made it really difficult to get the mock files to work for this test so I opted to take it out while changes are still being made. Tested that all this code works on a small dataset though so the only problems are related to mocking it correctly.

python/python/bystro/prs/preprocess_for_prs.py

akotlar · 2024-05-13T21:10:29Z

python/python/bystro/prs/preprocess_for_prs.py

    set_A = set(thresholded_scores.index)
-    set_B = set(dosage_feather["locus"])
+    set_B = set(dosage_loci_nomiss["locus"])
+
    overlap_snps = set_A.intersection(set_B)


Instead of doing this, add an optional argument to _extract_nomiss_dosage_loci: desired_loci. Pass in the set of loci, which are set(threshold_scores.index), and make one of the filter conditions the selection of just those loci. You will further reduce memory usage, to a constant number.

This would be good, but not important for now.

akotlar · 2024-05-13T21:13:07Z

python/python/bystro/prs/preprocess_for_prs.py

        for col in row.index:
-            if col != "allele_comparison":
+            if col != "allele_comparison" and row[col] != -1:


I would leave this a bit more general to work with null missing values as well for backward compatibility: https://github.com/bystrogenomics/bystro/blob/master/python/python/bystro/ancestry/inference.py#L276

akotlar · 2024-05-13T21:16:13Z

python/python/bystro/prs/preprocess_for_prs.py

+    return clean_scores_for_analysis(max_effect_per_bin, "ID_effect_as_ref")
+
+
+def extract_clumped_thresholded_genos(


this will not work as written for large dataframes because even for the subset of loci, we will be unable to always load all samples into memory; we'll want to perform PRS on groups of samples, and save out just the PRS scores per sample.

I'm ok with waiting until the next PR to update this.

akotlar

See comments; primarily we need to adjust _extract_nomiss_dosage_loci

missed that we actually do keep only locus column

akotlar

Approved; see comment on extract_clumped_thresholded_genos

akotlar reviewed May 10, 2024

View reviewed changes

cristinaetrv added 4 commits May 13, 2024 16:08

Optimize PRS C+T preprocess for performance.

433019d

Update tests to reflect changes in prs preprocess.

fd7c724

Linting updates.

2a5529f

Optimize PRS C+T preprocess for performance.

33188cc

cristinaetrv force-pushed the ct-prs-update-5824 branch from ba0661c to 33188cc Compare May 13, 2024 20:25

cristinaetrv commented May 13, 2024

View reviewed changes

cristinaetrv changed the title ~~[WIP] Optimize PRS C+T preprocess for performance.~~ Optimize PRS C+T preprocess for performance. May 13, 2024

Fixes after mypy.

644b8ae

akotlar reviewed May 13, 2024

View reviewed changes

python/python/bystro/prs/preprocess_for_prs.py Show resolved Hide resolved

akotlar reviewed May 13, 2024

View reviewed changes

akotlar previously requested changes May 13, 2024

View reviewed changes

akotlar approved these changes May 13, 2024

View reviewed changes

akotlar merged commit c6c0f93 into bystrogenomics:master May 13, 2024
3 checks passed

cristinaetrv mentioned this pull request May 22, 2024

[WIP] Optimize PRS C+T for memory. #510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize PRS C+T preprocess for performance. #493

Optimize PRS C+T preprocess for performance. #493

cristinaetrv commented May 8, 2024 •

edited

akotlar May 10, 2024 •

edited

cristinaetrv May 10, 2024

akotlar May 10, 2024

akotlar May 10, 2024

cristinaetrv May 13, 2024

akotlar May 13, 2024 •

edited

akotlar May 13, 2024

akotlar May 13, 2024 •

edited

akotlar May 13, 2024 •

edited

akotlar left a comment

akotlar left a comment

		@@ -171,38 +187,6 @@ def test_filter_scores_by_p_value(mock_processed_scores_df: pd.DataFrame):
		), "Filtered scores should contain expected SNP(s)."

		return clean_scores_for_analysis(max_effect_per_bin, "ID_effect_as_ref")


		def extract_clumped_thresholded_genos(

Optimize PRS C+T preprocess for performance. #493

Optimize PRS C+T preprocess for performance. #493

Conversation

cristinaetrv commented May 8, 2024 • edited

akotlar May 10, 2024 • edited

Choose a reason for hiding this comment

cristinaetrv May 10, 2024

Choose a reason for hiding this comment

akotlar May 10, 2024

Choose a reason for hiding this comment

akotlar May 10, 2024

Choose a reason for hiding this comment

cristinaetrv May 13, 2024

Choose a reason for hiding this comment

akotlar May 13, 2024 • edited

Choose a reason for hiding this comment

akotlar May 13, 2024

Choose a reason for hiding this comment

akotlar May 13, 2024 • edited

Choose a reason for hiding this comment

akotlar May 13, 2024 • edited

Choose a reason for hiding this comment

akotlar left a comment

Choose a reason for hiding this comment

akotlar left a comment

Choose a reason for hiding this comment

cristinaetrv commented May 8, 2024 •

edited

akotlar May 10, 2024 •

edited

akotlar May 13, 2024 •

edited

akotlar May 13, 2024 •

edited

akotlar May 13, 2024 •

edited