Sprint 11 Task List #456

akotlar · 2024-04-01T19:00:49Z

Due date for Sprint 11 - May 16th.

General

Improve installation instructions so that the annotator can be installed without issue on Amazon Linux 2023 - @dlin30 - 2024/05/08

Proteomics

Datasets: https://www.synapse.org/#!Synapse:syn53420674.1/datasets/, https://www.synapse.org/#!Synapse:syn31822992/wiki/617907

Add support for somascan upload to bystro webapp @dlin30 - April 8/9 - Currently under review by @akotlar
Add support for somascan upload through api - @akotlar - April 5th
Jupyter notebook demonstrating adjusting for batch effects Domain adaptation and imputation on neuroscience data, and in simulation - @austinTalbot7241993
Jupyter notebook demonstrating adjusting for batch effects using Domain adaptation and imputation on ~300 samples TMT + SomaScan data - @akotlar - 2024-05-08
SPPCA version for low sample scenarios - @austinTalbot7241993 Ilha - 2024/05/08
Generate network analysis results using SPPCA on ~300 sample dataset @akotlar - 2024/05/15
Improve protein abundance filtering using genetic data with the goal of 1) supporting any annotation features that Bystro outputs, (stretch) 2) outputting arrays of structs instead of struct of arrays for multi-field annotations that are requested @akotlar - 2024-05-10
API endpoint to submit filtering prot jobs @dlin30 - 2024/05/06 for initial PR, 2024/05/08 for merged PR
Finish queue listener code v1 Feature/streaming proteomics #434 @dlin30 - April 14

PRS

Goal for Sprint 11: Have a PRS C+T running through the webapp (with display of results in webapp potentially in sprint 12)
Provisional for Sprint 13: Have this deployed to IBDGC (we'll need information from them for what they'll find useful in terms of GWAS summary stats)

Important to IBDGC (and likely other consortiums).

Covariance Matrix Estimation

Overall goal: is to improve network analysis, regressions, clustering, anything that relies on a covariance matrix, and the empirical covariance matrix is not a good estimator, especially in small sample sizes.

Incorporate singular value shrinkers - Ilha - 2024/05/08
Start testing performance of these shrinkers on gaussian data / rank 1 covariance matrices - Ilha - 2024/05/08
Implement a non-negative covariance matrix estimator, which will be useful for genetic methods where we can assume no negative correlations (like rare variant analysis) - Ilha - 2024/05/16

Infrastructure

Improve Python wheels so that they're deployable on ARM Mac, X86 Mac, Linux - @akotlar - 2024/05/16 for ability to install on Mac.

Post IBDGC Tasks

The text was updated successfully, but these errors were encountered:

cristinaetrv · 2024-04-03T19:29:48Z

Notes on "Add automated report generation for the highly consequential variants (with far fewer annotations"

Gene name, clinvar, exonicAlelleFunction, siteType, cadd, sampleMaf, gnonomad.genomes/exomes.AF, gnomad.genomes/exomes id, amino acid substitution, codon number
VEP-like CONSEQUENCE - add that
- most severe and canonical - just the most severe

Needs to be easily digestible - either a saved subset of the full,

the processing will be done by an analyst, the end user (Judy) will need to be very digestible

* Adds support for SomaScan files in the API, and adds tests for the api endpoint.

akotlar · 2024-04-19T19:28:24Z

2024-04-19 Sprint 10 Retro

Overall what has been accomplished:

IBDGC deployment has taken most of @akotlar time, resulting in reworked upload system and bug fixes (submissionid / job failure)
@akotlar poteomics tasks will roll over as a result
@cristinaetrv PRS tasks on hold until she is back from sick leave
@austinTalbot7241993 Proteomics: 3 steps that need to happen:
1. Imputing missing values is important (the SomaScan & TMT datasets had many missing values). Normally we mean impute, we have a better strategy. This has been done, needs testing. Done using Soft Impute (matrix decomposition rank-based method): Soft impute CV #467. We have a cross validation scheme, and Austin has shown that we can explain 70% of variance on the 300 sample dataset. So we can now impute missing values, and that is a requirement for domain adaptation.
2. Domain adaptation: We want covariates to stay in their original space and we want to project new data in. Austin's solution is to make sure 1st and 2nd moments align. That means we need to estimate covariance matrix; it turns out that empirical covariance estimators are bad. He has focused on making respectable covariance estimation in Bystro. Will be used for our FAIRE machine learning method, POE, and for domain adaptation. Has shown that if you don't do this covariance estimation, domain adaptation makes things worse, else if makes things better, reducing discrepancy between datasets by 25%. Future improvements will come from collaboration with Ilha. So we now have a harmonization scheme. Covariance Module Improvements #465. It may not be as good as TAMPOR, but it will be more interpretable, because the original covariate space is left.
3. Remaining (Sprint 11): Try to do outer join on TMT + SomaScan, rather than just inner join on TMT and inner join on SomaScan.

Summary for Sprint 11 Work

Proteomics Statistical Methods

Write up summary of performance of domain adaptation (with soft impute) vs TAMPOR, or domain adaptation followed by TAMPOR.
Run network analysis on TMT and SomaScan data
Run QTL analysis
Explore Stanford technique for improved logistic regression performance via matrix decomposition

This is an ambitious list. If they roll over, they roll over to the next sprint

Deliverable that we're aiming at over next 2 sprints: get the work/results in Erik Johnson's and Thomas Wingo's hands.

Proteomics API

SomaScan support (API upload is in)
Improved filtering api function will roll over @akotlar
Re-introduce file labeling @dlin30
Re-introduce FragPipe support @dlin30
Add SomaScan upload support @dlin30
API endpoint support for filtering will roll over @dlin30

This involves making a submission plugin for proteomic filtering (and a listener on the bystro side). Goal is by end of sprint, you can use the Bystro protein filtering API from a machine that is not on the cluster, routing the API command through the bystro api server @dlin30

PRS

Nothing was achieved, all work rolls over. @akotlar will take over until Cristina is back, best effort. Expecting that initial PRS solution is done by Sprint 11 end; so delay 3 weeks.

PRS excitement is high from Dave Cutler, Elizabeth Leslie's group (potentially, as informed by Julien, her lead bioinformatic analyst), and IBDGC.

Infrastructure and bystro webapp

Further improvements on hold with the possible exception of migrating from zip file downloads to either tar downloads, an improved/fixed zip download, or individual file downloads rather than zipping

We currently have an issue with unzipping the big_daly result, on HGCC (but not Mac, other Linux machines), complaining about a possible zip file "bomb". This may be a result of the zip file being large and the unzip program being compiled on x86 not x86-64. "error: invalid zip file with overlapped components (possible zip bomb)". TBD

What went well

Learned a lot: covariance matrix estimation being finicky in finite samples (finance guys: Wolfe and Ledoitte)
Learned a lot on deployment and worked through important large upload issues leading to massively improved upload system. As a result of forcing ourselves to deploy our work to IBDGC, we have pushed the project forward by months.
Adding more tests to webapp to make future improvements to upload system less error prone / us more confident in them.
Got SSPCA paper rough draft to Jarvis Chen at Harvard and proved that it outperforms L1 regularization and is competitive with L2 regularization, increasing the value and breadth of people that will be interested in this. Jarvis has also expressed interest in bringing this to a wide range of students at HSPH.
We sat with 2 users (Chris Tasted at IBDGC and Julien at Emory) as they tried to use our product

What didn't

The tests didn't get done quickly enough. A lot of learning on how to write tests for async code in javascript
It's never great to have bugs, and the upload system simply had not been tested enough.
Whenever there is learning, it means things are harder than expected and there are delays.

What is 1 thing that we will do differently this sprint.

Use our own code more often. Example: initialization scheme turned out to be critical for supervised PPCA; upload system was undercooked; bug was introduced that prevented jobs from being marked failed leading to "stuck" jobs.
as part of this @akotlar will sit with more users

akotlar · 2024-04-23T17:14:28Z

2024-04-23

Proteomics Topic Meeting

Austin:

Domain adaptation, and we need complete data, so we need to impute missing values; the 330 TMT/SomaScan data has relatively large missingness. Our Soft Impute CV module, does well, gives around 70% variance explained in imputed data.
- The SoftImpute CV module selects a regularization parameter based on the observed data
- This simplifies our lives because this means that any statistical module we make can assume no missing data
- You need >30 samples
Nicole (Austin's wife) is a proteomicist, her thesis was on 6 samples. She separated into transmembrane vs non proteins and dropped missing values
POE & Domain Adaptation: We need matrix methods, deep neural networks aren't the best bet. He has been working on the fact that emprical covariance matrix estimation is not good. Ilha will estimate 15 or so covariance matrix estimation methods. You take a whole bunch of experiments, create a mapping to a common mean and covariance matrix, then future experiments can also be mapped into that space / projected into mapping into that space.
- He is also making methods to characterize performance. Will be useful for diagnoses as well.
He is also looking at singular value shrinkage methods.
Why he did PPCA: You could put in an option to either plot to the first 2 principal components or the first 2 that have nothing to do with race. This would be useful for Erik Johnson's denoising work.

akotlar · 2024-04-30T17:23:19Z

2024-04-30

Proteomics Topic Meeting

Domain Adaptation:

Goal is to learn a function that adjusts 1 dataset to match the mean and covariance matrix of the group
Austin recommends that we find the means and covariances of all batches align, most outlier detection depends on first 2 moments. Estimating the variance in high dimensions is difficult, so we need to regularize the covariance matrix. The problem is we don't have enough samples to do the mapping and evaluate performance. There is no way to way to evaluate on real data.
- What would the minimum size be? Several thousand samples.
We have demonstrated that we get good performance on synthetic data.
The data we have is ~330 samples, the same samples. We should ask them for their 900 TMT dataset. 9000 proteins, all brain. Accessing this data is a bit easier because it is less identifiable. Thomas will email Nick and ask to get this.
Have we compared to TAMPOR?
- There is not a good way to compare to TAMPOR. The way to do this in ML is to use cross validation.
How was TAMPOR evaluated? Not very rigorously; e.g. they look at the differential expression signal, and see whether it seems right.
We could intentionally put some outlier point and see if we can detect it. There are 2 modes of running the mass spec, MS2 and MS3. We have a dataset, where they generated the data that had a mixture of MS2 and MS3 (400 samples, 93 or so were MS3). They re-ran the entire dataset, in just MS2

akotlar · 2024-05-04T17:15:16Z

2024-05-03

Proteomics topic meeting

ProteomicsPipelineDemonstration.ipynb.zip

@akotlar is working on adapting this to proteomics data

akotlar · 2024-05-07T17:19:30Z

2024-05-07

Proteomics Topic Meeting

Dennis got blocked by annotator installation (to create dev instance); running into installation issues, which are being documented and fixed.

Ilha is working this week on covariance estimation methods:

This week's work: constraining covariance matrix to have non-negative entries. The expected correlation is -1/sqrt(mutation_rate_product), so slightly negative. This means we introduce many 0's, sparsity.

Alex - on track for proteomics data; initial analysis on 300 sample CSF TMT + SomaScan, then 400 and 900 sample datasets that Thomas/Nick shared.

Austin - will share the Jupyter notebook demonstrating SPPCA on neuroscience data.

Common variant topic meeting

Austin/POE:
People have created hypothesis testing for detecting spikes in isotropic covariance matrices.
We whiten homozygotes, apply to heterozygotes.
We will implement a hypothesis test for detecting a single spike; we know that after you whiten heterozygotes, your covariance matrix will be isotropic with a single spike. This will result in a call and p-value.
Then we will focus on singular value shrinkers that give good estimates.

Rare variant topic meeting

Austin is trying to prove rare variant analysis is inherently impossible outside mendelian traits. He is showing that if you have many rare variants, and bound their effects (in terms of P(Disease|variant))...when having any mutation has a tiny effect, the population variance in having disease goes to 0; which is to say everyone has identical risk for having disease.

akotlar · 2024-05-08T16:56:32Z

2025-05-08

Austin - working on NeurIPS paper
Dennis - wrapping up installation guide
Cristina - PR'ing PRS today
Ilha - close to completing the singular value shrinkers; working on operator norm shrinker that is well suited for large n; genotyping data will use the non-negative covariance matrix estimator
Alex - Gotten the 300 sample data decompressed (required 7zip to avoid the "corruption" and refusal to decompress). In comm with Eric Dammers, who has instructed on what the files mean (same naming scheme as the the olink/tmt/somascan paper)

akotlar · 2024-05-08T20:32:15Z

2025-05-10 Weekly Meeting

Agenda

Create an item in the task list if the work being undertaken is over 1/2 days of work; help us track new and necessary work that comes up post-sprint planning.

Discussion

Singular value shrinkers is still WIP - working on a version that handles any sample size
PRS - on track
Proteomics - behind a few days but will come back on track
Infrastructure - CVXPY & scikit-allel in particular presented issues during install on Arm Mac, need to follow up and find a solution (according to cvxpy/cvxpy#2075 this is now resolved)

akotlar added the .task list A checklist of smaller tasks label Apr 1, 2024

akotlar added this to the Sprint 10 milestone Apr 1, 2024

akotlar added a commit that referenced this issue Apr 18, 2024

#456: Add SomaScan API support (#466)

97fa0b5

* Adds support for SomaScan files in the API, and adds tests for the api endpoint.

akotlar changed the title ~~Sprint 10 Task List~~ Sprint 11 Task List Apr 30, 2024

cristinaetrv modified the milestones: Sprint 10, Sprint 11 Apr 30, 2024

akotlar added a commit to akotlar/bystro that referenced this issue May 17, 2024

issue bystrogenomics#456: filter proteomic data and join on any field

1ab2b6f

cristinaetrv closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sprint 11 Task List #456

Sprint 11 Task List #456

akotlar commented Apr 1, 2024 •

edited by cristinaetrv

cristinaetrv commented Apr 3, 2024

akotlar commented Apr 19, 2024 •

edited

akotlar commented Apr 23, 2024

akotlar commented Apr 30, 2024

akotlar commented May 4, 2024

akotlar commented May 7, 2024 •

edited

akotlar commented May 8, 2024

akotlar commented May 8, 2024 •

edited

Sprint 11 Task List #456

Sprint 11 Task List #456

Comments

akotlar commented Apr 1, 2024 • edited by cristinaetrv

Due date for Sprint 11 - May 16th.

General

Proteomics

PRS

Covariance Matrix Estimation

Infrastructure

Post IBDGC Tasks

cristinaetrv commented Apr 3, 2024

akotlar commented Apr 19, 2024 • edited

2024-04-19 Sprint 10 Retro

Overall what has been accomplished:

Summary for Sprint 11 Work

Proteomics Statistical Methods

Proteomics API

PRS

Infrastructure and bystro webapp

What went well

What didn't

What is 1 thing that we will do differently this sprint.

akotlar commented Apr 23, 2024

2024-04-23

Proteomics Topic Meeting

akotlar commented Apr 30, 2024

2024-04-30

Proteomics Topic Meeting

akotlar commented May 4, 2024

2024-05-03

Proteomics topic meeting

akotlar commented May 7, 2024 • edited

2024-05-07

Proteomics Topic Meeting

Common variant topic meeting

Rare variant topic meeting

akotlar commented May 8, 2024

2025-05-08

akotlar commented May 8, 2024 • edited

2025-05-10 Weekly Meeting

Agenda

Discussion

akotlar commented Apr 1, 2024 •

edited by cristinaetrv

akotlar commented Apr 19, 2024 •

edited

akotlar commented May 7, 2024 •

edited

akotlar commented May 8, 2024 •

edited