Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no SNPs in common between prior studies and the conventional GWAS #7

Open
Xuemin-Wang opened this issue Jun 22, 2020 · 9 comments
Open

Comments

@Xuemin-Wang
Copy link

Dear author,

Another issue came along as I was using bGWAS (as in the title and below). I have checked the two ZMatrices files manually. There are large amount of SNPs in common among the gwas studies.

Adding data from the conventional GWAS :

"allEC_unmunged"
Done!
0 SNPs in common between prior studies and the conventional GWAS

Thresholding...

0 SNPs left after thresholding
breast cancer - Age of Menarche - PCOS - uterine fibroids - epithelial ovarian cancer - T2D - HDL cholesterol - hypertension - Pulse wave Arterial Stiffness index - Impedance of whole body - High light scatter retic ulocyte percentage - Creatinine (enzymatic) in urine - Alanine aminotransferase (U/L) - C-reactive protein (mg/L) - SHBG (nmol/L) - Ankle spacing width - Length of menstrual cycle - Vascular/heart problems diagnosed by doctor: Heart attack : removed (less than 2 instrument after thresholding)
0 studies left after thresholding
Pruning MR instruments...
distance : 500Kb

Then it stayed in this status for hours without progressing.

I read the original code, but didn't find a clue.

Do you have any idea what's going on?

Regards,
xuemin

@n-mounier
Copy link
Owner

n-mounier commented Jul 6, 2020

Hello,
Sorry again for the late answer. I think the problem comes from the fact that all the studies are removed because there are not enough strong instruments after thresholding:
0 studies left after thresholding
So after removing all studies, there are no SNP left in the Z-matrix, so no SNP in common with the prior GWAS (and that's why it gets stuck).

My guess is that most of the prior GWASs you're using (in the Z-matrix) have a quite low sample size, and therefore a small number of SNPs are reaching the threshold used to select instruments for MR so the package can't use them for the analysis. You could try two things:

  1. I noticed that you already changedMR_ninstruments = 3 to MR_ninstruments = 2 (that is the minimum number of instruments per prior GWASs needed to perform the analysis), you could also try using a less stringent threshold to select instruments for MR (by default, MR_threshold = 1e-06, try to set it to 1e-05 - note that this is the larger threshold allowed because larger thresholds could lead to the selection of invalid instruments, that are not strongly associated with the exposures)
  2. to check what is actually going on and make sure that there is indeed a lack of strong instruments (and not a problem with the selection of the instrument in the package), could you please have a look at the summary statistics of the prior GWASs included in the Z-matrix? Looking at the Manhattan Plot for each trait, for example, seeing how many loci are reaching the threshold used to select instruments?

Thank you.

@Xuemin-Wang
Copy link
Author

Dear Ninon,

Thanks very much for your info.

Sample sizes of the gwas sumstats are actually large, varying from 62k to hundreds of thousands. Some of the phenotypes have a causal effect on the focal phenotype (the conventional gwas) based on previous studies; so there should be some SNPs left after thresholding.

Is there a column of p-value included in the ZMatrix? I didn't see it in your example files. Does the gwas sumstats have to be imputed (by i.e. ssimp) before running bGWAS? I didn't impute gwas sumstats, which resulted in missing values for some SNPs in the full ZMatrix. Do you think this caused the issue? I tried to impute with ssimp, but an error occurred. I'm still working on the imputation to see if the above issue can be solved.

Many thanks,
xuemin

@n-mounier
Copy link
Owner

n-mounier commented Jul 21, 2020

Hello,

Thank you for the clarifications. Indeed, it seems that there should be (enough) instruments to perform the analysis.

Is there a column of p-value included in the ZMatrix?
We are working directly with z-scores : the matrix should contain the corresponding z-score for each SNP-trait association. I have updated the doc to make it clearer (https://github.com/n-mounier/bGWAS/blob/master/doc/ZMatrices.md#description). Then, the p-value threshold used to select instrument is converted into a z-score threshold during the analysis, and we're only working with z-scores.

Does the gwas sumstats have to be imputed (by i.e. ssimp) before running bGWAS? I didn't impute gwas sumstats, which resulted in missing values for some SNPs in the full ZMatrix. Do you think this caused the issue
It does not necessarily have to be imputed, assuming that there are enough SNPs in common in the different prior GWASs, and that SNPs that are instruments for one of these prior GWASs are also present in the other GWASs. Missing values in the full ZMatrix are not a problem (they're set to 0 when calculating the prior, i.e. if we don't know the effect on a SNP on a specific risk factor, the contribution of this risk factor to the prior will be null). SNPs with missing values should not be included in the Z-matrix containing instruments ("ZMatrix_MR") that is used to identify the relevant risk factors.
Could you please use this piece of code (applied to your own ZMatrix_MR) to check how many instruments you have for each risk factor (before pruning) and let me know how it looks like?

library(dplyr)
D <- data.table::fread("~/ZMatrices/ZMatrix_MR.csv.gz")

# keep only z-scores columns
D %>%
  dplyr::select(-c("rs", "chrm", "pos", "alt", "ref")) -> Zscores
nrow(Zscores)  
              
# are there any NA?
sum(is.na(Zscores)) # should be zero!
Zscores %>%
  tidyr::drop_na() -> Zscores_noNA
nrow(Zscores_noNA)

# how many instruments / prior GWASs after removing NA
Zlimit = -qnorm(1e-5/2)
Zscores_noNA %>%
  apply(MARGIN = 2, FUN = function(col){
    sum(abs(col)>Zlimit)})

@Xuemin-Wang
Copy link
Author

Xuemin-Wang commented Aug 17, 2020

Dear Ninon,

Thanks very much for your comments. Finally got bGWAS running without an error. The previous issue was indeed due to incorrect format of the input gwas summary statistics files.

Will imputation make big differences to the results? Currently I run bGWAS without imputing the gwas sumstats, so there are just over 1m SNPs in ZMatrix_Full.csv.gz. There are about 91k SNPs in ZMatrix_MR.csv.gz. However, no significant SNP was detected (please find attached the log file bGWAS_logfile.log), which does not seem right to me.

Besides "MR_threshold" and "MR_ninstruments", which parameters would you recommend to adjust to get reliable results?

Many thanks,
patrick

@Xuemin-Wang
Copy link
Author

There was also a Warning message at the end of the anlysis.
"Warning message
In stats::cor(all.priors[, 6:7]) : the standard deviation is zero"

Regards,
patrick

@Xuemin-Wang
Copy link
Author

Hi Ninon,

I tried to only keep the common SNPs across all gwas sumstats in the ZMatrix_Full.csv.gz, and bGWAS ran successfully without any error or warning. The previous failure might be due to large amount of missingness of the Z score in the full matrix. I'll try to impute the sumstats to see if things can be improved.

Regards,
patrick

@n-mounier
Copy link
Owner

n-mounier commented Sep 11, 2020

Hi Patrick,

I am glad that you (finally) managed to perform the analysis.
I don't really understand why removing NAs from the full ZMatrix solved the problem. The one I am using contains a fraction of NAs (about 2.5% of the values are missing), and I never ran into similar issue.

Looking at the log file, it seems that when you used your ZMatrix with NAs, the missing values were not correctly replaced by 0 before estimating the prior, that would explain why the correlation between prior effects and observed effects is NA (does not explain the warning message though) and also why you only have 291,355 SNPs left (when trying
I did not figure out a way to reproduce this behaviour, would be nice to know what you get when doing this:

D <- data.table::fread("~/ZMatrices/ZMatrix_Full.csv.gz", data.table=F)

# keep only z-scores columns
D %>%
  dplyr::select(-c("rs", "chrm", "pos", "alt", "ref")) -> Zscores
# fraction of missing values
sum(is.na(Zscores))/(nrow(Zscores)*ncol(Zscores)

# replace missing values by 0 (like in the bGWAS pipeline)
Zscores %>% 
  mutate_all(~replace(., is.na(.), 0)) -> Zscores0
sum(is.na(Zscores0))/(nrow(Zscores0)*ncol(Zscores0))

Thank you.

@Xuemin-Wang
Copy link
Author

Hi Ninon,

Thanks very much for your comments.

Sorry for the late update. I had been working on other stuff.

It started with 291,355 SNPs during "calculation of Bayes Factors and p-vales". Would you expect more SNPs for a total of 12,465,486 SNPs in the full ZMatrix?

I tried your scripts as suggested. There were around 15% of missing values in the full ZMatrix.

sum(is.na(Zscores))/(nrow(Zscores)*ncol(Zscores))
[1] 0.1462719

To investigate if the issue was actually caused by the incorrect replacement of NAs by 0. I've converted NAs to 0 as suggested and saved the new ZMatrix. Then bGWAS was conduted using the newly saved full ZMatrix. Correlation between prior and observed effects for SNPs was still NA. Additionally, the same warning message was still produced ("In stats::cor(all.priors[, 6:7]) : the standard deviation is zero").

Below is the 2nd part of the log file:

Out-of-sample R-squared for MR instruments across all chromosomes is NaN

Out-of-sample squared correlation for MR instruments across all chromosome is NA

Correlation between prior and observed effects for all SNPs is NA

Correlation between prior and observed effects for SNPs with GWAS p-value < 0.001 is NA

The file CoefficientsByChromosome.csv has been successfully written.
Done!
The file Prior.csv has been successfully written.
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<<< Calculation of Bayes Factors and p-values >>>

Calculating them for all SNPs

Computing observed Bayes Factor for all SNPs...

Done!

Computing BF p-values...

using a distribution approach:
... getting approximated p-values using non-linear quantiles
... checking p-values near significance threshold

Estimating p-values for posterior effects...

Done!

Estimating p-values for direct effects...

Done!
The file Prior.csv has been updated into Prior_BFp.csv

Pruning and identifying significant SNPs
Identification based on BFs
Starting with 291,355 SNPs

Selecting significant SNPs according to p-values...

0 SNPs left
Done!
Identification based on posterior effects
Starting with 291,355 SNPs

Selecting significant SNPs according to p-values...

4 SNPs left
Done!

Pruning significant SNPs...

distance : 500Kb
1 SNPs left
Done!
Identification based on direct effects
Starting with 291,355 SNPs

Selecting significant SNPs according to p-values...

0 SNPs left
Done!
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
Time of the analysis: 10 minute(s) and 49 second(s).
Warning messages:
In stats::cor(all.priors[, 6:7]) : the standard deviation is zero

Here are the scripts used for the above analysis:
full <- data.table::fread("ZMatrix_Full.csv.gz", data.table=F)

keep only z-scores columns

full %>%
dplyr::select(-c("rs", "chrm", "pos", "alt", "ref")) -> Zscores

fraction of missing values

sum(is.na(Zscores))/(nrow(Zscores)*ncol(Zscores))
Zscores %>%
mutate_all(~replace(., is.na(.), 0)) -> Zscores0
sum(is.na(Zscores0))/(nrow(Zscores0)*ncol(Zscores0))
full %>%
dplyr::select(c('rs', 'chrm', 'pos', 'alt', 'ref')) -> snp_info
full_new <- cbind(snp_info, Zscores0)
fwrite(full_new, 'ZMatrix_Full.csv.gz', sep='\t', row.names=FALSE, quote=FALSE)

#use the newly saved full matrix in bGWAS
AllStudies = list_priorGWASs(Z_matrices = "/working/lab_mandys/xueminW/bGWAS/BMI_adj/traits_25")
MyGWAS = 1
B = bGWAS(Z_matrices = "/directory/to/ZMatrix",
name = "bGWAS",
GWAS = MyGWAS,
prior_studies = NULL,
MR_threshold = 1e-6,
MR_ninstruments = 3,
MR_pruning_dist = 500,
MR_pruning_LD = 0,
MR_shrinkage = 1,
stepwise_threshold = NULL,
prior_shrinkage = NULL,
sign_method = "p",
sign_thresh = 5e-8,
use_permutations= FALSE,
res_pruning_dist = 500,
res_pruning_LD = 0,
save_files = TRUE,
verbose = TRUE)

Do you have any idea of what might causing the issues?

Many thanks,
patrick

@Xuemin-Wang
Copy link
Author

Xuemin-Wang commented Oct 8, 2020

Just ran bGWAS with the ZMatrix_Full including only common SNPs across all traits (7,954,416 SNPs vs. 10,754,303 the union of SNPs across all traits). The correlation between prior and observed effects was no longer NA (0.1277 for all SNPs and 0.4098 for SNPs with GWAS p-value < 0.001) and the number of SNPs to start was 7,954,416. So I assume bGWAS ran successfully. Still no significant SNP was detected according to BFs p-values, though more than a dozen of SNPs were genome-wide significant based on the p value of direct or posterior effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants