Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to adjust by multiple variables using ComBat-Seq? #33

Open
evaesquinas opened this issue Jul 26, 2022 · 3 comments
Open

How to adjust by multiple variables using ComBat-Seq? #33

evaesquinas opened this issue Jul 26, 2022 · 3 comments

Comments

@evaesquinas
Copy link

evaesquinas commented Jul 26, 2022

Dear @zhangyuqing ,

I am trying to adjust my RNA data using ComBat-Seq since I realised that there are 3 batches that I need to adjust for:

  • Place (2 levels: place 1, place 2)
  • Library Preparation Date (16 levels - different dates)
  • Type of tube (2 levels: A, B)

I have 960 samples and around 62000 genes.

In my biological matrix, I have: Age, Sex, Group (cases,controls..) and WBC counts.

biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + LYMPH + MONO + NEUT, data=phenotype)

In the tutorial of Combat-Seq appears how to adjust by 1 variable but it doesn't tell you how to adjust by more than 1.

I have seen a lot of posts using combat that the only way is to adjust by 1 variable and then, with those results, adjust again by the 2nd variable and so on.

That would be:

Adjust by library prep.

raw.cts_adjustedLibPrep <- ComBat_seq(counts = raw_cts_matrix, batch=batch_libraryprep, group=NULL, covar_mod = biol_mat)

Adjust by library prep + type of tube.

raw.cts_adjusted_LibPrep_TypeTube <- ComBat_seq(counts = raw.cts_adjustedLibPrep, batch=batch_type_tube, group=NULL, covar_mod = biol_mat)

Adjust by library prep + type of tube + place

 raw.cts_adjusted_LibPrep_TypeTube_Place <- ComBat_seq(counts = raw.cts_adjusted_LibPrep_TypeTube, batch=batch_place, group=NULL, covar_mod = biol_mat)

For the first adjustment (library prep) it takes around 15min, but for the second... it has been running for more than 2 days.. I stopped it and launch it again, changing the order of the adjustment but it is taking too much time and it seems that something is wrong...

Here I attach a similar example of my data:

Covariates (variables to adjust + biological information)

set.seed(11344)
covariates_df <- data.frame(
  Sample = paste0("A", seq(1,960)),
  Age = sample(seq(18, 70), size = 960, replace = TRUE),
  Group = sample(seq(1, 3), size = 960, replace = TRUE),
  Sex = sample(seq(1, 2), size = 960, replace = TRUE),
  WBC = sample(seq(0.5, 35, by=0.3), size = 960, replace = TRUE),
  Place = sample(seq(1, 2), size = 960, replace = TRUE),
  Tubes = sample(seq(1, 2), size = 960, replace = TRUE),
  LibPrep = sample(seq(1, 16), size = 960, replace = TRUE)
)
rownames(covariates_df) <- covariates_df$Sample

categorical_variables <- c("Sample", "Group", "Sex", "Place", "Tubes", "LibPrep")
covariates_df[,categorical_variables] <- lapply(covariates_df[,categorical_variables],as.factor)


>str(covariates_df)

'data.frame':	960 obs. of  8 variables:
 $ Sample : Factor w/ 960 levels "A1","A10","A100",..: 1 112 223 334 445 556 667 778 889 2 ...
 $ Age    : int  40 66 37 40 25 42 46 51 45 61 ...
 $ Group  : Factor w/ 3 levels "1","2","3": 2 2 2 3 2 2 3 3 3 3 ...
 $ Sex    : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 1 ...
 $ WBC    : num  12.8 29.9 25.7 0.8 30.8 5.9 30.5 19.1 2.9 12.2 ...
 $ Place  : Factor w/ 2 levels "1","2": 1 1 2 1 2 1 2 2 1 2 ...
 $ Tubes  : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 2 2 1 2 ...
 $ LibPrep: Factor w/ 16 levels "1","2","3","4",..: 14 1 13 4 10 5 1 8 7 5 ...

The 3 variables that I want to adjust for will be: Place, Tubes and LibPrep. On the other hand, the biological information that I want to preserve would be Age, Group, Sex and WBC.

RNA raw counts

exp_df <- data.frame(replicate(960,sample(0:5216979,66000,rep=TRUE)))
colnames(exp_df) <- covariates_df$Sample
rownames(exp_df) <- paste0("Gene", rownames(exp_df))
exp_df_mat <- as.matrix(exp_df)

> str(exp_df)
'data.frame':	66000 obs. of  960 variables:
 $ A1  : int  3687756 4638393 4315502 316404 2209492 831342 4261323 1283200 1522808 1088673 ...
 $ A2  : int  4645779 3495763 4782397 2869628 2402288 1012125 1267979 1361720 1919367 4859438 ...
 $ A3  : int  2883976 4770076 228011 915940 19038 4650368 112632 1316246 1926498 484384 ...
 $ A4  : int  3496840 3676764 2763012 2723528 944224 3809955 5054054 1122139 116722 4090191 ...
 $ A5  : int  3140122 650422 2075888 2987814 1656462 2863317 155120 1086391 3163073 625809 ...
 $ A6  : int  4084796 1932277 3491059 4898410 4183070 4470479 4193882 928271 3282841 4418068 ...
 $ A7  : int  765797 3153554 5075853 4775395 3582194 4274642 1530455 1433179 4757168 2209519 ...
 $ A8  : int  1652346 3505656 111027 3170748 5087383 912180 2693545 1907366 3135340 548296 ...
 $ A9  : int  441053 5132021 1857530 2413007 1218852 3614836 4388581 106105 3270886 3840996 ...
 $ A10 : int  995597 2076013 1576689 1857073 1435772 3788330 3983860 816969 733090 1357226 ...
 $ A11 : int  4944929 5182067 4415296 3224862 145068 3533346 4174175 2406340 4572529 4820674 ...
 $ A12 : int  2731408 1896439 5063233 4621405 2686720 1507796 4764591 887296 2257893 384470 ...

The biological matrix would be:
biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + WBC, data=covariates_df)

head(biol_mat)

(Intercept) Age as.factor(Sex)2 as.factor(Group)2 as.factor(Group)3  WBC
A1           1  40               0                 1                 0 12.8
A2           1  66               1                 1                 0 29.9
A3           1  37               1                 1                 0 25.7
A4           1  40               0                 0                 1  0.8
A5           1  25               0                 1                 0 30.8
A6           1  42               1                 1                 0  5.9

Could you kindly help me, please? I have tried all the possibilities and I do not really know if it is something wrong my code or it is because combat_seq cannot handle too much amount of data.

Thanks very much in advance
Regards

@punyung
Copy link

punyung commented Dec 9, 2022

@evaesquinas Hi, I've tried your example data and it took me roughly 13 hours to finish the adjustment of 3 batches. I hope this can help you.

@maxnest
Copy link

maxnest commented Jan 11, 2023

@evaesquinas hi, did you manage to figure out what is the best way to deal with the adjustment of multiple batches? Have you tried any other programs for this task? Faced the same problem on my data and still can't continue the analysis as the adjustment takes too long and seems to be endless. I would be grateful for any help and advice

@evaesquinas
Copy link
Author

@maxnest Hi! No, I did not sorry. I ended adjusting only by 1 batch. And I did not find any other way to solve the problem. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants