Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Assigned data dfdc.final[[x]] - dfdc.final[[y]] must be compatible with existing data #190

Open
jfertaj opened this issue Sep 13, 2021 · 11 comments

Comments

@jfertaj
Copy link

jfertaj commented Sep 13, 2021

Hi,

I am trying to run a quantification analyses using artMS and get the following error:
Error: Assigned data `dfdc.final[[x]] - dfdc.final[[y]]` must be compatible with existing data

The chunk of my code that triggers the error is this:

artmsAnalysisQuantifications(log2fc_file = "results.txt",
                              modelqc_file = "results_ModelQC.txt",
                              species = "human",
                              enrich = TRUE,
                              output_dir = "AnalysisQuantifications_followUP")

my SessionInfo is the following

R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1   stringr_1.4.0   purrr_0.3.4     readr_2.0.1    
 [5] tidyr_1.1.3     tibble_3.1.4    ggplot2_3.3.5   tidyverse_1.3.1
 [9] dplyr_1.0.7     vroom_1.5.4     artMS_1.8.3    

loaded via a namespace (and not attached):
  [1] readxl_1.3.1               snow_0.4-3                
  [3] backports_1.2.1            circlize_0.4.13           
  [5] corrplot_0.90              BiocFileCache_1.14.0      
  [7] plyr_1.8.6                 lazyeval_0.2.2            
  [9] splines_4.0.5              digest_0.6.27             
 [11] foreach_1.5.1              htmltools_0.5.2           
 [13] fansi_0.5.0                magrittr_2.0.1            
 [15] memoise_2.0.0              cluster_2.1.2             
 [17] tzdb_0.1.2                 openxlsx_4.2.4            
 [19] limma_3.46.0               ComplexHeatmap_2.6.2      
 [21] modelr_0.1.8               matrixStats_0.60.1        
 [23] xts_0.12.1                 askpass_1.1               
 [25] prettyunits_1.1.1          colorspace_2.0-2          
 [27] rvest_1.0.1                blob_1.2.2                
 [29] rappdirs_0.3.3             ggrepel_0.9.1             
 [31] haven_2.4.3                crayon_1.4.1              
 [33] jsonlite_1.7.2             org.Mm.eg.db_3.12.0       
 [35] lme4_1.1-27.1              survival_3.2-13           
 [37] zoo_1.8-9                  iterators_1.0.13          
 [39] glue_1.4.2                 gtable_0.3.0              
 [41] UpSetR_1.4.0               seqinr_4.2-8              
 [43] GetoptLong_1.0.5           shape_1.4.6               
 [45] BiocGenerics_0.36.1        scales_1.1.1              
 [47] futile.options_1.0.1       pheatmap_1.0.12           
 [49] DBI_1.1.1                  Rcpp_1.0.7                
 [51] viridisLite_0.4.0          progress_1.2.2            
 [53] clue_0.3-59                flashClust_1.01-2         
 [55] bit_4.0.4                  preprocessCore_1.52.1     
 [57] stats4_4.0.5               DT_0.19                   
 [59] htmlwidgets_1.5.4          httr_1.4.2                
 [61] getopt_1.20.3              gplots_3.1.1              
 [63] RColorBrewer_1.1-2         ellipsis_0.3.2            
 [65] factoextra_1.0.7           farver_2.1.0              
 [67] pkgconfig_2.0.3            XML_3.99-0.7              
 [69] dbplyr_2.1.1               utf8_1.2.2                
 [71] labeling_0.4.2             tidyselect_1.1.1          
 [73] rlang_0.4.11               reshape2_1.4.4            
 [75] AnnotationDbi_1.52.0       cellranger_1.1.0          
 [77] munsell_0.5.0              tools_4.0.5               
 [79] cachem_1.0.6               cli_3.0.1                 
 [81] generics_0.1.0             RSQLite_2.2.8             
 [83] ade4_1.7-17                broom_0.7.9               
 [85] fastmap_1.1.0              ggdendro_0.1.22           
 [87] yaml_2.2.1                 fs_1.5.0                  
 [89] org.Hs.eg.db_3.12.0        bit64_4.0.5               
 [91] zip_2.2.0                  caTools_1.18.2            
 [93] nlme_3.1-153               formatR_1.11              
 [95] leaps_3.1                  xml2_1.3.2                
 [97] biomaRt_2.46.3             rstudioapi_0.13           
 [99] compiler_4.0.5             plotly_4.9.4.1            
[101] curl_4.3.2                 png_0.1-7                 
[103] marray_1.68.0              reprex_2.0.1              
[105] statmod_1.4.36             stringi_1.7.4             
[107] futile.logger_1.4.3        lattice_0.20-44           
[109] Matrix_1.3-4               nloptr_1.2.2.2            
[111] vctrs_0.3.8                pillar_1.6.2              
[113] lifecycle_1.0.0            MSstats_3.22.1            
[115] BiocManager_1.30.16        GlobalOptions_0.1.2       
[117] data.table_1.14.0          bitops_1.0-7              
[119] R6_2.5.1                   KernSmooth_2.23-20        
[121] gridExtra_2.3              IRanges_2.24.1            
[123] codetools_0.2-18           lambda.r_1.2.4            
[125] boot_1.3-28                MASS_7.3-54               
[127] gtools_3.9.2               assertthat_0.2.1          
[129] gProfileR_0.7.0            openssl_1.4.5             
[131] rjson_0.2.20               withr_2.4.2               
[133] minpack.lm_1.2-1           S4Vectors_0.28.1          
[135] PerformanceAnalytics_2.0.4 parallel_4.0.5            
[137] doSNOW_1.0.19              hms_1.1.0                 
[139] quadprog_1.5-8             VennDiagram_1.6.20        
[141] grid_4.0.5                 minqa_1.2.4               
[143] Cairo_1.5-12.2             lubridate_1.7.10          
[145] scatterplot3d_0.3-41       Biobase_2.50.0            
[147] FactoMineR_2.4

My keys.txt is attached

Thanks a lot
Juan

keys.txt

@biodavidjm
Copy link
Owner

Hi @jfertaj
this is an easy fix. The issue is that you are not currently following guidelines with respect to the bioReplicate notation, i.e, this part:

Condition: The conditions names must follow these rules:
Use only letters (A - Z, both uppercase and lowercase) and numbers (0 - 9). The only special character allowed is underscore (_).
Very important: A condition name cannot begin with a number (R limitation).

BioReplicate: biological replicate number. It is based on the condition name. Use as prefix the corresponding Condition name, and add as suffix dash (-) plus the biological replicate number. For example, if condition H1N1_06H has too biological replicates, name them H1N1_06H-1 and H1N1_06H-2

which means that, for example, for your condition NV1, instead of this bioreplicate names...

2N_V1
1N_V1
3N_V1
4N_V1
5N_V1
6N_V1
7N_V1
8N_V1
9N_V1
10N_V1
11N_V1
12N_V1
13N_V1
14N_V1
15N_V1

you should have these ones instead

NV1-1
NV1-2
NV1-3
NV1-4
NV1-5
NV1-6
NV1-7
NV1-8
NV1-9
NV1-10
NV1-11
NV1-12
NV1-13
NV1-14
NV1-15

And same thing for all the other conditions. And once you have ready the new keys file, you will have to run the Quantification step again and use the new results files with that new notation.

Next version of artMS will warn the user about this.

Hope it helps
David

@jfertaj
Copy link
Author

jfertaj commented Sep 13, 2021

Hi David,

I have corrected the keys.txt file and then I realised than my version of artMS was outdated so I have updated to version v1.10.2. However when running artmsQuantification I have a problem that I have posted before

  Join results in 10010711 rows; more than 764921 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Any idea why?

@biodavidjm
Copy link
Owner

Hi,

I guess you are referring to this issue. I thought that the issue was resolved. It was not only the wrong notation for conditions and bioreplicates, but also bioreplicates must be unique, and then the run column, which should also be unique. So please, ensure that it is the case. See the attached "corrected" version of your keys.txt file (please, give it a try)
keys-fixed.txt

@jfertaj
Copy link
Author

jfertaj commented Sep 13, 2021

Hi David,

I have tried with your version but the error is still there, here is the full output

---------------------------------------------------
artMS: BASIC QUALITY CONTROL (evidence.txt based)
---------------------------------------------------
>> MERGING FILES 
--(-) Raw.files in evidence not found in keys file:
 9-7-2021_40APP_V4_2575

-- Plot: correlation matrices
---- by Biological replicates 
---- by Conditions 
-- Plot: intensity stats
---- AB PROCESSED 
<< Basic quality control analysis completed!
---------------------------------------------
artMS: EXTENDED QUALITY CONTROL (-evidence.txt based)
---------------------------------------------
>> MERGING FILES 
--(-) Raw.files in evidence not found in keys file:
 9-7-2021_40APP_V4_2575

>> GENERATING QC PLOTS 
--- Plot PSM done 
--- Plot IONS done 
--- Plot TYPE done 
--- Plot PEPTIDES done 
--- Plot PEPTIDE OVERLAP done 
--- Plot PROTEINS done 
--- Plot PROTEIN OVERLAP done 
--- Plot Plot Ion Oversampling done 
--- Plot Charge State done 
--- Plot Mass Error done 
--- Plot Mass-over-Charge distribution done 
--- Plot Peptide Intensity CV done 
--- Plot Peptide Detection (using modified.sequence) done 
--- Plot Protein Intensity CV done 
--- Plot Protein Detection done 
--- Plot ID overlap done 
--- Plot PCA and Inter-Correlation (WARNING: it might take a long time. Please, be patient)
	(-) Skip peptide-based correlation matrix (too many samples)
	(-) Skip Protein-based correlation matrix (too many samples)
--- Plot Sample Preparation... done
>> QC extended completed
--------------------------------------------
artMS: Relative Quantification using MSstats
--------------------------------------------
>> Reading the configuration file
>> LOADING DATA 
>> MERGING FILES 
--(-) Raw.files in evidence not found in keys file:
 9-7-2021_40APP_V4_2575

>> CONVERT Intensity values < 1 to NA
>> FILTERING 
-- Contaminants CON__|REV__ removed
-- Removing protein groups
-- Use <Leading.razor.protein> as Protein ID
-- PROCESSING AB
>> CONVERTING THE DATA TO MSSTATS FORMAT 
-- Selecting Sequence Type: MaxQuant 'Modified.sequence' column
	(+) <Fraction> column added (with value 1, MSstats requirement)
-- Adding NA values for missing values (required by MSstats) 
-- Write out the MSstats input file (-mss.txt) 
>> RUNNING MSstats (it usually takes a 'long' time: please, be patient)
-- Normalization method: equalizeMedians
INFO  [2021-09-13 22:11:26] ** Features with one or two measurements across runs are removed.
INFO  [2021-09-13 22:11:27] ** Fractionation handled.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 8506154 rows; more than 757484 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
In addition: Warning messages:
1: ggrepel: 174 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
2: In RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

3: In RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

4: ggrepel: 32 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
5: ggrepel: 48 unlabeled data points (too many overlaps). Consider increasing max.overlaps 

Sorry for bothering so much

@jfertaj
Copy link
Author

jfertaj commented Sep 13, 2021

Also, I have created a new keys.txt from scratch, loaded in R and test for unique values like this

> unique(sort(keys$Run))
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
[76] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
> unique(sort(keys$BioReplicate))
 [1] "10N-V1" "10N-V4" "10P-V1" "10P-V4" "10S-V1" "10S-V4" "11N-V1" "11N-V4"
 [9] "11P-V1" "11P-V4" "11S-V1" "11S-V4" "12N-V1" "12N-V4" "12P-V1" "12P-V4"
[17] "12S-V1" "12S-V4" "13N-V1" "13N-V4" "13P-V1" "13P-V4" "13S-V1" "13S-V4"
[25] "14N-V1" "14N-V4" "14P-V1" "14P-V4" "14S-V1" "14S-V4" "15N-V1" "15N-V4"
[33] "15P-V1" "15P-V4" "15S-V1" "15S-V4" "1N-V1"  "1N-V4"  "1P-V1"  "1P-V4" 
[41] "1S-V1"  "1S-V4"  "2N-V1"  "2N-V4"  "2P-V1"  "2P-V4"  "2S-V1"  "2S-V4" 
[49] "3N-V1"  "3N-V4"  "3P-V1"  "3P-V4"  "3S-V1"  "3S-V4"  "4N-V1"  "4N-V4" 
[57] "4P-V1"  "4P-V4"  "4S-V1"  "4S-V4"  "5N-V1"  "5N-V4"  "5P-V1"  "5P-V4" 
[65] "5S-V1"  "5S-V4"  "6N-V1"  "6N-V4"  "6P-V1"  "6P-V4"  "6S-V1"  "6S-V4" 
[73] "7N-V1"  "7N-V4"  "7P-V1"  "7P-V4"  "7S-V1"  "7S-V4"  "8N-V1"  "8N-V4" 
[81] "8P-V1"  "8P-V4"  "8S-V1"  "8S-V4"  "9N-V1"  "9N-V4"  "9P-V1"  "9P-V4" 
[89] "9S-V1"  "9S-V4" 
> 

The number of unique elements is equals to 90, the original number of samples. I need to remove one after QC but just to do it from scratch I have re-run it with the whole datase

@biodavidjm
Copy link
Owner

Hi there,

Remember the important rules:

Condition: The conditions names must follow these rules:

  • Use only letters (A - Z, both uppercase and lowercase) and numbers (0 - 9). The only special character allowed is underscore (_).
  • Very important: A condition name cannot begin with a number (R limitation). @jfertaj You are not showing the condition names, but if they match the bioreplicates, you are breaking this rule.

BioReplicate: biological replicate number. It is based on the condition name. Use as prefix the corresponding Condition name, and add as suffix dash (-) plus the biological replicate number. For example, if condition H1N1_06H has too biological replicates, name them H1N1_06H-1 and H1N1_06H-2

Have you tried the keys files that I included in my previuos response?

@jfertaj
Copy link
Author

jfertaj commented Sep 14, 2021 via email

@biodavidjm
Copy link
Owner

Thanks! I'll take a look and get back to you soon

@jfertaj
Copy link
Author

jfertaj commented Sep 19, 2021 via email

@biodavidjm
Copy link
Owner

Hi Juan,

sorry for the late response. I've been carefully debugging the issue and I am sorry to report that this is not an artMS issue, but rather an MSstats/data.table one. According to the error message:

INFO  [2021-09-22 08:15:39] ** Features with one or two measurements across runs are removed.
INFO  [2021-09-22 08:15:39] ** Fractionation handled.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 8512292 rows; more than 765628 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

and based on similar errors found on the internet, the error could be solved if the merge function that is called somewhere would include the option allow.cartesian=TRUE (data.table does not use it by default)

I strongly encourage you to report this error in the MSstats google group. Specifically, it fails when running this MSstats function (normalization step):

mssquant = dataProcess(
  raw = dmss,
  logTrans = 2,
  normalization = "equalizeMedians",
  nameStandards = NULL,
  featureSubset = "all",
  remove_uninformative_feature_outlier = FALSE,
  min_feature_count = 2,
  n_top_feature = 3,
  summaryMethod = "TMP",
  equalFeatureVar = TRUE,
  censoredInt = "NA",
  MBimpute = 1,
  remove50missing = FALSE,
  fix_missing = NULL,
  maxQuantileforCensored = 0.999,
  use_log_file = FALSE,
  append = FALSE,
  verbose = TRUE,
  log_file_path = NULL
)

dmss is the evidence-mss.txt file generated by artMS, you could include it if they ask you for it.

However, let me point something out. It is truly remarkable the low number of proteins identified:

In the evidence file:

> evidence %>% summarise_all(n_distinct)
  Sequence Length Modifications Modified.sequence Oxidation..M..Probabilities Oxidation..M..Score.Diffs Acetyl..Protein.N.term.
1     3658     37             6              3970                        1238                      5903                       2
  Oxidation..M. Missed.cleavages Proteins Leading.proteins Leading.razor.protein Gene.names Protein.names Type Raw.file Experiment
1             4                3      455              383                   340        383           381

After contaminants and protein group removal:

> dmss %>% summarise_all(n_distinct)
  ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType Condition BioReplicate Run Fraction
1         308            3519               4           1             1                1         6           90  90        1
  Intensity
1    144852

barely 308 proteins. Is this expected? did you search with the right database? You should include this when asking in the msstats group

Please, let us know how it goes.

Thanks!

@J-Sha
Copy link

J-Sha commented Nov 11, 2021

Hi Juan and David,

I just met the same issue as Juan, also I have a relative small dataset (~300 proteins) for this set of data. I'm wondering did you find a solution for it?

Actually I realized this error just happened after the message of "--- Number of +/- INF values: 344 ", which I think should happened during the imputeMissingValue and merge the original log2FC to the impute steps, here is the full error:
" Error: Assigned data dfdc.final[[x]] - dfdc.final[[y]] must be compatible with existing data.
x Existing data has 156 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled. "

I'm wondering is it possible we can separate the imputation and stats steps? Then maybe we can skip the imputing errors and direct feed the imputed data to perform the stats with MSstats.

Looking forward for your response. Really appreciate it!

Best,
Jihui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants