Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors and differences in results while running the CLI version for ICGS2 #49

Open
therealgenna opened this issue Dec 16, 2020 · 1 comment

Comments

@therealgenna
Copy link

I've used the latest github version (Dec 14 2020) and my goal is to run ICGS2 clustering. I've created two versions of input counts - one is plain tsv file (called 'tsv' below) and the other a directory with matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz (called 'mtx') to emulate the 10X output (I don't know how to make a compatible h5 file). I also had to replace ':' with '_' in gene symbols/names as it seemed to cause some issues. The counts (and gene names) are identical in both cases, but the results are not. In both cases, the ICGS seems to have finished - the last line in the log files is ICGS run complete... halted prior to full differential comparison analysis.

These are the commands I've issued:

tsv:

python altanalyze-master_github12.14.2020/AltAnalyze.py --platform RNASeq --species Dr --expname test1a --output test1a --runICGS yes --expdir 56hpf-LTA-counts/56hpf-LTA-counts.tsv --dataFormat counts

mtx:

python altanalyze-master_github12.14.2020/AltAnalyze.py --platform RNASeq --species Dr --expname test2 --output test2 --runICGS yes --ChromiumSparseMatrix 56hpf-LTA-counts/ --dataFormat counts

I had 4884 cells. For the tsv counts I have 4585 lines (cells) in ICGS-NMF/FinalGroups.txt while for mtx there are 4281.

I have these questions:

  • can I trust the clustering (ICGS2) results despite the multiple logged errors (please see below)? It seems that most of them are related to gene biotype annotations which are missing ...
  • why some cells are missing from ICGS-NMF/FinalGroups.txt files?
  • why do the two versions give different results? Is there some random component here?

Thank you for your help!

Here are the errors:

try({hopg<-hopach(data,dmat=distmatg,ord="own")})
Error in base::rowMeans(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions
In addition: Warning message:
In collap(data, level, d, dmat, newmed) :
  Not enough medoids to use newmed='medsil' in collap() - 
 using newmed='nn' instead 


Traceback (most recent call last):
  File "/path/to/altanalyze-master_github12.14.2020/visualization_scripts/clustering.py", line 261, in heatmap
    newFilename, Z1, Z2 = R_interface.remoteHopach(inputFilename,cluster_method,metric_gene,metric_array)
  File "/path/to/altanalyze-master_github12.14.2020/R_interface.py", line 106, in remoteHopach
    z.Hopach(cluster_method,metric_gene,force_gene,metric_array,force_array)
  File "/path/to/altanalyze-master_github12.14.2020/R_interface.py", line 626, in Hopach
    if 'clustering' in hopach_run:
UnboundLocalError: local variable 'hopach_run' referenced before assignment

hopach failed... continue with an alternative method
Traceback (most recent call last):
  File "/path/to/altanalyze-master_github12.14.2020/RNASeq.py", line 4122, in correlateClusteredGenesParameters
    except Exception: TFs = importGeneSets('BioTypes',filterType='transcription regulator',geneAnnotations=gene_to_symbol_db)
  File "/path/to/altanalyze-master_github12.14.2020/RNASeq.py", line 2826, in importGeneSets
    for line in open(fn,'rU').xreadlines():
IOError: [Errno 2] No such file or directory: '/path/to/altanalyze-master_github12.14.2020/AltDatabase/EnsMart72/goelite/Dr/gene-mapp/Ensembl-BioTypes.txt'
Traceback (most recent call last):
  File "/path/to/altanalyze-master_github12.14.2020/GO_Elite.py", line 1357, in runGOElite
    try:go_to_mod_genes, mapp_to_mod_genes, timediff, mappfinder_input, resource = mappfinder.generateMAPPFinderScores(species,species_code,source_data,mod,system_codes,permute,resources,file_dirs,root,Multi=mlp)
  File "/path/to/altanalyze-master_github12.14.2020/mappfinder.py", line 462, in generateMAPPFinderScores
    if PoolVar: q.put([print_out]); return None
AttributeError: 'NoneType' object has no attribute 'put'
gene associations assigned
Traceback (most recent call last):
  File "/path/to/altanalyze-master_github12.14.2020/stats_scripts/ICGS_NMF.py", line 1295, in CompleteICGSWorkflow
    annotatedGroupsFile = RNASeq.predictCellTypesFromClusters(finalgrpfile, goelite_path)
  File "/path/to/altanalyze-master_github12.14.2020/RNASeq.py", line 5583, in predictCellTypesFromClusters
    for line in open(goelite_path,'rU').xreadlines():
IOError: [Errno 2] No such file or directory: '/path/to/test1a/NMF-SVM/SVMOutputs/GO-Elite/clustering/MarkerFinder-subsampled-ordered/GO-Elite_results/pruned-results_z-score_elite.txt'

Unable to export annotated groups file with predicted cell type names.
Parent directory not found locally for ['/DataPlots/', '/DataPlots/exp.56hpf-LTA-counts-ICGS-UMAP_scores.txt']
@nsalomonis
Copy link
Owner

My apologies for the long delay in the response. When running AltAnalyze on the .mtx file with the option "--dataFormat counts", it may result in unexpected issues as the program should likely try to adjust the an already normalized/log2 file (will scale it again and take the log2 of the log2 values). I think this is likely the issue encountered. We will see what we can do on our end to detect such an issue and prevent it (we can check to make sure the data is not already log2 converted).

I should also note, that for Danio rerio, we also made some relatively recent (I believe in the last several months) changes that prevent ID errors with Ensembl IDs for this species (which can have chromsome IDs that are incompatible with the supporting gene ID naming conventions).

To address your other questions:

  • can I trust the clustering (ICGS2) results despite the multiple logged errors (please see below)? It seems that most of them are related to gene biotype annotations which are missing ...
    -- Yes, some of these errors are due to the species with non-supported files for cell-type prediction, although we will add support in future versions for Dr. The HOPACH error can be resolved by ensuring HOPACH is installed in the version of R you have available from the command-line when you type "R". AltAnalyze used to be able to install automatically on most operating systems, but increased gatekeeper functions, depending on the admin profile, can prevent the software from installing these.
  • why some cells are missing from ICGS-NMF/FinalGroups.txt files?
    -- This is because ICGS2 identifies transcriptionally robust clusters by classifying all cells (SVM) to the surviving NMF cluster centroids (sufficiently unique marker genes in the cluster). If cells fail to sufficiently align, they are excluded after version 2.1.4.
  • why do the two versions give different results? Is there some random component here?
    -- Without the "--dataFormat counts" option, in principle, the results should be identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants