Release v0.2.4 · zktuong/dandelion

What's Changed

slicing and check contigs by @zktuong in #159
add new functions and rework github actions by @zktuong in #161

New features

slicing functionality

the Dandelion object can now be sliced like a AnnData, or pandas DataFrame!

vdj[vdj.data['productive'] == 'T']
Dandelion class object with n_obs = 38 and n_contigs = 94
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status'
    metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

vdj[vdj.metadata['productive_VDJ'] == 'T']
Dandelion class object with n_obs = 17 and n_contigs = 36
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status'
    metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

vdj[vdj.metadata_names.isin(['cell1', 'cell2', 'cell3', 'cell4', 'cell5'])]
Dandelion class object with n_obs = 5 and n_contigs = 20
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status'
metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

vdj[vdj.data_names.isin(['contig1','contig2','contig3','contig4','contig5'])]
Dandelion class object with n_obs = 2 and n_contigs = 5
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status'
metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

not sure implementing it like adata[:, adata.var.something] make sense as it's not really row information in the data slot?
also the base slot in Dandelion is .data, and doesn't make sense for .metadata to be the 'row'
maybe scverse/scirpy#327 can come up with a better strategy and i can adopt that later on.

`ddl.pp.check_contigs`

created a new function ddl.pp.check_contigs as a way to just check if contigs are ambiguous, rather than outright removing them. I envisage that this will eventually replace simple mode in ddl.pp.filter_contigs in the future.
- new column in .data: ambiguous, T/F to indicate whether contig is considered ambiguous or not (different from cell level ambiguous).
- the .metadata and several other functions ignores any contigs marked as T to maintain the same behaviour
- The largest difference between ddl.pp.check_contigs and ddl.pp.filter_contigs is that the onus is on the user to remove any 'bad' cells from the GEX data (illustrated in the tutorial) with check_contigs whereas this happens semi-automatically with filter_contigs.

`ddl.update_metadata` now comes with a 'by_celltype' option

This brings a new feature - B cell, alpha-beta T cell and gamma-delta T cell associated columns for V,D,J,C and productive columns!
- this is achieved through a new .retrieve_celltype subfunction in the Query class, which breaks up the retrieval into the 3 major groups if by_celltype = True.
- No longer the need to guess which belongs to which and allows for easy slicing! This does cause a bit of .obs bloating.
- Which leads to the removal of constant_status_VDJ, constant_status_VJ, productive_status_VDJ, productive_status_VJ as the metadata is getting bloated with the slight rework of Dandelion metadata slot to account for the new B/abT/gdT columns

`tl.productive_ratio`

Calculates a cell-level representation of productive vs non-productive contigs.
- Plotting is achieved through pl.productive_ratio

`tl.vj_usage_pca`

Computes PCA on a cell-level representation of V/J gene usage across designated groupings
- uses scanpy.pp.pca internally
- Plotting can be achieved through scanpy.pl.pca

bug fixes

fix cell ordering issue scverse/scirpy#347
small refactor of ddl.pp.filter_contigs
- moved some of the repetitive loops into callable functions
- deprecate filter_vj_chains argument and replaced with filter_extra_vdj_chains and filter_extra_vj_chains to hopefully enable more interpretable behaviour. fixes #158
- umi adjustment step was buggy but i have now made the behaviour consistent with how it functions in ddl.pp.check_contigs
rearrangement_status_VDJ and rearrangement_status_VJ (renamed from rearrangement_VDJ_status and rearrangement_VJ_status) from now gives a single value for whether a chimeric rearrangement occured e.g. TRDV pairing with TRAJ and TRAC as in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267242/
fixed issues with progress bars getting out of hand
fixed issue with ddl.tl.find_clones crashing if more than 1 type of loci is found in the data.
- now a B, abT and gdT prefix will be appended to BCR/TR-ab/TR-gd clones.
check_contigs, find_clones and define_clones were removing non-productive contigs even though there's no need to. May cause issues with filter_contigs... but there's a problem for next time.
fix issue with min_size in network not behaving as intended. switch to using connected components to find which nodes to trim

other changes

new column chain_status, to summarise the reworked locus_status column.
- Should contain values like ambiguous, Orphan VDJ, Single pair etc, similar to chain_pairing in scirpy.
Also fixed the ordering of metadata to make it more presentable, instead of just randomly slotting into the data frame.
ddl.concat now allows for custom suffix/prefix - only operates on sequence_id
remove .edges from Dandelion class because this doesn't get used anywhere and it's also stored in the networkx graphs
minimum spanning tree construction performed using networkx directly so that i don't have to keep changing the adjacency matrices from pandas to networkx back and forth
clean up documentation slightly

Full Changelog: v0.2.2...v0.2.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.4

What's Changed

New features

slicing functionality

`ddl.pp.check_contigs`

`ddl.update_metadata` now comes with a 'by_celltype' option

`tl.productive_ratio`

`tl.vj_usage_pca`

bug fixes

other changes

Contributors

v0.2.4

What's Changed

New features

slicing functionality

ddl.pp.check_contigs

ddl.update_metadata now comes with a 'by_celltype' option

tl.productive_ratio

tl.vj_usage_pca

bug fixes

other changes

Contributors

`ddl.pp.check_contigs`

`ddl.update_metadata` now comes with a 'by_celltype' option

`tl.productive_ratio`

`tl.vj_usage_pca`