Some changes on `finemapper.py` with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file #188

1511878618 · 2024-03-18T10:12:43Z

Pull Request Title: Enhancements and New Features for Genetic Data Processing

Code only modified for finemapper.py

Summary of Changes

This PR introduces several updates and new features to improve the handling and processing of genetic data. Key changes include:

Support for BGEN to cal LD and save to .npz Conversion to accelate the loading speed compared to .bcor:
- Implemented functionality for converting BGEN files directly to LD matrices stored in .npz format, aligning with polyFun standards. Use --geno genofile --ldstore2 $(which ldstore) --cache-dir ./ --cache-format npz to save in npz format by default.
NPZ File Reading Capability:
- Added capability to read npz files using --ld your_npz_prefix.
PGEN File Support:
- This may be slower than ldstore2 did!
- Introduced support for using pgen files as input via --geno pgen_file_prefix.
- Integrated finemap_tools for invoking Plink2 (version must be later than PLINK v2.00a6LM 64-bit Intel, dated 2 Mar 2024) to compute LD, using the command template: plink2 --r2-unphased square.
- Note: The --geno option matches files using prefixes, with bed files having higher priority over pgen to avoid conflicts when both file types are present.
Improvements in LD Matrix Handling:
- Removed the assertion that LD matrices must not contain any NaN values during reading. Instead, i modified sync_ld_sumstats function to exclude SNPs with NA values.
- Addressed cases where an SNP's LD calculation might result in NaN due to unfiltered genotype data or extreme scenarios. Users are advised to consider stricter QC or automatically exclude SNPs with NA in any LD calculation based on the number of SNPs dropped, as indicated by the output.
Enhancements in Summary Statistics (sumstats) Loading:
- For sumstats files that are bgz compressed and have an associated .tbi file, implemented reading via the tabix command-line tool. This approach is particularly efficient for genome-wide sumstats, allowing direct retrieval of data by chromosome, significantly reducing loading times.
- In scenarios where finemap_tools is unavailable, the original logic of reading the entire file will be followed.
- Sumstats files organized according to polyFun requirements (columns: SNP, A1, A2, BP, CHR, with SNP in the first column) can be processed using tabix -s 2 -b 3 -e 3 -c S sumstats_with_bgz_compressed.bgz.
Integration of finemap_tools Package:
- Included finemap_tools for filtering bialleic and ambiguous alleles during sumstats reading.
Code Formatting Updates:
- Applied code formatting improvements using Black.

bgen bug fix

This is the commits :03c283d2190e2f3100462bb8932ed4f7441b54aa do, and after this commits is some more changes which may not necessary.

Future Developments

Further development and updates will continue in my own repository and will not be submitted as pull requests to this project.

1511878618 · 2024-03-18T10:14:04Z

finemapper.py

            snp_alleles = rsid.alleles
            snp_chrom = rsid.chrom
            snp_pos = rsid.pos
+            rsid = rsid.rsid  # NOTE: this is the change


This is the changes code only.

big_xutingfeng added 12 commits March 14, 2024 18:29

fix bugs for finemapper.py with bgen input

03c283d

update LD result with npz to accleate load speed

74dee8a

fix bugs for local npz file founded and --cache-dir is passed

dfb3f5f

fix bugs

1efd465

fix a god damn bugs by misstake touch the keyboard

214c482

No need for assert LD_array is all not NA, but drop them all to filter

7afc957

fix bugs for filtering NA LD

b773de5

add pgen file as --geno and cal ld by finemap_tools

06b9bce

black code

4f3b63f

restrict LD by pgen in snp set in sumstats

1c92276

finnal update with little format changes by xtf

16f51e2

remove comments unnecessary

b997165

1511878618 commented Mar 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some changes on `finemapper.py` with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file #188

Some changes on `finemapper.py` with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file #188

1511878618 commented Mar 18, 2024

1511878618 Mar 18, 2024

Some changes on finemapper.py with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file #188

Are you sure you want to change the base?

Some changes on finemapper.py with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file #188

Conversation

1511878618 commented Mar 18, 2024

Pull Request Title: Enhancements and New Features for Genetic Data Processing

Summary of Changes

bgen bug fix

Future Developments

1511878618 Mar 18, 2024

Choose a reason for hiding this comment

Some changes on `finemapper.py` with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file #188

Some changes on `finemapper.py` with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file #188