After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file #11

zhangyongzlm · 2020-09-11T02:31:08Z

I am faced with a problem. I want to add my own NPC cell types (e.g., C15, C17, C17C ... X2117) to the existing models. Finally, I found that the size of the new model file is smaller than that of the corresponding ENCODE model file.

I also try to load the newly generated model file and find that the NPC cell types are indeed added to the model.

The following is my code for training the model.

import os, sys

os.environ["THEANO_FLAGS"] = "device=cuda0"
import matplotlib.pyplot as plt
import seaborn

seaborn.set_style("whitegrid")
import itertools
import numpy

numpy.random.seed(0)
from avocado import Avocado

import pandas as pd
import argparse
import math


parser = argparse.ArgumentParser(description="Train a new model")
parser.add_argument(
    "chrom", type=str, help="Specify the chromosome that training is performed in"
)
parser.add_argument(
    "--chromSize",
    action="store",
    dest="chromSize",
    type=str,
    default="./hg38.chrom.sizes",
    help="The file storing the chrom sie information",
)
parser.add_argument(
    "--batchsize",
    action="store",
    dest="batchsize",
    type=int,
    default=40000,
    help="Batch size for neural network predictions.",
)
args = parser.parse_args()

chrom_size = pd.read_table(args.chromSize, sep="\t", names=["chr", "size"])
chrom_size.set_index(["chr"], inplace=True)

celltypes = [
    "C15",
    "C17",
    "C17C",
    "C666-1",
    "NP460",
    "NP460_EBV",
    "NP69",
    "NP69_EBV",
    "NPC23",
    "NPC32",
    "NPC43",
    "NPC43noEBV",
    "NPC53",
    "NPC76",
    "X2117",
]
assays = [
    "ChIP-seq_H3K27ac_signal_p-value",
    "ChIP-seq_H3K4me1_signal_p-value",
    "ChIP-seq_H3K4me3_signal_p-value",
]

data = {}
for celltype, assay in itertools.product(celltypes, assays):
    filename = (
        "./signals/{}/{}/{}.{}.pval.signal.bw.{}.npz".format(celltype, assay.split("_")[1], celltype, assay.split("_")[1], args.chrom)
    )
    print(filename)
    data[(celltype, assay)] = numpy.load(filename)[args.chrom]

model = Avocado.load("./avocado/.encode2018core-model/avocado-" + args.chrom)
size = chrom_size.loc[args.chrom]["size"]
model.fit_celltypes(data, epoch_size=math.ceil(size / args.batchsize), n_epochs=200)

model.save("./model/NPC_" + args.chrom)

jmschrei · 2020-09-21T17:40:55Z

That's weird, but I'm not necessarily sure that means there's a problem. Potentially, you have a higher compression level set for hdf5 files than I did. Can you still make predictions and everything fine?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file #11

After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file #11

zhangyongzlm commented Sep 11, 2020

jmschrei commented Sep 21, 2020

After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file #11

After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file #11

Comments

zhangyongzlm commented Sep 11, 2020

jmschrei commented Sep 21, 2020