全基因组 CpG (or nonCpG) 位点 bed 文件的获取方式

CpG 位点是基因组上的一种特殊序列，它在 DNA 的结构和功能中起着重要作用。其组成和甲基化影响着基因组上各区域的调控，是表观遗传学领域中重要的研究目标。

本文中的脚本主要用于提取基因组中 CpG 位点的位置，以方便一些着重于 CpG 位点的下游分析进行。此外在部分场景下，我们可能希望排除 CpG 位点进行分析（因为其独特的性质例如高突变率等），因此该脚本也可产生 AT&CG(nonCpG) 位点信息文件。

需要准备的文件和条件有：

基因组序列文件
Biopython 库（pip install biopython）

可能需要准备的前置条件：

已被放置于 $PATH 环境变量中的 bedtools

脚本内容：

from Bio import SeqIO
import os
import argparse
import gzip
import subprocess

def get_args():
    parser = argparse.ArgumentParser(description="Use to extract AT/CG(nonCpG) & CpG sites.")
    
    input_group = parser.add_argument_group("Input")
    input_group.add_argument("-g", "--genome", required=True, help="The path of genome sequence file.")
    input_group.add_argument("-c", "--chr", required=True, help="The target chromosome to extract.")

    option_group = parser.add_argument_group("Optional parameters")
    option_group.add_argument("--onlycpg", action="store_true", help="Only extract CpG sites.")
    option_group.add_argument("--merge", action="store_true", help="Use bedtools to merge bed files.")
    option_group.add_argument("--gzip", action="store_true", help="Use gzip to compress output.")
    option_group.add_argument("--nosoftmask", action="store_true", help="Remove soft-masked where repeats are in lower-case text.")

    output_group = parser.add_argument_group("Output")
    output_group.add_argument("-n", "--name", help="The name of output file.")
    output_group.add_argument("-p", "--path", help="The path of output file.", default='./')
    
    args = parser.parse_args()
    if not args.name:
        args.name = os.path.splitext(os.path.basename(args.genome))[0]
    
    return args

def process_command(command):
    result = subprocess.run(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    if result.returncode != 0:
        print(f"Error when executing command: {command}\n{result.stderr.decode()}")

def main():
    args = get_args()

    with gzip.open(args.genome, "rt") if args.genome.endswith(".gz") else open(args.genome, "r") as genome_file:
        chr_num = "chr" + args.chr.replace("chr", "")
        cpg_bed_name = os.path.join(args.path, f"{args.name}_{chr_num}_CpGsites.bed.tmp")
        atcg_bed_name = os.path.join(args.path, f"{args.name}_{chr_num}_ATnonCpGsites.bed.tmp")
        
        with open(cpg_bed_name, "w") as cpg_bed, open(atcg_bed_name, "w") as atcg_bed:
            for record in SeqIO.parse(genome_file, "fasta"):
                if record.id != args.chr:
                    continue
                if not args.nosoftmask:
                    seq = str(record.seq).upper()
                else:
                    seq = str(record.seq)

                for i, nuc in enumerate(seq):
                    if nuc == "N":
                        continue
                    if nuc in ["A", "T"] and not args.onlycpg:
                        atcg_bed.write(f"{chr_num}\t{i}\t{i+1}\n")
                    elif nuc in ["C", "G"]:
                        if (nuc == "C" and i + 1 < len(seq) and seq[i + 1] == "G") or (nuc == "G" and i > 0 and seq[i - 1] == "C"):
                            cpg_bed.write(f"{chr_num}\t{i}\t{i+1}\n")
                        elif not args.onlycpg:
                            atcg_bed.write(f"{chr_num}\t{i}\t{i+1}\n")
        
        for bed_file in [cpg_bed_name, atcg_bed_name] if not args.onlycpg else [cpg_bed_name]:
            outfile = bed_file.replace(".tmp", "")
            if args.merge:
                command = f"bedtools merge -i {bed_file} > {outfile} && rm {bed_file}"
            else:
                command = f"mv {bed_file} {outfile}"
            process_command(command)
            if args.gzip:
                process_command(f"gzip -f {outfile}")
        
        if args.onlycpg:
            os.remove(atcg_bed_name)

if __name__ == "__main__":
    main()

将上述内容保存在一个名为 get_sites.py 的文件内，或者点击此处下载。

运行示例（以 hg38 人类基因组为例，提取其在 21 号染色体上的 CpG 位点位置信息）：

1 2	$ python get_sites.py -g hg38.fa -c chr21 --merge --gzip -n GRCh38 # python get_sites.py -g [sequence file] -c [chr] --merge --gzip -n [outname]

最后将会得到名为 GRCh38_chr21_ATnonCpGsites.bed.gz 和 GRCh38_chr21_CpGsites.bed.gz 的文件，以下是各参数说明：

-g 指定参考基因组序列文件，.gz 压缩格式可以被自动识别。
-c 指定要提取位点信息的染色体，需和序列文件中的序列 id 一致。
--merge 在提取完位点信息后，调用 bedtools 对输出文件进行合并，将每行 1bp 的 bed 文件合并为区间。该选项要求 bedtools 可以直接调用，否则会出现错误。bedtools 的安装可以直接通过 conda install -c bioconda bedtools 实现。
--gzip 在提取完位点信息后（如果指定了 --merge 则在合并后），对输出文件进行压缩。
-n 指定输出文件的前缀。如果命令中未给定，则默认为序列文件的前缀。
-p 指定输出文件的路径。如果命令中未给定，则默认为运行脚本时所处的路径。

此外，如果在运行脚本时指定了 --onlycpg，则仅产生 CpG 位点位置文件。如果在运行脚本时指定了 --nosoftmask，则跳过软掩蔽区域。

因为该脚本针对特定染色体进行，因此可以批量运行脚本以同时处理多条染色体，示例：

$ for i in {1..23};
do
python get_sites.py -g hg38.fa -c chr"$i" --merge --gzip -n GRCh38 &
done

请保证计算机或服务器有足够的 CPU 和内存进行批量运行，此处更建议使用 parallel 来进行并行数量控制。如果在服务器上有作业调度系统，也可以合理利用，例如 slurm 的作业数组等，以下是一个 slurm 作业数组运行的示例：

#!/bin/bash

#SBATCH --job-name=getsites
#SBATCH --nodelist=yournode
#SBATCH --ntasks=1
#SBATCH --array=1-23%10
#SBATCH --mail-user=youremail
#SBATCH --mail-type=END

python get_sites.py -g hg38.fa -c chr"$SLURM_ARRAY_TASK_ID" --merge --gzip -n GRCh38

产生结果示例：

1	$ zcat GRCh38_chr21_CpGsites.bed.gz \| head -n 5

chr21   5010008 5010010
chr21   5010053 5010055
chr21   5010215 5010217
chr21   5010331 5010333
chr21   5010335 5010337

后记

虽说这半年已经写了非常多的 script，但是翻来翻去感觉都属于针对性过强的类型，除了我大概率也没有人能用得上。因此最后只能选定本文中这个短小精悍~~（没啥难度）~~的脚本作为分享~~（水文）~~，或许能帮到有需要的人。

此外也还有一个和本文类似的脚本，主要用来提取全基因组上所有四倍简并位点的位置信息。之后如果有需要的话会补充在这篇文章里面。