提取 bwa 比对中唯一比对 reads（uniquely mapped reads）的方法

本文由 Juse 基于以下资料进行撰写：

How to extract uniquely mapped reads from bam/sam produced by bwa-mem? from SEQanswers

Obtaining uniquely mapped reads from BWA mem alignment (filtering by q score does not seem to do the trick in my case) from Biostars

Obtaining uniquely mapped reads from BWA mem alignment from StackExchange

在此感谢 community 中各位大佬的无私分享，关于 SAM format 的详细说明可见：

https://samtools.github.io/hts-specs/SAMv1.pdf

正文

在某些文章中，有时我们会看到作者提及到在后续分析中，ta 只考虑了 uniquely mapping reads（后文统称唯一比对）。从直觉上出发，使用唯一比对的思路不难理解 —— 这些比对更加精确，并使后续的分析具有更好的解释性。

目前来说，得到唯一比对的最主流方法是直接通过 MAPQ 进行过滤，但这并不能得到严格意义上的唯一比对，因为部分 reads 在经过该过滤后仍然具有多重比对。不过这里需要明确一点：通过 MAPQ 过滤后仍存在的多重比对数量其实已经很少，如 Biostars 中的帖主所示：

For the one sample I am using to test out commands, I filtered my data so I only have properly paired reads and mapped reads (in the case of my unpaired reads) and also filtered by q score (I get nearly the same results at q of 10, 20, or 30, so it's not an issue of changing the q score level), merged all my bam files, and obtained 193,998 reads total for that sample. When I open the bam file in Geneious and count the number of reads for each locus they add up to 194, 050. I realize the difference between these two numbers is an incredibly small number of reads.

因此，仅使用 MAPQ 进行过滤看上去也是可行的（如果你能接受极小一部分 “可能具有多个比对” 的 reads）。并且严格意义上说，唯一比对这一概念本身就具有局限性 —— 比对时所使用的参数如果过于宽松，那么具有多重比对的 reads 数量也会大大增加。

不过，如果实在想要避免多重比对带来的影响，可以根据 bwa 比对中的一些 Alignment Tags 来进行过滤，在 bwa 中，拥有多重比对的 reads 会出现以下一些 Tag：

XA，该 flag 中描述了对应 read 的 Alternative hits。
SA，该 flag 用于标识嵌合的比对结果（Supplementary Alignment）。

这两个 flag 的格式非常相似，其中 [XA] 是 bwa 等一部分比对软件使用的 flag。更详细信息可见下文说明。

因此，如果要过滤出唯一比对，只需要对这两个 flag 进行筛选：

1	$ sambamba view -t 12 -h -f bam -F "mapping_quality >= 1 and not (unmapped or secondary_alignment) and not ([XA] != null or [SA] != null)" [input bam] -o [output bam]

sambamba -F 中的各个 filter tag 说明：

mapping_quality >= 1：要求比对质量分数至少为 1。
not (unmapped or secondary_alignment)：不满足 unmapped 或 secondary_alignment 条件。
- unmapped：未比对上的 reads。
- secondary_alignment：同一 read 除了主比对之外的次比对记录。
not ([XA] != null or [SA] != null)：不存在 [XA] tag 或 [SA] tag。
- [XA]：当 read 完整地比对到了多个位置时，比对软件会选择最佳的一个作为 primary (主比对)，其他的标记为 secondary (次比对)。而 [XA] 和 secondary_alignment 的区别在于前者在主比对记录中进行次比对结果的标注，后者则将次比对结果进行单独记录。相比之下，[XA] 能够节省存储空间。
- [SA]：指 read 的不同部分比对到了不同的位置，可能由结构变异、基因融合、错误组装等因素导致。

以上过滤使用的 filter tag 可根据自身需求进行灵活调整，上述命令将过滤未比对的 reads 并保留有且仅有唯一比对位置的 reads。