通俗易懂的贝叶斯深度神经网络介绍及代码实现

最近在课题组里的一些任务涉及到了把模型的架构改成贝叶斯层，让模型输出的结果能够体现不确定性。恶补相关知识的同时正好也能记录下笔记发博客，一举两得。

这篇文章只做概念介绍，不做深入剖析，追求原理的话敬请走下文中相关链接前往原论文。

贝叶斯神经网络是什么

或许你在很多其他的地方听过贝叶斯这个词汇。以下是一些生物学中的例子：

系统发育学中的贝叶斯建树（Mrbayes、Phylobayes）。
群体遗传学（种群遗传学）中分析种群结构、估计种群历史等（Structure、ABC）。
在某些基因组比对的工具里也有用到贝叶斯方法（bwa）。

贝叶斯方法的大致原理是结合先验知识和实验数据来估计参数或测试假设，这个过程涉及到随机采样，因此其他方法不同，应用贝叶斯法得到的结果往往不会是完全一致的，也就是说，可能每次跑出来的结果之间会相似，但不会完全相同。

这里就提及到了贝叶斯方法的一个宝贵性质 —— 不确定性。

如果有模型搭建经验的话，你应该知道，对于传统的神经网络架构而言，在你训练好模型后，如果给它相同的输入，那么一般而言它都会产生一致的输出（因为模型内部的权重是确定的）。但是有些时候，我们可能想知道：

我们模型输出的结果到底多可靠？它的不确定性有多大？波动范围大概是多少？

这时候，贝叶斯神经网络就发挥了其优势。

贝叶斯神经网络如何引入不确定性

我深知过于详尽的数学公式和逻辑论证会让人头昏眼花（数学佬除外），所以我并不打算以过于专业的角度进行解释。如果你对本文涉及到的贝叶斯神经网络个中原理感兴趣，可以去观摩相关的文章：

Weight Uncertainty in Neural Networks

简单来说，贝叶斯神经网络中，每个神经元（或者说节点）之间连接的权重并不是一个固定的值，而是一个分布。假设在一个传统的神经网络架构中，一个权重的值为 0.2，那么这个权重在贝叶斯神经网络中的对应值可能就变为了 0.1-0.3。

只要能够理解上述这点，那么其输出为什么具有不确定性也就不难理解了。在贝叶斯神经网络的前向传播中，权重将会从一个分布中进行采样，因此不同的运行里我们得到的结果会不同，以上面的权重为例，或许这一次模型采样的权重为 0.15，而下一次就变成了 0.25。

这里同样引申出另一个重要的点：贝叶斯神经网络如何得到结果的不确定性？不难看出，它引入不确定性是通过权重的随机采样实现的，但是它每次运行输出的结果依然是个确定的值而不是一个分布（因为权重在采样后都是确定的某个值）。不过正如上面所说，因为每次运行得到的结果都会不同，因此我们可以通过多次预测得到一组预测结果，这组预测结果的平均值和标准差等信息即我们所需要的。

贝叶斯神经网络的优势和不足

贝叶斯神经网络（BNNs）与传统神经网络相比，优势如下：

不确定性估计：贝叶斯神经网络能够提供预测的不确定性估计。这是因为 BNNs 使用概率分布而非单一值来表示权重，从而能够在给出预测时同时给出关于这些预测的不确定性信息。

这一优势同时导致了其他方面的优势：

减少过拟合。这其实不难理解，正如被广泛使用的 dropout 一样，两者有一个共通之处 —— 随机性，dropout 体现在丢弃神经元的随机性上，而贝叶斯神经网络体现在权重取值的随机性上。
自适应（鲁棒）性。通过对权重的不确定性进行建模，可以更好地适应数据分布的变化，从而在面对数据或环境变化时更加 robust。

举一些实际例子：

与传统神经网络相比，贝叶斯神经网络在处理对抗性攻击时表现出更高的鲁棒性 (Uchendu, Campoy, Menart & Hildenbrandt, 2021)。
在经济领域，贝叶斯神经网络在预测任务中的表现优于传统神经网络。这在预测乳金融市场动态等方面得到了应用和验证 (Magris, Shabani & Iosifidis, 2022)。

当然，不确定性是一把双刃剑，并且它不一定会让模型表现得更加出色（免费午餐警告⚠），一些明显的缺点如下：

更高的计算成本和更长的训练时间。这一点在后文的代码实现中能够得到更清晰的解释。
内存需求。需要存储权重的分布而不是单个权重值。
可解释性问题。尽管贝叶斯神经网络提供预测的不确定性估计，但它们本身的结构和决策过程仍然是黑盒的。

关于贝叶斯神经网络的介绍，我强烈推荐以下博客文章：

A Gentle Introduction to Bayesian Deep Learning

这里面稍微涉及到了一些基础的数学知识，理解上并不困难，也是从更易于理解的角度出发进行介绍。

贝叶斯神经网络的代码实现

在很久之前贝叶斯神经网络需要自己通过 coding 实现，但现在已经有造好的轮子，所以我们只需要搬过来用即可。

blitz 安装

这里会用到 Python 的 blitz module，其具体源码和 document 可见：

https://github.com/piEsposito/blitz-bayesian-deep-learning

pip 安装命令：

1	pip install blitz-bayesian-pytorch

或者你可以使用 conda 完成:

1	conda install -c conda-forge blitz-bayesian-pytorch

亦或者你可以通过 clone 其 GitHub 仓库完成：

conda create -n blitz python=3.9
conda activate blitz
git clone https://github.com/piEsposito/blitz-bayesian-deep-learning.git
cd blitz-bayesian-deep-learning
pip install .

运行范例

这里使用经典的 LeNet 和 MNIST 数据集进行贝叶斯神经网络和传统神经网络的比较，以下代码将根据 blitz github repository 中的示例代码进行补充和修改，原代码可见：

https://github.com/piEsposito/blitz-bayesian-deep-learning/blob/master/blitz/examples/bayesian_LeNet_mnist.py

首先 import 需要的库：

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.datasets as dsets
import torchvision.transforms as transforms

from blitz.modules import BayesianLinear, BayesianConv2d
from blitz.losses import kl_divergence_from_nn
from blitz.utils import variational_estimator

下载并加载数据：

train_dataset = dsets.MNIST(root="./data",
                             train=True,
                             transform=transforms.ToTensor(),
                             download=True
                            )
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=64,
                                           shuffle=True)

test_dataset = dsets.MNIST(root="./data",
                             train=False,
                             transform=transforms.ToTensor(),
                             download=True
                            )
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                           batch_size=64,
                                           shuffle=True)

构建贝叶斯神经网络和传统神经网络：

@variational_estimator
class BayesianCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = BayesianConv2d(1, 6, (5,5))
        self.conv2 = BayesianConv2d(6, 16, (5,5))
        self.fc1   = BayesianLinear(256, 120)
        self.fc2   = BayesianLinear(120, 84)
        self.fc3   = BayesianLinear(84, 10)

    def forward(self, x):
        out = F.relu(self.conv1(x))
        out = F.max_pool2d(out, 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out


class lenet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 6, (5,5))
        self.conv2 = nn.Conv2d(6, 16, (5,5))
        self.fc1   = nn.Linear(256, 120)
        self.fc2   = nn.Linear(120, 84)
        self.fc3   = nn.Linear(84, 10)

    def forward(self, x):
        out = F.relu(self.conv1(x))
        out = F.max_pool2d(out, 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out

BayesianCNN 类前的 @variational_estimator 是一种装饰器，它用于拓展包含贝叶斯神经网络的类的功能（例如后面会提到的 .sample_elbo），同时与原先的 nn.Module 适配。
nn.Conv2d 层对应 BayesianConv2d，nn.Linear 对应 BayesianLinear。

贝叶斯神经网络训练及评估：

# Bayesian train & test

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
classifier = BayesianCNN().to(device)
optimizer = optim.Adam(classifier.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()
best_accuracy = 0
waiting = 0
iteration = 0
stop = False

for epoch in range(100):

    if stop:
        break
    for i, (datapoints, labels) in enumerate(train_loader):
        optimizer.zero_grad()
        loss = classifier.sample_elbo(
            inputs=datapoints.to(device),
            labels=labels.to(device),
            criterion=criterion,
            sample_nbr=3,
            complexity_cost_weight=1 / 50000,
        )
        # print(loss)
        loss.backward()
        optimizer.step()

        iteration += 1
        if iteration % 250 == 0:
            print(loss)
            correct = 0
            total = 0
            with torch.no_grad():
                for data in test_loader:
                    images, labels = data
                    outputs = classifier(images.to(device))
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels.to(device)).sum().item()

                accuracy = 100 * correct / total
                print(
                    "Iteration: {} | Accuracy of the network on the 10000 test images: {} %".format(
                        str(iteration), str(100 * correct / total)
                    )
                )

                if accuracy > best_accuracy + 0.005:
                    best_accuracy = accuracy
                    waiting = 0
                else:
                    waiting += 1

                if waiting >= 4:
                    print(f"The best accuracy is {best_accuracy} for bayesian nn.")
                    stop = True
                    break

这里把一百个 epoch 全跑完太消时，所以我使用了简陋的早停策略提前终止训练，需要关注的参数主要为：

.sample_elbo：这是 blitz 中特有的计算 loss 的功能，这里的 sample_nbr 可以当作计算多少次来进行一次梯度更新（因为每次计算都会有不一样的结果），而 complexity_cost_weight 相当于对模型的复杂性进行惩罚，该值越大，惩罚的力度越高（根据权重的先验分布和后验分布） 。它的源码如下：

def sample_elbo(self,
                inputs,
                labels,
                criterion,
                sample_nbr,
                complexity_cost_weight=1):

    """ Samples the ELBO Loss for a batch of data, consisting of inputs and corresponding-by-index labels
            The ELBO Loss consists of the sum of the KL Divergence of the model
             (explained above, interpreted as a "complexity part" of the loss)
             with the actual criterion - (loss function) of optimization of our model
             (the performance part of the loss).
            As we are using variational inference, it takes several (quantified by the parameter sample_nbr) Monte-Carlo
             samples of the weights in order to gather a better approximation for the loss.
        Parameters:
            inputs: torch.tensor -> the input data to the model
            labels: torch.tensor -> label data for the performance-part of the loss calculation
                    The shape of the labels must match the label-parameter shape of the criterion (one hot encoded or as index, if needed)
            criterion: torch.nn.Module, custom criterion (loss) function, torch.nn.functional function -> criterion to gather
                        the performance cost for the model
            sample_nbr: int -> The number of times of the weight-sampling and predictions done in our Monte-Carlo approach to
                        gather the loss to be .backwarded in the optimization of the model.

    """

    loss = 0
    for _ in range(sample_nbr):
        outputs = self(inputs)
        loss += criterion(outputs, labels)
        loss += self.nn_kl_divergence() * complexity_cost_weight
    return loss / sample_nbr

相关概念介绍（非必要内容）

KL 散度是一种衡量两个概率分布差异的方法，关于该概念的理解，可参见：

Kullback-Leibler(KL)散度介绍 from 知乎

在贝叶斯神经网络中，使用 KL 散度作为损失函数的一部分是为了实现变分推断。在这个背景下，网络的权重被视为随机变量，遵循某种概率分布。变分推断的目标是寻找一个简单的分布（变分后验分布），以近似复杂的真实后验分布。

贝叶斯神经网络的损失函数通常包含两部分：

性能损失（Performance Loss）：这部分通常是传统神经网络中使用的损失函数（如上述代码中的交叉熵），用于衡量模型预测与实际情况之间的差距。
复杂度损失（Complexity Loss）：这部分是 KL 散度，用于衡量变分后验分布与先验分布之间的差异。KL 散度越大，表示变分后验分布与先验分布差异越大，模型复杂度越高。

那么为什么变分后验分布与先验分布差异越大，模型复杂度越高？

这与贝叶斯推断的本质有关，在贝叶斯方法中，先验分布代表模型权重的初始假设，这时我们通常选择较为简单或宽泛的分布（例如没有信息的均匀分布或分布较广的正态分布等）。在进行训练后，训练数据会对参数估计产生影响，从而导致权重的后验分布产生变化。

这个情况下，后验分布与先验分布相差显著时，模型可能捕捉到了数据中的复杂特征和模式，从而提高了其拟合能力。然而，这也可能意味着模型捕捉到了数据中的噪声，导致过拟合。因此，上述代码中引入了新的超参数 complexity_cost_weight 对这一点进行了限制，以起到一定的正则化作用，保证模型的泛化能力。同时也不难看出，错误的先验决定也会影响到模型的训练和泛化。

关于后验分布更新的示例，可以看以下文章：

贝叶斯教你如何正确抛硬币 from 知乎

以抛硬币为例，正面向上的概率即 P(正面向上) 的先验分布可能是 Beta(1,1)，即一个均匀分布（没有任何信息量），之后我们根据观测数据来更新这个分布得到后验概率分布：

第一次抛出正面，后验分布变为 Beta(2,1)。
第二次抛出正面，后验分布变为 Beta(3,1)，此时的概率分布已变为向右倾斜（即正面向上的概率趋于更高）。
假设之后又抛了数次，总共出现正面 552 次，反面 539 次。
此时后验分布为 Beta(553, 540)，这个分布是轻微向右倾斜，但峰值很接近 0.5 且很窄。

关于贝塔分布的介绍，可见：深入理解Beta分布 from 知乎

你可以使用以下代码进行复现：

import matplotlib.pyplot as plt
from scipy.stats import beta
import numpy as np

a = [1, 2, 3, 553]
b = [1, 1, 1, 540]

fig, axes = plt.subplots(1, len(a), figsize=(len(a)*3, 3))

for i, (suba, subb) in enumerate(zip(a, b)):

    x = np.linspace(0, 1, 100)
    y = beta.pdf(x, suba, subb)

    axes[i].plot(x, y, 'r-', lw=2, alpha=0.6, label='beta pdf')
    axes[i].set_xlabel('x')
    axes[i].set_xlim(0, 1)
    axes[i].set_title(f'Beta({suba}, {subb})')

需要注意的是，complexity_cost_weight 的值将对结果产生很大的影响，在 blitz 的其他示例代码中可以看到这个值一般设为 1/batchsize，但是在 MNIST 数据集中，它使用了 1/50000 作为该超参数的值。经过实测，如果使用 1/batchsize，模型的准确率将在 10% 上下浮动，说明惩罚过严重，导致权重的后验分布无法灵活适应数据。

针对该超参数的选择，blitz 的开发者没有给出 guideline~~（很多相关 issue 发布多年都没有回复…..）~~。但从该库源码里 KL 散度的计算中可以看出，模型越复杂（贝叶斯层越多或神经元越多），KL 散度越高，最后的计算复杂度惩罚项越大。因此对于复杂的模型而言，调低该值或许是一个不错的选择。

训练和预测结果：

tensor(4.0596, device='cuda:0', grad_fn=<DivBackward0>)
Iteration: 250 | Accuracy of the network on the 10000 test images: 94.26 %
......
The best accuracy is 98.75 for bayesian nn.

再换成传统的 Lenet 试试：

# Lenet train & test

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
classifier = lenet().to(device)
optimizer = optim.Adam(classifier.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()
best_accuracy = 0
waiting = 0
iteration = 0
stop = False

for epoch in range(100):

    if stop:
        break
    for i, (datapoints, labels) in enumerate(train_loader):
        optimizer.zero_grad()

        datapoints = datapoints.cuda()
        labels = labels.cuda()

        pred = classifier(datapoints)
        loss = criterion(pred, labels)
        # print(loss)
        loss.backward()
        optimizer.step()

        iteration += 1
        if iteration % 250 == 0:
            print(loss)
            correct = 0
            total = 0
            with torch.no_grad():
                for data in test_loader:
                    images, labels = data
                    outputs = classifier(images.to(device))
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels.to(device)).sum().item()

                accuracy = 100 * correct / total
                print(
                    "Iteration: {} | Accuracy of the network on the 10000 test images: {} %".format(
                        str(iteration), str(100 * correct / total)
                    )
                )

                if accuracy > best_accuracy + 0.005:
                    best_accuracy = accuracy
                    waiting = 0
                else:
                    waiting += 1

                if waiting >= 4:
                    print(f"The best accuracy is {best_accuracy} for lenet nn.")
                    stop = True
                    break

tensor(0.2020, device='cuda:0', grad_fn=<NllLossBackward0>)
Iteration: 250 | Accuracy of the network on the 10000 test images: 91.49 %
......
The best accuracy is 99.01 for lenet nn.

从结果上看，贝叶斯神经网络并没有取得比传统神经网络更好的表现，但是两者仅相差几个千分点，也可能是训练的随机性造成的。不过也能看出，并不是越复杂的网络架构就一定会发挥的更好。

以上代码并没法看出预测的不确定性，接下来将以 blitz 的另一个示例展示其结果的波动：

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

from blitz.modules import BayesianLinear
from blitz.utils import variational_estimator

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X, y = fetch_california_housing(return_X_y=True)
X = StandardScaler().fit_transform(X)
y = StandardScaler().fit_transform(np.expand_dims(y, -1))

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=.1,
                                                    random_state=42)


X_train, y_train = torch.tensor(X_train).float(), torch.tensor(y_train).float()
X_test, y_test = torch.tensor(X_test).float(), torch.tensor(y_test).float()

@variational_estimator
class BayesianRegressor(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        #self.linear = nn.Linear(input_dim, output_dim)
        self.blinear1 = BayesianLinear(input_dim, input_dim*2)
        self.blinear2 = BayesianLinear(input_dim*2, output_dim)
        # self.blinear1 = nn.Linear(input_dim, input_dim*2)
        # self.blinear2 = nn.Linear(input_dim*2, output_dim)

        
    def forward(self, x):
        x_ = self.blinear1(x)
        x_ = F.relu(x_)
        return self.blinear2(x_)


regressor = BayesianRegressor(8, 1)
optimizer = optim.Adam(regressor.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()

ds_train = torch.utils.data.TensorDataset(X_train, y_train)
dataloader_train = torch.utils.data.DataLoader(ds_train, batch_size=32, shuffle=True)

ds_test = torch.utils.data.TensorDataset(X_test, y_test)
dataloader_test = torch.utils.data.DataLoader(ds_test, batch_size=32, shuffle=True)

for epoch in range(10):
    for i, (datapoints, labels) in enumerate(dataloader_train):
        optimizer.zero_grad()
        
        loss = regressor.sample_elbo(inputs=datapoints,
                           labels=labels,
                           criterion=criterion,
                           sample_nbr=3,
                           complexity_cost_weight=0)
        # loss = criterion(regressor(datapoints), labels)

        print(loss.item(), end='\r')
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1} finished!')

1 2	for _ in range(3): print(regressor(X_test).detach().numpy()[0:3])

[[-1.203408  ]
 [-0.32957393]
 [ 1.9057552 ]]
[[-1.2015951 ]
 [-0.32578745]
 [ 1.9102848 ]]
[[-1.2165544 ]
 [-0.33425388]
 [ 1.9049363 ]]

可以看到，三次运行输出的结果都是不一致的，通过以下命令我们可以获得预测的均值和标准差：

preds = [regressor(X_test) for _ in range(100)]
preds = torch.stack(preds)
means = preds.mean(axis=0).detach().numpy().squeeze()
stds = preds.std(axis=0).detach().numpy().squeeze()

这里需要注意，对于表格数据先需进行标准化，具体原因见注意事项。

注意事项

以上就是使用 blitz 实现贝叶斯神经网络的整体流程，同样地，其他的网络架构可以通过将 nn 层替换为对应的贝叶斯层实现：

BayesianLinear
BayesianConv1d
BayesianConv2d
BayesianConv3d
BayesianLSTM
BayesianGRU
BayesianEmbedding

使用贝叶斯神经网络时，有几点需要注意：

新增的超参数中，complexity_cost_weight 非常重要，该值的选择可能会极大影响模型的表现。当模型越复杂时，该值应当越低。
要获得结果的不确定性需多次运行模型，这对于模型过于复杂的任务来说可能极为耗时。因此需综合考虑自身需求和时间成本。
对于计算机视觉领域中的图片分类问题等，不需要对输入数据进行特别处理。但对于结构化数据（例如上述的 fetch_california_housing 数据集），需要注意特征的尺度是否一致，如果不一致则需要对其进行标准化处理。由于贝叶斯层在初始化时，对于每个权重的处理是一致的，因此特征的尺度不同将极大影响后续权重的更新，例如尺度更小的特征可能得到更大的权重，同时相应的复杂度惩罚也就越严重。对于图片分类，每个通道的尺度相同，因此不存在这方面的问题。

一些有趣的问题

问 ChatGPT 相同的问题时，它会给出不同的回答。这是为什么？是因为它的模型架构中引入了贝叶斯方法吗？
除了通过对权重进行采样引入正则化以外，贝叶斯神经网络还在其他哪些地方体现出了可以提升模型表现的性质？
你认为分类问题更需要贝叶斯神经网络还是回归问题更需要？为什么？

参考资料

Robustness of Bayesian Neural Networks to White-Box Adversarial Attacks
Bayesian Bilinear Neural Network for Predicting the Mid-price Dynamics in Limit-Order Book Markets