General Linear Model

线性模型 – Generalize Linear Models

所谓的GLMs是一类机器学习算法,其中最小二乘线性回归(LMS)以及逻辑回归(LR)都是GLMs的一个特例。

线性模型的特性

  1. 指数分布族 – The exponential family

    所有GLMs的分布律都可以写成如下的指数分布族:\(p(y;\eta) = b(y)exp(\eta^TT(y) - a(\eta))\)

    其中\(\eta\)被称为自然参数或者规范参数, \(T(y)\) 被称为充分统计量,\(a(\eta)\) 是对数分量函数。

  2. 构建GLM的假设条件

    • \(y|x;\theta \sim ExponetialFamily(\eta)\)

    • 给定\(x\),我们的目标是预测\(T(y)\)的期望值。即希望通过假设函数 \[ h_{\theta}(x) \] 预测的结果满足 \[ h_\theta(x) = E[T(y)|x] \]

    • 自然参数\(\eta\)\(x\) 线性相关,即\(\eta = \theta^Tx\)

线性模型实例

下面我们介绍一些线性模型的实例,比如我们常见的线性回归,逻辑回归以及Softmax回归多分类器。

  1. 线性回归

    线性回归是一种进行连续值预测的常用方法。假设有\(m\)个样本,每个样本包含\(n\)个特征,即\(X\in R^{m,n}\),那么回归预测的问题可以描述成 \[ y^{(i)} = \theta^T * x^{(i)} + \epsilon^{(i)} \] 其中\(\theta\)为待拟合的线性模型的权重,\(\epsilon(i)\)为拟合误差。

    假设各个样本的拟合误差独立同分布于(IID)正态分布,即\(\epsilon^{(i)} \sim N(\mu,\delta^2)\) . 假设\(\mu = 0\)(可以通过正则化来达到),那么 \[ p(\epsilon^i) = \frac{1}{\sqrt{2\pi}\delta}\exp{-\frac{(\epsilon^{(i)})^2}{2\delta^2}} \]

    \[ p(y^{(i)} | x^{(i)},\theta) = \frac{1}{\sqrt{2\pi}\delta}\exp{-\frac{(y^{(i)} - \theta^T*x^{(i)})^2}{2\delta^2}} \]

    那么,该分布的似然函数为: \[ L(\theta) = \prod_{i=1}^m p(y^{(i)} | x^{(i)},\theta) \]

    \[ \log{L(\theta)} = \sum_{i=1}^m p(y^{(i)} | x^{(i)},\theta) = -m\log{\sqrt{2\pi}\delta} - \sum_{i=1}^m \frac{(y^{(i)} - \theta^Tx^{(i)})^2}{2\delta^2} \]

    为了使得线性模型能够得到较好的拟合效果,即最大似然估计: \[ \max{(\log{L(\theta)})} \] 结合上式,即 \[ \min{(\sum_{i=1}^m (y^{(i)} - \theta^Tx^{(i)})^2)} \] 其中无论\(\delta\)如何取值都与问题求解无关。 那么最终的拟合过程,变成了最小二乘问题。

  2. 逻辑回归

    逻辑回归是一种二分类问题,假设给你m个样本,每个样本\(n\)个特征,并且所有样本有两个类别,假设是{0,1}。假设样本服从二项分布,即 \[ p(y=1|\phi) = \phi, p(y=0|\phi) = 1- \phi \] 那么 \[ p(y|\phi) = \phi^y(1-\phi)^{1-y} \]

    \[ p(y|\phi) = \exp{(y\log{\phi} + (1-y)\log{(1-\phi)})} \]

    \[ p(y|\phi) = \exp{(y\log{\frac{\phi}{1-\phi}} + \log{(1-\phi)})} \]

    构建GLM得到: \[ T(y) = y, b(y) = 1, \eta = \log{\frac{\phi}{1-\phi}}, \phi = \frac{1}{1+e^{\eta}}, a(\eta) = \log{(1+e^{\eta})} \]\[ \eta^{(i)} = \theta^T * x^{(i)} \] 所以 \[ \phi^{(i)} = \frac{1}{1+e^{-(\theta^T x^{(i)})}} \] 同时,为了求解拟合参数\(\theta\).参考线性回归的求解过程:

    • 求解似然函数:\(L(\theta) = \prod_{i=1}^{m}((\phi^{(i)})^{y^{(i)}} (1-\phi^{(i)})^{(1-y^{(i)})})\)

    • 对似然函数变形:\(l(\theta) = \log{L(\theta)} = \sum_{i=1}^{m}(y^{(i)}\log{\phi^{(i)}}+(1-y^{(i)})\log{(1-\phi^{(i)})})\)

    • 求解变形后的似然函数的最大值,梯度下降求解。每一步迭代的梯度为: \[ \frac{\partial{(l(\theta))}}{\partial{\theta_j}} = (y\frac{1}{\phi} + (1-y)\frac{1}{1-\phi}) \phi(1-\phi)x_j \]

      \[ \frac{\partial{(l(\theta))}}{\partial{\theta_j}} = (y - \phi)x_{j} \]

  3. Softmax回归

    Softmax Regression主要用于解决多分类问题,假设 \[ y\in \{1,2,3,…,k\} \] 用于表示样本的类别。

    \(\phi_i\) 表示样本属于第\(i\)类的概率,即\(p(y=i;\phi) = \phi_i\)

    由于\(\sum_{i=1}^{k}\phi_i = 1\), 所以可以令\(\phi_{k} = 1-\sum_{i=1}^{k-1}\phi_i\)

    为了使得多项分布能够用指数函数族描述,在此引入\(T(y)\),其中 \[ T(1) = \left(\begin{array}{c} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \end{array} \right), T(2) = \left(\begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right), \ldots, T(k-1) = \left(\begin{array}{c} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \end{array} \right), T(k) = \left(\begin{array}{c} 0 \\ 0 \\ 0 \\ \vdots \\ 0 \end{array} \right) \] 那么\(T(y) \neq y, T(y) \in R^{k-1}\)。且\((T(y))_i = 1\{y=i\}\) ,其中\(1\{.\}=1\) when · is true.

    那么\(E[(T(y))_i] = p(y=i) = \phi_i\) \[ p(y;\phi) = \phi_1^{1\{y=1\}}\phi_2^{1\{y=2\}}\ldots\phi_k^{1\{y=k\}} \]

    \[ p(y;\phi) = \exp({1\{y=1\}}\log\{\phi_1\}+\ldots+{1\{y=k\}}\log\{\phi_k\}) \]

    \[ p(y;\phi) = \exp({1\{y=1\}}\log\{\phi_1\}+\ldots+(1-\sum_{i=1}^{k-1}{1\{y=i\}})\log\{\phi_k\}) \]

    \[ p(y;\phi) = \exp(\sum_{i=1}^{k-1}{1\{y=i\}}\log\{\phi_i\}+\ldots+(1-\sum_{i=1}^{k-1}{1\{y=i\}})\log\{\phi_k\}) \]

    \[ p(y;\phi) = \exp(\sum_{i=1}^{k-1}{1\{y=i\}}\log\{\phi_i/\phi_k\}+\ldots+\log{\phi_k}) \]

    \[ p(y;\phi) = \exp(\sum_{i=1}^{k-1}{(T(y))_i}\log\{\phi_i/\phi_k\}+\ldots+\log{\phi_k}) \]

    \[ p(y;\eta) = b(y)exp(\eta^TT(y) - a(\eta)) \]

    得到$b(y)=1, = (\begin{array}{c} (_1/_k) \ (_2/_k) \ (_3/k) \ \ ({k-1}/_k) \end{array} ) \(, 其中\)a() = -(_k)$

    link function(\(\eta\)为分布律的函数):\(\eta_i = \log(\phi_i/\phi_k)\)

    response function(分布律为\(\eta\)的函数,也称为softmax函数):\(\phi_i = e^{\eta_i}/\sum_{j=1}^k{e^{\eta_j}}\)

    又:\(\eta_i=\theta_i^Tx\)

    得:\(\phi_i = \exp(\theta_i^Tx)/\sum_{j=1}^k{\exp(\theta_j^Tx)}\)

    那么假设函数: \[ h_{\theta}(x) = E[T(y)|x;\theta] = \left(\begin{array}{c} \phi_1 \\ \phi_2 \\ \phi_3 \\ \vdots \\ \phi_{k-1} \end{array} \right) = \left(\begin{array}{c} \frac{\exp(\theta_1^Tx)}{\sum_{j=1}^k{\exp(\theta_j^Tx)}} \\ \frac{\exp(\theta_2^Tx)}{\sum_{j=1}^k{\exp(\theta_j^Tx)}} \\ \frac{\exp(\theta_3^Tx)}{\sum_{j=1}^k{\exp(\theta_j^Tx)}} \\ \vdots \\ \frac{\exp(\theta_{k-1}^Tx)}{\sum_{j=1}^k{\exp(\theta_j^Tx)}} \end{array} \right) \] 似然函数为: \[ l(\theta) = \sum_{i=1}^{m}\log(p(y^{(i)}|x^{(i)};\theta)) = \sum_{i=1}^{m}\prod_{l=1}^k(\frac{\exp(\theta_l^Tx^{(i)})}{\sum_{j=1}^k{\exp(\theta_j^Tx^{(i)})}})^{1\{y^{(i)}=l\}} \] 最终求解该似然函数的最优解可以得到整个模型的最优解及其对应的模型。

Step by Step to fulfill softmax classification for MNIST dataset

Step1: Load sample data set

加载MNIST样本数据

1
2
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

注:在执行上述代码时如果遇到file not gzip file错误,是由于下载的数据有问题。去修改源文件[Where tensorflow located]/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py, 将SOURCE_URL换成:

1
2
> SOURCE_URL = 'http://yann.lecun.com/exdb/mnist/'
>

Step2: Construct GLM(Softmax)

构造假设函数(Softmax分类模型):\(y = h_{\theta}(x) = softmax(W*x + b)\)

1
2
3
4
5
import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)

Step3: Train GLM(Softmax)

求解假设函数的最优值,构造最优化问题目标函数:\(H_{y_i^{‘}}(y) = H_{y_i^{‘}}(h_{\theta(x^{(i)}})\)

在此使用交叉熵作为目标函数,即:\(H_{y_i^{‘}}(y) = -\sum_i(y_i^{‘})\log(h_{\theta(x^{(i)})})\)

1
2
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

亦可以直接使用tensorflow提供的函数计算cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))其中y=W*x+b

定义优化方法(梯度下降)

1
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

Tensorflow执行优化过程(随机梯度下降)

1
2
3
4
5
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

Step 4: 模型评估

1
2
3
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))