1 Fisher信息量定义
Fisher信息是一种测量可观测随机变量XXX携带的未知参数θ\thetaθ的信息量的方法,其中XXX的概率依赖于参数θθθ。令f(X;θ)f(X;\theta)f(X;θ)是一个参数为θ\thetaθ的随机变量XXX的概率密度函数。如果fff随着θ\thetaθ的变化出现陡峭的峰谷,则说明从数据中得到了θ\thetaθ正确的值,换句话说数据XXX提供了关于参数θ\thetaθ很多的信息。如果fff随着θ\thetaθ的变化是比较平缓的,则需要对XXX进行更多的采样进而估计参数θ\thetaθ。
形式上,关于似然函数自然对数θ\thetaθ的偏导数称为分数,即为S=∂∂θlogf(X;θ)S=\frac{\partial }{\partial \theta}\log f(X;\theta)S=∂θ∂logf(X;θ)。在某些规则性条件下,如果θ\thetaθ是真参数(即为XXX实际分布f(X;θ)f(X;\theta)f(X;θ),则在真参数值θ\thetaθ处评估的分数的期望值为0,具体推导如下所示E[∂∂θlogf(X;θ)∣θ]=∫R∂∂θf(x;θ)f(x;θ)f(x;θ)dx=∂∂θ∫Rf(x;θ)dx=∂∂θ1=0\begin{aligned}&\mathbb{E}\left[\left.\frac{\partial }{\partial \theta}\log f(X;\theta)\right|\theta\right]\\=&\int_{\mathbb{R}}\frac{\frac{\partial}{\partial \theta}f(x;\theta)}{f(x;\theta)}f(x;\theta)dx\\=&\frac{\partial}{\partial \theta}\int_\mathbb{R}f(x;\theta)dx\\=&\frac{\partial}{\partial \theta}1=0\end{aligned}===E[∂θ∂logf(X;θ)∣∣∣∣θ]∫Rf(x;θ)∂θ∂f(x;θ)f(x;θ)dx∂θ∂∫Rf(x;θ)dx∂θ∂1=0Fisher信息量则被定义为分数SSS的方差,具体公式如下所示
I(θ)=E[(∂∂θlogf(X;θ))2∣θ]=∫R(∂∂θlogf(x;θ))2f(x;θ)dx\mathcal{I}(\theta)=\mathbb{E}\left[\left.\left(\frac{\partial}{\partial \theta}\log f(X;\theta)\right)^2\right|\theta\right]=\int_{\mathbb{R}}\left(\frac{\partial}{\partial \theta}\log f(x;\theta)\right)^2 f(x;\theta)dxI(θ)=E[(∂θ∂logf(X;θ))2∣∣∣∣∣θ]=∫R(∂θ∂logf(x;θ))2f(x;θ)dx由上公式可以发现I(θ)≥0\mathcal{I}(\theta)\ge 0I(θ)≥0。携带高Fisher信息的随机变量意味着分数的绝对值通常很高。Fisher信息不是特定观测值的函数,因为随机变量XXX已被平均化。如果f(x;θ)f(x;\theta)f(x;θ)关于θ\thetaθ是二次可微的,则此时Fisher信息量可以写为如下公式I(θ)=−E[∂2∂θ2logf(X;θ)∣θ]\mathcal{I}(\theta)=-\mathbb{E}\left[\left.\frac{\partial^2}{\partial \theta^2}\log f(X;\theta)\right|\theta\right]I(θ)=−E[∂θ2∂2logf(X;θ)∣∣∣∣θ]因为
∂2∂θ2logf(X;θ)=∂2∂θ2f(X;θ)f(X;θ)−(∂∂θf(X;θ)f(X;θ))2=∂2∂θ2f(X;θ)f(X;θ)−(∂∂θf(X;θ))2\frac{\partial^2}{\partial \theta^2} \log f(X;\theta)=\frac{\frac{\partial^2}{\partial\theta^2}f(X;\theta)}{f(X;\theta)}-\left(\frac{\frac{\partial}{\partial \theta}f(X;\theta)}{f(X;\theta)}\right)^2=\frac{\frac{\partial^2}{\partial \theta^2}f(X;\theta)}{f(X;\theta)}-\left(\frac{\partial}{\partial \theta}f(X;\theta)\right)^2∂θ2∂2logf(X;θ)=f(X;θ)∂θ2∂2f(X;θ)−(f(X;θ)∂θ∂f(X;θ))2=f(X;θ)∂θ2∂2f(X;θ)−(∂θ∂f(X;θ))2又因为
E[∂2∂θ2f(X;θ)f(X;θ)∣θ]=∂2∂θ2∫Rf(x;θ)dx=0\mathbb{E}\left[\left.\frac{\frac{\partial^2}{\partial \theta^2}f(X;\theta)}{f(X;
\theta)}\right|\theta\right]=\frac{\partial^2}{\partial \theta^2}\int_\mathbb{R} f(x;\theta)dx=0E[f(X;θ)∂θ2∂2f(X;θ)∣∣∣∣∣θ]=∂θ2∂2∫Rf(x;θ)dx=0综合以上两公式可以推导出Fisher信息量的新形式,证毕。
2 Cramer–Rao界推导
Cramer-Rao界指出,Fisher信息量的逆是θ\thetaθ的任何无偏估计量方差的下界。考虑一个θ\thetaθ的无偏估计θ^(X)\hat{\theta}(X)θ^(X),无偏估计的数学形式可以表示为E[θ^(X)−θ∣θ]=∫(θ^(x)−θ)f(x;θ)dx=0\mathbb{E}\left[\left.\hat{\theta}(X)-\theta\right|\theta\right]=\int\left(\hat{\theta}(x)-\theta\right)f(x;\theta)dx=0E[θ^(X)−θ∣∣∣θ]=∫(θ^(x)−θ)f(x;θ)dx=0因为这个表达式与θ\thetaθ无关,所以它对θ\thetaθ的偏导数也必须为000。根据乘积法则,这个偏导数也等于0=∂∂θ∫(θ^(x)−θ)f(x;θ)dx=∫(θ^(x)−θ)∂f∂θdx−∫fdx0=\frac{\partial}{\partial \theta}\int \left(\hat{\theta}(x)-\theta\right)f(x;\theta)dx=\int\left(\hat{\theta}(x)-\theta\right)\frac{\partial f}{\partial \theta}dx-\int f dx0=∂θ∂∫(θ^(x)−θ)f(x;θ)dx=∫(θ^(x)−θ)∂θ∂fdx−∫fdx对于每个θ\thetaθ,似然函数是一个概率密度函数,因此∫fdx=1\int fdx=1∫fdx=1,进而则有∂f∂θ=f∂logf∂θ\frac{\partial f}{\partial \theta}=f \frac{\partial \log f}{\partial \theta}∂θ∂f=f∂θ∂logf根据以上两个条件,可以得到∫(θ^−θ)f∂logf∂θdx=1\int\left(\hat{\theta}-\theta\right)f\frac{\partial \log f}{\partial \theta}dx =1∫(θ^−θ)f∂θ∂logfdx=1然后将将被积函数分解为∫((θ^−θ)f)(f∂logf∂θ)dx=1\int \left(\left(\hat{\theta}-\theta\right)\sqrt{f}\right)\left(\sqrt{f}\frac{\partial \log f}{\partial \theta}\right)dx=1∫((θ^−θ)f)(f∂θ∂logf)dx=1将积分中的表达式进行平方,再根据Cauchy–Schwarz不等式可得1=(∫[(θ^−θ)f]⋅[f∂logf∂θ]dx)2≤[∫(θ^−θ)2fdx]⋅[∫(∂logf∂θ)2fdx]1=\left(\int \left[\left(\hat{\theta}-\theta\right)\sqrt{f}\right]\cdot\left[\sqrt{f}\frac{\partial \log f}{\partial \theta}\right]dx\right)^2\le \left[\int \left(\hat{\theta}-\theta\right)^2 fdx\right]\cdot \left[\int \left(\frac{\partial \log f}{\partial \theta}\right)^2fdx\right]1=(∫[(θ^−θ)f]⋅[f∂θ∂logf]dx)2≤[∫(θ^−θ)2fdx]⋅[∫(∂θ∂logf)2fdx]其中第二个括号内的因子被定义为Fisher信息量,而第一个括号内的因子是估计量θ^\hat{\theta}θ^的期望均方误差,进而则有Var(θ^)≥1I(θ)\mathrm{Var}(\hat{\theta})\ge \frac{1}{\mathcal{I}(\theta)}Var(θ^)≥I(θ)1由上公式可以发现估计θ\thetaθ的精度基本上受到似然函数的Fisher信息量的限制。
3 矩阵形式
给定一个N×1N\times 1N×1的参数向量θ=[θ1,θ2,⋯ ,θN]⊤\theta=[\theta_1,\theta_2,\cdots,\theta_N]^{\top}θ=[θ1,θ2,⋯,θN]⊤,此时Fisher信息量可以表示为一个N×NN\times NN×N的矩阵。这个矩阵被称为Fisher信息矩阵,具体形式如下所示
[I(θ)]i,j=E[(∂∂θilogf(X;θ))(∂∂θjlogf(X;θ))∣θ][\mathcal{I}(\theta)]_{i,j}=\mathbb{E}\left[\left.\left(\frac{\partial}{\partial \theta_i}\log f(X;\theta)\right)\left(\frac{\partial}{\partial\theta_j}\log f(X;\theta)\right)\right|\theta\right][I(θ)]i,j=E[(∂θi∂logf(X;θ))(∂θj∂logf(X;θ))∣∣∣∣θ]Fisher信息矩阵是一个N×NN\times NN×N的半正定矩阵。在某些正则条件下,Fisher信息矩阵又可以写成如下形式
[I(θ)]i,j=−E[∂2∂θi∂θjlogf(X;θ)∣θ][\mathcal{I}(\theta)]_{i,j}=-\mathbb{E}\left[\left.\frac{\partial^2}{\partial \theta_i \partial \theta_j}\log f(X;\theta)\right|\theta\right][I(θ)]i,j=−E[∂θi∂θj∂2logf(X;θ)∣∣∣∣θ]