贝叶斯估计是贝叶斯学派估计未知参数的主要方法,与频率学派相比,贝叶斯学派最主要的观点就是未知量是一个随机变量,在进行抽样分布之前,未知量有自己的分布函数,即所谓的先验分布
而贝叶斯估计也就是通过引入未知量的先验分布来将先验信息和传统频率学派的总体信息和样本信息结合起来,得到一个未知量的后验分布,然后对未知量进行统计推断。

关于未知量是否可看作随机变量 在经典学派与贝叶斯学派 间争论了很长时间,后来这一观点渐渐被经典学派认同。如今两派的争论焦点已经变成了如何利用各种先验信息来合理地确定先验分布。

贝叶斯估计

对于未知参数θ\theta,假设其分布(先验分布)为π(Θ)\pi(\Theta)

总体分布以及样本分布都依赖于先验分布,因而将先验信息加入后的样本X\boldsymbol{X}θ\theta的条件分布(the joint conditional pdf of X\boldsymbol{X}, given Θ\Theta = θ\theta,)变成了:

g(Xθ)=L(Xθ)π(θ)=Πif(xiθ)π(θ)g(\boldsymbol{X}|\boldsymbol{\theta})=L(\boldsymbol{X} \mid \boldsymbol{\theta})\pi(\theta)=\Pi_i f(x_{i} \mid \theta) \pi(\theta)

Then the marginal pdf of X\boldsymbol{X} is:

g1(X)=θg(X,θ)dθ=θL(Xθ)π(θ)dθg_1(\boldsymbol{X})=\int_{\boldsymbol{\theta}} g(\boldsymbol{X}, \boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta}=\int_{\boldsymbol{\theta}} L(\boldsymbol{X} \mid \boldsymbol{\theta}) \pi(\theta) \mathrm{d} \boldsymbol{\theta}

基于总体和样本信息,对未知参数的分布做出推断(后验分布posterior pdf,the conditional pdf of Θ\Theta):

π(θX)=g(X,θ)g1(X)=L(Xθ)π(θ)θL(Xθ)π(θ)dθ\pi(\theta \mid \boldsymbol{X})=\frac{g(\boldsymbol{X}, \theta)}{g_1(\boldsymbol{X})}=\frac{L(\boldsymbol{X} \mid \theta) \pi(\theta)}{\int_{\boldsymbol{\theta}} L(\boldsymbol{X} \mid \theta) \pi(\theta) \mathrm{d} \theta}

因为条件概率函数的积分值为 1,且分母与θ\theta,所以后验分布的分布类型只与分子相关,写作:

π(θX)L(Xθ)π(θ)\pi(\theta \mid \boldsymbol{X}) \propto L(\boldsymbol{X} \mid \theta) \pi(\theta)

decision theory

Statistical decision theory is concerned with the making of decisions in the presence of statistical knowledge.

  • We assume that the true state of system full of uncertainties can be considered to be numerical quantitiles, denoted by θ\theta in the Parameter Space Θ\Theta.
  • At the same time, we need define a action α\alpha in the Action Space A\mathcal{A}.

A key element of decision theory is the specifying the loss of an action given a state of nature, so a loss function L(θ,α)\mathcal{L}(\theta,\alpha) is defined.

Note: Considering the uncertainties about the true state of system(bayes view) or unknowability about the Population(Frequenist view), we tend to define Risk functions as the expectations of loss functions in different distributions.

point estimation

Mainly based on the Introduction to mathematical statistics Robert V. Hogg, Late Professor of Statistics,University of Iowa, Joseph W. McKean, Western Michigan University, Allen T. Craig,Late Professor of Statistics, University of Iowa

Frome the view of statistical decision theory, performing point estimation is equal to selecting a decision function δ\delta, which is a sample function for Frequenists.
From the Bayesian viewpoint, performing point estimation is equal to selecting a decision function δ\deltaδ(X)\delta(\boldsymbol{X}) is a predicted value of θ\theta (an experimental value of the random variable Θ\Theta). We just need a Loss function L(θ,δ(X))\mathcal{L}(\theta,\delta(\boldsymbol{X})) to help us.

Frequenist Risk

Considering that the loss fuction will change with the distribution of the Population. We need to caculate the expectation of Loss function on the Population to evaluate a decision function, that is Frequenist Risk.

R(θ,δ(X)))=EX[L(θ,δ(X))]R(\theta, \delta(\boldsymbol{X})))=\mathrm{E}_{\boldsymbol{X}}[\mathcal{L}(\theta, \delta(\boldsymbol{X}))]

Assume that we use the strategy of minimizing the expected loss:

δ(X)=arg minδEX[L(θ,δ(X))]=arg minδθL(θ,δ(X))k(θX)dθ\delta(\boldsymbol{X})=\argmin _{\delta}\, \mathrm{E}^{\boldsymbol{X}}[\mathcal{L}(\theta, \delta(\boldsymbol{X}))]=\argmin _{\delta} \int_{\theta} \mathcal{L}(\theta, \delta(\boldsymbol{X})) k(\theta\mid\boldsymbol{X}) \mathrm{d} \theta

In most cases, we can’t find a “ideal” decision function meeting the above equation, considenring the unknown parameter of θ\theta, so we need to define decision principles to help us make decisions. A most common decision principle is minimax principle, we define the minimax decision function δ\delta^* as below:

supθR(θ,δ)=infδDsupθΘR(θ,δ)\sup _\theta R\left(\theta, \delta^{*}\right)=\inf _{\delta^* \in \mathcal{D}^*} \sup _{\theta \in \Theta} R\left(\theta, \delta^*\right)

In another word, a decision function δ\delta^* is a minimax decision function if it minimizes the maximum risk over all possible values of θ\theta.

Bayes Expected Loss

In the Bayesian view, the θ\theta has its own distribution, so we can get the expected loss of a decision function δ\delta:

ρ(δ)=Eπ[L(θ,δ(X))]\rho(\delta)=\mathrm{E}^{\pi}[\mathcal{L}(\theta, \delta(\boldsymbol{X}))]

We call this Bayes Expected Loss. We call an action δ\delta a Bayes action if it minimizes the Bayes expected loss.

For a no-data problem, R(θ,δ(X)))=L(θ,δ(X))R(\theta, \delta(\boldsymbol{X})))=L(\theta, \delta(\boldsymbol{X}))

Bayes Risk

From the view of Bayesian, we can get the expectation of Risk function on the prior distribution of θ\theta:

r(π,δ)=Eπ[R(θ,δ(X))]r(\pi, \delta)=\mathrm{E}^{\pi}[R(\theta, \delta(\boldsymbol{X}))]

We call this Bayes Risk, and we will get a Bayes Rule if we minimize the Bayes Risk.

if L=(θδ)2\mathcal{L}=(\theta-\delta)^2, then δ(X)=E(ΘX)\delta(\boldsymbol{X})=E(\Theta\mid \boldsymbol{X})

Generalize this to estimate a specified function of θ\theta, say, l(θ)l(\theta),we can get the Bayesian estimator of l(θ)l(\theta):

EX[L(l(θ),δ(X))]=arg minδEπ[R(l(θ),δ(X))]\mathbb{E}_{\boldsymbol{X}}[\mathcal{L}(l(\theta), \delta(\boldsymbol{X}))]=\argmin _{\delta} \mathrm{E}^{\pi}[R(l(\theta), \delta(\boldsymbol{X}))]