Estimating the Bernouilli Parameter, a non-Bayesian Max Entropy Method

Introduction and Result

A maximum entropy alternative to Bayesian methods for the estimation of independent Bernouilli sums.

Let X=\{x_1,x_2,\ldots, x_n\}, where x_i \in \{0,1\} be a vector representing an n sample of independent Bernouilli distributed random variables \sim \mathcal{B}(p). We are interested in the estimation of the probability p.

We propose that the probablity that provides the best statistical overview, p_m (by reflecting the maximum ignorance point) is

p_m= 1-I_{\frac{1}{2}}^{-1}(n-m, m+1), (1)

where m=\sum_i^n x_i and I_.(.,.) is the beta regularized function.

Comparison to Alternative Methods

EMPIRICAL: The sample frequency corresponding to the “empirical” distribution p_s=\mathbb{E}(\frac{1}{n} \sum_i^n x_i), which clearly does not provide information for small samples.

BAYESIAN: The standard Bayesian approach is to start with, for prior, the parametrized Beta Distribution p \sim Beta(\alpha,\beta), which is not trivial: one is contrained by the fact that matching the mean and variance of the Beta distribution constrains the shape of the prior. Then it becomes convenient that the Beta, being a conjugate prior, updates into the same distribution with new parameters. Allora, with n samples and m realizations:

p_b \sim Beta(\alpha+m, \beta+n-m) (2)

with mean p_b = \frac{\alpha +m}{\alpha +\beta +n}. We will see below how a low variance beta has too much impact on the result.


Let F_p(x) be the CDF of the binomial \mathcal{B} in(n,p). We are interested in \{ p: F_p(x)=q \} the maximum entropy probability. First let us figure out the target value q.

To get the maximum entropy probability, we need to maximize H_q=-\left(\;q \; log(q) +(1-q)\; log (1-q)\right). This is a very standard result: taking the first derivative w.r. to q, \log (q)+\log (1-q)=0, 0\leq q\leq 1 and since H_q is concave to q, we get q =\frac{1}{2}.

Now we must find p by inverting the CDF. Allora for the general case,

p= 1-I_{\frac{1}{2}}^{-1}(n-x,x+1).

And note that as in the graph below (thanks to comments below by überstatistician Andrew Gelman), we can have a “confidence band” (sort of) with

p_\alpha= 1-I_{\alpha}^{-1}(n-x,x+1) ;

in the graph below the band is for values of: \alpha= \frac{1}{2}, \frac{3}{4}.

Application: What can we say about a specific doctor or center’s error rate based on n observations?

Case (Real World): A thoraxic surgeon who does mostly cardiac and lung transplants (in addition to emergency bypass and aortic ruptures) operates in a business with around 5% perioperative mortality. So far in his new position in the U.S. he has done 60 surgeries with 0 mortality.

What can we reasonable say, statistically, about his error probability?

Note that there may be selection bias in his unit, which is no problem for our analysis: the probability we get is conditional on being selected to be operated on by that specific doctor in that specific unit.

Assuming independence, we are concerned with Y = 0, 1, \ldots,n a binomially distributed r.v. \sim \mathcal{B}(n,p) where n is the number of trials and p is the probability of failure per trial. Clearly, we have no idea what p and need to produce our best estimate conditional on, here, y=0.

Here applying (1) with m=0 and n=60, we have p=0.01148.

Why is this preferable to a Bayesian approach when, say, n is moderately large?

A Bayesian would start with a prior expectation of, say .05, and update based on information. But it is highly arbitrary. Since the mean is \frac{\alpha}{\alpha +\beta}, we can eliminate one parameter. Let us say we start with Beta(\alpha, 19 \alpha ) and have no idea of the variance. As we can see in the graph below there are a lot of shapes to the possible distribution: it becomes all in the parametrization.


Thanks to Saar Wilf for useful discussions.

Estimating the Bernouilli Parameter, a non-Bayesian Max Entropy Method

4 thoughts on “Estimating the Bernouilli Parameter, a non-Bayesian Max Entropy Method

  1. Andrew Gelman says:

    It’s not clear you can come up with much of a “best estimate” from data with y=0. Maybe all that can be given in general (without using some prior information on p) are reasonable bounds. I like the Agresti-Coull procedure which gives a reasonable 95% interval. I’ve used this in a consulting project (in that case, the data were y=75 out of n=75) and put it in Regression and Other Stories.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s