Wikipedia:Reference desk/Archives/Mathematics/2011 January 18

Mathematics desk
< January 17 << Dec | January | Feb >> January 19 >
Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


January 18

edit

Calculating the probability that two binomial distributions are in order

edit

Given samples from two binomial distributions (that is, two series of Bernoulli trials), I want to estimate the probability that one latent probability exceeds the other (equivalently, p(X>Y)). The two samples have different sizes, sample sizes may be small, and probabilities are extremely small (much less than 1%). For example, one sample might have 10 trials and no successes, and the other 100 trials and one success.

I have tried approximating the binomials with normal distributions, and using the fact that the difference between two normal distributions is a normal distribution where the mean is the difference of the means, and the variance is the sum of the variances.[1] Because of the small probabilities (and the significant likelihood that one or both samples have zero successes), I used the Agresti-Coull formula (add two successes and two failures). Unfortunately, not only does this produce implausibly high estimates of p, it also often predicts with high probability that distributions are in reverse order of their unadjusted means, which seems counter-intuitive.

As I understand it, the principle behind Agresti-Coull is that it's pulling all the samples towards p=0.5. Although each of my samples is small, I have enough samples to be able to estimate a good overall mean for their latent probabilities. Is there a normal approximation technique that can use that?

Does anyone have any suggestions? Thanks, Bovlb (talk) 07:04, 18 January 2011 (UTC)[reply]

That X has a binomial distribution with parameters n and p means that the probability that X=x knowing that N=n and p<P<p+dp is
 .
Using Bayes' formula for conditional probability you get
 
 
 .
This means that P has a beta distribution with mean value
 
and standard deviation
 
Now you can approximate the P 's with normally distributed variables.

Bo Jacoby (talk) 08:12, 18 January 2011 (UTC).[reply]

Let   be independent Binomial variables with counts   and unknown probabilities  . Then   is not equivalent to   (or  , or anything of the sort), so you should clarify what you're after.
Now, the posterior probability that   after doing some experiments crucially depends on the prior distributions of  . Bo has kept implicit his assumption that the prior is uniform, which is contrary to your stated knowledge about the subject that you expect the probabilities to be very small.
If you think   is around p, a good way to model your knowledge is a beta distribution with   (tweaking α corresponds to varying levels of confidence). After observing a successes and b failures, the parameters are updated to  . You can do the same for   and then you have the posterior distributions for  . You can use whatever technique to find the probability that one is greater than the other. I'm not sure that the normal distribution is a good approximation for the beta.
Note that if the number of experiments is too small in relation to the probabilities involved and the confidence of your prior, nothing you will do will give a meaningful result. -- Meni Rosenfeld (talk) 10:44, 18 January 2011 (UTC)[reply]

The OP wants "to estimate the probability that one latent probability exceeds the other". That is  , not  . The only source of information regarding the probability   is the observation  . Before the observation is made you know nothing about  , and the prior distribution must be  , which is a special case of the above general formula for  . Similarily the prior distribution for   is   because there are   a priori equally probable outcomes,  . The stated knowledge about the subject that you expect the probabilities to be very small is posterior knowledge, not prior knowledge. Even if the normal distribution may not be a good approximation for   and  , it is a fair approximation for  . If you want the exact value for  , you may evaluate the double integral of the product of the two beta distributions.

 

This is a perfectly meaningful result. Bo Jacoby (talk) 13:01, 18 January 2011 (UTC).[reply]

The OP said "... equivalently, p(X>Y)", so I noted that they are not equivalent.
It is wrong to assume that we have no prior knowledge about p before conducting experiments. If, for example, the OP is talking about the probability that a certain drug will cure a person of a disease, he may be able to run a computer simulation of the effects and achieve an estimate for its effectiveness before the drug was ever administered a single time. I have already discussed this with you in the past, I don't know why you insist to carry on with this mistake.
From the OP's description it seemed that he does have such prior knowledge, but he can of course provide a clarification about his situation. -- Meni Rosenfeld (talk) 13:42, 18 January 2011 (UTC)[reply]

The OP did not define X and Y, so we cannot conclude that he did not mean what he wrote: "the probability that one latent probability exceeds the other". If the OP has prior information he should tell us, so when he did not tell us we are entitled to assume that he has no prior information. When he asks, we are entitled to assume that he does not know the answer. Consider the question: "What is the probability that a tossed coin shows head?". You answer 50%, and then you are told: "No, it is 100%, because I just tossed it, and it showed head, as a matter of fact!" Isn't that cheating? This OP tells us that "For example, one sample might have 10 trials and no successes, and the other 100 trials and one success". We are entitled to assume that this is the kind of information available: N trials and X successes. Bo Jacoby (talk) 17:48, 18 January 2011 (UTC).[reply]

You're right, the OP didn't define X and Y. I did define what I meant with the notations to make it clear what it is that I say is not equivalent. If the point is moot with respect to what the OP had in mind that's great.
The OP didn't seem like he was aware of the subtleties of Bayesian statistics and how to model and incorporate his prior knowledge. He might have committed the common fallacy of assuming that any prior knowledge he has must be tossed aside once he starts doing "experiments". He asked us for our help, which means we need to think one step further and realize he might have useful information he doesn't even know is useful.
If you treat this as a textbook problem then it might be reasonable to assume no information when none is specified. But I'm treating this as an attempt to help the OP with a real-world problem he has.
No, I am not wrong on this matter. I don't know by which metric you measure a "better" computation, but my objections lead to a more accurate computation which will give more helpful results for what the OP is trying to do. -- Meni Rosenfeld (talk) 19:30, 18 January 2011 (UTC)[reply]
Oh, and unrelated to this: I've just noticed I missed an important detail, that the OP has several small samples for each variable. The correct thing to do is to merge the samples - treat it as if you have one big sample with the total counts and the total successes. There is absolutely no useful information about the probability in the distribution between the samples. So if the total count is enough to reliably estimate the probability then the result is not very sensitive to the prior. -- Meni Rosenfeld (talk) 19:46, 18 January 2011 (UTC)[reply]

Thanks for your assistance. It's wonderful to fall asleep on a hard problem and wake up to find people trying to help you. Perhaps the best way for me to clarify is to explain a little more of the background to my problem.

I have a computer system that contains a feature that suggests "drilldown" actions to the user. There is an algorithm that selects which actions to offer as a function of the user context. Each of my samples represents the pair of (user context,suggested action). This Cartesian product is large, which explains the small sample size. Over 99% of the time, the user is happy with the main content or selects a static action, and therefore rejects all suggested actions. I can observe that the sample ps tend to cluster around that (low) probability as N rises. Aggregate samples for each user context, and for each suggested action show similar effects.

The data I have gathered is not fairly distributed across the Cartesian product because it is determined by the algorithm's past selections, but it is fairly broad because the algorithm has changed over time, there is some deliberate randomness in the system, and there are external reasons why some actions may be unavailable at some times.

My goal is to evaluate the success of the algorithm (past, present, and hypothetical future) in placing the suggested actions in the correct order, and hence showing the best ones to the user. I plan to do this by considering all ordered pairs of actions in the (extended) selection list for a specific user context, and calculating the probability that each pair is in the correct order. Because of the paucity of the data, each individual comparison is unlikely to tell me much, but my hope is that they will tell me more in the aggregate. I can then see how the effectiveness of the algorithm changes over time, and do hypothesis testing on proposed algorithms without having to perform user testing on each one.

I am aware that there are certain confounders in the data gathering, mainly related to the fact that I show several suggested actions at once. User selection choices will be affected by: order of presentation; collocation of specific actions; the unlikelihood that a user will break linear navigation and select two actions from one suggestion set; and the number of suggested actions. Nevertheless, I am attempting to model the events as independent Binomial distributions.

One question for Bo: I'm probably missing something obvious, but could you please explain how you calculated the values for   and   in your worked example?

Thanks, Bovlb (talk) 21:34, 18 January 2011 (UTC)[reply]

Sorry, I made errors which I have corrected below.
Bo Jacoby (talk) 22:45, 18 January 2011 (UTC)[reply]

Thanks. That was the calculation I tried to replicate. So, even if the two distributions are not closely approximated by normal distributions, their difference is, and I can generate that normal difference distribution by treating them like normal distributions. If zero has a Z-score of -0.783 in that distribution, then it (coincidentally) has a probability of 0.783, which is the probability that my suggestions are in the correct order. My Agresti-Coull approximation gave me:

10 trials and no successes means  ,
100 trials and one success means  .
Distribution of difference is  .
Z-score of zero in difference distribution is  
The probability is therefore 0.88507.

But I see now that it's all about the priors. If I observe that my experiments taken in the aggregate have an overall mean with high confidence, and that individual experiments cluster around that mean as N increases, then I ought to be able to use that distribution as a prior to modelling the result of an individual experiment. (Strictly, my prior should be based on the overall data minus the data for the experiment in question, which remains posterior.)

Plugging a prior p of, say, 0.001 into Meni's formula gives me:

 
 
 
 
 
 
 
 
 
 

And combining them as if normal distributions, per Bo, gives me:

 
 
 
The resulting probability that the two distributions are in the correct order is therefore about 0.30489

Does that seem right?

Thanks again, Bovlb (talk) 01:01, 19 January 2011 (UTC)[reply]

Here is a graph of my new prior distribution with p=0.1%, shown from zero to 0.2%, with the mean marked in red, plus the distributions of the two experiments. The "zero out of ten" experiment is almost indistinguishable from the prior. Bovlb (talk) 02:03, 19 January 2011 (UTC)[reply]

Note that the formula for mean   standard deviation of the sum of independent random variables

 

does not rely on the distributions being normal.

The prior information is conveniently translated into values of trials n and successes x. Instead of the observed n=10 and x=0 you use n=1009 and x=0 in the formula for mean and standard deviation of P. But why assume that you have this prior observation n=999 and x=0 when actually you don't? The prior value P=0.001 is merely guesswork contaminating your observations.

The resulting probability that the two distributions are in the correct order is therefore about 0.30489 does not seem quite right. Even if the z-score 0.51039 is correct the probability 0.30489 rely on the distribution being normal to five significant digits, which is not the case. The correct formula is the double integration above. Bo Jacoby (talk) 05:02, 19 January 2011 (UTC)[reply]

[to OP] Note that you shouldn't double-count evidence. If your "prior" really is the aggregation of your data, then you can't use it for that same data. You can use it only for new data.
I think you were speaking of using datapoints 2-1000 as a prior for datapoint 1, datapoints 1,3-1000 for 2, 1-2,4-1000 for 3 and so on. That's wrong since you're counting each datapoint 999 times. -- Meni Rosenfeld (talk) 09:45, 19 January 2011 (UTC)[reply]

Minor detail: Bovlb wrote

 

meaning

 

Bo Jacoby (talk) 23:01, 19 January 2011 (UTC).[reply]

Thanks Bo, I corrected the subscript. I wasn't sure what the SD sum rule assumed about the distribution. I take the point about the probability estimate not being accurate to 5 digits.
Also, I take the point about the prior and the double counting. I guess the fix for that is to split off a random portion of the experimetal data, develop a prior from that, and then use it for the remaining data.
Thanks again for the assistance. Bovlb (talk) 17:35, 20 January 2011 (UTC)[reply]
Doesn't that just postpone the problem? When you take a portion of your experimental result and want to develop a "prior" from it, you need to have a prior for that calculation. And so forth ad infinitum. In the end you'll have to make a mathematically arbitrary choice of prior based on knowledge about the situation you're ultimately modelling. And once you do that, the sane approach is to use that prior in a single-step estimation involving all of your experimental results. –Henning Makholm (talk) 17:43, 20 January 2011 (UTC)[reply]
What Henning said. A prior is just what the name implies - it's what you have before you start doing experiments. You can spend some time thinking about what you know a priori and modeling it as a distribution, but you're definitely not supposed to go out of your way to find the "prior distribution" from your experimental data.
If you really don't know anything about the parameters then use a noninformative prior, like the uniform distribution that Bo advocates.
It is possible, and mathematically equivalent, to start with some prior, update it with a part of your data, and use the posterior of the first batch as a prior for the second batch. But at best this leads to no difference, and at worst can cause confusion and mistakes.
Now that you've described what you're doing, it seems to me too complicated for our simple beta-distribution model. -- Meni Rosenfeld (talk) 09:55, 23 January 2011 (UTC)[reply]

The probability that one latent probability (P) exceeds the other (Q) when one sample has 10 trials and no successes, and the other 100 trials and one success, is 2525/3108≃0.81242

I made http://www.wolframalpha.com compute this result from

(100+1)(Binomial[100,1])(Integrate[q(1-q)^99(1-q)^11,(q,0,1)])

and the intermediate result (1-q)^11 from

(10+1)(Binomial[10,0])(Integrate[(1-p)^10,(p,q,1)])

Bo Jacoby (talk) 22:43, 22 January 2011 (UTC).[reply]

If you start with a uniform prior then your result is correct. In Mathematica this is
Probability[P > Q, {P \[Distributed] BetaDistribution[1, 11], Q \[Distributed] BetaDistribution[2, 100]}]
P and Q both have a prior expectation of 0.5. P which has 0/10 successes has a posterior expectation of 1/11   - there is not enough evidence to shift significantly from the prior. For Q which has 1/100 successes this is   - the empirical mean is higher, but it has more evidential weight to push down from the prior. Thus it is natural that P is most likely to be greater than Q. -- Meni Rosenfeld (talk) 09:24, 23 January 2011 (UTC)[reply]

Thank you for your confirmation and explanation of this (for me) counter-intuitive result!

Yes, I do start with N=0, X=0 , meaning that the prior probability distribution is uniform between 0 and 1

 

The information N=10, X=0 gives

 

The information N=100, X=1 gives

 

(and not what I erroneously wrote earlier. I hope it is correct now.)

 
 
 
 

Bo Jacoby (talk) 21:25, 23 January 2011 (UTC). The normal distribution approximation gives Pr(P>Q)=(1+erf(0.8183/sqrt(2)))/2=0.7934 to be compared to the more precise value 0.81242 above. Bo Jacoby (talk) 16:27, 24 January 2011 (UTC).[reply]

"sgn" in Latex

edit

What package do I need to use to get the "\sgn" Sign_function command in Latex?

I've googled, but as far as I can make out, it doesn't exist! — Preceding unsigned comment added by Damian Eldridge (talkcontribs) 07:07, 18 January 2011 (UTC)[reply]

You can add it yourself:
\newcommand{\sgn}{\mathrm{sgn}}
-- Meni Rosenfeld (talk) 09:49, 18 January 2011 (UTC)[reply]
You shouldn't use plain \mathrm for named operators, it does not get the spacing right. The proper definition is
\newcommand\sgn{\mathop{\mathrm{sgn}}\nolimits}
or, better yet, if you use the amsmath package:
\DeclareMathOperator\sgn{sgn}
Emil J. 12:45, 18 January 2011 (UTC)[reply]
I always use operatorname for this sort of thing:
 \newcommand{\sgn}{\operatorname{sgn}}

I think that's equivalent to DeclareMathOperator though Tinfoilcat (talk) 16:03, 18 January 2011 (UTC)[reply]

Topology - Sequential spaces and sub sequential spaces

edit

Hi! I seem to have difficulties in understanding the difference between sequential spaces to subsequential spaces. Can somone give maybe an example of a topological space which is subsequential but not sequential? Thanx! Topologia clalit (talk) 13:47, 18 January 2011 (UTC)[reply]

As far as I understand it, the whole point is that subsequential spaces are a different kind of spaces, they are not necessarily topological spaces. A subsequential space whose convergence relation induces an actual topology is in fact sequential.—Emil J. 14:23, 18 January 2011 (UTC)[reply]
But that just moves the question to: What is an example of a convergence relation that does not induce a topology? –Henning Makholm (talk) 14:37, 18 January 2011 (UTC)[reply]
There's an example in Johnstone's cited paper: the set {x,y,z} with every sequences converging to both x and y, and a sequence converging to z iff it takes the value x finitely many times. The induced topology (in which a set is closed if it's closed under taking limits of sequences) is the trivial topology, but this is not the convergence relation of the trivial topology. Algebraist 15:12, 18 January 2011 (UTC)[reply]

Thanks Algebraist but i'm not sure i understand.. is this a finit space of 3 points? i mean, doesn't an infinite sequence suppose to have an infinite nember of disjoint elements? Also, how can a sequence converge to two points x and y? doesn't convergence mean that there is one limit? Topologia clalit (talk) 16:31, 18 January 2011 (UTC)[reply]

A sequence in a space is just a function from the natural numbers into the space. It can certainly take a given value more than once. There's nothing in the axioms of a subsequential space that require a sequence to have at most one limit. Algebraist 16:50, 18 January 2011 (UTC)[reply]
Indeed, under the trivial topology, everything is a limit of anything. Also, the page TC linked to even says that ambiguous limits are necessary for a counterexample: "A subsequential space is said to be sequentially Hausdorff if each sequence converges to at most one limit. ... every sequentially Hausdorff subsequential space is necessarily a sequential space."Henning Makholm (talk) 17:13, 18 January 2011 (UTC)[reply]

I see now your point, should have thought about it by myself i guess.. Thanks guys! Topologia clalit (talk) 13:04, 19 January 2011 (UTC)[reply]

Math game

edit

Is there a website where they use a game to increase your simple mathematics and arithmetic skills like fractions, decimals, percentage and addition, subtraction, multiplication and division? —Preceding unsigned comment added by 174.89.43.48 (talk) 19:30, 18 January 2011 (UTC)[reply]

Google is your friend. Search on the terms you used as the header of this section.—msh210 22:15, 18 January 2011 (UTC)[reply]

Mathematics and language

edit

This question is related to this one (but not the same, I want a mathematical perspective). As a mathematician, what math knowledge would you use to analyze something like language? Which software would you use?--Quest09 (talk) 20:24, 18 January 2011 (UTC)[reply]

That's a really broad question. "Language" encompasses a huge range of phenomena and areas of study, and "something like language" is even broader. All sorts of areas of math might be useful. A few that immediately spring to mind include Markov chains, probability and statistics (including, for example, Bayes' theorem), and graph theory. One might also use something like Fourier analysis to analyze spoken language. The software really depends on what you're doing—it could be anything. Other than obviously useless software such as games, I would think any kind of software could be potentially useful to a mathematician studying language. If you want a more specific answer, you'll have to provide a more specific question, I think. —Bkell (talk) 20:45, 18 January 2011 (UTC)[reply]
Yes, but I suppose you would exclude fields like geometry and even analysis upfront, wouldn't you? — Preceding unsigned comment added by Quest09 (talkcontribs) 11:27, 19 January 2011 (UTC)[reply]
Fourier series are based on trigonometry, not far away from geometry. And fractal geometry, could also some application on language analysis. 80.58.205.34 (talk) 11:39, 19 January 2011 (UTC)[reply]
Off the top of my head, I can't think of a way in which geometry would be directly useful, but I am definitely not confident enough to say it should be excluded; I'm probably just not imaginative enough. I certainly would not exclude analysis. I already mentioned an application of Fourier analysis above. Also, if you want to model something like the spread of slang words, or really any other facet of the evolution or spread of language, then you will probably be using some kind of mathematical modeling, which is a broad field that can use all kinds of ideas from analysis and differential equations. —Bkell (talk) 06:41, 20 January 2011 (UTC)[reply]
Natural language processing might be good place to start.--Salix (talk): 22:24, 18 January 2011 (UTC)[reply]
You might also get some useful ideas from our articles on computational linguistics and quantitative linguistics, and perhaps even mathematical sociology. —Bkell (talk) 06:44, 20 January 2011 (UTC)[reply]
More advanced math has shown relevance as well. Rene Thom, Fields Medal winner for catastrophe theory, ventured into semiotics, and here's a great overview pdf of Thom's theory of meaning. A good book that follows up on this is Petitot's Morphogenesis of Meaning. And straight from the horse's mouth (after I asked him) was Amartya Sen advocating that topology and graph theory be researched more for use in the social sciences, as a mathematical approach separate from the standard functional calculus used now. SamuelRiv (talk) 03:51, 23 January 2011 (UTC)[reply]

Combination of error distributions

edit

If you take two measurements of, say, a length and get, say, 2cm and 3cm and you know that each measurement has a normally distributed error with mean zero and standard deviation 0.5cm, what's the distribution of the length? Clearly it has mean 2.5cm, but what else can we say about it? --Tango (talk) 22:09, 18 January 2011 (UTC)[reply]

The posterior distribution depends on the prior. In fact it is proportional to the prior times the likelihood function.
The likelihood function, if the measurements are independent, is simply the product of the likelihoods for the two measurements,  .
The naive choice for prior would be the improper prior  , but this of course does not faithfully reflect your prior knowledge (if the result was "between 4 and 5 cm", would that have the same believability as "between   and   cm"?). I'd probably use a translated Student's t-distribution for the logarithm of the length.
Note also that the mean of the posterior is not necessarily 2.5cm. -- Meni Rosenfeld (talk) 09:19, 19 January 2011 (UTC)[reply]
Unless I'm mistaken, the likelihood function that Meni quotes as a product is (up to an overall constant factor) just a normal distribution with mean 2.5 cm and standard deviation  ·0.5 cm. The actual difference between the two samples cancels out. –Henning Makholm (talk) 18:08, 19 January 2011 (UTC)[reply]
I think it should be  . I've added this simplification to the equation. In general, when multiplying pdfs of normal variables, the result will be proportional to normal with inverse variance equal to the sum of inverse variances, and mean equal to the average of means weighted by inverse variance. This will be your posterior distribution if the prior is nearly constant in the relevant region. -- Meni Rosenfeld (talk) 19:13, 19 January 2011 (UTC)[reply]
You're right and I should do these things on paper instead of in my head. Of course the posterior distribution ought to get narrower when we have two samples instead of one. –Henning Makholm (talk) 21:08, 19 January 2011 (UTC)[reply]
Paper is nice, but personally I prefer to do these things on silicon :) -- Meni Rosenfeld (talk) 09:43, 20 January 2011 (UTC)[reply]