Talk:Statistical significance/Archive 3

Latest comment: 8 years ago by Isambard Kingdom in topic Effect size
Archive 1 Archive 2 Archive 3 Archive 4 Archive 5 Archive 6

Merge in "Statistical threshold"

The proposal has been made by editor Animalparty to merge in the content from Statistical threshold. Since the concepts are very different, and a "statistical threshold" is an arbitrarily determined limit, I do not think that such a merge would be provident. --Bejnar (talk) 15:55, 12 August 2014‎ (UTC)

On 17 August 2014, after discussion at Afd, Statistical threshold was redirected to Statistical hypothesis testing. Any further discussion should be on that talk page. --Bejnar (talk) 13:37, 17 August 2014 (UTC)

Definition

The definition states that statistical significance is a low probability. But it is not a p-value; it is a label that is placed on the result of a hypothesis after comparing the p-value to the significance level. This needs clarifying. Tayste (edits) 01:19, 2 February 2015 (UTC)

The current definition defines p-value results below a threshold as significant. This is supported by the sources. For more details, I encourage you to take a gander at the discussion threads above. danielkueh (talk) 01:23, 2 February 2015 (UTC)
Tayste above is correct in his or her edit to the article that tried to differentiate this article's subject from p-value. The lead of this article, especially the very first sentence, is enshrining p-value as the only paradigm of statistical significance and doing so covertly. ElKevbo (talk) 03:19, 2 February 2015 (UTC)
I don't know what you mean by "enshrining p-value as the only..." If you have reliable sources to buttress your claim, I would be happy to see it. danielkueh (talk) 03:22, 2 February 2015 (UTC)

Comment: Recently, there were three edits, made by different editors, to the lead sentence of this page. The first edit introduced a redundancy, the second edit equates statistical significance with p-values, and the last edit introduced a bizarre definition of statistical significance. I reverted all three of them. Anyone who wishes to contest the current lead should at least take a look at the long discussion threads in Archive 2 and also at the sources themselves, which are of high quality. In fact, many of them are peer-reviewed. And if there are editors who still want change the lead, they should at least explain 1) why the current lead is not supported by the reliable sources that have been cited AND 2) provide mainstream sources of comparable quality to support whatever changes they have in mind. danielkueh (talk) 07:51, 2 February 2015 (UTC)

ChrisLloyd58 has recently reverted my reversion of his edit and has again replaced the previously well-sourced definition with "Statistical significance is the property of a statistic deviating from a reference level (often zero) by more than one would expect by pure chance." I have seen many closely related definition on statistical significance, but I have never seen ChrisLloyd's definition in any reputable textbook or in any scholarly journal on statistics. Aside from the erroneous appeal to chance (see Archive 2 of talk page), it is not clear what this "zero reference level" is. Furthermore, there are no sources to support his idiosyncratic definition. It is certainly not an improvement and I believe the previous definition, however imperfect, should be restored as it is well-sourced (see WP:V). danielkueh (talk) 01:19, 3 February 2015 (UTC)
Response: The previous page said that statistical significance is “the low probability of obtaining at least as extreme results given that the null hypothesis is true.” This is a P-value. Consult any statistics text you like and you will see that I am right. So the previous text said that “statistical significance is the P-value.”
The second paragraph said “These tests are used to determine whether the outcome of a study would lead to a rejection of the null hypothesis based on a pre-specified low probability threshold called P-value.” P-values are not themselves thresholds. They are calculated from the data. The fixed threshold is called a significance level. Again, consult any first year text.
I have left the references the same as they will confirm the basic definitions of P-values and significance levels.
Even the present text is far from perfect because it is rather repetitive.
Actually, I would not be adverse to removing the page entirely. The key concepts are P-values, significance levels, hypotheses and types of errors. The fact that we say as result has attained statistical significance when we reject the null is no reason to have a page on the abstract noun "statistical significance." But if we are going to have one, then it should not mislead people.
ChrisLloyd58 (talk) 01:28, 3 February 2015 (UTC)
ChrisLloyd58, I appreciate you joining the discussion but please indent your comments with colons as it helps editors to different the various comments on this talk page.
The previous lead definition does not equate statistical significance with p-values. It merely makes the point that the p-value has to be below an arbitrary threshold for it to be considered significant. Hence, the keyword "low" at the beginning of the sentence. Furthermore, the previous definition is supported by reliable sources. A list of explicit quotes can be found in Archive 2 of this talk page. I strongly urge to take you a look.
With respect to your new definition, it is not an improvement for several reasons:
  • By appealing to the notion of "more than pure chance," it commits the inverse probability fallacy. This has been discussed to death and a list of reputable sources can be found in Archive 2 of this talk page. As one other editor previously commented, a lead definition should not start with a fallacy.
  • You have not provided a single reliable source that explicitly defines statistical significance as "the property of a statistic deviating from a reference level (often zero)." Until you do, there is no reason to believe that this idiosyncratic definition is representative of mainstream sources. If anything, it qualifies as a fringe definition (see wp:fringe). Same goes for the second sentence, "More formally, it describes rejection of the null hypothesis that the underlying parameter equals the reference level."
  • You have not explained how the previous definition is NOT supported by the preponderance of high quality sources cited. Again, please see Archive 2 for details. Until you do, there is no reason to believe that the previous definition was misleading. Quite the opposite.
I take your point about about the troublesome title of this page. But that is a separate discussion. One that I am actually open to discuss.
danielkueh (talk) 02:00, 3 February 2015 (UTC)
Aside from all the other edits, I restored a modified form of the previous lead definition. The reasons are clear. The previous lead definition is well sourced and supported by high quality peer-reviewed sources (see Archive 2 of talk page) whereas the new lead definition isn't (and I suspect never will be). Furthermore, the new lead definition contained an inverse probability fallacy, which the previous lead corrected (see Archive 2 of talk page for details). However, given the concerns expressed by the other editors on the potential ambiguity of whether statistical significance is equivalent to p-values in general, I modified the beginning of the previous lead sentence to remove that ambiguity. Hopefully, it is clearer now. danielkueh (talk) 17:44, 3 February 2015 (UTC)
The lede needs to open with a context setting, such as "In statistical hypothesis testing," since this is a technical topic.
The term "very low" is incorrect - the probability need only be lower than the significance level, which is not necessarily "very low".
Also, I think that "given" is incorrect - we are assuming the null hypothesis to be true when calculating the probability; we are not given that it is true. Tayste (edits) 22:05, 3 February 2015 (UTC)
Hi Tayste, the term given is quite standard. In fact it is what the | symbol stands for in p(A|B), or probability of A given B, with B being that the null is true. p-values are after all conditional probabilities. I wrote "very low" as it is cumbersome to state that it needs to be lower than a threshold value. To me, less than 5% sounds "very low" compared to say 20%, which is "quite low." But I take your point that it is a subjective statement, which is not necessarily helpful. I wonder if it is even necessary to define the p-value in the lead definition. Would it not be better to just say that something as follows:
"Statistical significance (or statistically significant result) is attained when a p-value is less than a pre-determined threshold."
Thoughts? danielkueh (talk) 22:21, 3 February 2015 (UTC)
In some fields (e.g. physics, genetics) much lower significance levels are used, e.g. 0.0000001, so p=0.0000002 would be non-significant yet "very low" in everyday language. Tayste (edits) 23:05, 3 February 2015 (UTC)
So you are in agreement then about not defining the p-value and just state it needs to be lower than a pre-determined threshold? After all, we already have an article that defines a p-value. We could just wikilink to that article as follows:
"In statistics, statistical significance (or statistically significant result) is attained when a p-value is less than a predetermined threshold.[1][2][3][4][5][6][7] It is an integral concept of statistical hypothesis testing where it helps investigators to decide if a null hypothesis can be rejected.[8][9] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[10][11] But if the p-value, which is the probability of obtaining at least as extreme result (large difference between two or more sample means) given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[8] An investigator may then report that the extreme result attains statistical significance."
Better? danielkueh (talk) 23:08, 3 February 2015 (UTC)

Multiple comparisons problem

Do we need a section to briefly mention how statistical significance is affected by the multiple comparisons problem? Tayste (edits) 19:05, 9 February 2015 (UTC)

What do you have in mind? danielkueh (talk) 19:36, 9 February 2015 (UTC)
Well, the interpretation of "statistically significant" meaning "a result unlikely to be due to chance" becomes flawed when one has conducted many tests. Thought this ought to be mentioned briefly, with a link to the actual article for more info. Maybe even just putting it into a "See Also" section. This leads me on to my next suggestion. Tayste (edits) 21:26, 9 February 2015 (UTC)
I know what you mean. I don't disagree. I just want to know if you have a specific text/paragraph/section in mind. danielkueh (talk) 21:41, 9 February 2015 (UTC)

wikilink for significance level

significance level redirects here, so this page should define it in the lead. Furthermore, doing so is more important the the defintion of p-value which is provided by linking to that page. The fact that a subsection of Type_I_and_type_II_errors defines significance level is irrelevant. This page should be the definitive definition, hence removing the inappropriate link. Tayste (edits) 21:49, 9 February 2015 (UTC)

Again, I still fail to understand your rationale. We're writing a lead to explain statistical significance (see WP:lead), not because of some wiki-link issue. If you like, we can have significance level point to the subsection of Type_I_and_type_II_errors instead. danielkueh (talk) 21:52, 9 February 2015 (UTC)

Move page to significance level?

I think Significance level should be an article, with statistical significance redirecting to it, rather than the other way around. Tayste (edits) 21:27, 9 February 2015 (UTC)

Opposed. Statistical significance is a broader concept than significance level. danielkueh (talk) 21:39, 9 February 2015 (UTC)
Can you please give me an example where the term statistical significance is used without reference to a significance level? Tayste (edits) 21:42, 9 February 2015 (UTC)
What is your point? Can you give me an example where significance level is used without reference to statistical significance? danielkueh (talk) 21:45, 9 February 2015 (UTC)
Power calculations, for example. One chooses a significance level without necessarily conducting a hypothesis test (e.g. the study might not go ahead).
My point is that statistical significance is a consequence of choosing a significance level, not the other way around. And I think significance level is a more appropriate title for the article, as that's the more important concept to be explained. Statistical significance comes out naturally while explaining significance level, but the latter seems to be awkwardly mentioned in this article. Tayste (edits) 21:56, 9 February 2015 (UTC)
People, scientists and the like, want to know if a result is statistically significant. No one cares about the significance level or critical levels, because most of the time, those levels are fixed. Thus, your point about statistical significance being a consequence of choosing a significance level is a non sequitur. Statistical significance is also consequence of calculating means and standard errors. And in non-parametric cases, in terms of ranks. Does that mean we should rename this article in terms of mean differences? danielkueh (talk) 22:07, 9 February 2015 (UTC)
To make the point that there is greater interest in statistical significance than significance level, I did a quick search on Google Scholar and Google Books for the terms "statistical significance" and "significance level." The results are as follows:
  • "statistical significance": 2,160,000 results (0.09 sec) on scholar.google.com
  • "significance level": About 709,000 results (0.10 sec) on scholar.google.com
  • "statistical significance": About 794,000 results (0.32 seconds) on books.google.com
  • "significance level": About 381,000 results (0.40 seconds) on books.google.com
Based on these quick searches, I think it's clear which concept is of greater interest to most readers and writers alike. danielkueh (talk) 22:22, 9 February 2015 (UTC)
A p-value measure of "statistical significance" can be calculated from the data without having selected (arbitrarily, usually) a threshold "significance level". This p-value can be objectively reported, as it is, without making any black-white judgement about acceptance or rejection of the null hypothesis. In this respect, then, "statistical significance" can be more interesting than the "significance level" itself. Isambard Kingdom (talk) 08:38, 10 February 2015 (UTC)

Relevance

A paragraph on the distinction between statistical significance and significance (in the sense of relevance) is badly needed. These two concepts are all too often confused, leading to all kinds of nonsensical conclusions. The little section on effect size hints in that direction but is not explicit enough. In that context there could also be a link to the Wikipedia entry on publication bias. Strasburger (talk) 09:40, 27 February 2015 (UTC)

Good idea! Perhaps you could start a draft on this talk page. We can then tweak and discuss? We could even start with the following sentence from the lead and expand/modify it further.
Statistical significance is not the same as research, theoretical, or practical significance.
danielkueh (talk) 16:50, 27 February 2015 (UTC)

Significance and Relevance

Statistical significance, i.e. the p value, should not be confused with significance in the common meaning, i.e., with relevance (also termed practical significance). A finding can be (statistically) significant, or even (statistically) highly significant, and still be irrelevant for all practical purposes. The reason is that statistical significance – intendedly – depends on the number n of observations (of trials, of subjects) and increases with (the square root of) n. As a consequence, any effect, including many nonsensical examples, become significant, provided the data set is sufficiently large. A remedy could be to limit sample size sensibly.[1] Anyway, all that significance tells us is simply that the effect in question is, to a certain probability, not due to chance alone.

(This paragraph could go at the end of the section "Role in statistical hypothesis testing":) Note that this probability (the false-alarm rate) is by no means the 5% that one would innocently assume. Basically its value is unknown since it requires information on the probability that a real effect was there in the first place – which we do not have. There are rule-of-thumb calculations, though; according to one widely used, that probability is 29% for p=5%, i.e. is much larger than expected.[2] [3]

The relevance (or importance) of an effect, on the other hand, can be assessed by statistical measures of effect size, like Cohen’s d. Measures of effect size, as the name implies, assess the extent of an effect and are independent of sample size n. They estimate population properties, not sample properties; larger samples just make effect-size measures more robust. Unlike with significance, the larger the effect size is, the more relevant a finding generally is.

Current scientific practice puts way too much emphasis on p values, as has been repeatedly criticized. The hidden assumption is that reaching significance is a "success" and that the more significant a finding is, the more relevant it is. One consequence of that fallacy is that many, or even a majority, of results cannot be replicated, i.e., are not "real". [4] The over-emphasis on p values further leads to an unwarranted public awareness of irrelevant findings and a lack of awareness of important effects. A related, detrimental effect to the credibility of science is termed publication bias which, again, seriously distorts our view on the world. Strasburger (talk) 18:51, 28 February 2015 (UTC)

@Strasburger, while I certainly agree that p-values aren't a panacea, I don't quite follow you when you say "As a consequence, any effect, including many nonsensical examples, become significant, provided the data set is sufficiently large." Perhaps you simply mean that the statistics become convergent, so that, for example, 1/20 of all experiments with random data really do satisfy the 5% significance level in the limit of large numbers of data? One thing, 1/20 of all experiments with random data should also satisfy the 5% significance level in the limit of a large number of experiments (regardless of the number of data in each experiment). At least, that is my interpretation, but I'm willing to be corrected on this. Isambard Kingdom (talk) 19:55, 28 February 2015 (UTC)
@Kingdom, wrt your first statement, I see your point but that appears to be a common misunderstanding (one which I also fell for): The t value, by its definition and because variance is invariant with n, decreases with the square root of n. So, the larger n, the smaller any difference (or correlation etc.) needs to be to become significant. Statistics books are full of humorous examples for that (my favorite is a study on a few thousand children, where short sightedness correlated highly significantly with intelligence. The correlation was only r=0.1, i.e. explained variance was 1%). A nice website with nonsensical correlations is at http://www.tylervigen.com/ Formally, the problem is e.g. treated in the Friston paper which I cited, and in statistics books (I use Bortz: Statistics). – As to the meaning of the 5% when the null hypothesis is not true – which is always the case – that seems to be much more complicated than one thinks. It is explained in laymen terms in the Nature paper by Nuzzo (2014) (which is fun reading), and formally e.g. in Goodman (2001) (which I took from the Nuzzo paper). Strasburger (talk) 11:21, 1 March 2015 (UTC)
Ps. Given your questions I should perhaps spend a few more sentences. (?) Strasburger (talk) 12:31, 1 March 2015 (UTC)
I have now expanded the remark on false alarm rate a little, to make it more accessible. Strasburger (talk) 16:25, 2 March 2015 (UTC)
@Strasburger, I need to investigate this, and do some learning, but perhaps you can give me a jump start. I'm guessing that this effect size has something to do with how close the p-value is to the chosen level of significance, and this closeness, measured as a distance, converges to a specific value for large numbers of data. At the same time, in practical terms, having a p-value that is just barely less than the chosen significance level is not really much different from a p-value that is just barely greater. And, yet, the first case would be deemed "significant" while the second would not. Is this getting at the crux of the issue you are highlighting in the article? Isambard Kingdom (talk) 17:04, 1 March 2015 (UTC)
No, not at all. It is much simpler: Look at the equation defining the t value in the simplest case, the One-sample t-test. In the denominator, you have standard deviation divided by the square root of the sample size n. The larger n gets, the smaller the denominator gets (the denominator is called standard error), and the larger t gets. The larger t is, the more significant it is, i.e. the smaller its p value.
Now compare that with the equation for simplest case of an effect size, Cohen's d: The main difference to t is that the square root of n is missing. So Cohen's d is independent of n.
@Strasburger, additional point/question. As the section "Role in statistical hypothesis testing" is currently written, it is discussing an experimenter collecting a few data, then checking for significance. If the results seem to be close to significant, but not quite crossing the chosen level of significance, more data are collected until, by virtue of statistical jitter, the threshold is finally crossed. At this point the experimenter stops his/her work and declares "success". I would suggest that this is simply bad experimental practice, and not really a subject that can be easily addressed with statistical analysis. Data need to be collected blindly with respect to their statistical properties. Of course, we all know that this is not always the way things are done, hence problems. ...
Let me answer point by point for clarity. Yes, it is bad experimental practice. But it is also of not much practical consequence. The main fallacy is that significance is a "success". Significance doesn't tell much, it just says that the effect (however small it is), is likely not due to chance. There is now a journal that takes the drastic step of banning the report of p values for that reason altogether (Basic and Applied Social Psychology).
...But I don't see how (maybe I need to learn) introducing additional statistical measures (effect size, for example) circumvents what is certainly bad practice. ...
Effect size stays unchanged with additional data. So if the effect is weak in the first place, it will stay weak with more data.
...Alternatively, the experimenter can choose not to specify a "significance level" and simply report the p-value as it is measured along with the number of data. No subjective and arbitrary assessment of "significance" need be made. This is generally what I do in these sorts of settings. Let the data speak for themselves. Isambard Kingdom (talk) 17:21, 1 March 2015 (UTC)
Yes, that is what I do, too. What we also do is to point out that a p value that is slightly higher than 5% or 1% is close to that significance level. For p<10% one can suggest a trend. There are actually two schools in this respect (I forgot what they are called); one where a significance level needs to be defined beforehand, and the other where the p values are simply reported (together with n!!). And good practice will now more and more require to also report effect sizes. Strasburger (talk) 10:48, 2 March 2015 (UTC)
@Strasburger, probably best if you don't divide my comments up into pieces. Even I find it confusing. As for effect size, I've just learned what it is. Basically it is what a scientist (say) is trying to measure, for example a correlation coefficient. A statistical p-value helps to assess the significance of that correlation. So, now that I know this, yes, of course, effect size should be reported, also p, and n too. I don't think, however, one should ever collect more data just to try to get a smaller p-value. In that respect, I find the present discussion in the article to be troubling. Isambard Kingdom (talk) 11:59, 2 March 2015 (UTC)
Sorry for dividing; I'll keep that in mind. (Please revert it if you wish.) Strasburger (talk) 20:53, 2 March 2015 (UTC)
Quick comment: I would like to remind folks that Wikipedia has a template page for citations (See WP:CT). Editors who wish to insert new references should use these templates instead of writing them out. Also, avoid journal style writing such as "Friston (2012) (and others)..." unless these authors are central figures (e.g., Fisher) in the article. Thanks. danielkueh (talk) 20:43, 1 March 2015 (UTC)
Thanks; I wasn't aware of that. Could you change one of my citations accordingly as an example? (I'll do the others then). Karl Friston is actually a central figure in modern statistics and functional MRI. Strasburger (talk) 10:48, 2 March 2015 (UTC)
I already did change the citations. I do not doubt Friston is notable in neuroimaging. But this page is about statistical significance. So unless Friston originated a new major concept/test on statistical significance that is widely cited/used in many secondary sources on statistics, then no, he is not what we mean by "central figure." Think Fisher, Newman, Keuls, or Student. Household names that can be found in any genetic statistics textbook. danielkueh (talk) 14:36, 2 March 2015 (UTC)
Friston: Agreed. - I changed the citation format to the template in the above piece of text now. Strasburger (talk) 15:54, 2 March 2015 (UTC)

General comment: I agree that it would be good to have a section that distinguishes between statistical and theoretical/practical significance. However, I find the above proposed text to be problematic for two reasons:

  • It's too long. Ideally, it should be one paragraph. Two paragraphs at the most so as not to give this new section undue weight (WP:UNDUE).
  • The proposed text does not adequately focus on the differences between statistical significance and practical/theoretical significance. It's main focus appears to be on replication and effect size. There's nothing wrong with that but I wonder if those issues could be described in other sections.

danielkueh (talk) 16:57, 2 March 2015 (UTC)

My intention was to focus on the difference between statistical significance and relevance (as an aside, the terms practical/theoretical significance were used in none of the statistics books I know, so I'll stick with "relevance"). If you feel it's too long, the second paragraph "Note that ..." could go somewhere else. It should be said somewhere, though, because this is a rather widespread myth about the meaning of the alpha error. The structure would then be: Paragraph_1: Significance; Paragraph_2: Effect Size/Relevance; Paragraph_3: Why the latter is more important. Strasburger (talk) 20:53, 2 March 2015 (UTC)
Ok, we need to clarify one thing first. When you say *relevance," do you mean *practical/theoretical significance" or are you talking about *effect size?* danielkueh (talk) 21:01, 2 March 2015 (UTC)
By "relevance" I mean "practical significance" (i.e. importance in every day language). Effect sizes would be possible measures to assess relevance/practical significance. Strasburger (talk) 22:41, 2 March 2015 (UTC)
Ok. It may be that effect size could be a measure of practical significance but they are still two separate concepts. I think it is important not to conflate the two. So far, much of what I see in the third paragraph of the above proposed text relates to effect size, rather than the significance. That needs to change. Unless of course, we decide to work on the effect size section instead. danielkueh (talk) 23:03, 2 March 2015 (UTC)
Ok. I have changed all three paragraphs slightly. The point I wish to convey is pointing out a common fallacy, namely that significance is a measure of relevance. Paragraph_1 explains why that is not the case, Paragraph_2 explains that it works with effect size, and Paragrpah_3 points out the far-reaching consequences of that fallacy. Strasburger (talk) 00:18, 3 March 2015 (UTC)
In the proposed text, only the first and third sentences of the first paragraph directly address the issue of confusing statistical significance with relevance or practical significance. And even then, it falls short, as it does not address theoretical or research significance. The difference between statistical significance and research/practical/theoretical significance is an issue of definition and the rest of the proposed text do not speak to that issue.
The proposed text puts a heavy emphasis on *effect size* as an indicator of practical significance. But that is assuming that all researchers are interested in large effect sizes, which is not the case. Practical significance is somewhat idiosyncratic. In some cases, researchers do want large effects, e.g., treating a disease. But in other cases, e.g., testing for side effects of a drug, an effect size may not be desirable.
So in the end, we could expand the first and third sentences to address the difference between statistical significance and other types of significance. The third paragraph on effect size is better placed in the section on Effect size and the fourth paragraph could be used to start a new section on the limitations of reporting statistical significance. As for the second paragraph, I don't know what to do with it. I think it should just be scraped. danielkueh (talk) 00:53, 3 March 2015 (UTC)

@Danielkueh: I agree that the structure of the proposed section needs to be adjusted. But perhaps less drastically. What happened (I think) is that my proposed title raised unintended expectations. Indeed, the differences between the three “real-world” significances, and how these are opposed to statistical significance, are conceptual questions. But the respective concepts of the former three are (imo) to vague to warrant a separate paragraph in a short Wikipedia article. Note also that statistical significance, unlike research/theoretical/practical significance, is a mathematical concept. So the task is not to compare statistical significance to the other three, but to ask whether statistical significance can be a valid measure for significance in the real world. It cannot, although surprisingly often it is assumed that it can. So, a more precise (but also more lengthy) title for the intended subsection could be “Statistical Significance is not a Valid Measure of Relevance”, or, shorter, “Significance is not Relevance”. The section was intended to plug-in right before the section “Effect Size”.

As such, it was intended to start a section with a headline “Limitations and Fallacies (of statistical significance)” (without explicitly saying so). Probably it would be better to say so. It would also motivate talking about effect size, and part of it could be moved there. The current section on effect size is already part of an implicitly critical section; otherwise there were no reason to include it in an entry on statistical significance. Finally, a first subsection in the proposed “Limitations and Fallacies” could be a few lines entitled “Meaning of the Alpha Error” which would contain the little paragraph currently starting “Note that this probability”. Most people think that alpha=5% means that the probability of getting a significant result is 5% – which it is not. The paragraph could go like this:

“Meaning of the Alpha Error: Often it is assumed that an alpha error of, say, 5% means that, in reality, there is a 5% chance of the respective results being false alarms. That is not the case, however, since it relies on the assumption that the real effect is zero (that the null hypothesis is true). Strictly speaking this is never the case, however. In principal, that probability (the false-alarm rate) is unknown since it requires information on the probability that a real effect was there in the first place – which we do not have. There are rule-of-thumb calculations, however. According to one widely used, that probability is 29% (for p=5%) i.e. is much larger than would be expected.[2] [3] Strasburger (talk) 11:21, 6 March 2015 (UTC)

Again, correct me if I'm wrong, but alpha=5% means that there is a 5% chance that the null hypothesis could give an effect like that either hypothesized or observed (or larger). Whether or not the effect is real ("there in the first place") is not relevant to this statement. At least that is my understanding. Isambard Kingdom (talk) 20:43, 6 March 2015 (UTC)
Yes, this is a common fallacy (which I also fell for, and which is why I think it is important in a section on fallacies): Your statement "alpha=5% means that there is a 5% chance that the null hypothesis could give an effect like that either hypothesized or observed" is only true, if the null hypothesis is fulfilled. Please refer to the main text where Ref. 22 is cited.
Of course, it is a conditional statement -- if that is what "fulfilled" means. Yes, if the null hypothesis were true (as an hypothesis), then for alpha=5% there would be a 5% chance that it would give an effect like the more interesting hypothesis under scrutiny. Where is the fallacy? The point of doing a null hypothesis test is to *consider* something less interesting that the hypothesis being proposed. We stand up the two hypotheses and compare them. Isambard Kingdom (talk) 19:40, 8 March 2015 (UTC)
You are right, I did not read your statement carefully enough. As a conditional statement it is correct. However, it is often assumed that when the null hypothesis was rejected on the 5% level then there is (unconditionally) a probability of 5% of that result being a false alarm. I should drop "in reality". Please have a look at the Nuzzo article; it is relatively short and easy reading. Strasburger (talk) 20:10, 8 March 2015 (UTC)
If we agree on something, it is that people misinterpret significance. Isambard Kingdom (talk) 20:16, 8 March 2015 (UTC)
Strasburger. Whether or not practical/theoretical/research significance is "vague" is irrelevant. We don't include texts into Wikipedia because they are not vague. We include them if they are well-sourced and consistent with mainstream discourse. So far, your first proposed text above is completely at variance with its title linking statistical significance with relevance. Now you are proposing another text/section on the alpha error. Aside from being practically unintelligable to a non-specialist, this new text/section reads like original research (WP:OR) and in a non-NPOV tone (WP:NPOV). More appropriate for a personal blog, but not for Wikipedia.danielkueh (talk) 23:27, 6 March 2015 (UTC)
Danielkueh: I agree that whether something is vague is largely irrelevant for inclusion. I just do not want to be the person to write such an entry (and leave that to somebody else). So I drop the proposal "Significance and Relevance".
The section I intended to propose in the first place was on an important limitation for statistical significance, namely that it is often seen as a measure of relevance. Sorry if my title was misleading. This limitation is standardly pointed out in every serious statistics book, just not in the introductory chapters. One reference is Bortz: Statistics (a standard German book), where I have the page number). The German Wikipedia points it out. The Nuzzo reference is a feature article in Nature, addressed at anybody interested in significance. It is not at all original research.
I do think that a section on limitations and fallacies is important here. Wikipedia is a main source for information. Sorry if my text is currently difficult to comprehend and if the tone is non-NPOV. I'd be grateful for any suggestions on how to improve that. Perhaps it helps if I spell out the intended section, under the new headline. So I will do that as the next step. Strasburger (talk) 19:27, 8 March 2015 (UTC)
Strasburger, I have looked at the Nature reference and it does not make the point about the alpha level. More specifically, the paper describes the philosophical issue of creating a "hybrid system" comprising Fisher's p-values and Neyman's alpha levels. I agree that several sources do describe this problem and I believe this would be suitable for the History section. But there is no mainstream consensus yet to dissemble this hybrid system, which for better or worse, is standard practice. danielkueh (talk) 20:11, 8 March 2015 (UTC)
Danielkueh, the paper does have sections on the current practical use of a "hybrid system". These are not the sections I am referring to, however. The two paragraphs I refer to are on page 151, entitled What does it all mean. Strictly speaking they are not on the alpha level, but they are on how that level is often interpreted after the experiment has been done. There is mainstream consensus that that post-factum interpretation is plain wrong. The sections are repeated here for convenience:
"What does it all mean? One result is an abundance of confusion about what the P value means4. Consider Motyl's study about political extremists. Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumor — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is."
"These are sticky concepts, but some statisticians have tried to provide general rule-of-thumb conversions (see 'Probable cause'). According to one widely used calculation5, a P value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of 0.05 raises that chance to at least 29%. So Motyl's finding had a greater than one in ten chance of being a false alarm. Likewise, the probability of replicating his original result was not 99%, as most would assume, but something closer to 73% — or only 50%, if he wanted another 'very significant' result6, 7. In other words, his inability to replicate the result was about as surprising as if he had called heads on a coin toss and it had come up tails." (Nuzzo, p. 151)

Strasburger (talk) 10:49, 9 March 2015 (UTC)

Alright, backup a bit. Look at what your proposed text says and then compare it to the Nuzzo's paper. In fact, just look at the first sentence of your text, "Meaning of the Alpha Error: Often it is assumed that an alpha error...." That is not what Nuzzo's paper says. if you think that is what he says, then you are conflating p-values and alpha levels. Also you have to read Nuzzo's paper very carefully. His paper says "According to one widely used calculation, a P value of 0.01 corresponds to a false probability...." I don't want to get too deep into the Goodman (2001) paper that Nuzzo cites, but suffice to say, the chance numbers of 29% or 11% being bandied about are not universal. So it is inappropriate to include those numbers into the proposed text. If anything, it will just confuse the readers. Finally, if you read the summary of Nuzzo, you'll see that mainstream practices of using the hybrid system have not changed and there is no general or final consensus as what the alternatives to the hybrid system should be, despite multiple solutions being proposed. Again, I am not opposed to describing misinterpretations of p-values and statistical significance in general. But the proposed text on alpha errors, as it is currently written, is just not informative. In fact, it is misleading. danielkueh (talk) 11:57, 9 March 2015 (UTC)
As said above a few days ago, I plan to re-write the section. Please give me a little time (I'm in bed with fever). Commenting on a revised text might be better. Strasburger (talk) 17:07, 9 March 2015 (UTC)
Well, don't rewrite the paragraph. Instead just list 1-2 main points that you want to make so that we can discuss them more easily. Please keep it simple. In the meantime, rest well. Hope you feel better soon. Cheers. danielkueh (talk) 17:49, 9 March 2015 (UTC)

OK, fine (not recovered but out of bed). Here is a try:

Subsection 1: Neither the alpha error, nor the set alpha level, nor the p value for a significant result are the probability of that result being a false alarm. That probability is unknown (without knowledge on the size of the effect) but has been estimated as mostly being much higher. Point out that in the population, an effect is rarely, if ever, zero.
Subsection 2: "Significance is not Relevance". (In the meaning that statistical significance cannot assess relevance). The reason being that significance is conflated with n. Any ever so small effect will be significant, given sufficient sample size n.
Subsection Effect Size (additional to what is said now): Point out that effect size measures are invariant with n (example of t-value vs. Cohen's d: there is no sqrt(n) in the denominator). Effect size can assess the relevance of a given effect (high influence on blood pressure is more relevant than low influence on blood pressure) but cannot compare relevance between different effects.

Strasburger (talk) 11:30, 14 March 2015 (UTC)

Strasburger, I'm no statistician. Still, let me summarize my understanding: Statistical hypothesis testing is a frequentist interpretation. In a frequentist interpretation, one considers the conditional probability of data *given* model parameters -- the model parameters are, essentially, the hypothesis being tested. If we accept this succinct summary, then a frequentist might make a Type I error ("false alarm") because of statistical fluke – data drawn from, say, a null process, are statistical, and, so, they will sometimes seem to show a “significant” effect. But the frequentist never actually “knows” his parameters -- not before the experiment, and not after the experiment either. His parameters are only hypotheses he seeks to test. All he actually knows is the data he has, and so he can’t know whether or not a Type I error has occurred. A Type I error might be inferred, subsequently, from more data collected in different experiment, but that occurs in the future, after the experimenter has analyzed the data he has. So, assuming I’ve understood this, the “effect size” only enters into the discussion as the quantity the experimenter wishs to compare against the null. What is the probability that the null would give hypothetical data having an effect larger than that observed in the real data? I'm happy to be corrected on this (and so learn something), but I think "frequentist" hypothesis testing might be simpler than you are imagining. Though, again, that doesn't mean "significance" is not confusing to many researchers. Isambard Kingdom (talk) 17:22, 14 March 2015 (UTC)
I just want to add to Isambard Kingdom's comments. These two proposed points are nothing more than a rehash of the proposed paragraphs above. They do not address or take into consideration the comments and concerns made by the other editors. For example, subsection 1 still conflates or confuses p-values with the alpha values, which is not supported by the sources. Subsection 2 still confuses *relevance* with *effect size. I am tired of addressing these issues. Please my comments above for details. danielkueh (talk) 02:30, 15 March 2015 (UTC)

Effect size

In the paragraph on effect size, I wonder whether the clause "(in cases where the effect being tested for is defined in terms of an effect size)" could be deleted: I am not aware of cases where an effect cannot be quantified. Strasburger (talk) 13:02, 1 March 2015 (UTC)

I would advocate complete deletion of that paragraph. It seems to depict an experimenter performing an non-objective analysis: Collect some data, measure its "significance", and, then, if not significant, collect some more data, measure significance again, etc. until something significant is found. This is not the way to do things, and it should not be discussed in this article as if there is a way around such snooping. Still, I know that it happens. I just don't want to see it advocated, even if unintentionally so. If I have misinterpreted the text, then I apologize, but that is my interpretation of what we have at the moment in the text. Isambard Kingdom (talk) 14:10, 7 June 2015 (UTC)
I would add, however, that it is always good practice to report: the number of data, the "effect size" (be Pearson r, or whatever is being assessed), and p-value. Indeed, I can't understand why these quantities would not be reported. Isambard Kingdom (talk) 14:17, 7 June 2015 (UTC)
While I would certainly agree that the paragraph on effect size could (and should) be improved, I believe you did misinterpret what it tries to say. You are absolutely right about the bad practice you describe above, but reporting effect size is actually counter to that practice. The paragraph is simply an advice to report some measure of effect size in the results section. It need not be whatever is assessed, btw, just some valid measure of the size of an effect. Cohen's d seems the most common from what I see. What a valid measure of effect size is, should be part of the corresponding Wikipedia entry; the p value, in any case, is not one (although it is often misinterpreted that way). That advice is now quite common and is meant to discourage relying on significance and the p value.
That said, the fact that you misread that paragraph implies that something is not clear there. I believe it's the parenthesis that is misleading, so I will go ahead and delete it. Also misleading is it to say "the effect size" because that can be misread as referring to the measure for which p was determined. So I will change that to "an effect size". Strasburger (talk) 15:49, 7 June 2015 (UTC)
Small "effects" can be significant, of course. Perhaps this is what you are getting at? Generally speaking, small effects normally require lots of data to be resolved. Do people really not report effect size? In my experience that is sometimes all they report! I think we agree, though, good practice is to report data number N, effect size r, and p-value. Of course, p-value is estimated conditional on both N and r. So, given a null statistical hypothesis and N actual data having effect size r, there is a probability p that a sample of N synthetic data from the null hypothesis would have an effect larger than r. Whether or not this has "research significance" (as phrased in the article) depends on the situation, I think. Isambard Kingdom (talk) 16:39, 7 June 2015 (UTC)
  1. ^ Friston, Karl (2012). "Ten ironic rules for non-statistical reviewers". NeuroImage. 61: 1300–1310.
  2. ^ a b Goodman, S. N. (2001). "Of P-Values and Bayes: A Modest Proposal". Epidemiology. 12: 295–297.
  3. ^ a b Nuzzo, Regina (2014). "Scientific method: Statistical errors". Nature. 506: 105–152.
  4. ^ Ioannidis, John P. A. (2005). "Why most published research findings are false". PLoS Medicine. 2: e124.