Talk:Statistical significance/Archive 4

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Archive 4

Archive 5

Archive 6

Wiki Education Foundation-supported course assignment

Latest comment: 2 years ago1 comment1 person in discussion

This article is or was the subject of a Wiki Education Foundation-supported course assignment. Further details are available on the course page. Student editor(s): Cokusiak. Peer reviewers: Cokusiak.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 03:49, 18 January 2022 (UTC)

Wrong definition of p value

Latest comment: 8 years ago23 comments5 people in discussion

After being away from these discussions for quite some time, I noticed that large-scale modifications have been made to the lead paragraph. As a contributing party to the previous versions, I want to maintain a high level of accuracy for statistical concepts that are very misunderstood and abused. The first thing I'd like to correct is the fact that the current leading paragraph defines p value incorrectly.

The p-value is the probability of observing an effect given that the null hypothesis is true whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true.

In reality, the p value is the probability of observing an effect, or more extreme effect, given that the null hypothesis is true (i.e. the shaded areas near the tails in a classic presentation). It isn't simply the probability of a specific effect given the null, because that probability is always infinitesimal (e. g. just a sliver in a normal distribution). In fact, this is explained explicitly in one of the Nature journal references (http://www.nature.com/nmeth/journal/v10/n11/full/nmeth.2698.html) for the lead sentence. Pay special attention to Figure 1c. The figure text reads:

The statistical significance of the observation x is the probability of sampling a value from the distribution that is at least as far from the reference, given by the shaded areas under the distribution curve.

This is further reinforced by another already existing reference, namely that to the Nature Reviews Genetics paper (http://www.nature.com/nrg/journal/v15/n5/full/nrg3706.html). Although not open access, the relevant section states:

The P value, which was introduced earlier by Fisher in the context of significance testing, is defined as the probability of obtaining — among the values of T generated when H0 is true — a value that is at least as extreme as that of the actual sample (denoted as t).

The PNAS reference about revised standards for statistical evidence also agrees (http://www.pnas.org/content/110/48/19313.full):

The P value from a classical test is the maximum probability of observing a test statistic as extreme, or more extreme, than the value that was actually observed, given that the null hypothesis is true.

Furthermore, I propose changing the word "effect" to "results", since "effect" is a bit equivocal since it can both mean "something that is of interest to measure" (i.e. "effect" part of the word "effect size") and the measurement itself (i.e. "results").

Based on these considerations, I propose changing the lead paragraph from:

In statistics, statistical significance (or a statistically significant result) is attained when a p-value is less than the significance level.^[1]^[2]^[3]^[4]^[5]^[6]^[7] The p-value is the probability of observing an effect given that the null hypothesis is true whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true.^[8] As a matter of good scientific practice, a significance level is chosen before data collection and is usually set to 0.05 (5%).^[9] Other significance levels (e.g., 0.01) may be used, depending on the field of study.^[10]

to this:

In statistics, statistical significance (or a statistically significant result) is attained when a p-value is less than the significance level.^[1]^[2]^[3]^[4]^[5]^[6]^[7] The p-value is the probability of obtaining at least as extreme results given that the null hypothesis is true whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true.^[3]^[4]^[5]^[8] As a matter of good scientific practice, a significance level is chosen before data collection and is usually set to 0.05 (5%).^[9] Other significance levels (e.g., 0.01) may be used, depending on the field of study.^[10]

I start this discussion rather than being bold when editing to firmly anchor this change among participants. That way we can make long-lasting and durable changes. EmilKarlsson (talk) 22:20, 2 June 2015 (UTC)

~~I have no objections to the proposed change, i.e. changing effect --> at least as extreme results. It's a bit wordy but oh well, not a big deal. danielkueh (talk) 22:28, 2 June 2015 (UTC)~~

I do have some questions. Suppose we analyzed the results of a simple between-groups experiment and observed a p-value of 0.5 (or 50%), which is clearly not significant. Would the results need to be extreme to obtain such a large p-value? Is the p-value, in this case, "just a sliver?" danielkueh (talk) 14:52, 3 June 2015 (UTC)

Let the population means of the two groups be m(1) and m(2) (these are unknown parameters) and the observed sample means we get be x(1) and x(2). The observed difference is then |x(1)-x(2)| (i.e. the distance between sample means). Assuming that the null hypothesis of m(1) = m(2) is true, a p value of 0.5 means that there is a 50% probability of obtaining a difference |x(1)-x(2)| that is equal to or larger than (not merely "equal to") what you actually observed. The phrase "more extreme results" here means "a larger deviation from the null hypothesis than observed" i.e. a larger value for |x(1)-x(2)|. The p value (correctly defined) is never a sliver of a distribution (but the wrong definition of p value currently used implies that it is), since it is always "results or more extreme results / at least as extreme results" (i.e. from the observed result all the way towards the tail of the distributions). Consider figure 1C in http://www.nature.com/nmeth/journal/v10/n11/full/nmeth.2698.html#f1. The p value is then the entire black + grey area under the curve (for a two tailed test), not just the infinitesimal sliver constituting the area indicated by the dotted line under "x" (i.e. the observed results). Hopefully this goes some way towards answering your question, but please ask follow-up questions if something I wrote sounds weird, strange or otherwise unclear. EmilKarlsson (talk) 20:56, 3 June 2015 (UTC)

I disagree that the second sentence on the definition of the p-value implies that the p-value just a "sliver." I agree that the first sentence does imply that because the p-value has to be a sliver for it to be significant (assuming a very low threshold). Anyway, I actually picked up the word, sliver, from what you wrote earlier that the "because that probability is always infinitesimal (e. g. just a sliver in a normal distribution)." Hence, my earlier questions. Thank you for explaining the phrase, "results or more extreme results / at least as extreme results". I still find that statement to be a bit wordy for my taste, but that is a very small matter. danielkueh (talk) 22:08, 3 June 2015 (UTC)

Now I understand our crux! When I say that it is wrong to define p value as just "probability of results given null" (instead of "probability of results or more extreme results given null") because it would falsely entail that such a probability (p value) would always just be a tiny sliver, I am speaking about the area under the graph of a distribution (compare the infinitesimal area precisely under x constituting the flawed definition with the dark + grey area under the graph corresponding to the correct definition in the above mentioned figure 1C). This is because a given result value (assuming that the variable is continuous and can take any value within a reasonable range) is just one possibility among a very large set of realistic possibilities. If our observed difference |x(1)-x(2)| happened to be 2.1, it could have been 1.8, 1.85. 1.8445456, 2.14 and so on. Thus, getting our precise result of 2.1 (or any specific value) given null would almost always be quite unlikely. So, given this flawed definition of a p value, all p values would be exceedingly small, which is then my reductio argument against what I see as the wrong definition of p value. Instead, p value should be though of as "probability of getting a result of 2.1 or more extreme (i.e. further away from null), given null". Does this clarify what I wrote before, or does it introduce more questions than answers? EmilKarlsson (talk) 18:32, 4 June 2015 (UTC)

@EmilKarlsson, I understand what you're saying about the area under the curve and how it captures a range of lower probabilities or p-values. However, I still don't see any fundamental contradiction between the definition that you proposed and the current definition in the second lead sentence. But suppose for the sake of this discussion that there is a contradiction or a distinction between the two definitions. If we need to be exact, then we all we have to add is the word "observed" or "calculated" before p-value as in "observed p-value (p_obs)" or "calculated (p-calc)," which is different from the critical p-value (p_crit). It seems to me that your proposed definition speaks more to p_crit than to p_obs. At the end of the day, to determine if an experimental result is significant, all we need to know is whether p_obs < p_crit and not whether our observed p-value captures a range of lower values. danielkueh (talk) 20:36, 4 June 2015 (UTC)

@EmilKarlsson, FYI, I am not opposed to the proposed changes to the second sentence. I am just having fun discussing this topic here. So feel free to move on.

I still do not think we are completely on the same wavelength. What you have wrote makes sense if the x-axis in the histogram is the p-value, but it is rather the observed difference. It boils down to this: the probability of getting a specific result (e. g. 5.42 difference) given null is very, very tiny (and thus the flawed definition of p value implies that all p-values are essentially 0) because you could have gotten whatever observed difference since it is often a continuous variable. 5.42 is just a tiny, tiny subset of all possibilities. This is why the correct p value definition has to include the bit about "more extreme results" (i.e. an observed difference of 5.42 or more extreme away from null) to even make sense or be useful. EmilKarlsson (talk) 11:24, 6 June 2015 (UTC)

@EmilKarlsson, Again, I don't dispute proposed definition that includes the qualifier "or more extreme results." It is correct and fairly standard. If you feel that it should replace the present definition because it introduces more possibilities (smaller p-values) that correspond with more extreme results, then by all means do so. But to assert that the present definition is wrong (heading of this discussion) because it contradicts the proposed definition is a little over the top. All the present definition is saying is if we observed an effect (mean difference), we get this specific p-value. And if this p-value is less than alpha, it is significant. That is all. It really does not matter if there are "additional sets of possibilities" that correspond with "more extreme results." I suppose in the days before SPSS, SAS, or R, when people had to rely on t- or F-distribution tables to identify regions of p-values (one-tail or two-tail) that are smaller than alpha, it makes sense to remind them of "more extreme results" because that would help them understand these lookup tables. And the best they could hope to do when reporting p-values was to specify a range such as 0.01 < p < 0.05 or just simply p < 0.05. But in this day and age, when p-values can be calculated to the nth decimal, the only question that we need to know is whether this p-value (singular not plural) is smaller or greater than alpha. It's redundant, and often pointless, to ask if our observed p-value would include p-values that are smaller and correspond with larger mean differences. Of course they would, why wouldn't they? But guess what? Unless we actually do another set of experiments that produces greater effects, the only p-value we have to report is the one that was actually calculated. Again, none of these issues of pragmatics contradicts the proposed statement with the qualifier, "or more extreme results." But like I said, if you want to change "an effect" to "at least as extreme results" because it would add "other subset of possibilities," then by all means do so. Pragmatics aside, I concede it is conceptually correct. So knock yourself out. :) danielkueh (talk) 14:36, 6 June 2015 (UTC)

I agree with these changes. In the second sentence I would further change

"whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true"

to

"whereas the significance or alpha (α) level is the probability value at which the null hypothesis is rejected given that it is true". Strasburger (talk) 23:01, 3 June 2015 (UTC)

The current description of alpha does have problems, but I think the status of alpha as the type-I error rate does not quite shine through in your suggestion. Perhaps we can rework the entire first paragraph in addition to the p value sentence above? What about something like this:

In statistics, statistical significance (or a statistically significant result) is attained when the p-value (p) is smaller than the significance level or alpha (α). The p value is defined as the probability of getting at least as extreme results given that the null hypothesis is true and this value is determined by the observed data and sample size. ^[1]^[2]^[3]^[4]^[5]^[6]^[7] In contrast, alpha is defined as the probability of rejecting the null hypothesis given that it is true and is set by the researcher as the type-I error rate of the statistical test being used (and thus limits how often researchers incorrectly reject true null hypotheses). ^[3]^[4]^[5]^[8] Other significance levels (e.g., 0.01) may be used, depending on the field of study.^[10]

Is there anything that is unclear, weird or equivocal with this suggestion? EmilKarlsson (talk) 18:32, 4 June 2015 (UTC)

Very clear and concise. Strasburger (talk) 21:17, 4 June 2015 (UTC)

@EmilKarlsson, For starters, delete needless words such as "defined as" and just say what it is. For example, change "The p value is defined as the probability to "The p value is the probability." Change "smaller than" back to "less than" as that is how the symbol "<" is often read. I recommend either splitting the third sentence on the alpha level into multiple sentences, or omit some information. Too much info. It's the lead, not the main text. The last sentence does not follow from the previous sentence. Finally, I recommend making these changes *global,* i.e., be sure that the main text says the same thing as the lead. After all, the lead is supposed to be a summary (details in WP:lead). In fact, I recommend starting with the main text and working your way back to the lead. That way, no topic or issue is given undue weight. danielkueh (talk) 21:31, 4 June 2015 (UTC)

Good point. In fact, I think the entire article deserves to be re-written because of the seriousness and importance of this topic. What kind of sections do you think would be worthwhile to include? History, role in statistical significance testing (perhaps including table of important concepts related to statistical significance tests like alpha, beta, sample size etc.), what can be inferred from a statistically significant result, strengths and drawbacks, misunderstandings, alternatives to statistical significance? Is there any other key issue that stands out and deserves a place in this article? EmilKarlsson (talk) 11:24, 6 June 2015 (UTC)

@EmilKarlsson, it is entirely up to you. Right now, you seem to be focused on p-values, alphas, etc. So if you want, you can start with the "Role in statistical hypothesis testing." I agree this article could always be improved. If you think the present article is bad, you should have seen previous versions (e.g., [[1]]) By the way, I appreciate you taking the lead on this. Have fun editing. I am not as active as I used to be on Wikipedia, but if there's anything I can do to help, feel free to post your requests here or on my talk page. I am sure there are other editors would be interested as well. danielkueh (talk) 14:36, 6 June 2015 (UTC)

@EmilKarlsson, FYI, if you intend on taking this article to FA status (WP:FA), you should check Wikipedia's The Core Contest (WP:TCC). I believe this year's contest is just over but given the amount of time and effort it goes to building an article to FA status, you could try for next year's. It's just a bonus. danielkueh (talk) 15:03, 6 June 2015 (UTC)

Maybe it is just me, but it sometimes seems to me that there is too much focus on alpha and making an assessment, black and white, as to whether or not something is "significant". I can imagine that an experimenter might choose, beforehand, an alpha threshold (and that is the word I would use), but if a p-value is just slightly larger than the chosen alpha, the results still might be worthy of reporting. In my own work I have the flexibility to report p-values as they are, I put them in papers, and I let the reader judge them for what they are. So, in light of this, I would advocate some accommodation or discussion or something in this article of the often arbitrary thinking about the alpha should be and, indeed, whether or not there should even be an alpha. Isambard Kingdom (talk) 14:01, 7 June 2015 (UTC) Oh, and I now see that I've already been commenting on this talk page! I had actually forgotten. Forgive me for possible redundancy. Isambard Kingdom (talk) 14:29, 7 June 2015 (UTC)

@Isambard Kingdom, interesting. I'm assuming your work is in physics or similar? I know in the life sciences and in the social sciences, the alpha level is enforced strictly enforced. In fact, many folks get suspicious if the p-value is too close to the alpha level (e.g., 0.047). danielkueh (talk) 15:53, 7 June 2015 (UTC)

Hmm. I am in the life sciences/social sciences and I do it exactly like Isambard. Strasburger (talk) 16:29, 7 June 2015 (UTC)

@Strasburger, by "do it exactly like Isambard," you mean report p-values that are not significant? Sure, you can do that. Doesn't make it anymore statistically significant. danielkueh (talk) 16:39, 7 June 2015 (UTC)

It's good practice to just mention the achieved p-value, and discuss its 'significance'. A significance level is needed if a test is performed in establishing a crtitical area. Nijdam (talk) 18:37, 10 June 2015 (UTC)

If the p value is slightly above the 5% mark, one way of saying it is that "the result just missed significance (p=xxx%)." The result might be worth reporting, in particular if n is small. Reading Friston's paper cited in the main article is instructive for taking the focus a little away from the alpha level. Strasburger (talk) 19:38, 10 June 2015 (UTC)

@Nijdam, I guess it really depends on the research question. If the non-significant result is interesting (e.g., drug A does not work), then yes, we should discuss the practical or theoretical significance of the statistically non-significant result. danielkueh (talk) 20:01, 10 June 2015 (UTC)

references mentioned in discussion

^ ^a ^b ^c Redmond, Carol; Colton, Theodore (2001). "Clinical significance versus statistical significance". Biostatistics in Clinical Trials. Wiley Reference Series in Biostatistics (3rd ed.). West Sussex, United Kingdom: John Wiley & Sons Ltd. pp. 35–36. ISBN 0-471-82211-6.
^ ^a ^b ^c Cumming, Geoff (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York, USA: Routledge. pp. 27–28.
^ ^a ^b ^c ^d ^e Krzywinski, Martin; Altman, Naomi (30 October 2013). "Points of significance: Significance, P values and t-tests". Nature Methods. 10 (11). Nature Publishing Group: 1041–1042. doi:10.1038/nmeth.2698. Retrieved 3 July 2014.
^ ^a ^b ^c ^d ^e Sham, Pak C.; Purcell, Shaun M (17 April 2014). "Statistical power and significance testing in large-scale genetic studies". Nature Reviews Genetics. 15 (5). Nature Publishing Group: 335–346. doi:10.1038/nrg3706. Retrieved 3 July 2014.
^ ^a ^b ^c ^d ^e Johnson, Valen E. (October 9, 2013). "Revised standards for statistical evidence". Proceedings of the National Academy of Sciences. National Academies of Science. doi:10.1073/pnas.1313476110. Retrieved 3 July 2014.
^ ^a ^b ^c Altman, Douglas G. (1999). Practical Statistics for Medical Research. New York, USA: Chapman & Hall/CRC. p. 167. ISBN 978-0412276309.
^ ^a ^b ^c Devore, Jay L. (2011). Probability and Statistics for Engineering and the Sciences (8th ed.). Boston, MA: Cengage Learning. pp. 300–344. ISBN 0-538-73352-7.
^ ^a ^b ^c Schlotzhauer, Sandra (2007). Elementary Statistics Using JMP (SAS Press) (PAP/CDR ed.). Cary, NC: SAS Institute. pp. 166–169. ISBN 1-599-94375-1.
^ ^a ^b Craparo, Robert M. (2007). "Significance level". In Salkind, Neil J. (ed.). Encyclopedia of Measurement and Statistics. Vol. 3. Thousand Oaks, CA: SAGE Publications. pp. 889–891. ISBN 1-412-91611-9.
^ ^a ^b ^c Sproull, Natalie L. (2002). "Hypothesis testing". Handbook of Research Methods: A Guide for Practitioners and Students in the Social Science (2nd ed.). Lanham, MD: Scarecrow Press, Inc. pp. 49–64. ISBN 0-810-84486-9.

[Redmond_and_Colton-1] Redmond, Carol; Colton, Theodore (2001). "Clinical significance versus statistical significance". Biostatistics in Clinical Trials. Wiley Reference Series in Biostatistics (3rd ed.). West Sussex, United Kingdom: John Wiley & Sons Ltd. pp. 35–36. ISBN 0-471-82211-6.

[Cumming-2] Cumming, Geoff (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York, USA: Routledge. pp. 27–28.

[Krzywinski_and_Altman-3] Krzywinski, Martin; Altman, Naomi (30 October 2013). "Points of significance: Significance, P values and t-tests". Nature Methods. 10 (11). Nature Publishing Group: 1041–1042. doi:10.1038/nmeth.2698. Retrieved 3 July 2014.

[Sham_and_Purcell-4] Sham, Pak C.; Purcell, Shaun M (17 April 2014). "Statistical power and significance testing in large-scale genetic studies". Nature Reviews Genetics. 15 (5). Nature Publishing Group: 335–346. doi:10.1038/nrg3706. Retrieved 3 July 2014.

[Johnson-5] Johnson, Valen E. (October 9, 2013). "Revised standards for statistical evidence". Proceedings of the National Academy of Sciences. National Academies of Science. doi:10.1073/pnas.1313476110. Retrieved 3 July 2014.

[Altman-6] Altman, Douglas G. (1999). Practical Statistics for Medical Research. New York, USA: Chapman & Hall/CRC. p. 167. ISBN 978-0412276309.

[Devore-7] Devore, Jay L. (2011). Probability and Statistics for Engineering and the Sciences (8th ed.). Boston, MA: Cengage Learning. pp. 300–344. ISBN 0-538-73352-7.

[Schlotzhauer-8] Schlotzhauer, Sandra (2007). Elementary Statistics Using JMP (SAS Press) (PAP/CDR ed.). Cary, NC: SAS Institute. pp. 166–169. ISBN 1-599-94375-1.

[Salkind-9] Craparo, Robert M. (2007). "Significance level". In Salkind, Neil J. (ed.). Encyclopedia of Measurement and Statistics. Vol. 3. Thousand Oaks, CA: SAGE Publications. pp. 889–891. ISBN 1-412-91611-9.

[Sproull-10] Sproull, Natalie L. (2002). "Hypothesis testing". Handbook of Research Methods: A Guide for Practitioners and Students in the Social Science (2nd ed.). Lanham, MD: Scarecrow Press, Inc. pp. 49–64. ISBN 0-810-84486-9.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]