File:P-hacking by early stopping.svg

Original file(SVG file, nominally 1,152 × 576 pixels, file size: 135 KB)

Summary

Description
English: The figure shows the change in p-values computed from a t-test as the sample size increases, and how early stopping can allow for p-hacking.

Data is drawn from two identical normal distributions, . For each sample size , ranging from 5 to , a t-test is performed on the first <math>n<math> samples from each distribution, and the resulting p-value is plotted. The red dashed line indicates the commonly used significance level of 0.05.

If the data collection or analysis were to stop at a point where the p-value happened to fall below the significance level, a spurious statistically significant difference could be reported.

Illustration based on

Wagenmakers, Eric-Jan. "A practical solution to the pervasive problems of p values." Psychonomic bulletin & review 14.5 (2007): 779-804.

```python import numpy as np import matplotlib.pyplot as plt from scipy import stats

  1. Set random seed for reproducibility

np.random.seed(42)

  1. Function to perform t-test and return p-value

def perform_t_test(sample1, sample2):

   _, p_value = stats.ttest_ind(sample1, sample2)
   return p_value
  1. Initialize parameters

max_samples = 10**4 start_samples = 5 p_values = [] sample_sizes = range(start_samples, max_samples + 1)

  1. Generate data and perform t-tests

population1 = stats.norm(loc=0, scale=10) population2 = stats.norm(loc=0, scale=10)

samples1 = population1.rvs(max_samples) samples2 = population2.rvs(max_samples)

for n in sample_sizes:

   p_value = perform_t_test(samples1[:n], samples2[:n])
   p_values.append(p_value)
  1. Create the plot

plt.figure(figsize=(12, 6)) plt.semilogx(sample_sizes, p_values, 'b-') plt.axhline(y=0.05, color='r', linestyle='--', label='p = 0.05') plt.xlabel('Sample Size (log scale)') plt.ylabel('p-value') plt.title('Variability of p-value as Sample Size Increases') plt.grid(True, which="both", ls="-", alpha=0.2) plt.legend() plt.ylim(0, 1) plt.tight_layout() plt.savefig('p-hacking.svg') plt.show()

```
Date
Source Own work
Author Cosmia Nebula

Licensing

I, the copyright holder of this work, hereby publish it under the following license:
w:en:Creative Commons
attribution share alike
This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.
You are free:
  • to share – to copy, distribute and transmit the work
  • to remix – to adapt the work
Under the following conditions:
  • attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

Captions

Add a one-line explanation of what this file represents

Items portrayed in this file

depicts

15 July 2024

File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeThumbnailDimensionsUserComment
current01:21, 26 July 2024Thumbnail for version as of 01:21, 26 July 20241,152 × 576 (135 KB)Cosmia NebulaUploaded while editing "Data dredging" on en.wikipedia.org
The following pages on the English Wikipedia use this file (pages on other projects are not listed):

Metadata