“How many participants do I need?”
It’s probably the single most common question students and early-career researchers ask me when they’re designing a study. And it’s a fair question, you can’t conduct research without knowing how much data to collect. But the honest answer is rarely the one people want to hear.
The short version: more is better, but it depends on the design.
The longer version is what this post is about. Because the rules of thumb you may have picked up in your statistics classes: “n equals 30,” “a few hundred should be enough,” are often misleading, and sometimes actively harmful when applied to real research questions.
The Myth of “n = 30”
Many introductory statistics textbooks still teach the idea that 30 participants is some kind of magic threshold. The point at which the Central Limit Theorem kicks in and your analyses become trustworthy. There’s some mathematical truth behind this for certain narrow situations, but it has been wildly overgeneralized.
That said, I don’t think the “n = 30” rule is entirely useless. It can serve as a baseline floor, a minimum below which you really shouldn’t go for most quantitative work. If we didn’t have a number like 30 floating around, researchers might be tempted to run studies with 12 participants and report results as if they meant something. The number 30 establishes that there’s some minimum, and that’s better than no minimum at all.
But it’s a floor, not a target. For most real research questions, 30 participants is the absolute bottom of what could possibly be defensible, and often well below what you actually need.
What Sample Size Actually Depends On
Whenever someone asks me for a sample size, I answer with a series of questions back at them. What kind of study is this? What’s the goal? Is it survey data, an experiment, qualitative interviews? What’s the planned level of analysis? Are you running simple correlations? Comparing groups? Looking at within-subjects effects? Building regression models? Doing machine learning?
Each of these answers shifts the appropriate sample size, sometimes by orders of magnitude.
A within-subjects experiment comparing two conditions can sometimes be done with 30-40 participants if the effect is large and the measurement is precise. A between-groups study comparing the same two conditions might need 60-80 per group. A regression analysis with five predictors typically needs at least 100-150 participants for stable estimates. A study trying to detect an interaction between two variables can easily need three or four times more participants than a study looking at a main effect. A factor analysis or SEM model needs hundreds. A machine learning model needs thousands.
The same research question, framed differently or analyzed differently, can demand wildly different sample sizes. There is no single answer.
Effect Size: The Hidden Variable
Beyond design, the other major factor is effect size or how big the phenomenon you’re studying actually is.
The basic principle is simple: smaller effects need bigger samples to detect them reliably. If every participant in your sample shows a dramatic difference between conditions, you don’t need many people. The signal is so strong it cuts through the noise easily. If the difference between conditions is small and only emerges as a statistical tendency across many participants, you need a much larger sample to see it.
This is why studies of obvious effects (people prefer chocolate cake to cardboard) can be small, while studies of subtle effects (a new therapy is slightly better than the existing one) need to be enormous. The smaller the truth you’re trying to find, the more data you need to find it.
The implication for researchers is that you need some idea of how big your expected effect will be before you can plan your sample size. This is where prior literature, pilot studies, and theoretical reasoning all become useful. You’re not guessing in the dark, you’re estimating based on what’s known about similar effects, and adjusting accordingly.
Power Analysis: Doing It Properly
The formal way to bring all of this together is power analysis. Power analysis takes your expected effect size, your chosen statistical test, and your desired probability of detecting an effect (typically 80%) and tells you how large a sample you need.
I do power analyses for my own studies. Not because the field universally demands it (it should, but it doesn’t), but because I want to be more precise about what I’m doing. If I run a study and find a non-significant result, I want to know whether the effect genuinely wasn’t there or whether my sample was simply too small to detect it. Power analysis gives me that confidence in advance.
For R users, the pwr package is a good starting point. It handles t-tests, ANOVAs, correlations, and basic regression. For more complex designs, including mixed-effects models and simulation-based approaches, the simr package is excellent. G*Power is a free standalone tool that many researchers use as well.
These tools are easy to use once you understand the concepts. The harder part is committing to actually doing the analysis before you collect data, rather than after.
When Small Samples Are Justified
Not every study needs hundreds or thousands of participants. There are genuine cases where smaller samples are not just acceptable but appropriate.
Expert interviews are the clearest example. If you want to understand how senior radiologists make diagnostic decisions, your population of true experts is small to begin with. You can’t recruit 500 senior radiologists, there aren’t 500 in many regions. A study with 8 or 12 expert interviews can be entirely defensible if the experts are genuinely representative of the relevant population.
Qualitative research has its own logic. My rule of thumb for qualitative interviews is to keep interviewing until no new information emerges, until your data is saturated. Sometimes that happens at 8 interviews, sometimes at 25, occasionally more. The right sample size is the one at which additional interviews stop adding new insight. This is fundamentally different from quantitative reasoning about statistical power, but it’s just as principled.
Single-case experimental designs, intensive longitudinal studies, and certain neuroimaging paradigms all have their own rationales for smaller samples. The key is that “small” should be intentional and justified, not accidental.
Bigger Isn’t Always Better
On the other end of the spectrum, the rise of online recruitment platforms like Prolific and MTurk has made it easy to collect samples of thousands of participants quickly and cheaply. This sounds great, more is better, right?
Not necessarily. A large sample size means nothing if your sample isn’t well-matched to your research question. If you want to understand the experiences of Georgian university students but you recruit 5,000 American workers on MTurk, you have a large dataset of irrelevant participants. The sample size is impressive on paper and useless in practice.
The research question should drive the choice of population, and the choice of population should drive the sampling strategy. The size comes after that, not before. Running a study with 5,000 participants from the wrong population is worse than running a study with 200 participants from the right one.
What Large-Scale Assessment Data Offers
One of the privileges of working with international large-scale assessment data, such as PISA, PIRLS, TIMSS, is access to genuinely massive samples that are also high-quality. These programs sample tens of thousands of students per country, using carefully designed probability sampling with appropriate weights to produce nationally representative estimates.
This is rare in psychology. Most of our datasets are convenience samples, students recruited from undergraduate psychology pools, or online participants who happen to click on a link. These can be useful, but they don’t support strong generalizations about populations.
When you have a true probability sample with appropriate weights, you can make claims about the population as a whole. You can estimate parameters with known precision. You can make policy-relevant inferences with confidence. Any such dataset should be treated as a treasure, because for most psychological questions, we’ll never have anything like it.
The Literary Digest Lesson
The classic cautionary tale about sample size and sampling quality comes from the 1936 US presidential election. The Literary Digest, a prominent American magazine, ran what was then the largest political poll in history, over two million responses. They confidently predicted Alf Landon would defeat Franklin Roosevelt.
They were spectacularly wrong. Roosevelt won in a landslide. What went wrong wasn’t the size of the sample. It was the sampling method. The Digest had drawn its sample from telephone directories and automobile registration lists, meaning, in 1936, it had effectively sampled wealthy Americans, who were more likely to support Landon. The two million respondents were not representative of the actual electorate.
The lesson was permanent. A sample of two million from the wrong population is worse than a sample of two thousand from the right one. Size matters, but it matters far less than representativeness.
The Principle
If I had to summarize everything in this post in a single sentence: bigger is better, but only if quality is preserved.
Try to get the largest sample your resources allow. But don’t let the pursuit of large numbers distract you from sampling well, measuring well, and thinking carefully about what your sample actually represents. A modest sample carefully chosen will tell you more about the world than a huge sample carelessly assembled.
And before you collect data, do a power analysis. Estimate your effect size based on prior work. Justify your sample size in advance. This is the difference between research as planning and research as hope.
—
Giorgi Tchumburidze
June, 2026