Reliability and Validity in Psychology

Imagine I bring a bathroom scale into the room and tell you it measures height. You’d laugh. The instrument is wrong for the construct. It doesn’t matter how precise the scale is, how consistent its readings are, or how trustworthy the brand is, if I’m using it to measure height, I’m wrong.

Now imagine I bring the same scale to measure weight, which is what scales are actually for. But this time, every time you step on it, the number changes. 70 kilograms. 73. 68. 71. The scale measures the right thing, but it can’t measure it consistently.

The first scenario is a problem of validity. The second is a problem of reliability. And in psychology, where almost everything we measure is invisible – intelligence, anxiety, motivation, personality – these two concepts are absolutely fundamental. Get them wrong, and your entire research project, your entire clinical assessment, your entire scale becomes meaningless.

The Hidden Difficulty of Psychological Measurement

In physics, measurement is relatively straightforward. Length, mass, temperature are things have direct physical referents. You can hold them up against a standard. You can verify them.

In psychology, almost nothing works that way. We measure latent variables, constructs that exist only as theoretical entities, inferred from observable behavior. Anxiety isn’t a thing you can hold. Self-efficacy isn’t a substance you can weigh. We can only get at them indirectly, through questionnaires, behavioral observations, response patterns. That’s what makes psychometrics both fascinating and difficult.

And it’s exactly why reliability and validity matter so much. When you can’t directly verify what you’re measuring, you have to rely on careful reasoning and statistical evidence to argue that your instrument is doing what you claim it does. Skip that work, and you’re building on sand.

Validity: Are We Measuring What We Think We Are?

Validity is the question of fit. Does this instrument actually measure what we say it measures?

The bathroom scale isn’t valid for height. A test of statistical knowledge isn’t valid as a measure of “knowledge of psychology.” A questionnaire about social media use isn’t valid for measuring depression. The instrument might be excellent, but if it’s the wrong instrument for the construct, validity is gone.

Here’s a concrete example I sometimes use with students: imagine I take an exam designed to test statistics knowledge in psychology programs and I rename it “Knowledge of Psychology.” I administer it to students. Their scores will be highly consistent, Students who know statistics well will score high every time, students who don’t will score low. The measurement is reliable. But it’s not measuring what I claim it measures. Statistics knowledge is a real thing, and the test measures it correctly. But if I’m calling it a measure of general psychology knowledge, I’m wrong. The label doesn’t match the content.

Validity also isn’t one single thing. There are multiple types: content validity (does the test cover the relevant material?), construct validity (does it relate to the theoretical concept it claims to measure?), criterion validity (does it predict outcomes it should predict?), convergent validity (does it correlate with other measures of the same construct?), discriminant validity (does it not correlate with measures of unrelated constructs?). Each of these is a different question, and a fully validated instrument needs evidence for several of them. Validity is built up, piece by piece, over many studies.

Reliability: Are We Measuring Consistently?

Reliability is the question of stability. If we measure the same thing twice under the same conditions, do we get the same answer?

The faulty scale from my opening example is unreliable. So is a personality test that gives you wildly different scores each time you take it. So is an interview-based assessment where two trained clinicians watching the same patient reach completely different conclusions.

Like validity, reliability has multiple flavors. Internal consistency (often measured with Cronbach’s alpha) tells us whether the items within a scale are measuring the same underlying construct. This is usually the first thing I compute when validating a new scale, including in my current work adapting the PSS-10 to Georgian. Test-retest reliability tells us whether scores are stable over time. Inter-rater reliability matters when human judgment is involved, for example, when multiple coders analyze the same qualitative text and we need to know whether they agree.

The right type of reliability depends on what you’re measuring and how. But the underlying principle is always the same: if your measurement isn’t stable, you can’t trust any single result.

The Bullseye Analogy

The clearest way to see how reliability and validity interact is the classic target analogy. Imagine four targets, each with multiple shots:

Target 1: High validity, high reliability. All shots cluster tightly in the bullseye. The measurement is on target and consistent. This is what every researcher hopes to achieve.

Target 2: Low validity, high reliability. All shots cluster tightly, but in the wrong place, far from the bullseye. The measurement is consistent but consistently wrong. This is the case in my “statistics test labeled as psychology test” example. The instrument works precisely; it just doesn’t work for what we claim.

Target 3: High validity, low reliability. Shots are scattered around the bullseye on average, but no single shot is close. The measurement is right on average but unstable in any individual case. This is theoretically possible but rare in practice, and arguably the more dangerous combination, because aggregate results look fine even when individual measurements are unusable.

Target 4: Low validity, low reliability. Shots scattered everywhere, far from the bullseye. The instrument is neither accurate nor consistent. Throw it out.

The crucial insight here is the asymmetry: a measurement can be reliable without being valid, but it cannot be valid without being reliable. If your instrument can’t even measure consistently, it certainly can’t measure the right thing consistently. Reliability is a necessary condition for validity. Not sufficient,but necessary.

A Cautionary Tale: The 16 Personalities Problem

One of the most popular personality assessments in the world, the MBTI, and its many online clones marketed as “16 Personalities” illustrates exactly why reliability and validity matter.

The framework is rooted in Jungian theory and the work of Katharine Cook Briggs and Isabel Briggs Myers. It sorts people into 16 types based on four dichotomies: Introvert/Extravert, Sensing/Intuition, Thinking/Feeling, Judging/Perceiving. INTJ. ENFP. ISFJ. The codes are everywhere, in dating profiles, on LinkedIn bios, in corporate training materials.

The problem is that the test fails on both reliability and validity, and in deeply connected ways.

The validity problem is that the dichotomies don’t match how the underlying traits actually work. Introversion and extraversion are not two separate categories, they exist on a continuous spectrum, with most people falling somewhere in the middle. Forcing this continuous variation into a binary distinction throws away information. It’s a measurement-level error: the data is interval or ratio, but it’s being reported on a nominal scale. You can’t legitimately convert “slightly more introverted than extraverted” into a hard “I” type without losing the meaning of the original measurement.

That problem creates the reliability problem. Because so many people sit near the middle of each dichotomy, small day-to-day variation in their responses pushes them across the threshold. Someone who scored as INTJ on Monday might come back as INTP on Friday, not because their personality has fundamentally changed, but because they were one or two responses away from the cutoff. Studies have repeatedly shown that the MBTI has poor test-retest reliability, with a substantial percentage of people getting different type assignments when they retake the test.

Compare this to a properly designed Big Five inventory like the IPIP-NEO, which reports continuous scores on each of five dimensions. Your score on Conscientiousness might be 67th percentile today and 64th next week, small variation, but you’re still in the same general range. The information isn’t artificially binarized into “high or low,” and small fluctuations don’t produce dramatically different categorizations.

This is why the MBTI works as entertainment but fails as science. The framework violates basic measurement principles, and the violations cascade into both validity and reliability problems.

The Bottom Line

If you take only one thing from this post, take this: be careful with reliability and validity, because without them, all your effort is wasted.

You can run the most elegant study, recruit the best sample, perform the most sophisticated statistical analysis, but if your measurement instrument is unreliable or invalid, none of it means anything. The conclusions are not about what you think they’re about. They’re not even about what you’re measuring. They’re noise dressed up as signal.

Reliability and validity are not boxes to check at the end of a research project. They are the foundation. Build them in from the start. Document them. Question them. Re-examine them when you adapt instruments to new populations or languages. They are the difference between psychology as science and psychology as performance.

—
Giorgi Tchumburidze
May 2026

Reliability and Validity – The Two Words Every Psychologist Should Understand