Item Writing - The Underrated Craft of Test Development

Every test, every scale, every measurement instrument in psychology and education starts with the same humble unit: the item. A single question, a single statement, a single decision point. Get the items right, and your measurement has a chance. Get them wrong, and no amount of sophisticated statistics will save you.

Yet item writing is rarely treated as a serious craft. In most psychology programs and teacher training courses, it’s covered briefly, maybe a single lecture, And then assumed to be common sense. The result is that the world is full of badly-written items, in classroom exams, in driving tests, in personality questionnaires, in standardized assessments. Items that look fine on the surface and quietly fail to measure what they claim to measure.

This post is about doing it better. Most of what I cover applies whether you’re a high-school teacher writing a midterm, a university lecturer designing a final exam, or a researcher developing a new scale. The principles are the same, only the scale changes.

What Bad Items Look Like

Let me start with examples, because they make the problem obvious.

I sometimes use the Georgian driving test as a reference point because it’s full of badly-written items that anyone can recognize. One common pattern: a multiple-choice item with four options, where the fourth option is “all of the above.” If a test-taker knows that two of the first three options are correct, they don’t actually need to know about the third one, they can confidently choose “all of the above.” The item has just told them the answer through its structure. They didn’t demonstrate knowledge; they demonstrated test-taking skill.

Another common pattern: distractors that are obviously irrelevant. Imagine an item asking “Which of the following is a famous historic figure?” with three real historical figures and one option that’s a building, or a country, or some other object. The distractor is so categorically wrong that anyone can rule it out instantly, even without knowing which of the other three is correct. The item has effectively become a three-option question, not a four-option one.

These aren’t subtle problems. They’re visible from a single read-through. And yet they’re everywhere, because nobody trained the test writer to look for them.

Start With What You’re Measuring

The single most important thing item writers skip is also the most basic: deciding what they’re actually trying to measure.

Tests measure latent variables: abilities, knowledge, traits, attitudes that can’t be observed directly. Before writing a single item, you need to define this latent variable clearly. What is the construct? What does it consist of? What sub-domains or components does it include?

This is where a test blueprint comes in. A blueprint is a structured plan that lays out the content domains the test will cover, the relative weight of each domain, and the cognitive processes you want to assess. For an exam on educational psychology, your blueprint might allocate 20% to learning theories, 15% to motivation, 25% to assessment, and so on. Within each domain, you decide how many items will test pure knowledge, how many will test understanding, and how many will test application.

Only after this blueprint is in place do you start writing items. And when you write an item, you know exactly what it’s supposed to do: which content area it represents, what cognitive process it’s targeting, what it contributes to the overall measurement.

Most item creators skip this step entirely. They sit down with a topic in mind, start writing questions, and figure out the structure afterward, if at all. The lack of a blueprint is visible in the final test: uneven coverage, redundant items, missing topics, an unclear sense of what’s actually being measured.

Bloom’s Taxonomy in Practice

When I talk about cognitive processes, I’m largely thinking in terms of Bloom’s taxonomy: the classic framework that distinguishes between remembering, understanding, applying, analyzing, evaluating, and creating. Other taxonomies exist, but Bloom’s is the most widely taught and, in my view, the most practical.

I use Bloom’s taxonomy actively when writing items. Not because it’s a textbook requirement, but because it pushes you toward writing items that test more than just recall. If everything in your test sits at the “remembering” level – list the five stages, define this term, name the author of this theory – then you’re measuring memory, not understanding or competence.

The reason this matters is practical. If students can’t transfer what they learned into real-life situations, what they learned is mostly useless. A test that only measures recall doesn’t tell you whether your students can actually do anything with the knowledge. It just tells you whether they can repeat it back.

Items that target higher Bloom levels are harder to write, but they reveal more. An “apply” item asks students to use a concept in a new situation. An “analyze” item asks them to break apart a problem and identify its components. These items take more thought to write, but the information you get from them is dramatically more useful.

Can Multiple-Choice Measure Higher-Order Thinking?

One of the recurring debates in assessment is whether multiple-choice items can really test higher-order cognitive processes, or whether they’re inherently limited to recall.

My view: MC items can absolutely assess higher-order thinking, but it’s genuinely hard to write items that do so. Most multiple-choice items, in practice, sit at the lower levels of Bloom’s taxonomy: remembering, understanding, sometimes applying. Writing an MC item that requires real analysis or evaluation is challenging because you have to construct a scenario complex enough to require those processes, while still having a single defensible answer.

The first three levels of Bloom’s, remembering, understanding, applying, are where multiple choice shines. You can write excellent MC items here. The middle levels – analyzing, evaluating – are achievable with effort and skill. The top level, creating, is where MC really breaks down. You can’t ask a student to create something genuinely new and grade it with four pre-written options. For that, you need open-ended formats.

This isn’t an argument against multiple choice. It’s an argument for matching item format to what you’re trying to measure. Use MC where it fits. Use other formats where MC can’t reach.

Designing Good Distractors

If the stem of an item is the question, the distractors are the wrong answer choices. And good distractors are surprisingly hard to write.

A good distractor does two things at once. First, it doesn’t give itself away: it doesn’t stick out as obviously wrong, isn’t categorically different from the other options, isn’t trivially eliminable. Second, it’s a plausible mistake that someone with partial knowledge might genuinely make. The distractor checks the student’s knowledge precisely. If they truly understand the concept, they can reject the distractor. If their understanding is shallow or confused, the distractor will tempt them.

Distractor design also depends on who the test is for. If you’re writing for medium-level students, your distractors need to differentiate them from the genuinely weak students. If you’re writing for strong students, your distractors need to differentiate the truly excellent from the merely competent. The same item with the same correct answer can have very different distractors depending on the target population.

Pilot Testing: Where Psychometrics Comes In

Even the most carefully written item can fail in unexpected ways. Maybe the item turns out to be much easier than you thought. Maybe one of your “good” distractors is never chosen, because it’s actually obviously wrong to your students. Maybe an item that seemed great is too hard because of vocabulary you didn’t think about.

The only way to find these problems is to pilot the items before using them in a real assessment. Pilot testing, administering the items to a representative sample and analyzing the results, is the bridge between item writing as a craft and item writing as a science.

If you don’t pilot test, you’re hoping that your intuitions about each item are correct. Some of them probably are. Some of them definitely aren’t. And if too many of your items don’t work the way you expected, your overall test ends up measuring something other than what you intended.

What to Look At in Pilot Data

When you analyze piloted items, classical test theory gives you a rich set of statistics to examine. Item difficulty (p-value) tells you what proportion of students answered correctly. Item discrimination (often computed as the corrected item-total correlation, RIT or RIR) tells you whether the item distinguishes between stronger and weaker students. Distractor analysis tells you how each wrong option is performing, whether each one is being chosen by some students, and whether strong students are avoiding it while weaker students fall for it.

Every item should be examined on all of these dimensions. An item that everyone gets right (very high p) doesn’t discriminate. An item that everyone gets wrong (very low p) is either too hard or simply broken. An item with a negative discrimination, where strong students get it wrong and weak students get it right, is almost certainly broken in some way.

This kind of analysis is exactly what PsychoMetrika is designed for. Upload your data, and the app gives you all of these statistics, plus reliability analyses, factor analyses, and IRT models if you want to go deeper. The technical work is no longer the barrier, what matters is whether the item writer engages with the results and revises accordingly.

The One Thing

If you’re a teacher writing tests for your students, and you don’t have time to study psychometrics in depth, here’s the one principle that matters most:

Before writing each item, ask yourself what cognitive process you want the student to use to answer it.

Are they retrieving a fact from memory? Connecting two ideas? Applying a principle to a new situation? Analyzing a problem? Make this explicit. Once you’ve decided what cognitive process the item should engage, write the item to require exactly that process, no more, no less.

This single habit will improve your items more than any other technique. It forces you to think about measurement before you think about content. It pushes you away from the lazy default of pure recall. And it makes the rest of the process – choosing item format, writing distractors, evaluating pilot data – much clearer, because you know what each item is supposed to be doing.

Scale Development Works the Same Way

Everything I’ve said applies just as much to research scale development as it does to classroom testing. If you’re building a new scale to measure anxiety, motivation, attitudes toward technology, whatever it is, the principle is identical.

You start with the blueprint. What is this construct? What are its building elements, its sub-domains, its facets? Then you write items to measure each of those elements. Each item targets a specific facet, contributes to the overall measurement, and connects back to the theoretical definition.

This is how validated scales are built. It’s also why scales that skip this step often fail to validate: they were written by content, not by construct. The items might all be about anxiety in some loose sense, but they don’t systematically cover the construct, and the resulting scale measures something muddier than its name suggests.

The lesson is consistent at every level of measurement: write items deliberately, write them against a clear blueprint, and verify them with data. The craft is the same whether you’re writing four MC items for tomorrow’s quiz or sixty Likert items for next year’s published scale.

—
Giorgi Tchumburidze
June, 2026

Item Writing – The Underrated Craft