Research methodology
Our approach to qualitative analysis transforms open-ended interview responses into structured, reliable findings. Every code assignment is evidence-based, every theme is validated, and every conclusion is defensible.
Interview transcripts are rich with insight, but without systematic analysis, findings become anecdotal. Two analysts reading the same transcripts can reach different conclusions. Themes can be too broad (losing nuance) or too narrow (missing patterns). And there is no way to demonstrate to stakeholders that the results are reliable.
Systematic coding solves these problems. It provides a structured, repeatable process for transforming open-ended responses into quantified findings, with built-in quality checks that ensure accuracy and consistency.
Not all thematic analysis is created equal. Braun and Clarke (2021, 2022) identify three distinct variants, each with different strengths and trade-offs. We use Codebook Thematic Analysis because it combines the rigor clients expect with the flexibility that real interview data demands.
In applied market research, we rarely know every theme before reading the data. Reflexive TA gives us no way to prove our coding is reliable. Coding Reliability TA locks us into a fixed framework that cannot adapt. Codebook TA gives us the best of both: a structured codebook that evolves iteratively as we discover what the data contains, with reliability measurement built in.
Different questions produce different kinds of data. Each type requires a distinct coding approach, matched to the structure of the response.
Open responses mapped to an ordinal scale with predefined buckets.
Single-dimension classification into a small set of distinct categories.
Complex, open-ended responses broken into individual statements and grouped into validated themes.
Our thematic coding follows a four-step process grounded in established qualitative research methods. Each step has specific rules and quality checks.
Each response is broken into discrete meaning units: the smallest segment of text that contains a single idea or claim (Graneheim & Lundman, 2004). A participant who says three different things gets three separate meaning units, each coded independently.
Each meaning unit receives a descriptive code: a short label (2-5 words) capturing what the statement is about (Saldana, 2016). Codes use the participant's own language where it is distinctive ("in vivo coding") and standardized labels where consistency matters.
Related codes are grouped into broader themes, each organized around a single concept. We use a two-pass approach (Deterding & Waters, 2021):
Candidate themes are tested against specific rules to ensure they are coherent, distinct, and analytically useful. Themes may be split, merged, or reorganized based on these checks.
Themes are not arbitrary groupings. Each must pass specific validation criteria before it enters the final analysis.
A theme must be mentioned by at least 5% of participants to stand on its own. Themes below this threshold are merged with related themes or moved to "Other." This prevents findings from being driven by isolated comments.
If the quotes within a theme cluster into two or more distinct ideas, the theme is too broad. A theme about "convenience" that contains both "close to my office" and "fast service" captures two different concepts and should be split for actionable analysis.
When 70% or more of participants who mention Theme A also mention Theme B, the themes likely represent the same underlying concept. They are merged into a single theme to avoid double-counting and simplify the analysis.
Fewer than 4 themes for an open-ended question usually means important distinctions are being lost. More than 12 usually means themes are not abstracted enough. Sub-themes preserve nuance within the 6-10 target range.
If the "Other" category exceeds 15% of responses, a meaningful pattern is being missed. The uncategorized responses are reviewed to identify hidden themes that should be added to the codebook.
Broad themes work for executive summaries. Sub-themes provide the detail needed for actionable recommendations. A theme like "Value for money" (21%) might contain sub-themes for "Low absolute prices" (14%), "Deals and promotions" (8%), and "Portion value" (5%).
Every coded dataset passes through a two-round, five-agent process that produces measurable evidence of coding accuracy. Five independent AI agents code, cross-check, and validate every finding. No single agent's judgment is trusted in isolation.
A single AI coder, no matter how accurate, provides no way to measure reliability. The same data could be coded differently by a different system, and there would be no way to know which is correct. Multiple independent agents solve this by replicating the gold-standard practice of inter-rater reliability from human qualitative research (Cohen, 1960; Krippendorff, 2004), but without the time, cost, and fatigue limitations of human coders.
Three agents would suffice for basic reliability measurement. We use five because the second round of validation catches residual errors in the first round's resolution, improving accuracy from approximately 85-88% to 90-93%. For consulting engagements where findings drive significant business decisions, that incremental precision matters.
Independence between agents is not automatic. Two identical AI systems given identical inputs will produce identical outputs, proving nothing. We design genuine independence into each agent using four levers: model architecture, temperature, persona framing, and codebook emphasis.
| Agent | Model | Temperature | Persona | Codebook emphasis | Role |
|---|---|---|---|---|---|
| 1 | Opus | 0 | Thorough, inclusive | Inclusion criteria first | Primary Coder A |
| 2 | Sonnet | 0.2 | Conservative, precise | Exclusion criteria first | Primary Coder B |
| 3 | Sonnet | 0 | Neutral arbiter | Balanced | Round 1 resolver |
| 4 | Sonnet | 0 | Balanced, fresh perspective | Balanced | Independent re-coder |
| 5 | Opus | 0 | Senior quality reviewer | Balanced | Final validator |
Agents 1 and 5 use Opus (deeper reasoning). Agents 2, 3, and 4 use Sonnet (different architecture). Different model weights produce genuinely different coding judgments on ambiguous cases.
Agent 1 leans toward including borderline cases. Agent 2 leans toward excluding them. This creates productive tension: where both agree despite opposite biases, confidence is very high. Where they disagree, it surfaces genuine ambiguity.
Agent 2 operates at temperature 0.2, introducing slight randomness on borderline decisions. This mirrors the natural variation between human coders without degrading accuracy on clear-cut cases.
Agent 1 sees inclusion criteria first for each code. Agent 2 sees exclusion criteria first. This creates different cognitive anchoring without changing the actual rules.
Agent 4 processes transcripts in reverse order. Earlier transcripts subtly influence how coders interpret later ones. Reversing the order eliminates this bias.
Agent 4 has no knowledge of round 1 results. It codes the full dataset from scratch, providing a completely uncontaminated second opinion.
Neither sees the other's work. Both produce codes with written reasoning for every assignment.
Measure agreement between Agent 1 and Agent 2, corrected for chance. Identify every disagreement with both agents' reasoning.
Reviews both agents' reasoning against the codebook definition. Picks the correct code. Flags ambiguous definitions for human review.
No knowledge of round 1. Different model persona. Reverse transcript order. A completely fresh perspective.
Segments where they agree are auto-finalized (double-confirmed). Disagreements proceed to Agent 5.
The most capable model reviews the hardest cases: both rounds' reasoning, the codebook, and the original text. Its decision is final.
The codebook is the single most important factor in coding quality. Research shows that codebooks with full definitions, inclusion/exclusion criteria, and example quotes improve coding accuracy by 15-25 percentage points compared to code labels alone (Pangakis, Wolken, & Fasching, 2023). Each theme entry includes five components:
Inter-rater reliability measures whether independent coders assign the same codes to the same data. We use Cohen's Kappa (κ), which corrects for chance agreement (Cohen, 1960). Kappa is calculated per code, because some codes are inherently harder to apply consistently than others.
Scale: Landis & Koch (1977). Threshold based on Krippendorff (2004) recommendation of α ≥ 0.667 for applied research.
Every code assignment includes written reasoning from each agent that evaluated it. This creates a chain of evidence from the participant's words to the final theme, making every finding traceable and defensible.
Our approach is grounded in established qualitative research methods, each backed by decades of peer-reviewed evidence.