Research methodology
Our approach to qualitative analysis transforms open-ended interview responses into structured, reliable findings. Every code assignment is evidence-based, every theme is validated, and every conclusion is defensible.
Interview transcripts are rich with insight, but without systematic analysis, findings become anecdotal. Two analysts reading the same transcripts can reach different conclusions. Themes can be too broad (losing nuance) or too narrow (missing patterns). And there is no way to demonstrate to stakeholders that the results are reliable.
Systematic coding solves these problems. It provides a structured, repeatable process for transforming open-ended responses into quantified findings, with built-in quality checks that ensure accuracy and consistency.
Not all thematic analysis is created equal. Braun and Clarke (2021, 2022) identify three distinct variants, each with different strengths and trade-offs. We use Codebook Thematic Analysis because it combines the rigor clients expect with the flexibility that real interview data demands.
In applied market research, we rarely know every theme before reading the data. Reflexive TA gives us no way to prove our coding is reliable. Coding Reliability TA locks us into a fixed framework that cannot adapt. Codebook TA gives us the best of both: a structured codebook that evolves iteratively as we discover what the data contains, with reliability measurement built in.
Different questions produce different kinds of data. Each type requires a distinct coding approach, matched to the structure of the response.
Open responses mapped to an ordinal scale with predefined buckets.
Single-dimension classification into a small set of distinct categories.
Complex, open-ended responses broken into individual statements and grouped into validated themes.
Yes/no determination from open-ended responses. The most common variable type in market research coding.
The qualitative coding and survey methodology literature (Saldana, 2016; Krippendorff, 2004; Hsieh & Shannon, 2005; Miles, Huberman & Saldana, 2014) recognizes nine distinct variable types that researchers extract from open-ended interview responses. Our four native coding types cover all nine, either directly or through configuration.
| Variable type | What it is | Example | How we handle it |
|---|---|---|---|
| Thematic | Multiple themes per response | "Why did you switch?" coded as Poor usability + Cost concerns | Native type |
| Categorical | One label from 3+ unordered options | Satisfaction coded as Satisfied / Mixed / Dissatisfied | Native type |
| Rank-order | One bucket from an ordered scale | Company size coded as 101-500 employees | Native type |
| Binary | Yes/no, present/absent | "Did they evaluate competitors?" coded as Yes | Native type |
| Sentiment / Valence | Positive / neutral / negative attitude | Tone toward vendor support coded as Negative | Categorical with Positive/Mixed/Negative options |
| Frequency / Intensity | How often or how strongly | Dashboard usage coded as Daily / Weekly / Monthly | Rank-order with frequency buckets |
| Temporal | When something happened | Started evaluating coded as Q3 2025 | Categorical or Rank-order |
| Numeric / Continuous | Extract an exact number | Annual budget coded as $250,000 | Rare Use rank-order buckets instead. Verbal precision rarely warrants exact extraction. |
| Multi-code Ordered | Multiple ranked codes | Top priorities coded as Cost (1st), Speed (2nd) | Rare Use separate rank-order variables per item, or thematic without ranking. |
Our thematic coding follows a four-step process grounded in established qualitative research methods. Each step has specific rules and quality checks.
Each response is broken into discrete meaning units: the smallest segment of text that contains a single idea or claim (Graneheim & Lundman, 2004). A participant who says three different things gets three separate meaning units, each coded independently.
Each meaning unit receives a descriptive code: a short label (2-5 words) capturing what the statement is about (Saldana, 2016). Codes use the participant's own language where it is distinctive ("in vivo coding") and standardized labels where consistency matters.
Related codes are grouped into broader themes, each organized around a single concept. We use a two-pass approach (Deterding & Waters, 2021):
Candidate themes are tested against specific rules to ensure they are coherent, distinct, and analytically useful. Themes may be split, merged, or reorganized based on these checks.
Themes are not arbitrary groupings. Each must pass specific validation criteria before it enters the final analysis.
A theme must be mentioned by at least 5% of participants to stand on its own. Themes below this threshold are merged with related themes or moved to "Other." This prevents findings from being driven by isolated comments.
If the quotes within a theme cluster into two or more distinct ideas, the theme is too broad. A theme about "convenience" that contains both "close to my office" and "fast service" captures two different concepts and should be split for actionable analysis.
When 70% or more of participants who mention Theme A also mention Theme B, the themes likely represent the same underlying concept. They are merged into a single theme to avoid double-counting and simplify the analysis.
Fewer than 4 themes for an open-ended question usually means important distinctions are being lost. More than 12 usually means themes are not abstracted enough. Sub-themes preserve nuance within the 6-10 target range.
If the "Other" category exceeds 15% of responses, a meaningful pattern is being missed. The uncategorized responses are reviewed to identify hidden themes that should be added to the codebook.
Broad themes work for executive summaries. Sub-themes provide the detail needed for actionable recommendations. A theme like "Value for money" (21%) might contain sub-themes for "Low absolute prices" (14%), "Deals and promotions" (8%), and "Portion value" (5%).
Every study passes through a two-phase, seven-agent process. In Phase 1, four discovery agents read all transcripts and build the codebook inductively. In Phase 2, three application agents apply that codebook with built-in reliability measurement. No single agent's judgment is trusted in isolation, and a human review gate separates the two phases.
A single AI coder, no matter how accurate, provides no way to measure reliability. The same data could be coded differently by a different system, and there would be no way to know which is correct. Multiple independent agents solve this by replicating the gold-standard practice of inter-rater reliability from human qualitative research (Cohen, 1960; Krippendorff, 2004), but without the time, cost, and fatigue limitations of human coders.
Five agents use Claude Sonnet for high-volume extraction and coding. The two quality-gate agents, the codebook validator (D4) and the disagreement resolver (Agent 3), use Claude Opus. On expert reasoning benchmarks, Opus scores 17 points higher than Sonnet (GPQA Diamond: 91.3% vs. 74.1%). These two agents make the highest-leverage decisions in the pipeline: a flawed codebook corrupts every downstream code, and a bad resolution corrupts the final dataset. Opus costs 1.67x more per token, adding roughly $1-2 per 100 interviews.
The pipeline supports two modes. If you already have a codebook (from a previous study or written by hand), you can run application only. For new studies, the full pipeline discovers the codebook first, then applies it.
The two-phase approach builds accuracy at every step. Most interview responses are clear-cut: the participant's words either match a codebook definition or they do not. The application coders agree on roughly 80% of segments, and those agreements are almost always correct.
This estimate is based on three factors:
This means that for every 100 code assignments the pipeline makes, approximately 93 will match what an expert human coder would assign. The remaining cases are borderline situations where the codebook definition is genuinely ambiguous, the participant's language is unclear, or a code partially applies. These edge cases are flagged in the output with reasoning chains, so they can be reviewed if needed.
Independence between agents is not automatic. Two identical AI systems given identical inputs will produce identical outputs, proving nothing. We design genuine independence into each agent using different temperatures, persona framing, and task-specific roles.
| Agent | Model | Temp | Persona | Role |
|---|---|---|---|---|
| D1 | Sonnet | 0.2 | Thorough, nuanced | Extractor A |
| D2 | Sonnet | 0 | Precise, conservative | Extractor B |
| D3 | Sonnet | 0 | Senior methodologist | Theme synthesizer |
| D4 | Opus | 0 | Strict quality auditor | Codebook validator |
| Agent | Model | Temp | Persona | Emphasis | Role |
|---|---|---|---|---|---|
| 1 | Sonnet | 0 | Inclusive | Inclusion first | Primary Coder A |
| 2 | Sonnet | 0.3 | Conservative | Exclusion first | Primary Coder B |
| 3 | Opus | 0 | Neutral arbiter | Balanced | Resolver |
In discovery, Extractor A is thorough and nuanced while Extractor B is precise and conservative. In application, Coder A leans toward including borderline cases while Coder B leans toward excluding them. Opposite biases surface genuine ambiguity.
Extractors use 0.2 vs. 0. Application coders use 0 vs. 0.3. Slight randomness on borderline decisions mirrors the natural variation between human coders without degrading accuracy on clear-cut cases.
Application Coder A sees inclusion criteria first for each code. Coder B sees exclusion criteria first. This creates different cognitive anchoring without changing the actual rules.
Two independent agents read the same transcripts and extract meaning units separately. Each one catches codes the other overlooks, producing a more complete foundation for the codebook.
Discovery agents have no connection to application agents. The codebook is the only artifact that passes between phases, reviewed by a human researcher before application begins.
Each extractor reads all transcripts and produces codes with reasoning. Neither sees the other's work.
Merges both extractors' codes into a unified codebook with definitions, inclusion/exclusion criteria, and example quotes for each theme.
Checks every theme for coherence, distinctness, and completeness. Flags issues for revision.
A researcher reviews the discovered codebook, makes any adjustments, and approves it for application.
Neither sees the other's work. Both produce codes with written reasoning for every assignment.
Measure agreement between Agent 1 and Agent 2, corrected for chance. Segments where both agree are auto-finalized.
Reviews both coders' reasoning against the codebook definition. Picks the correct code. Flags ambiguous definitions for human review.
The codebook is the single most important factor in coding quality. Research shows that codebooks with full definitions, inclusion/exclusion criteria, and example quotes improve coding accuracy by 15-25 percentage points compared to code labels alone (Pangakis, Wolken, & Fasching, 2023). Each theme entry includes five components:
Inter-rater reliability measures whether independent coders assign the same codes to the same data. We use Cohen's Kappa (κ), which corrects for chance agreement (Cohen, 1960). Kappa is calculated per code, because some codes are inherently harder to apply consistently than others.
Scale: Landis & Koch (1977). Threshold based on Krippendorff (2004) recommendation of α ≥ 0.667 for applied research.
Every code assignment includes written reasoning from each agent that evaluated it. This creates a chain of evidence from the participant's words to the final theme, making every finding traceable and defensible.
Transparency matters. Below are the exact system prompts and codebook formatting each agent receives. Nothing is hidden or summarized. These are the literal instructions that shape each agent's coding behavior.
Each agent receives a persona prompt as the first line of every API call. This shapes how the agent approaches its task.
You are a thorough, nuanced qualitative researcher. Read each response carefully and extract every distinct meaning unit. Capture both explicit statements and clearly implied meaning. It is better to extract too many meaning units than to miss one.
You are a precise, conservative qualitative researcher. Extract only meaning units that are clearly and explicitly stated. Do not infer or interpret beyond what the participant said. Each meaning unit should represent one distinct, verifiable claim.
You are a senior qualitative methodologist. Your task is to group first-cycle codes into coherent themes via second-cycle coding. Each theme must have a clear definition, inclusion criteria, exclusion criteria, and example quotes. Themes should be distinct, analytically useful, and grounded in the data.
You are a strict quality auditor reviewing a codebook. Test every theme for coherence (does it represent one concept?), distinctness (does it overlap with other themes?), and completeness (are the inclusion/exclusion criteria clear enough for a coder to apply consistently?). Flag any issues.
You are a thorough, inclusive qualitative coder. Capture both explicit statements and clearly implied meaning. When evidence partially matches a code definition, lean toward including the code. It is better to over-include than to miss a relevant code.
You are a conservative, precise qualitative coder. Only assign a code when the participant's words clearly and explicitly match the codebook definition. Do not infer or interpret beyond what was said. When in doubt, do not assign the code.
You are a neutral arbiter resolving a coding disagreement. Review the participant's words, the codebook definition, and both coders' reasoning. Decide strictly based on whether the evidence meets the codebook definition. Do not favor either coder.
When presenting codebook definitions, Coder A and Coder B see the same information in a different order. This creates different cognitive anchoring, similar to how a human reading "include when..." first will approach a decision differently than one reading "exclude when..." first.
--- CODE: Poor usability --- Definition: Participant describes the product as difficult to use... INCLUDE when: Any mention of difficulty navigating, excessive clicks, confusing workflows... EXCLUDE when: General complaints about the product that are not specifically about ease of use...
--- CODE: Poor usability --- Definition: Participant describes the product as difficult to use... EXCLUDE when: General complaints about the product that are not specifically about ease of use... INCLUDE when: Any mention of difficulty navigating, excessive clicks, confusing workflows...
--- CODE: Poor usability --- Definition: Participant describes the product as difficult to use... Include when: Any mention of difficulty navigating, excessive clicks, confusing workflows... Exclude when: General complaints about the product that are not specifically about ease of use...
This is the complete prompt that each application coding agent (1, 2) receives for every thematic question. The persona and codebook emphasis sections change per agent. Everything else is identical.
[Agent persona prompt inserted here] You are coding an interview response for the question: "[question text from codebook]" Multiple codes CAN apply to a single response. Assign all that apply. CODEBOOK: [All codes formatted per agent's emphasis order, each with: definition, inclusion criteria, exclusion criteria, up to 3 examples with reasoning, up to 2 negative examples with reasoning] PARTICIPANT RESPONSE: "[participant's actual response text]" INSTRUCTIONS: 1. For each code in the codebook, explain whether the participant's response matches the definition. 2. Be specific: quote the exact words from the response that match (or don't match) each code. 3. WORD BOUNDARY: Only match words that appear as complete, standalone words. Never match a word found inside a longer word (e.g., "equity" contains the letters q-u-i-t but the participant did NOT say "quit"). Verify any keyword is bounded by spaces, punctuation, or the start/end of the text. 4. NEGATION: Pay attention to negation words (not, never, no, didn't, wasn't, wouldn't). "Not satisfied" means dissatisfied. "No problems" means things went well. Identify the complete negated phrase before coding. 5. SARCASM: Watch for sarcastic or ironic statements where context suggests the speaker means the opposite of their literal words. Cues include exaggeration, contradiction with surrounding statements, or phrases like "yeah right." Code the intended meaning, not the literal words. 6. HEDGING: Distinguish between definitive statements and hedged or qualified ones. Words like "kind of," "sort of," "I guess," "maybe," and "not really" weaken or change the meaning. Code the actual strength of the statement. 7. ABSENCE: If the participant does not mention a topic, do NOT treat that as evidence for or against any code. Only code what is actually stated or clearly implied. Silence on a topic is not data. 8. CONTEXT: Read the entire response before coding any part of it. A phrase can change meaning based on what comes before or after it. "The price was high but worth every penny" is not a price complaint. 9. Then list all codes that apply. Respond in this exact JSON format: { "reasoning": { "code_name_1": "Explanation of why this code does or does not apply", "code_name_2": "Explanation of why this code does or does not apply" }, "codes_assigned": ["code_name_1", "code_name_2"], "confidence": "high|medium|low" }
When Coder A and Coder B disagree, the resolver agent receives both coders' reasoning side by side.
[Resolver agent persona prompt inserted here] Two independent coders have coded the same interview response and disagree. Your task is to determine the correct coding based strictly on the codebook definition. QUESTION: "[question text from codebook]" PARTICIPANT RESPONSE: "[participant's actual response text]" CODEBOOK: [All codes with balanced emphasis: definition, inclusion criteria, exclusion criteria, examples, negative examples] CODER A's ASSESSMENT: Codes assigned: ["code_1", "code_2"] Reasoning: [Coder A's full reasoning for each code] CODER B's ASSESSMENT: Codes assigned: ["code_1"] Reasoning: [Coder B's full reasoning for each code] INSTRUCTIONS: 1. Review the participant's exact words. 2. Review the codebook definition, inclusion criteria, and exclusion criteria. 3. Evaluate each coder's reasoning. 4. WORD BOUNDARY: Only match words that appear as complete, standalone words. Never match a word found inside a longer word (e.g., "equity" contains the letters q-u-i-t but the participant did NOT say "quit"). Verify any keyword is bounded by spaces, punctuation, or the start/end of the text. 5. NEGATION: Pay attention to negation words (not, never, no, didn't, wasn't, wouldn't). "Not satisfied" means dissatisfied. "No problems" means things went well. Identify the complete negated phrase before coding. 6. SARCASM: Watch for sarcastic or ironic statements where context suggests the speaker means the opposite of their literal words. Cues include exaggeration, contradiction with surrounding statements, or phrases like "yeah right." Code the intended meaning, not the literal words. 7. HEDGING: Distinguish between definitive statements and hedged or qualified ones. Words like "kind of," "sort of," "I guess," "maybe," and "not really" weaken or change the meaning. Code the actual strength of the statement. 8. ABSENCE: If the participant does not mention a topic, do NOT treat that as evidence for or against any code. Only code what is actually stated or clearly implied. Silence on a topic is not data. 9. CONTEXT: Read the entire response before coding any part of it. A phrase can change meaning based on what comes before or after it. "The price was high but worth every penny" is not a price complaint. 10. Determine the correct codes based on the codebook definition. 11. If the codebook definition is ambiguous (both coders' interpretations are reasonable), flag it. Respond in this exact JSON format: { "reasoning": "Your step-by-step analysis", "codes_assigned": ["code_name_1"], "favored_coder": "A|B|neither", "definition_ambiguous": true|false, "ambiguity_note": "If ambiguous, describe what about the definition is unclear" }
Categorical, rank-order, and binary questions use simpler prompts since they involve assigning a single value rather than multiple thematic codes.
[Agent persona] You are coding an interview response for the question: "[question text]" Assign exactly ONE category from the list below. CATEGORIES: - Satisfied: Uses clearly positive language... - Mixed: Acknowledges both positives and negatives... - Dissatisfied: Uses clearly negative language... PARTICIPANT RESPONSE: "[response text]" INSTRUCTIONS: 1. Explain which category best fits the participant's response and why. 2. Quote specific words that support your choice. 3. WORD BOUNDARY: Only match complete, standalone words. Never match a word inside a longer word. 4. NEGATION: "Not satisfied" means dissatisfied. Identify the complete negated phrase before coding. 5. SARCASM: If context suggests the opposite of literal words (exaggeration, contradiction), code intended meaning. 6. HEDGING: "Kind of," "I guess," "maybe" weaken meaning. Code the actual strength, not a simplified version. 7. ABSENCE: Silence on a topic is not data. Only code what is actually stated. 8. CONTEXT: Read the full response first. A phrase can change meaning based on surrounding sentences. Respond in JSON: { "reasoning": "Explanation of your coding decision", "category_assigned": "category_name", "confidence": "high|medium|low" }
[Agent persona] You are coding an interview response for the question: "[question text]" Map the participant's response to exactly ONE bucket from the list below. BUCKETS: - 1-100 (1 to 100) - 101-500 (101 to 500) - 501-2000 (501 to 2000) - 2001-10000 (2001 to 10000) - 10000+ (10001 to unlimited) PARTICIPANT RESPONSE: "[response text]" INSTRUCTIONS: 1. Extract the relevant value from the response. 2. Determine which bucket it falls into. 3. If ambiguous, assign the closest bucket and note it. 4. WORD BOUNDARY: Only match complete, standalone words. Never match a word inside a longer word. 5. NEGATION: "Not quite 500" is different from "500." 6. CONTEXT: Read the full response first. Surrounding sentences may clarify or change the number. Respond in JSON: { "extracted_value": "The value from the response", "reasoning": "How you determined the bucket", "bucket_assigned": "bucket_label", "confidence": "high|medium|low" }
[Agent persona] You are coding an interview response for the question: "[question text]" This is a binary (yes/no) determination. Assign exactly one label: "[positive_label]" or "[negative_label]". CRITERIA: Definition of "[positive_label]": [definition] Code as "[positive_label]" when: [inclusion_criteria] Code as "[negative_label]" when: [exclusion_criteria] PARTICIPANT RESPONSE: "[response text]" INSTRUCTIONS: 1. Review the participant's exact words. 2. Determine whether the response meets the definition. 3. Quote the specific evidence that supports your determination. 4. WORD BOUNDARY: Only match complete, standalone words. Never match a word inside a longer word. 5. NEGATION: "Not satisfied" means dissatisfied. Identify the complete negated phrase before coding. 6. SARCASM: If context suggests the opposite of literal words (exaggeration, contradiction), code intended meaning. 7. HEDGING: "Kind of," "I guess," "maybe" weaken meaning. Code the actual strength, not a simplified version. 8. ABSENCE: Silence on a topic is not data. Only code what is actually stated. 9. CONTEXT: Read the full response first. A phrase can change meaning based on surrounding sentences. Respond in JSON: { "reasoning": "Explanation with quoted evidence", "binary_assigned": "[positive]" or "[negative]", "codes_assigned": ["[positive]"] or [], "confidence": "high|medium|low" }
The pipeline is a set of Python scripts that orchestrate all seven agents automatically. Here is exactly how to set it up and run it.
pip install anthropic For the full pipeline, you only need one file: your transcripts. The discovery phase builds the codebook automatically. If you are running application only (with an existing codebook), you need both files.
Your interview transcripts. One entry per participant, with their responses and question text for each question.
{
"participants": [
{
"participant_id": 1,
"metadata": {
"name": "Participant 1",
"date": "2026-01-15"
},
"transcript": [
{
"question_id": "Q1",
"question_text": "Why did you switch from
your previous provider?",
"response": "Honestly the biggest thing
was onboarding new hires.
It was such a clunky process,
like you'd have to re-enter
their info in three different
places."
},
{
"question_id": "Q2",
"question_text": "How satisfied are you with
your current solution?",
"response": "It does the job but there
are definitely things that
frustrate me."
}
]
}
]
} Only needed if you already have a codebook and want to skip discovery. One entry per question, with codes, definitions, inclusion/exclusion criteria, and example quotes.
{
"study_name": "BambooHR Win-Loss Study",
"version": "1.0",
"questions": [
{
"question_id": "Q1",
"question_text": "Why did you switch from your
previous provider?",
"coding_type": "thematic",
"multi_code": true,
"codes": [
{
"code_name": "Poor usability",
"definition": "Participant describes the
product as difficult to use...",
"inclusion_criteria": "Any mention of
difficulty navigating...",
"exclusion_criteria": "General complaints
not about ease of use...",
"examples": [
{
"text": "It took 12 clicks to approve
a PTO request",
"reasoning": "Describes excessive steps"
}
],
"negative_examples": [
{
"text": "It just didn't have what we
needed",
"reasoning": "Missing features, not
usability"
}
]
}
]
}
]
}
Set the ANTHROPIC_API_KEY environment variable before running the script. Do not paste your key directly into the config file.
# On Mac/Linux: export ANTHROPIC_API_KEY=sk-ant-your-key-here # On Windows (PowerShell): $env:ANTHROPIC_API_KEY = "sk-ant-your-key-here" # On Windows (Command Prompt): set ANTHROPIC_API_KEY=sk-ant-your-key-here
Navigate to the pipeline folder and run the script. The pipeline will process all participants and questions automatically, printing progress as it goes.
# Navigate to the pipeline folder cd website/how-to-code-transcripts/pipeline # Run the full pipeline (discovery + application) python run_full.py # Or run phases separately: # Discovery only (build codebook): python run_discovery.py # Application only (apply existing codebook): python run_coding.py
For the full pipeline, discovery takes 15-25 minutes and costs ~$8-15 per study. Application takes 30-45 minutes and costs ~$30 per 100 interviews. You will see real-time progress updates in your terminal as each phase completes.
The pipeline creates an output/ folder with these files:
| File | What it contains | When to read it |
|---|---|---|
discovered_codebook.json | The codebook built by the discovery phase, with themes, definitions, criteria, and examples | Review and approve this before application runs (human review gate) |
final_codes.json | The final validated code assignments for every segment, with reasoning | This is your primary deliverable |
reliability_summary.json | Per-code kappa values and overall kappa | Check that overall kappa is above 0.65 |
flagged_items.json | Codes with low reliability and codebook definitions flagged as ambiguous | If any codes are flagged, revise the codebook definitions and re-run |
reliability.txt | Human-readable reliability report: Agent 1 vs Agent 2 | To understand where and why agents disagreed |
agent_1_codes.json | Agent 1's raw coding with full reasoning per segment | For audit trail or to understand specific coding decisions |
agent_2_codes.json | Agent 2's raw coding with full reasoning per segment | For audit trail or to compare with Agent 1 |
All settings live in config.py. The defaults work well for most projects. Settings you might adjust:
| Setting | Default | What it controls |
|---|---|---|
KAPPA_THRESHOLD | 0.65 | Minimum Cohen's Kappa to consider a code reliable. Below this, the code is flagged for review. |
MIN_THEME_FREQUENCY | 0.05 | Discovery: a theme must appear in at least 5% of responses to be a standalone theme. Below this, the synthesizer merges it into a broader theme. |
MAX_OTHER_FREQUENCY | 0.15 | Discovery: if more than 15% of responses land in "Other," the validator flags the codebook for additional themes. |
BATCH_DELAY_SECONDS | 0.5 | Pause between API calls to respect rate limits. Increase if you hit rate-limit errors. |
MAX_RETRIES | 3 | How many times to retry if an API call returns malformed JSON. |
Our approach is grounded in established qualitative research methods, each backed by decades of peer-reviewed evidence.