Research methodology

How we code interview transcripts

Our approach to qualitative analysis transforms open-ended interview responses into structured, reliable findings. Every code assignment is evidence-based, every theme is validated, and every conclusion is defensible.

Why systematic coding matters

Interview transcripts are rich with insight, but without systematic analysis, findings become anecdotal. Two analysts reading the same transcripts can reach different conclusions. Themes can be too broad (losing nuance) or too narrow (missing patterns). And there is no way to demonstrate to stakeholders that the results are reliable.

Systematic coding solves these problems. It provides a structured, repeatable process for transforming open-ended responses into quantified findings, with built-in quality checks that ensure accuracy and consistency.

κ ≥ 0.65
Inter-rater reliability target for all coded data
5
Independent AI agents with two rounds of reliability checks
2-round
Code, resolve, re-code, validate process

6 steps across 2 rounds

Round 1: Independent coding + resolution
1
Agent 1 codes all segments Opus, inclusive persona, inclusion-criteria-first
2
Agent 2 codes all segments independently Sonnet, conservative persona, exclusion-criteria-first
3
Measure reliability + Agent 3 resolves disagreements Cohen's Kappa calculated per code. Neutral arbiter reviews both agents' reasoning.
Round 2: Independent validation + final resolution
4
Agent 4 codes all segments from scratch Sonnet, fresh perspective, no knowledge of round 1, reverse order
5
Measure reliability against round 1 resolved codes Segments where both rounds agree are auto-finalized.
6
Agent 5 resolves remaining disagreements Opus, senior reviewer makes the final call on the hardest cases.

Our approach: Codebook Thematic Analysis

Not all thematic analysis is created equal. Braun and Clarke (2021, 2022) identify three distinct variants, each with different strengths and trade-offs. We use Codebook Thematic Analysis because it combines the rigor clients expect with the flexibility that real interview data demands.

Reflexive TA

Highly flexible
Captures deep interpretation
No codebook
No reliability measurement
Results vary by analyst

Codebook TA

Our approach
Structured codebook with definitions
Supports inter-rater reliability
Iterative, evolving codebook
Works with AI-assisted coding
Scales to large datasets

Coding reliability TA

Highest reliability
Fixed, testable codebook
Rigid, cannot evolve
Misses unexpected themes
Requires all codes upfront

Why Codebook TA?

In applied market research, we rarely know every theme before reading the data. Reflexive TA gives us no way to prove our coding is reliable. Coding Reliability TA locks us into a fixed framework that cannot adapt. Codebook TA gives us the best of both: a structured codebook that evolves iteratively as we discover what the data contains, with reliability measurement built in.

Four types of interview questions

Different questions produce different kinds of data. Each type requires a distinct coding approach, matched to the structure of the response.

Rank-order variables

Open responses mapped to an ordinal scale with predefined buckets.

Example question
"How large is your company?"
Example response
"We're about 800 people globally"
Coded as
500-1,000 employees
Method: Directed content analysis (Hsieh & Shannon, 2005). Predefined categories with clear boundaries.

Categorical variables

Single-dimension classification into a small set of distinct categories.

Example question
"How do you feel about your current tool?"
Example response
"It does the job but there are definitely things that frustrate me"
Coded as
Mixed
Method: Directed content analysis with anchor descriptions defining each category level.

Thematic coding

Complex, open-ended responses broken into individual statements and grouped into validated themes.

Example question
"Why did you switch providers?"
Example response
"The onboarding was clunky, we had to re-enter data in three places, and honestly we were paying too much for what we got"
Coded as
Poor usability Missing features Cost concerns
Method: Full Codebook Thematic Analysis with meaning unit segmentation, two-pass coding, and theme validation.

Binary variables

Yes/no determination from open-ended responses. The most common variable type in market research coding.

Example question
"Did you evaluate other vendors before choosing?"
Example response
"We looked at a few other options but they were all too expensive"
Coded as
Yes
Method: Presence/absence coding (Krippendorff, 2004). Binary determination against a codebook definition.

How our four types cover the full literature taxonomy

The qualitative coding and survey methodology literature (Saldana, 2016; Krippendorff, 2004; Hsieh & Shannon, 2005; Miles, Huberman & Saldana, 2014) recognizes nine distinct variable types that researchers extract from open-ended interview responses. Our four native coding types cover all nine, either directly or through configuration.

Variable type What it is Example How we handle it
Thematic Multiple themes per response "Why did you switch?" coded as Poor usability + Cost concerns Native type
Categorical One label from 3+ unordered options Satisfaction coded as Satisfied / Mixed / Dissatisfied Native type
Rank-order One bucket from an ordered scale Company size coded as 101-500 employees Native type
Binary Yes/no, present/absent "Did they evaluate competitors?" coded as Yes Native type
Sentiment / Valence Positive / neutral / negative attitude Tone toward vendor support coded as Negative Categorical with Positive/Mixed/Negative options
Frequency / Intensity How often or how strongly Dashboard usage coded as Daily / Weekly / Monthly Rank-order with frequency buckets
Temporal When something happened Started evaluating coded as Q3 2025 Categorical or Rank-order
Numeric / Continuous Extract an exact number Annual budget coded as $250,000 Rare Use rank-order buckets instead. Verbal precision rarely warrants exact extraction.
Multi-code Ordered Multiple ranked codes Top priorities coded as Cost (1st), Speed (2nd) Rare Use separate rank-order variables per item, or thematic without ranking.

The coding process

Our thematic coding follows a four-step process grounded in established qualitative research methods. Each step has specific rules and quality checks.

1

Segment into meaning units

Each response is broken into discrete meaning units: the smallest segment of text that contains a single idea or claim (Graneheim & Lundman, 2004). A participant who says three different things gets three separate meaning units, each coded independently.

Before segmentation
"The onboarding was clunky, we had to re-enter data in three places, and we were paying too much"
After segmentation
MU-1 "The onboarding was clunky"
MU-2 "we had to re-enter data in three places"
MU-3 "we were paying too much"
2

First-cycle coding

Each meaning unit receives a descriptive code: a short label (2-5 words) capturing what the statement is about (Saldana, 2016). Codes use the participant's own language where it is distinctive ("in vivo coding") and standardized labels where consistency matters.

"The onboarding was clunky" Poor usability
"re-enter data in three places" Duplicate data entry
"paying too much" Cost concerns
3

Theme construction (two-pass approach)

Related codes are grouped into broader themes, each organized around a single concept. We use a two-pass approach (Deterding & Waters, 2021):

Pass 1: Discovery
Read 10-20% of transcripts and identify recurring patterns. Build an initial codebook with definitions, examples, and exclusion criteria for each theme. This pass is analyst-led.
Pass 2: Application
Apply the finalized codebook to all transcripts. Every response is coded against the same definitions. AI-assisted coding handles volume; human review ensures quality.
Theme: Product usability issues
Poor usability Confusing navigation Too many clicks Duplicate data entry
4

Theme refinement

Candidate themes are tested against specific rules to ensure they are coherent, distinct, and analytically useful. Themes may be split, merged, or reorganized based on these checks.

Theme validation rules

Themes are not arbitrary groupings. Each must pass specific validation criteria before it enters the final analysis.

5% minimum frequency

A theme must be mentioned by at least 5% of participants to stand on its own. Themes below this threshold are merged with related themes or moved to "Other." This prevents findings from being driven by isolated comments.

Split when two concepts emerge

If the quotes within a theme cluster into two or more distinct ideas, the theme is too broad. A theme about "convenience" that contains both "close to my office" and "fast service" captures two different concepts and should be split for actionable analysis.

Merge at 70%+ participant overlap

When 70% or more of participants who mention Theme A also mention Theme B, the themes likely represent the same underlying concept. They are merged into a single theme to avoid double-counting and simplify the analysis.

Target 6-10 themes per question

Fewer than 4 themes for an open-ended question usually means important distinctions are being lost. More than 12 usually means themes are not abstracted enough. Sub-themes preserve nuance within the 6-10 target range.

15%

"Other" capped at 15%

If the "Other" category exceeds 15% of responses, a meaningful pattern is being missed. The uncategorized responses are reviewed to identify hidden themes that should be added to the codebook.

Sub-themes preserve detail

Broad themes work for executive summaries. Sub-themes provide the detail needed for actionable recommendations. A theme like "Value for money" (21%) might contain sub-themes for "Low absolute prices" (14%), "Deals and promotions" (8%), and "Portion value" (5%).

The 5-agent reliability system

Every coded dataset passes through a two-round, five-agent process that produces measurable evidence of coding accuracy. Five independent AI agents code, cross-check, and validate every finding. No single agent's judgment is trusted in isolation.

Why multiple independent agents?

A single AI coder, no matter how accurate, provides no way to measure reliability. The same data could be coded differently by a different system, and there would be no way to know which is correct. Multiple independent agents solve this by replicating the gold-standard practice of inter-rater reliability from human qualitative research (Cohen, 1960; Krippendorff, 2004), but without the time, cost, and fatigue limitations of human coders.

Three agents would suffice for basic reliability measurement. We use five because the second round of validation catches residual errors in the first round's resolution, improving accuracy from approximately 85-88% to 90-93%. For consulting engagements where findings drive significant business decisions, that incremental precision matters.

Agent configuration

Independence between agents is not automatic. Two identical AI systems given identical inputs will produce identical outputs, proving nothing. We design genuine independence into each agent using four levers: model architecture, temperature, persona framing, and codebook emphasis.

Agent Model Temperature Persona Codebook emphasis Role
1 Opus 0 Thorough, inclusive Inclusion criteria first Primary Coder A
2 Sonnet 0.2 Conservative, precise Exclusion criteria first Primary Coder B
3 Sonnet 0 Neutral arbiter Balanced Round 1 resolver
4 Sonnet 0 Balanced, fresh perspective Balanced Independent re-coder
5 Opus 0 Senior quality reviewer Balanced Final validator

How we ensure genuine independence

Different model architectures

Agents 1 and 5 use Opus (deeper reasoning). Agents 2, 3, and 4 use Sonnet (different architecture). Different model weights produce genuinely different coding judgments on ambiguous cases.

Different personas

Agent 1 leans toward including borderline cases. Agent 2 leans toward excluding them. This creates productive tension: where both agree despite opposite biases, confidence is very high. Where they disagree, it surfaces genuine ambiguity.

Temperature variation

Agent 2 operates at temperature 0.2, introducing slight randomness on borderline decisions. This mirrors the natural variation between human coders without degrading accuracy on clear-cut cases.

Codebook emphasis

Agent 1 sees inclusion criteria first for each code. Agent 2 sees exclusion criteria first. This creates different cognitive anchoring without changing the actual rules.

Transcript order

Agent 4 processes transcripts in reverse order. Earlier transcripts subtly influence how coders interpret later ones. Reversing the order eliminates this bias.

Context isolation

Agent 4 has no knowledge of round 1 results. It codes the full dataset from scratch, providing a completely uncontaminated second opinion.

The two-round process

Round 1 Independent coding + resolution
1 2
Agents 1 and 2 code all segments independently

Neither sees the other's work. Both produce codes with written reasoning for every assignment.

κ
Calculate Cohen's Kappa per code

Measure agreement between Agent 1 and Agent 2, corrected for chance. Identify every disagreement with both agents' reasoning.

3
Agent 3 resolves all disagreements

Reviews both agents' reasoning against the codebook definition. Picks the correct code. Flags ambiguous definitions for human review.

Round 2 Independent validation + final resolution
4
Agent 4 codes all segments independently

No knowledge of round 1. Different model persona. Reverse transcript order. A completely fresh perspective.

κ
Compare Agent 3's resolved codes vs. Agent 4's codes

Segments where they agree are auto-finalized (double-confirmed). Disagreements proceed to Agent 5.

5
Agent 5 makes the final call

The most capable model reviews the hardest cases: both rounds' reasoning, the codebook, and the original text. Its decision is final.

The codebook: precision in every definition

The codebook is the single most important factor in coding quality. Research shows that codebooks with full definitions, inclusion/exclusion criteria, and example quotes improve coding accuracy by 15-25 percentage points compared to code labels alone (Pangakis, Wolken, & Fasching, 2023). Each theme entry includes five components:

Example codebook entry
Theme name
Poor post-sale support responsiveness
Definition
Participant describes slow response times, unanswered tickets, difficulty reaching a real person, or long resolution times after becoming a customer.
Include
Any mention of delayed support responses, unresolved issues, phone trees, or being "passed around" between departments.
Exclude
Pre-sale experience, onboarding difficulties, product bugs. These are separate themes.
Example quote
"We submitted a ticket about a payroll error and didn't hear back for two weeks."

Measuring inter-rater reliability

Inter-rater reliability measures whether independent coders assign the same codes to the same data. We use Cohen's Kappa (κ), which corrects for chance agreement (Cohen, 1960). Kappa is calculated per code, because some codes are inherently harder to apply consistently than others.

< 0.20
0.21-0.40
0.41-0.60
0.61-0.80
0.81-1.00
Poor Fair Moderate Substantial Almost perfect
Our minimum threshold: κ ≥ 0.65

Scale: Landis & Koch (1977). Threshold based on Krippendorff (2004) recommendation of α ≥ 0.667 for applied research.

Full audit trail

Every code assignment includes written reasoning from each agent that evaluated it. This creates a chain of evidence from the participant's words to the final theme, making every finding traceable and defensible.

Participant said
"We submitted a ticket about a payroll error and didn't hear back for two weeks."
Agent 1 (inclusive)
Participant describes a specific support ticket unanswered for an extended period. Matches "slow response times" and "unanswered tickets." Poor post-sale support
Agent 2 (conservative)
Explicit mention of an unresolved ticket with a specific timeframe (two weeks). Clear match to codebook definition. Poor post-sale support
Status
Both agents agree. Auto-confirmed.

Exact agent prompts

Transparency matters. Below are the exact system prompts and codebook formatting each agent receives. Nothing is hidden or summarized. These are the literal instructions that shape each agent's coding behavior.

Agent persona prompts

Each agent receives a persona prompt as the first line of every API call. This shapes how the agent interprets ambiguous evidence and makes borderline decisions.

1 Agent 1: Inclusive Coder Opus, temp 0
You are a thorough, inclusive qualitative coder. Capture both explicit statements and clearly implied meaning. When evidence partially matches a code definition, lean toward including the code. It is better to over-include than to miss a relevant code.
2 Agent 2: Conservative Coder Sonnet, temp 0.2
You are a conservative, precise qualitative coder. Only assign a code when the participant's words clearly and explicitly match the codebook definition. Do not infer or interpret beyond what was said. When in doubt, do not assign the code.
3 Agent 3: Neutral Arbiter Sonnet, temp 0
You are a neutral arbiter resolving a coding disagreement. Review the participant's words, the codebook definition, and both coders' reasoning. Decide strictly based on whether the evidence meets the codebook definition. Do not favor either coder.
4 Agent 4: Fresh Perspective Sonnet, temp 0
You are a balanced qualitative coder. Apply codes when the evidence supports them according to the codebook definition. Do not assign codes when the evidence is ambiguous or insufficient.
5 Agent 5: Senior Reviewer Opus, temp 0
You are a senior quality reviewer making the final determination. Your role is to determine which coding is more defensible given the codebook definition. Consider: which code assignment is more directly supported by the participant's actual words? Which is more consistent with the codebook's inclusion and exclusion criteria?

How codebook emphasis changes the prompt

When presenting codebook definitions, Agent 1 and Agent 2 see the same information in a different order. This creates different cognitive anchoring, similar to how a human reading "include when..." first will approach a decision differently than one reading "exclude when..." first.

1 Inclusion-first (Agent 1)
--- CODE: Poor usability ---
Definition: Participant describes the product as difficult to use...

INCLUDE when: Any mention of difficulty navigating,
excessive clicks, confusing workflows...

EXCLUDE when: General complaints about the product
that are not specifically about ease of use...
2 Exclusion-first (Agent 2)
--- CODE: Poor usability ---
Definition: Participant describes the product as difficult to use...

EXCLUDE when: General complaints about the product
that are not specifically about ease of use...

INCLUDE when: Any mention of difficulty navigating,
excessive clicks, confusing workflows...
3 4 5 Balanced (Agents 3, 4, 5)
--- CODE: Poor usability ---
Definition: Participant describes the product as difficult to use...

Include when: Any mention of difficulty navigating,
excessive clicks, confusing workflows...

Exclude when: General complaints about the product
that are not specifically about ease of use...

Full prompt template: thematic coding

This is the complete prompt that each coding agent (1, 2, 4) receives for every thematic question. The persona and codebook emphasis sections change per agent. Everything else is identical.

Complete thematic coding prompt
[Agent persona prompt inserted here]

You are coding an interview response for the question: "[question text from codebook]"

Multiple codes CAN apply to a single response. Assign all that apply.

CODEBOOK:
[All codes formatted per agent's emphasis order, each with:
  definition, inclusion criteria, exclusion criteria,
  up to 3 examples with reasoning,
  up to 2 negative examples with reasoning]

PARTICIPANT RESPONSE:
"[participant's actual response text]"

INSTRUCTIONS:
1. For each code in the codebook, explain whether the participant's
   response matches the definition.
2. Be specific: quote the exact words from the response that match
   (or don't match) each code.
3. Then list all codes that apply.

Respond in this exact JSON format:
{
  "reasoning": {
    "code_name_1": "Explanation of why this code does or does not apply",
    "code_name_2": "Explanation of why this code does or does not apply"
  },
  "codes_assigned": ["code_name_1", "code_name_2"],
  "confidence": "high|medium|low"
}

Full prompt template: disagreement resolution

When Agents 1 and 2 disagree (or when round 1 resolved codes disagree with Agent 4), the resolver agent (3 or 5) receives both coders' reasoning side by side.

Complete resolution prompt
[Resolver agent persona prompt inserted here]

Two independent coders have coded the same interview response and disagree.
Your task is to determine the correct coding based strictly on the codebook
definition.

QUESTION: "[question text from codebook]"

PARTICIPANT RESPONSE:
"[participant's actual response text]"

CODEBOOK:
[All codes with balanced emphasis: definition,
  inclusion criteria, exclusion criteria,
  examples, negative examples]

CODER A's ASSESSMENT:
Codes assigned: ["code_1", "code_2"]
Reasoning: [Coder A's full reasoning for each code]

CODER B's ASSESSMENT:
Codes assigned: ["code_1"]
Reasoning: [Coder B's full reasoning for each code]

INSTRUCTIONS:
1. Review the participant's exact words.
2. Review the codebook definition, inclusion criteria, and exclusion criteria.
3. Evaluate each coder's reasoning.
4. Determine the correct codes based on the codebook definition.
5. If the codebook definition is ambiguous (both coders' interpretations
   are reasonable), flag it.

Respond in this exact JSON format:
{
  "reasoning": "Your step-by-step analysis",
  "codes_assigned": ["code_name_1"],
  "favored_coder": "A|B|neither",
  "definition_ambiguous": true|false,
  "ambiguity_note": "If ambiguous, describe what about the definition
                      is unclear"
}

Other prompt templates

Categorical, rank-order, and binary questions use simpler prompts since they involve assigning a single value rather than multiple thematic codes.

Categorical coding prompt
[Agent persona]

You are coding an interview response for the question:
"[question text]"

Assign exactly ONE category from the list below.

CATEGORIES:
- Satisfied: Uses clearly positive language...
- Mixed: Acknowledges both positives and negatives...
- Dissatisfied: Uses clearly negative language...

PARTICIPANT RESPONSE:
"[response text]"

INSTRUCTIONS:
1. Explain which category best fits the participant's
   response and why.
2. Quote specific words that support your choice.

Respond in JSON:
{
  "reasoning": "Explanation of your coding decision",
  "category_assigned": "category_name",
  "confidence": "high|medium|low"
}
Rank-order coding prompt
[Agent persona]

You are coding an interview response for the question:
"[question text]"

Map the participant's response to exactly ONE bucket
from the list below.

BUCKETS:
- 1-100 (1 to 100)
- 101-500 (101 to 500)
- 501-2000 (501 to 2000)
- 2001-10000 (2001 to 10000)
- 10000+ (10001 to unlimited)

PARTICIPANT RESPONSE:
"[response text]"

INSTRUCTIONS:
1. Extract the relevant value from the response.
2. Determine which bucket it falls into.
3. If ambiguous, assign the closest bucket and note it.

Respond in JSON:
{
  "extracted_value": "The value from the response",
  "reasoning": "How you determined the bucket",
  "bucket_assigned": "bucket_label",
  "confidence": "high|medium|low"
}
Binary coding prompt
[Agent persona]

You are coding an interview response for the question:
"[question text]"

This is a binary (yes/no) determination. Assign exactly
one label: "[positive_label]" or "[negative_label]".

CRITERIA:
Definition of "[positive_label]": [definition]
Code as "[positive_label]" when: [inclusion_criteria]
Code as "[negative_label]" when: [exclusion_criteria]

PARTICIPANT RESPONSE:
"[response text]"

INSTRUCTIONS:
1. Review the participant's exact words.
2. Determine whether the response meets the definition.
3. Quote the specific evidence that supports your
   determination.

Respond in JSON:
{
  "reasoning": "Explanation with quoted evidence",
  "binary_assigned": "[positive]" or "[negative]",
  "codes_assigned": ["[positive]"] or [],
  "confidence": "high|medium|low"
}

Running the pipeline

The coding pipeline is a Python script that orchestrates all five agents automatically. Here is exactly how to set it up and run it.

Prerequisites

1
Python 3.10+ installed on your machine
2
Anthropic Python SDK. Install with: pip install anthropic
3
Anthropic API key with access to Claude Opus and Claude Sonnet models

Step 1: Prepare your input files

The pipeline expects two JSON files in the same folder as the scripts. Use the templates as a starting point.

{ } codebook.json

Your study's coding definitions. One entry per question, with codes, definitions, inclusion/exclusion criteria, and example quotes.

{
  "study_name": "BambooHR Win-Loss Study",
  "version": "1.0",
  "questions": [
    {
      "question_id": "Q1",
      "question_text": "Why did you switch from your
                         previous provider?",
      "coding_type": "thematic",
      "multi_code": true,
      "codes": [
        {
          "code_name": "Poor usability",
          "definition": "Participant describes the
                         product as difficult to use...",
          "inclusion_criteria": "Any mention of
                         difficulty navigating...",
          "exclusion_criteria": "General complaints
                         not about ease of use...",
          "examples": [
            {
              "text": "It took 12 clicks to approve
                       a PTO request",
              "reasoning": "Describes excessive steps"
            }
          ],
          "negative_examples": [
            {
              "text": "It just didn't have what we
                       needed",
              "reasoning": "Missing features, not
                            usability"
            }
          ]
        }
      ]
    }
  ]
}
{ } transcripts.json

Your interview transcripts. One entry per participant, with their responses matched to question IDs from the codebook.

{
  "participants": [
    {
      "participant_id": 1,
      "metadata": {
        "name": "Participant 1",
        "date": "2026-01-15"
      },
      "transcript": [
        {
          "question_id": "Q1",
          "response": "Honestly the biggest thing
                       was onboarding new hires.
                       It was such a clunky process,
                       like you'd have to re-enter
                       their info in three different
                       places."
        },
        {
          "question_id": "Q2",
          "response": "It does the job but there
                       are definitely things that
                       frustrate me."
        }
      ]
    }
  ]
}

Step 2: Set your API key

Set the ANTHROPIC_API_KEY environment variable before running the script. Do not paste your key directly into the config file.

Terminal
# On Mac/Linux:
export ANTHROPIC_API_KEY=sk-ant-your-key-here

# On Windows (PowerShell):
$env:ANTHROPIC_API_KEY = "sk-ant-your-key-here"

# On Windows (Command Prompt):
set ANTHROPIC_API_KEY=sk-ant-your-key-here

Step 3: Run the pipeline

Navigate to the pipeline folder and run the script. The pipeline will process all participants and questions automatically, printing progress as it goes.

Terminal
# Navigate to the pipeline folder
cd website/how-to-code-transcripts/pipeline

# Run the full 5-agent pipeline
python run_coding.py

What happens when you run it

The script runs all five agents sequentially. For a typical project (100 interviews, 12 questions each), expect 75-120 minutes of processing time and approximately $110-130 in API costs. You will see real-time progress updates in your terminal as each segment is coded.

Step 4: Review the outputs

The pipeline creates an output/ folder with these files:

File What it contains When to read it
final_codes.json The final validated code assignments for every segment, with reasoning This is your primary deliverable
reliability_summary.json Per-code kappa values and overall kappa for both rounds Check that overall kappa is above 0.65
flagged_items.json Codes with low reliability and codebook definitions flagged as ambiguous If any codes are flagged, revise the codebook definitions and re-run
round_1_reliability.txt Human-readable reliability report: Agent 1 vs Agent 2 To understand where and why agents disagreed in round 1
round_2_reliability.txt Human-readable reliability report: Resolved codes vs Agent 4 To understand residual disagreements after resolution
agent_1_codes.json Agent 1's raw coding with full reasoning per segment For audit trail or to understand specific coding decisions
agent_2_codes.json Agent 2's raw coding with full reasoning per segment For audit trail or to compare with Agent 1
agent_4_codes.json Agent 4's independent re-coding with reasoning For audit trail or round 2 analysis
round_1_resolved.json Agent 3's resolved codes after round 1 disagreement resolution To trace the resolution chain

Customizing the configuration

All settings live in config.py. The defaults work well for most projects. Settings you might adjust:

Setting Default What it controls
KAPPA_THRESHOLD 0.65 Minimum Cohen's Kappa to consider a code reliable. Below this, the code is flagged for review.
MIN_THEME_FREQUENCY 0.05 A theme must appear in at least 5% of responses to be a standalone theme. Below this, consider merging into a broader theme.
MAX_OTHER_FREQUENCY 0.15 If more than 15% of responses land in "Other," your codebook likely has gaps. Add new themes.
BATCH_DELAY_SECONDS 0.5 Pause between API calls to respect rate limits. Increase if you hit rate-limit errors.
MAX_RETRIES 3 How many times to retry if an API call returns malformed JSON.

Methodological foundations

Our approach is grounded in established qualitative research methods, each backed by decades of peer-reviewed evidence.

Braun, V. & Clarke, V.
2006
Using thematic analysis in psychology
Qualitative Research in Psychology, 3(2), 77-101
Foundational framework for thematic analysis. One of the most cited methods papers in social science.
Braun, V. & Clarke, V.
2022
Thematic Analysis: A Practical Guide
SAGE Publications
Distinguishes three TA variants (Reflexive, Codebook, Coding Reliability) and provides updated guidance for each.
Graneheim, U.H. & Lundman, B.
2004
Qualitative content analysis in nursing research: concepts, procedures and measures to achieve trustworthiness
Nurse Education Today, 24(2), 105-112
Standard framework for segmenting transcripts into meaning units and establishing coding trustworthiness.
Saldana, J.
2016
The Coding Manual for Qualitative Researchers
SAGE Publications, 3rd edition
Defines first-cycle and second-cycle coding methods, including descriptive, in vivo, and process coding approaches.
Deterding, N.M. & Waters, M.C.
2021
Flexible coding of in-depth interviews: a twenty-first-century approach
Sociological Methods & Research, 50(2), 708-739
Recommends the two-pass coding approach: initial indexing pass followed by focused coding pass.
Hsieh, H-F. & Shannon, S.E.
2005
Three approaches to qualitative content analysis
Qualitative Health Research, 15(9), 1277-1288
Defines directed content analysis for coding open-ended responses into predefined categories (rank-order and categorical variables).
Cohen, J.
1960
A coefficient of agreement for nominal scales
Educational and Psychological Measurement, 20(1), 37-46
Introduces Cohen's Kappa, the standard inter-rater reliability metric correcting for chance agreement.
Krippendorff, K.
2004
Content Analysis: An Introduction to Its Methodology
SAGE Publications, 2nd edition
Defines Krippendorff's Alpha and establishes reliability thresholds (α ≥ 0.667 for tentative, α ≥ 0.80 for firm conclusions).
Landis, J.R. & Koch, G.G.
1977
The measurement of observer agreement for categorical data
Biometrics, 33(1), 159-174
Establishes the standard interpretation scale for kappa values (slight, fair, moderate, substantial, almost perfect).
Gao, J. et al.
2024
CollabCoder: a lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models
Proceedings of the ACM on Human-Computer Interaction (CSCW)
Demonstrates that human-AI collaborative coding reduces coding time by ~50% while maintaining intercoder reliability.
Pangakis, N., Wolken, S. & Fasching, N.
2023
Automated annotation with generative AI suggests promising avenues for qualitative research
arXiv preprint
Shows that structured codebook prompts with definitions and examples improve AI coding accuracy by 15-25 percentage points.
Miles, M.B., Huberman, A.M. & Saldana, J.
2014
Qualitative Data Analysis: A Methods Sourcebook
SAGE Publications, 3rd edition
Comprehensive reference for qualitative coding methods and quality standards, including the 0.80 threshold on 95% of codes.