Research methodology

How we code interview transcripts

Our approach to qualitative analysis transforms open-ended interview responses into structured, reliable findings. Every code assignment is evidence-based, every theme is validated, and every conclusion is defensible.

Why systematic coding matters

Interview transcripts are rich with insight, but without systematic analysis, findings become anecdotal. Two analysts reading the same transcripts can reach different conclusions. Themes can be too broad (losing nuance) or too narrow (missing patterns). And there is no way to demonstrate to stakeholders that the results are reliable.

Systematic coding solves these problems. It provides a structured, repeatable process for transforming open-ended responses into quantified findings, with built-in quality checks that ensure accuracy and consistency.

κ ≥ 0.65
Inter-rater reliability target for all coded data
8
AI agents: 5 across discovery phases, 3 in application
5+3
Five discovery phases build one canonical codebook, then 3 agents apply it

8 steps across 2 phases

Phase 1: Discovery (6 agents, 6 phases, codebook + segmentation dimensions)
1
Phase 1: Classify every question as thematic, categorical, binary, or rank-order Determines the coding type for each question so the right extraction logic is applied downstream.
2
Phase 2: Two extractors independently break responses into meaning units (parallel) Thorough and precise Sonnet agents extract codes from all transcripts independently, in parallel across questions.
3
Phase 3: Clusterer compresses meaning units into semantic groups per question (parallel) A preliminary grouping agent reduces hundreds of raw codes per question into ~20-40 rough clusters, making global synthesis computationally feasible.
4
Phase 4: Global synthesizer names themes once from all questions combined A single call sees every cluster from every thematic question. Theme names are coined here and used everywhere. This is the methodological heart of the design.
5
Phase 5: Per-question validator applies the global codebook to each question (parallel) Confirms which global themes are relevant per question, adds question-specific coding notes, and flags patterns the global codebook missed.
6
Phase 6: Dimension Architect classifies themes into segmentation dimensions Reads the completed codebook and study_config.json to produce the dimensions section: which variables are defining (clustering inputs), outcome (validation only), or profiling (post-clustering description). Human review gate covers both themes and dimensions before application.
Phase 2: Application (3 agents apply the codebook)
7
Agent 1 + Agent 2 code all segments independently Inclusive and conservative personas with opposing codebook emphasis code every response.
8
Measure reliability (Cohen's Kappa per code) Segments where both coders agree are auto-finalized. Disagreements proceed to resolution.
9
Agent 3 resolves disagreements Neutral arbiter reviews both coders' reasoning against the codebook. Its decision is final.
Phase 3: Segmentation Preparation (segment-prep.py validates dimensions)
10
segment-prep.py validates dimensions and outputs cluster-ready data Reads final_codes.json and the dimensions section of codebook.json. Runs variance checks on every defining dimension (20–80% rule for binary; meaningful spread for ordinal). Enforces the N/10 ceiling. Outputs segmentation-ready.csv, segmentation-profile.csv, and a segmentation-validation-report.txt showing which dimensions pass or require reclassification.

Our approach: Codebook Thematic Analysis

Not all thematic analysis is created equal. Braun and Clarke (2021, 2022) identify three distinct variants, each with different strengths and trade-offs. We use Codebook Thematic Analysis because it combines the rigor clients expect with the flexibility that real interview data demands.

Reflexive TA

Highly flexible
Captures deep interpretation
No codebook
No reliability measurement
Results vary by analyst

Codebook TA

Our approach
Structured codebook with definitions
Supports inter-rater reliability
Iterative, evolving codebook
Works with AI-assisted coding
Scales to large datasets

Coding reliability TA

Highest reliability
Fixed, testable codebook
Rigid, cannot evolve
Misses unexpected themes
Requires all codes upfront

Why Codebook TA?

In applied market research, we rarely know every theme before reading the data. Reflexive TA gives us no way to prove our coding is reliable. Coding Reliability TA locks us into a fixed framework that cannot adapt. Codebook TA gives us the best of both: a structured codebook that evolves iteratively as we discover what the data contains, with reliability measurement built in.

Four types of interview questions

Different questions produce different kinds of data. Each type requires a distinct coding approach, matched to the structure of the response.

Rank-order variables

Open responses mapped to an ordinal scale with predefined buckets.

Example question
"How large is your company?"
Example response
"We're about 800 people globally"
Coded as
500-1,000 employees
Method: Directed content analysis (Hsieh & Shannon, 2005). Predefined categories with clear boundaries.

Categorical variables

Single-dimension classification into a small set of distinct categories.

Example question
"How do you feel about your current tool?"
Example response
"It does the job but there are definitely things that frustrate me"
Coded as
Mixed
Method: Directed content analysis with anchor descriptions defining each category level.

Thematic coding

Complex, open-ended responses broken into individual statements and grouped into validated themes.

Example question
"Why did you switch providers?"
Example response
"The onboarding was clunky, we had to re-enter data in three places, and honestly we were paying too much for what we got"
Coded as
Poor usability Missing features Cost concerns
Method: Full Codebook Thematic Analysis with meaning unit segmentation, two-pass coding, and theme validation.

Binary variables

Yes/no determination from open-ended responses. The most common variable type in market research coding.

Example question
"Did you evaluate other vendors before choosing?"
Example response
"We looked at a few other options but they were all too expensive"
Coded as
Yes
Method: Presence/absence coding (Krippendorff, 2004). Binary determination against a codebook definition.

How our four types cover the full literature taxonomy

The qualitative coding and survey methodology literature (Saldana, 2016; Krippendorff, 2004; Hsieh & Shannon, 2005; Miles, Huberman & Saldana, 2014) recognizes nine distinct variable types that researchers extract from open-ended interview responses. Our four native coding types cover all nine, either directly or through configuration.

Variable type What it is Example How we handle it
Thematic Multiple themes per response "Why did you switch?" coded as Poor usability + Cost concerns Native type
Categorical One label from 3+ unordered options Satisfaction coded as Satisfied / Mixed / Dissatisfied Native type
Rank-order One bucket from an ordered scale Company size coded as 101-500 employees Native type
Binary Yes/no, present/absent "Did they evaluate competitors?" coded as Yes Native type
Sentiment / Valence Positive / neutral / negative attitude Tone toward vendor support coded as Negative Categorical with Positive/Mixed/Negative options
Frequency / Intensity How often or how strongly Dashboard usage coded as Daily / Weekly / Monthly Rank-order with frequency buckets
Temporal When something happened Started evaluating coded as Q3 2025 Categorical or Rank-order
Numeric / Continuous Extract an exact number Annual budget coded as $250,000 Rare Use rank-order buckets instead. Verbal precision rarely warrants exact extraction.
Multi-code Ordered Multiple ranked codes Top priorities coded as Cost (1st), Speed (2nd) Rare Use separate rank-order variables per item, or thematic without ranking.

The coding process

Our thematic coding follows a four-step process grounded in established qualitative research methods. Each step has specific rules and quality checks.

1

Segment into meaning units

Each response is broken into discrete meaning units: the smallest segment of text that contains a single idea or claim (Graneheim & Lundman, 2004). A participant who says three different things gets three separate meaning units, each coded independently.

Before segmentation
"The onboarding was clunky, we had to re-enter data in three places, and we were paying too much"
After segmentation
MU-1 "The onboarding was clunky"
MU-2 "we had to re-enter data in three places"
MU-3 "we were paying too much"
2

First-cycle coding

Each meaning unit receives a descriptive code: a short label (2-5 words) capturing what the statement is about (Saldana, 2016). Codes use the participant's own language where it is distinctive ("in vivo coding") and standardized labels where consistency matters.

"The onboarding was clunky" Poor usability
"re-enter data in three places" Duplicate data entry
"paying too much" Cost concerns
3

Theme construction (two-pass approach)

Related codes are grouped into themes across all questions at once, not question by question. This is the critical design choice that ensures theme names are consistent across the full study. We use a two-pass approach (Deterding & Waters, 2021):

Phase 1: Discovery (5 phases, 5 agents)
Two extractors independently break responses into meaning units per question (in parallel). A clusterer compresses those units into rough semantic groups. A global synthesizer sees all clusters from all thematic questions in one call and names themes once for the entire study. A per-question validator applies the global codebook to each question and flags gaps. A human researcher reviews the final codebook before application begins.
Phase 2: Application (3 agents)
Three independent AI agents apply the validated codebook to all transcripts. Two coders with opposing biases code independently. A neutral arbiter (Claude Opus) resolves disagreements. Cohen's Kappa measures reliability per code.
Theme: Product usability issues
Poor usability Confusing navigation Too many clicks Duplicate data entry
4

Theme refinement

Candidate themes are tested against specific rules to ensure they are coherent, distinct, and analytically useful. Themes may be split, merged, or reorganized based on these checks.

Theme validation rules

Themes are not arbitrary groupings. Each must pass specific validation criteria before it enters the final analysis.

5% minimum frequency

A theme must be mentioned by at least 5% of participants to stand on its own. Themes below this threshold are merged with related themes or moved to "Other." This prevents findings from being driven by isolated comments.

Split when two concepts emerge

If the quotes within a theme cluster into two or more distinct ideas, the theme is too broad. A theme about "convenience" that contains both "close to my office" and "fast service" captures two different concepts and should be split for actionable analysis.

Merge at 70%+ participant overlap

When 70% or more of participants who mention Theme A also mention Theme B, the themes likely represent the same underlying concept. They are merged into a single theme to avoid double-counting and simplify the analysis.

Target 6-10 themes per question

Fewer than 4 themes for an open-ended question usually means important distinctions are being lost. More than 12 usually means themes are not abstracted enough. Sub-themes preserve nuance within the 6-10 target range.

15%

"Other" capped at 15%

If the "Other" category exceeds 15% of responses, a meaningful pattern is being missed. The uncategorized responses are reviewed to identify hidden themes that should be added to the codebook.

Sub-themes preserve detail

Broad themes work for executive summaries. Sub-themes provide the detail needed for actionable recommendations. A theme like "Value for money" (21%) might contain sub-themes for "Low absolute prices" (14%), "Deals and promotions" (8%), and "Portion value" (5%).

The two-phase, 9-agent system

Every study passes through a two-phase, nine-agent process. In Phase 1, six discovery agents work across six sequential phases to build a canonical codebook and segmentation dimension structure from all transcripts. In Phase 2, three application agents apply that codebook with built-in reliability measurement. No single agent's judgment is trusted in isolation, and a human review gate separates the two phases.

The core validity problem: theme names coined per question

Earlier AI coding systems synthesized themes independently for each question. This creates a subtle but serious validity problem: the same underlying concept can get different names depending on which question surfaces it first. An agent analyzing Q5 ("Why did you switch?") might name a theme "Ease of use problems." The same agent, working independently on Q10 ("What frustrated you most?"), might name the same concept "Interface complexity" or "Faster processing" because it had no memory of what it had coined before.

When that happens, findings become incomparable across questions. A researcher cannot sum "Ease of use problems" from Q5 with "Interface complexity" from Q10 without knowing they represent the same thing. The analysis appears fragmented, and cross-question patterns are invisible.

How the 5-phase design solves this

Braun and Clarke (2021) state that themes should be developed from the full dataset, not constructed within individual data items. Our Phase 4 global synthesizer does exactly this: it receives compressed clusters from every thematic question simultaneously and names themes once. Theme names coined in Phase 4 are used in Phase 5 validation for every question. "Ease of use problems" is the name for that concept everywhere in the study, regardless of which question triggered it.

This is not just operational tidiness. Cross-question theme consistency is a validity requirement. Without it, percentage breakdowns across questions cannot be meaningfully compared, and study-wide frequency claims are indefensible.

Why multiple independent agents?

A single AI coder, no matter how accurate, provides no way to measure reliability. The same data could be coded differently by a different system, and there would be no way to know which is correct. Multiple independent agents solve this by replicating the gold-standard practice of inter-rater reliability from human qualitative research (Cohen, 1960; Krippendorff, 2004), but without the time, cost, and fatigue limitations of human coders.

Five agents use Claude Sonnet for high-volume extraction, per-question validation, and application coding. Three agents use Claude Opus: D3 (Preliminary Clusterer), D4 (Global Theme Architect), and Agent 3 (Resolver). Opus scores 17 points higher than Sonnet on expert reasoning benchmarks (GPQA Diamond: 91.3% vs. 74.1%). Each of these three agents makes decisions that either cannot be corrected downstream or that propagate into every subsequent step.

Why D3 and D4 use Opus: the irreversible information loss problem

D4 can only see what D3 passes it. D3 receives ~150 raw descriptive codes per question and compresses them into ~30 clusters. D4 receives those clusters — not the original codes. If D3 collapses two distinct concepts into one cluster, D4 has no way to recover the distinction. The information loss is permanent.

This asymmetry drives the choice of Opus for D3. The failure modes are not symmetric:

Under-clustering (recoverable)
D3 keeps two concepts as separate clusters when they could have been merged. D4 sees both clusters, recognizes they represent the same underlying idea, and combines them into one theme. No information is lost. D4 is well-suited to this merging task because it sees cross-question context that D3 does not.
Over-merging (permanent)
D3 collapses two distinct concepts into one cluster. D4 receives one cluster entry. D4 can only catch this if the representative quote reveals both concepts — if the quote is ambiguous, D4 names one theme and the distinction is gone. Example: "absolute price too high" and "poor ROI for what you paid" are different buying signals. A Sonnet D3 might merge them into "cost concerns." An Opus D3 keeps them separate and lets D4 decide whether to merge.

D3 Opus persona: conservative by design

The D3 Opus persona is explicitly instructed: "When in doubt about whether two codes belong together, keep them separate. Under-clustering is recoverable downstream; over-merging is not." This is the opposite of the original Sonnet D3 persona, which was told to "err toward over-clustering." That instruction was wrong — it protected against under-clustering (the recoverable failure) while accepting over-merging (the permanent failure).

D4 uses Opus for a separate reason: it is the single most consequential call in the entire pipeline. Every theme name coined in Phase 4 is used in Phase 5 validation, in the final codebook, and in every application code assignment. Sonnet produces acceptable codebooks. Opus produces sharper conceptual distinctions, more precise definitions, and is less likely to conflate concepts that are related but analytically distinct.

Combined, the upgrade adds approximately $2.56 per discovery run. For a $50K–$150K consulting engagement, this is noise. The protection against permanent information loss at D3 and poor theme naming at D4 is worth it unconditionally.

Two ways to run the pipeline

The pipeline supports two modes. If you already have a codebook (from a previous study or written by hand), you can run application only. For new studies, the full pipeline discovers the codebook first, then applies it.

Application only (3 agents)
When to use You already have a validated codebook from a prior study or written by a researcher.
Agents 2 independent coders + 1 resolver
Process Agent 1 and Agent 2 code independently. Calculate kappa. Agent 3 resolves disagreements.
Reliability checks Cohen's Kappa per code (Agent 1 vs. Agent 2)
Estimated accuracy ~93%
Est. cost (100 interviews) ~$29
Processing time ~30-45 minutes
Full pipeline (8 agents) Our approach
When to use New study. No existing codebook. The pipeline discovers themes from the data.
Agents 5 discovery agents across 5 phases (2 extractors + 1 clusterer + 1 global synthesizer + 1 validator) then 3 application agents (2 coders + 1 resolver)
Process Discovery builds one canonical codebook from all questions. Human reviews it. Application codes all transcripts against it.
Reliability checks Global codebook validated per question in Phase 5 discovery. Cohen's Kappa per code in application.
Estimated accuracy ~93% coding accuracy
Est. cost (100 interviews) ~$37-41 ($8-15 discovery + ~$30 application)
Processing time ~75-105 minutes (45-60 min discovery + 30-45 min application)

Where the accuracy gains come from

The two-phase approach builds accuracy at every step. Most interview responses are clear-cut: the participant's words either match a codebook definition or they do not. The application coders agree on roughly 80% of segments, and those agreements are almost always correct.

1
Dual extraction catches what a single extractor would miss. Two independent discovery agents read every transcript. Each one surfaces meaning units the other overlooked, producing a more complete set of initial codes.
2
Global theme naming eliminates cross-question inconsistency. A single Phase 4 global synthesizer sees all clusters from all thematic questions simultaneously and names themes once. This prevents the same concept from being called "Ease of use" in Q5 and "Interface friction" in Q10 because two independent agents never saw each other's work.
2a
Per-question validation catches what the global codebook misses. Phase 5 applies the global codebook to each question's actual responses. Question-specific patterns that the global pass did not cover are flagged as additions, and gaps in coverage are surfaced for human review before application begins.
3
Two independent coders with opposite biases in the application phase. One coder leans toward inclusion, the other toward exclusion. Where both agree despite opposite tendencies, confidence is very high. Where they disagree, it surfaces genuine ambiguity for the resolver.
4
Agreement filtering. On the roughly 80% of segments where both coders agree, accuracy exceeds 95%. The resolver focuses its attention on the remaining 20%, where its reasoning-based adjudication adds the most value.
5
Six prompt-level coding safeguards. Every agent prompt includes explicit constraints that target known LLM coding errors identified in the research literature: (1) word-boundary enforcement prevents matching words inside longer words, (2) negation awareness prevents "not satisfied" from being coded as satisfaction, (3) sarcasm detection flags ironic statements for intended-meaning coding, (4) hedging sensitivity distinguishes "I guess it's okay" from "it's great," (5) absence-is-not-denial prevents treating silence on a topic as evidence, and (6) full-response context prevents decontextualized phrase-matching.

Our expected accuracy: ~93%

This estimate is based on three factors:

Baseline AI coding accuracy
Research shows that LLMs with structured codebook prompts (definitions, inclusion/exclusion criteria, examples) achieve 78-85% agreement with expert human coders on thematic coding tasks (Pangakis, Wolken & Fasching, 2023; Gao et al., 2024). This is our starting point for each individual agent.
Multi-agent agreement filtering
When two independent agents with different biases both assign the same code, accuracy on those agreed segments exceeds 95%. Agreement between agents with designed-in independence is a strong signal. Roughly 80% of segments fall into this high-confidence category.
Cross-question codebook + Opus resolver adjudication
The discovery phase produces a canonical codebook validated against every question's actual responses before coding begins. The remaining 20% of segments (disagreements between coders) go through an Opus-powered resolver that reviews both agents' reasoning against the codebook definition. Opus scores 17 points higher than Sonnet on expert reasoning benchmarks, improving resolver accuracy from ~75% to ~85% on disputed segments. This pushes overall accuracy to approximately 93%.

How to interpret "~93% accuracy"

This means that for every 100 code assignments the pipeline makes, approximately 93 will match what an expert human coder would assign. The remaining cases are borderline situations where the codebook definition is genuinely ambiguous, the participant's language is unclear, or a code partially applies. These edge cases are flagged in the output with reasoning chains, so they can be reviewed if needed.

Agent configuration

Independence between agents is not automatic. Two identical AI systems given identical inputs will produce identical outputs, proving nothing. We design genuine independence into each agent using different temperatures, persona framing, and task-specific roles.

Discovery agents (Phase 1) — 5 phases

Agent Phase Model Temp Persona Role
D1 1 + 2 Sonnet 0.2 Thorough, nuanced Classify questions; Extractor A
D2 2 Sonnet 0 Precise, conservative Extractor B
D3 3 Opus 0 Conservative grouper Clusterer — errs toward finer-grained clusters to prevent irreversible over-merging
D4 4 Opus 0 Senior methodologist Global theme synthesizer (all questions, one call)
D5 5 Sonnet 0 Qualitative methodologist Per-question validator (applies global codebook, parallel)

Application agents (Phase 2)

Agent Model Temp Persona Emphasis Role
1 Sonnet 0 Inclusive Inclusion first Primary Coder A
2 Sonnet 0.3 Conservative Exclusion first Primary Coder B
3 Opus 0 Neutral arbiter Balanced Resolver

How we ensure genuine independence

Different personas

In discovery, Extractor A is thorough and nuanced while Extractor B is precise and conservative. In application, Coder A leans toward including borderline cases while Coder B leans toward excluding them. Opposite biases surface genuine ambiguity.

Temperature variation

Extractors use 0.2 vs. 0. Application coders use 0 vs. 0.3. Slight randomness on borderline decisions mirrors the natural variation between human coders without degrading accuracy on clear-cut cases.

Codebook emphasis

Application Coder A sees inclusion criteria first for each code. Coder B sees exclusion criteria first. This creates different cognitive anchoring without changing the actual rules.

Dual extraction

Two independent agents read the same transcripts and extract meaning units separately. Each one catches codes the other overlooks, producing a more complete foundation for the codebook.

Phase separation

Discovery agents have no connection to application agents. The codebook is the only artifact that passes between phases, reviewed by a human researcher before application begins.

The two-phase process

Phase 1 Discovery: build one canonical codebook
D1
Phase 1: Classify all questions

Identifies each question as thematic, binary, categorical, or rank-order so downstream agents apply the right logic.

D1 D2
Phase 2: Dual extraction per thematic question (parallel)

Two extractors independently break each thematic question's responses into meaning units with descriptive codes. Neither sees the other's work. Run in parallel across questions.

D3
Phase 3: Per-question clustering (parallel)

Compresses hundreds of raw codes per question into ~20-40 rough semantic clusters. Strips responses to code labels only to keep input tokens manageable. Representative quotes are recovered by participant ID afterward.

D4
Phase 4: Global theme synthesis (single call)

Receives all clusters from all thematic questions in one call. Names themes once for the entire study. This is where the canonical codebook is born — no question-level naming, no drift across questions.

D5
Phase 5: Per-question validation (parallel)

Applies the global codebook to each question's actual responses. Identifies which global themes appear, adds question-specific coding notes, and flags patterns the global codebook missed.

D6
Phase 6: Dimension Architect builds the segmentation variable structure

Reads the completed codebook and study_config.json to produce a dimensions section appended to codebook.json. Each dimension is classified as defining (goes into cluster analysis), outcome (used for cluster validation only), or profiling (describes segments post-clustering). The maximum number of defining dimensions is sample_size / 10 — the N/10 ceiling for stable clusters. The agent produces the minimum number of dimensions needed, not the maximum allowed.

Human review gate

A researcher reviews two things: (1) the codebook themes, per-question coverage, and flagged gaps; (2) the dimensions section — checking that defining/outcome/profiling classifications and composite groupings are correct. Makes adjustments and approves before application begins.

Phase 2 Application: code the transcripts
1 2
Agent 1 and Agent 2 code all segments independently

Neither sees the other's work. Both produce codes with written reasoning for every assignment.

κ
Calculate Cohen's Kappa per code

Measure agreement between Agent 1 and Agent 2, corrected for chance. Segments where both agree are auto-finalized.

3
Agent 3 resolves all disagreements

Reviews both coders' reasoning against the codebook definition. Picks the correct code. Flags ambiguous definitions for human review.

Phase 3 Segmentation Preparation: validate dimensions before cluster analysis
prep
Read coded data and dimension structure

Loads coded/final_codes.json (Phase 2 output) and the dimensions section of codebook.json (Phase 1 Phase 6 output). These two files together tell the script which fields exist and how to classify each one.

σ
Variance checks on every defining dimension

For each defining dimension: binary dimensions must have 20–80% positive prevalence across participants; ordinal dimensions must have meaningful spread (not near-constant). Any dimension that fails is flagged for reclassification to profiling before cluster analysis. Also enforces the N/10 ceiling — if the number of defining dimensions exceeds sample_size / 10, a warning is raised and the lowest-variance dimensions should be dropped.

csv
Output three files

segmentation-ready.csv — one row per participant, one column per defining dimension, all values numeric. This is the direct input to cluster analysis. segmentation-profile.csv — outcome and profiling variables for post-clustering segment description. segmentation-validation-report.txt — per-dimension variance stats, ceiling check, and a summary line: "Ready for cluster analysis" or "Warnings require review before proceeding."

The codebook: precision in every definition

The codebook is the single most important factor in coding quality. Research shows that codebooks with full definitions, inclusion/exclusion criteria, and example quotes improve coding accuracy by 15-25 percentage points compared to code labels alone (Pangakis, Wolken, & Fasching, 2023). Each theme entry includes five components:

Example codebook entry
Theme name
Poor post-sale support responsiveness
Definition
Participant describes slow response times, unanswered tickets, difficulty reaching a real person, or long resolution times after becoming a customer.
Include
Any mention of delayed support responses, unresolved issues, phone trees, or being "passed around" between departments.
Exclude
Pre-sale experience, onboarding difficulties, product bugs. These are separate themes.
Example quote
"We submitted a ticket about a payroll error and didn't hear back for two weeks."

Measuring inter-rater reliability

Inter-rater reliability measures whether independent coders assign the same codes to the same data. We use Cohen's Kappa (κ), which corrects for chance agreement (Cohen, 1960). Kappa is calculated per code, because some codes are inherently harder to apply consistently than others.

< 0.20
0.21-0.40
0.41-0.60
0.61-0.80
0.81-1.00
Poor Fair Moderate Substantial Almost perfect
Our minimum threshold: κ ≥ 0.65

Scale: Landis & Koch (1977). Threshold based on Krippendorff (2004) recommendation of α ≥ 0.667 for applied research.

Full audit trail

Every code assignment includes written reasoning from each agent that evaluated it. This creates a chain of evidence from the participant's words to the final theme, making every finding traceable and defensible.

Participant said
"We submitted a ticket about a payroll error and didn't hear back for two weeks."
Agent 1 (inclusive)
Participant describes a specific support ticket unanswered for an extended period. Matches "slow response times" and "unanswered tickets." Poor post-sale support
Agent 2 (conservative)
Explicit mention of an unresolved ticket with a specific timeframe (two weeks). Clear match to codebook definition. Poor post-sale support
Status
Both agents agree. Auto-confirmed.

Exact agent prompts

Transparency matters. Below are the exact system prompts and codebook formatting each agent receives. Nothing is hidden or summarized. These are the literal instructions that shape each agent's coding behavior.

Agent persona prompts

Each agent receives a persona prompt as the first line of every API call. This shapes how the agent approaches its task.

Discovery agents

D1 Extractor A: Thorough Sonnet, temp 0.2
You are a thorough, nuanced qualitative researcher. Read each response carefully and extract every distinct meaning unit. Capture both explicit statements and clearly implied meaning. It is better to extract too many meaning units than to miss one.
D2 Extractor B: Precise Sonnet, temp 0
You are a precise, conservative qualitative researcher. Extract only meaning units that are clearly and explicitly stated. Do not infer or interpret beyond what the participant said. Each meaning unit should represent one distinct, verifiable claim.
D3 Preliminary Clusterer (Phase 3) Opus, temp 0
You are a qualitative researcher performing preliminary grouping of descriptive codes. Your job is to compress meaning units into rough semantic clusters that will be passed to a global synthesis agent. When in doubt about whether two codes belong together, keep them separate. Under-clustering is recoverable downstream; over-merging is not — once two distinct concepts are collapsed into one cluster, the distinction is lost permanently. Err strongly on the side of finer-grained clusters.
D4 Global Theme Architect (Phase 4) Opus, temp 0
You are a senior qualitative methodologist building a study-wide thematic codebook. You have access to preliminary clusters from ALL thematic questions in the study. Your job is to synthesize a canonical set of themes that applies consistently across all questions. Theme names you write here will be used everywhere in this study — name them for the concept, not the question. Write definitions precise enough that two independent coders would agree on the same responses.
D5 Per-Question Validator (Phase 5) Sonnet, temp 0
You are a qualitative methodologist applying a global codebook to a specific interview question. Your job is to identify which global themes are relevant to this question's responses, add question-specific coding notes where needed, and flag any patterns the global codebook does not cover. Be precise: only mark a theme as applicable if it genuinely appears in the sample responses.
D6 Dimension Architect (Phase 6) Opus, temp 0
You are a senior market research methodologist designing the segmentation variable structure for a B2B interview study. You will receive a completed codebook and study configuration. Your job is to produce a dimensions section that classifies themes into segmentation variables. Classify each dimension as: defining (goes into cluster analysis — these determine which cluster each participant falls into), outcome (used after clustering to validate that segments predict something useful — never enters clustering), or profiling (used after clustering to describe and communicate what each segment looks like). The maximum number of defining dimensions is set by the N/10 rule. Produce the minimum number of dimensions needed to capture the full range of buyer variation — not the maximum allowed. Module B questions (current-state data) describe what happened after the tool choice, not what drove it. Never classify Module B themes as defining — they are profiling by default.

Application agents

1 Coder A: Inclusive Sonnet, temp 0
You are a thorough, inclusive qualitative coder. Capture both explicit statements and clearly implied meaning. When evidence partially matches a code definition, lean toward including the code. It is better to over-include than to miss a relevant code.
2 Coder B: Conservative Sonnet, temp 0.3
You are a conservative, precise qualitative coder. Only assign a code when the participant's words clearly and explicitly match the codebook definition. Do not infer or interpret beyond what was said. When in doubt, do not assign the code.
3 Resolver: Neutral Arbiter Opus, temp 0
You are a neutral arbiter resolving a coding disagreement. Review the participant's words, the codebook definition, and both coders' reasoning. Decide strictly based on whether the evidence meets the codebook definition. Do not favor either coder.

How codebook emphasis changes the prompt

When presenting codebook definitions, Coder A and Coder B see the same information in a different order. This creates different cognitive anchoring, similar to how a human reading "include when..." first will approach a decision differently than one reading "exclude when..." first.

1 Inclusion-first (Coder A)
--- CODE: Poor usability ---
Definition: Participant describes the product as difficult to use...

INCLUDE when: Any mention of difficulty navigating,
excessive clicks, confusing workflows...

EXCLUDE when: General complaints about the product
that are not specifically about ease of use...
2 Exclusion-first (Coder B)
--- CODE: Poor usability ---
Definition: Participant describes the product as difficult to use...

EXCLUDE when: General complaints about the product
that are not specifically about ease of use...

INCLUDE when: Any mention of difficulty navigating,
excessive clicks, confusing workflows...
3 Balanced (Resolver)
--- CODE: Poor usability ---
Definition: Participant describes the product as difficult to use...

Include when: Any mention of difficulty navigating,
excessive clicks, confusing workflows...

Exclude when: General complaints about the product
that are not specifically about ease of use...

Full prompt template: thematic coding

This is the complete prompt that each application coding agent (1, 2) receives for every thematic question. The persona and codebook emphasis sections change per agent. Everything else is identical.

Complete thematic coding prompt
[Agent persona prompt inserted here]

You are coding an interview response for the question: "[question text from codebook]"

Multiple codes CAN apply to a single response. Assign all that apply.

CODEBOOK:
[All codes formatted per agent's emphasis order, each with:
  definition, inclusion criteria, exclusion criteria,
  up to 3 examples with reasoning,
  up to 2 negative examples with reasoning]

PARTICIPANT RESPONSE:
"[participant's actual response text]"

INSTRUCTIONS:
1. For each code in the codebook, explain whether the participant's
   response matches the definition.
2. Be specific: quote the exact words from the response that match
   (or don't match) each code.
3. WORD BOUNDARY: Only match words that appear as complete, standalone
   words. Never match a word found inside a longer word (e.g., "equity"
   contains the letters q-u-i-t but the participant did NOT say "quit").
   Verify any keyword is bounded by spaces, punctuation, or the
   start/end of the text.
4. NEGATION: Pay attention to negation words (not, never, no, didn't,
   wasn't, wouldn't). "Not satisfied" means dissatisfied. "No problems"
   means things went well. Identify the complete negated phrase before coding.
5. SARCASM: Watch for sarcastic or ironic statements where context suggests
   the speaker means the opposite of their literal words. Cues include
   exaggeration, contradiction with surrounding statements, or phrases
   like "yeah right." Code the intended meaning, not the literal words.
6. HEDGING: Distinguish between definitive statements and hedged or
   qualified ones. Words like "kind of," "sort of," "I guess," "maybe,"
   and "not really" weaken or change the meaning. Code the actual
   strength of the statement.
7. ABSENCE: If the participant does not mention a topic, do NOT treat
   that as evidence for or against any code. Only code what is actually
   stated or clearly implied. Silence on a topic is not data.
8. CONTEXT: Read the entire response before coding any part of it. A
   phrase can change meaning based on what comes before or after it.
   "The price was high but worth every penny" is not a price complaint.
9. Then list all codes that apply.

Respond in this exact JSON format:
{
  "reasoning": {
    "code_name_1": "Explanation of why this code does or does not apply",
    "code_name_2": "Explanation of why this code does or does not apply"
  },
  "codes_assigned": ["code_name_1", "code_name_2"],
  "confidence": "high|medium|low"
}

Full prompt template: disagreement resolution

When Coder A and Coder B disagree, the resolver agent receives both coders' reasoning side by side.

Complete resolution prompt
[Resolver agent persona prompt inserted here]

Two independent coders have coded the same interview response and disagree.
Your task is to determine the correct coding based strictly on the codebook
definition.

QUESTION: "[question text from codebook]"

PARTICIPANT RESPONSE:
"[participant's actual response text]"

CODEBOOK:
[All codes with balanced emphasis: definition,
  inclusion criteria, exclusion criteria,
  examples, negative examples]

CODER A's ASSESSMENT:
Codes assigned: ["code_1", "code_2"]
Reasoning: [Coder A's full reasoning for each code]

CODER B's ASSESSMENT:
Codes assigned: ["code_1"]
Reasoning: [Coder B's full reasoning for each code]

INSTRUCTIONS:
1. Review the participant's exact words.
2. Review the codebook definition, inclusion criteria, and exclusion criteria.
3. Evaluate each coder's reasoning.
4. WORD BOUNDARY: Only match words that appear as complete, standalone
   words. Never match a word found inside a longer word (e.g., "equity"
   contains the letters q-u-i-t but the participant did NOT say "quit").
   Verify any keyword is bounded by spaces, punctuation, or the
   start/end of the text.
5. NEGATION: Pay attention to negation words (not, never, no, didn't,
   wasn't, wouldn't). "Not satisfied" means dissatisfied. "No problems"
   means things went well. Identify the complete negated phrase before coding.
6. SARCASM: Watch for sarcastic or ironic statements where context suggests
   the speaker means the opposite of their literal words. Cues include
   exaggeration, contradiction with surrounding statements, or phrases
   like "yeah right." Code the intended meaning, not the literal words.
7. HEDGING: Distinguish between definitive statements and hedged or
   qualified ones. Words like "kind of," "sort of," "I guess," "maybe,"
   and "not really" weaken or change the meaning. Code the actual
   strength of the statement.
8. ABSENCE: If the participant does not mention a topic, do NOT treat
   that as evidence for or against any code. Only code what is actually
   stated or clearly implied. Silence on a topic is not data.
9. CONTEXT: Read the entire response before coding any part of it. A
   phrase can change meaning based on what comes before or after it.
   "The price was high but worth every penny" is not a price complaint.
10. Determine the correct codes based on the codebook definition.
11. If the codebook definition is ambiguous (both coders' interpretations
   are reasonable), flag it.

Respond in this exact JSON format:
{
  "reasoning": "Your step-by-step analysis",
  "codes_assigned": ["code_name_1"],
  "favored_coder": "A|B|neither",
  "definition_ambiguous": true|false,
  "ambiguity_note": "If ambiguous, describe what about the definition
                      is unclear"
}

Other prompt templates

Categorical, rank-order, and binary questions use simpler prompts since they involve assigning a single value rather than multiple thematic codes.

Categorical coding prompt
[Agent persona]

You are coding an interview response for the question:
"[question text]"

Assign exactly ONE category from the list below.

CATEGORIES:
- Satisfied: Uses clearly positive language...
- Mixed: Acknowledges both positives and negatives...
- Dissatisfied: Uses clearly negative language...

PARTICIPANT RESPONSE:
"[response text]"

INSTRUCTIONS:
1. Explain which category best fits the participant's
   response and why.
2. Quote specific words that support your choice.
3. WORD BOUNDARY: Only match complete, standalone words.
   Never match a word inside a longer word.
4. NEGATION: "Not satisfied" means dissatisfied. Identify
   the complete negated phrase before coding.
5. SARCASM: If context suggests the opposite of literal
   words (exaggeration, contradiction), code intended meaning.
6. HEDGING: "Kind of," "I guess," "maybe" weaken meaning.
   Code the actual strength, not a simplified version.
7. ABSENCE: Silence on a topic is not data. Only code what
   is actually stated.
8. CONTEXT: Read the full response first. A phrase can change
   meaning based on surrounding sentences.

Respond in JSON:
{
  "reasoning": "Explanation of your coding decision",
  "category_assigned": "category_name",
  "confidence": "high|medium|low"
}
Rank-order coding prompt
[Agent persona]

You are coding an interview response for the question:
"[question text]"

Map the participant's response to exactly ONE bucket
from the list below.

BUCKETS:
- 1-100 (1 to 100)
- 101-500 (101 to 500)
- 501-2000 (501 to 2000)
- 2001-10000 (2001 to 10000)
- 10000+ (10001 to unlimited)

PARTICIPANT RESPONSE:
"[response text]"

INSTRUCTIONS:
1. Extract the relevant value from the response.
2. Determine which bucket it falls into.
3. If ambiguous, assign the closest bucket and note it.
4. WORD BOUNDARY: Only match complete, standalone words.
   Never match a word inside a longer word.
5. NEGATION: "Not quite 500" is different from "500."
6. CONTEXT: Read the full response first. Surrounding
   sentences may clarify or change the number.

Respond in JSON:
{
  "extracted_value": "The value from the response",
  "reasoning": "How you determined the bucket",
  "bucket_assigned": "bucket_label",
  "confidence": "high|medium|low"
}
Binary coding prompt
[Agent persona]

You are coding an interview response for the question:
"[question text]"

This is a binary (yes/no) determination. Assign exactly
one label: "[positive_label]" or "[negative_label]".

CRITERIA:
Definition of "[positive_label]": [definition]
Code as "[positive_label]" when: [inclusion_criteria]
Code as "[negative_label]" when: [exclusion_criteria]

PARTICIPANT RESPONSE:
"[response text]"

INSTRUCTIONS:
1. Review the participant's exact words.
2. Determine whether the response meets the definition.
3. Quote the specific evidence that supports your
   determination.
4. WORD BOUNDARY: Only match complete, standalone words.
   Never match a word inside a longer word.
5. NEGATION: "Not satisfied" means dissatisfied. Identify
   the complete negated phrase before coding.
6. SARCASM: If context suggests the opposite of literal
   words (exaggeration, contradiction), code intended meaning.
7. HEDGING: "Kind of," "I guess," "maybe" weaken meaning.
   Code the actual strength, not a simplified version.
8. ABSENCE: Silence on a topic is not data. Only code what
   is actually stated.
9. CONTEXT: Read the full response first. A phrase can change
   meaning based on surrounding sentences.

Respond in JSON:
{
  "reasoning": "Explanation with quoted evidence",
  "binary_assigned": "[positive]" or "[negative]",
  "codes_assigned": ["[positive]"] or [],
  "confidence": "high|medium|low"
}

Running the pipeline

The pipeline is a set of Python scripts that orchestrate all seven agents automatically. Here is exactly how to set it up and run it.

Prerequisites

1
Python 3.10+ installed on your machine
2
Anthropic Python SDK. Install with: pip install anthropic
3
Anthropic API key with access to Claude Sonnet

Step 1: Create a project folder and prepare your input files

Each interview study gets its own folder under research/Interview Projects/{Study Name}/. Put your transcripts.json there. For the Dimension Architect phase, also create a study_config.json in the same folder. The pipeline reads from and writes to that folder.

{ } transcripts.json

Your interview transcripts. One entry per participant, with their responses and question text for each question.

{
  "participants": [
    {
      "participant_id": 1,
      "metadata": {
        "name": "Participant 1",
        "date": "2026-01-15"
      },
      "transcript": [
        {
          "question_id": "Q1",
          "question_text": "Why did you switch from
                            your previous provider?",
          "response": "Honestly the biggest thing
                       was onboarding new hires.
                       It was such a clunky process,
                       like you'd have to re-enter
                       their info in three different
                       places."
        },
        {
          "question_id": "Q2",
          "question_text": "How satisfied are you with
                            your current solution?",
          "response": "It does the job but there
                       are definitely things that
                       frustrate me."
        }
      ]
    }
  ]
}
{ } study_config.json

Tells the Dimension Architect how to classify questions. Specifies the outcome variable (captured from screener), per-question temporal layer annotations, and sample size (which sets the N/10 ceiling on defining dimensions). If absent, Phase 6 runs with no context and infers everything from question text.

{
  "study_name": "HR Leaders",
  "sample_size": 80,
  "outcome_variable": {
    "question_id": "Q4",
    "field_name": "current_tool",
    "description": "Primary recruiting tool — verified from screener"
  },
  "question_context": [
    {
      "question_id": "Q1",
      "temporal_layer": "current",
      "purpose_hint": "defining",
      "note": "firmographic — seniority and function"
    },
    {
      "question_id": "Q7",
      "temporal_layer": "module_a",
      "purpose_hint": "defining",
      "note": "evaluation trigger — upstream of tool choice"
    },
    {
      "question_id": "Q16",
      "temporal_layer": "module_b",
      "purpose_hint": "profiling",
      "note": "current tool strengths — downstream of choice"
    }
  ]
}
{ } codebook.json (application-only mode)

Only needed if you already have a codebook and want to skip discovery. One entry per question, with codes, definitions, inclusion/exclusion criteria, and example quotes.

{
  "study_name": "BambooHR Win-Loss Study",
  "version": "1.0",
  "questions": [
    {
      "question_id": "Q1",
      "question_text": "Why did you switch from your
                         previous provider?",
      "coding_type": "thematic",
      "multi_code": true,
      "codes": [
        {
          "code_name": "Poor usability",
          "definition": "Participant describes the
                         product as difficult to use...",
          "inclusion_criteria": "Any mention of
                         difficulty navigating...",
          "exclusion_criteria": "General complaints
                         not about ease of use...",
          "examples": [
            {
              "text": "It took 12 clicks to approve
                       a PTO request",
              "reasoning": "Describes excessive steps"
            }
          ],
          "negative_examples": [
            {
              "text": "It just didn't have what we
                       needed",
              "reasoning": "Missing features, not
                            usability"
            }
          ]
        }
      ]
    }
  ]
}

Step 2: Point the pipeline at your project folder

Open config.py and set PROJECT_NAME and PROJECT_DIR to your study. All paths (transcript input, codebook output, coded output) derive from PROJECT_DIR automatically. Then set your API key.

Terminal
# On Mac/Linux:
export ANTHROPIC_API_KEY=sk-ant-your-key-here

# On Windows (PowerShell):
$env:ANTHROPIC_API_KEY = "sk-ant-your-key-here"

# On Windows (Command Prompt):
set ANTHROPIC_API_KEY=sk-ant-your-key-here

Step 3: Run the pipeline

Navigate to the pipeline folder and run the script. The pipeline will process all participants and questions automatically, printing progress as it goes.

Terminal
# Set unbuffered output so progress prints in real time (required)
# Navigate to the pipeline folder
cd research/"2 How to code transcripts"/pipeline

# Run the full pipeline (discovery + application)
# -u flag: unbuffered output — without it, all progress buffers until the end
python -u run_full.py

# Or run phases separately:
# Discovery only (build codebook, ~45-60 min):
python -u run_discovery.py

# Application only (apply existing codebook, ~30-45 min):
python -u run_coding.py

What happens when you run it

Discovery takes 45-60 minutes for a 75-participant study (Phases 2 and 3 run questions in parallel; Phase 4 is one large streaming call). Application takes 30-45 minutes. Discovery costs ~$8-15 per study; application costs ~$30 per 100 interviews. Use python -u (unbuffered) — without it, all terminal output buffers until the script ends.

After application completes: Run python segment-prep.py (from the segmentation workflow folder) to validate dimensions and produce cluster-ready output. This takes under a minute and produces a validation report telling you whether all defining dimensions pass variance checks and are ready for cluster analysis. See research/3 Segmentation workflow/python-script-spec.md for the full spec.

Crash recovery: The pipeline saves checkpoints automatically. If it crashes, just re-run — it resumes from where it left off. Delete codebook_partial_clusters.json, codebook_partial_global.json, and codebook_partial.json from the project folder only if you want a clean restart.

Step 4: Review the outputs

Outputs land in the project folder (research/Interview Projects/{Study Name}/). Discovery writes to the project root; application writes to a coded/ subfolder; segmentation preparation writes to a segmentation/ subfolder.

File What it contains When to read it
codebook.json The canonical codebook built by discovery: global themes, per-question applicability, question-specific additions, and coverage gap notes. Phase 6 appends a dimensions section classifying themes as defining, outcome, or profiling. Review themes and dimensions before application runs (human review gate)
coded/final_codes.json The final validated code assignments for every segment, with reasoning This is your primary coding deliverable — input to segment-prep.py
coded/reliability_summary.json Per-code kappa values and overall kappa Check that overall kappa is above 0.65
coded/flagged_items.json Codes with low reliability and codebook definitions flagged as ambiguous If any codes are flagged, revise the codebook definitions and re-run
coded/reliability.txt Human-readable reliability report: Agent 1 vs Agent 2 To understand where and why agents disagreed
coded/agent_1_codes.json Agent 1's raw coding with full reasoning per segment For audit trail or to understand specific coding decisions
coded/agent_2_codes.json Agent 2's raw coding with full reasoning per segment For audit trail or to compare with Agent 1
segmentation/segmentation-ready.csv One row per participant, one numeric column per defining dimension. Direct input to cluster analysis. After segment-prep.py passes validation — feed to clustering script
segmentation/segmentation-profile.csv Outcome variable and profiling variables per participant. Used after clustering to describe each segment. After cluster assignments are made — join to describe segments
segmentation/segmentation-validation-report.txt Per-dimension variance stats, N/10 ceiling check, missing data summary. Ends with "Ready for cluster analysis" or "Warnings require review." Read this first after segment-prep.py runs — resolve any warnings before clustering

Customizing the configuration

All settings live in config.py. The defaults work well for most projects. Settings you might adjust:

Setting Default What it controls
KAPPA_THRESHOLD 0.65 Minimum Cohen's Kappa to consider a code reliable. Below this, the code is flagged for review.
MIN_THEME_FREQUENCY 0.05 Discovery: a theme must appear in at least 5% of responses to be a standalone theme. Below this, the synthesizer merges it into a broader theme.
MAX_OTHER_FREQUENCY 0.15 Discovery: if more than 15% of responses land in "Other," the validator flags the codebook for additional themes.
DISCOVERY_CONCURRENCY 2 Questions processed in parallel during Phases 2, 3, and 5. Keep at 2 for most studies; 3 is the safe maximum before rate-limit errors appear at high token volumes. Set to 1 for sequential debugging.
BATCH_DELAY_SECONDS 0.5 Pause between API calls to respect rate limits. Increase if you hit rate-limit errors.
MAX_RETRIES 5 How many times to retry if an API call returns malformed JSON. Higher values are needed for large extraction batches that occasionally truncate.

Methodological foundations

Our approach is grounded in established qualitative research methods, each backed by decades of peer-reviewed evidence.

Braun, V. & Clarke, V.
2006
Using thematic analysis in psychology
Qualitative Research in Psychology, 3(2), 77-101
Foundational framework for thematic analysis. One of the most cited methods papers in social science.
Braun, V. & Clarke, V.
2021
One size fits all? What counts as quality practice in (reflexive) thematic analysis?
Qualitative Research in Psychology, 18(3), 328-352
States that themes should be developed from the full dataset, not constructed within individual data items — the methodological basis for cross-question theme synthesis.
Braun, V. & Clarke, V.
2022
Thematic Analysis: A Practical Guide
SAGE Publications
Distinguishes three TA variants (Reflexive, Codebook, Coding Reliability) and provides updated guidance for each.
Graneheim, U.H. & Lundman, B.
2004
Qualitative content analysis in nursing research: concepts, procedures and measures to achieve trustworthiness
Nurse Education Today, 24(2), 105-112
Standard framework for segmenting transcripts into meaning units and establishing coding trustworthiness.
Saldana, J.
2016
The Coding Manual for Qualitative Researchers
SAGE Publications, 3rd edition
Defines first-cycle and second-cycle coding methods, including descriptive, in vivo, and process coding approaches.
Deterding, N.M. & Waters, M.C.
2021
Flexible coding of in-depth interviews: a twenty-first-century approach
Sociological Methods & Research, 50(2), 708-739
Recommends the two-pass coding approach: initial indexing pass followed by focused coding pass.
Hsieh, H-F. & Shannon, S.E.
2005
Three approaches to qualitative content analysis
Qualitative Health Research, 15(9), 1277-1288
Defines directed content analysis for coding open-ended responses into predefined categories (rank-order and categorical variables).
Cohen, J.
1960
A coefficient of agreement for nominal scales
Educational and Psychological Measurement, 20(1), 37-46
Introduces Cohen's Kappa, the standard inter-rater reliability metric correcting for chance agreement.
Krippendorff, K.
2004
Content Analysis: An Introduction to Its Methodology
SAGE Publications, 2nd edition
Defines Krippendorff's Alpha and establishes reliability thresholds (α ≥ 0.667 for tentative, α ≥ 0.80 for firm conclusions).
Landis, J.R. & Koch, G.G.
1977
The measurement of observer agreement for categorical data
Biometrics, 33(1), 159-174
Establishes the standard interpretation scale for kappa values (slight, fair, moderate, substantial, almost perfect).
Gao, J. et al.
2024
CollabCoder: a lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models
Proceedings of the ACM on Human-Computer Interaction (CSCW)
Demonstrates that human-AI collaborative coding reduces coding time by ~50% while maintaining intercoder reliability.
Pangakis, N., Wolken, S. & Fasching, N.
2023
Automated annotation with generative AI suggests promising avenues for qualitative research
arXiv preprint
Shows that structured codebook prompts with definitions and examples improve AI coding accuracy by 15-25 percentage points.
Miles, M.B., Huberman, A.M. & Saldana, J.
2014
Qualitative Data Analysis: A Methods Sourcebook
SAGE Publications, 3rd edition
Comprehensive reference for qualitative coding methods and quality standards, including the 0.80 threshold on 95% of codes.