> ## Documentation Index > Fetch the complete documentation index at: https://docs.poly.ai/llms.txt > Use this file to discover all available pages before exploring further. # Tutorial: Conversation reviews (advanced) > PolyAcademy Level 2 – Use advanced diagnostics to trace behavior to its source and identify system-level improvements. export const LessonMeta = ({level, difficulty, time}) => { const levelConfig = { 1: { badge: 'bg-green-100 text-green-800 dark:bg-green-900 dark:text-green-200', label: 'Level 1' }, 2: { badge: 'bg-amber-100 text-amber-800 dark:bg-amber-900 dark:text-amber-200', label: 'Level 2' }, 3: { badge: 'bg-red-100 text-red-800 dark:bg-red-900 dark:text-red-200', label: 'Level 3' } }; const difficultyConfig = { Beginner: 'bg-green-100 text-green-800 dark:bg-green-900 dark:text-green-200', Intermediate: 'bg-amber-100 text-amber-800 dark:bg-amber-900 dark:text-amber-200', Advanced: 'bg-red-100 text-red-800 dark:bg-red-900 dark:text-red-200' }; const lvl = levelConfig[level] || levelConfig[1]; const diffColor = difficultyConfig[difficulty] || difficultyConfig['Beginner']; return

{lvl.label} {difficulty} {time && {time} }

; }; export const Quiz = ({questions = []}) => { const [selected, setSelected] = useState({}); const [resetCount, setResetCount] = useState(0); const letters = ['A', 'B', 'C', 'D']; const handleSelect = (qIdx, optIdx) => { if (selected[qIdx] !== undefined) return; setSelected(prev => ({ ...prev, [qIdx]: optIdx })); }; const handleReset = () => { setSelected({}); setResetCount(c => c + 1); }; if (!questions?.length) return null; const getOptionClasses = ({hasAnswered, isThisCorrect, isThisSelected}) => { if (!hasAnswered) { return { btn: 'flex w-full items-center gap-3 py-2.5 px-4 rounded-xl text-sm leading-normal transition-all duration-150 text-left border cursor-pointer border-gray-200 bg-white text-gray-700 hover:border-gray-300 hover:bg-gray-50 hover:shadow-sm dark:border-gray-600 dark:bg-gray-800 dark:text-gray-200 dark:hover:border-gray-500 dark:hover:bg-gray-700', badge: 'w-6 h-6 rounded-full text-xs font-bold flex items-center justify-center shrink-0 leading-none transition-all duration-150 bg-gray-100 text-gray-500 dark:bg-gray-700 dark:text-gray-300', icon: null }; } if (isThisCorrect) { return { btn: 'flex w-full items-center gap-3 py-2.5 px-4 rounded-xl text-sm leading-normal transition-all duration-150 text-left border cursor-default border-green-400 bg-green-50 text-green-900 font-medium dark:border-green-500 dark:bg-green-950 dark:text-green-100', badge: 'w-6 h-6 rounded-full text-xs font-bold flex items-center justify-center shrink-0 leading-none transition-all duration-150 bg-green-500 text-white dark:bg-green-500', icon: }; } if (isThisSelected) { return { btn: 'flex w-full items-center gap-3 py-2.5 px-4 rounded-xl text-sm leading-normal transition-all duration-150 text-left border cursor-default border-red-400 bg-red-50 text-red-900 dark:border-red-500 dark:bg-red-950 dark:text-red-100', badge: 'w-6 h-6 rounded-full text-xs font-bold flex items-center justify-center shrink-0 leading-none transition-all duration-150 bg-red-500 text-white dark:bg-red-500', icon: }; } return { btn: 'flex w-full items-center gap-3 py-2.5 px-4 rounded-xl text-sm leading-normal transition-all duration-150 text-left border cursor-default border-gray-100 bg-white text-gray-400 dark:border-gray-700 dark:bg-gray-800 dark:text-gray-500', badge: 'w-6 h-6 rounded-full text-xs font-bold flex items-center justify-center shrink-0 leading-none transition-all duration-150 bg-gray-100 text-gray-500 dark:bg-gray-700 dark:text-gray-500', icon: null }; }; return

{questions.map((q, qIdx) => { const answer = selected[qIdx]; const hasAnswered = answer !== undefined; const isCorrect = answer === q.correct; return

{qIdx + 1} {q.q}

{q.options.map((opt, i) => { const isThisCorrect = i === q.correct; const isThisSelected = i === answer; const {btn, badge, icon} = getOptionClasses({ hasAnswered, isThisCorrect, isThisSelected }); return ; })}

{hasAnswered ?

{isCorrect ? 'Correct.' : 'Not quite.'} {' '} {q.explanation}

: null}

; })}

; }; export const ProgressTracker = ({lessonNum, totalLessons, level}) => { const [checked, setChecked] = useState(false); return

setChecked(prev => !prev)} className={checked ? 'flex items-center gap-3 p-4 rounded-lg border-2 border-green-600 bg-green-50 dark:bg-green-950 cursor-pointer select-none transition-all' : 'flex items-center gap-3 p-4 rounded-lg border-2 border-gray-200 dark:border-gray-600 bg-gray-50 dark:bg-gray-800 cursor-pointer select-none transition-all'}>

{checked ? : null}

{checked ? 'Lesson complete' : 'Mark lesson complete'}

{lessonNum && totalLessons ?

{level ? level + ' - ' : ''}Lesson {lessonNum} of {totalLessons}

: null}

; }; **Level 2 – Lesson 8 of 8** – Master advanced diagnostics to understand exactly why your agent behaves the way it does. At this stage, use [Conversation Review](/analytics/conversations/review) to answer questions like: > Why did Variant A behave differently from Variant B? > > Was this failure caused by ASR, retrieval, rules, response control, or phrasing? > > Why did the agent *not* call a function it was allowed to call? If you can't point to a specific system layer and say *"this is where the decision was made"*, the agent isn't under control yet. ## Beyond the transcript At Level 1, the transcript was enough. At Level 2, the transcript is only the **symptom**. Real work happens in [diagnosis](/analytics/conversations/diagnosis) layers, function traces, variant attribution, and latency signals. Review with toggles on. ## Tracing a problem to its source ```mermaid theme={"theme":{"light":"github-light","dark":"github-dark"}} flowchart TD A[Unexpected agent behavior] --> B{What does the transcript show?} B -->|Wrong word transcribed| C[ASR layer – check Transcript Corrections / Keyphrase Boosting] B -->|Wrong topic retrieved| D[KB layer – check topic name, sample questions] B -->|Right topic, wrong response| E[Behavior / Response Controls layer] B -->|Action didn't fire| F[Function trace – check call order and parameters] B -->|Variant A ≠ Variant B| G[Variant layer – check which variant handled each turn] B -->|Response too long for voice| H[Latency / interruption layer – tune pacing] ``` ## Check your understanding ## Advanced use of the Conversations table Before opening individual conversations, shape the table itself. Add these columns: * **Variant** * **Environment** * **Tool call** * **Handoff reason** * **Duration** Use this to: * Compare variants side by side * Spot regressions after promotion * Identify behavior that only occurs in Live > Example: > Calls with Variant = B have longer durations and more handoffs. > > This is a signal before you even open a transcript. Columns Visibility panel showing toggleable pre-built metric columns for the conversations table

Columns Visibility panel showing toggleable pre-built metric columns for the conversations table

## Comparative review patterns At Level 2, you should rarely inspect a single conversation in isolation. Common patterns: * Same intent, different variants * Same KB topic, different phrasing * Same user request across Chat and Call * Same flow before and after a KB change Conversation Review supports this by exposing **environment, variant, and function data together**. ## Diagnosis layers (deep use) Toggle diagnosis layers selectively. Each answers a different class of question. ### Topic citations (advanced) At this level, topic citations are not just about *correct vs incorrect*. Use them to detect: * Topic competition * Overly generic topic names * Sample question leakage across intents > Example: > Three topics are cited repeatedly for "late checkout": > > * late\_checkout > * checkout\_policy > * general\_stay\_questions > > This indicates retrieval ambiguity. The fix is structural, not textual. ### Tool calls (advanced) Tool call traces show **what the agent committed to doing**, not just what it said. Inspect: * Call order * Conditional execution * Parameters passed * Calls that *should* have happened but didn't > Example: > The agent asks for SMS consent but never calls `start_sms_flow`. > > This usually indicates: > > * A missing action branch in the KB > * A response control interrupting output > * A rules conflict preventing execution ### Flows and steps Flows expose **decision paths**. Use them when: * Multiple conditions exist * Behavior depends on prior turns * The agent appears to "jump" topics > Example: > A billing question enters a reservation flow. > > This is often caused by: > > * Early entity capture > * Over-eager routing rules > * Poorly scoped flow entry conditions ### Variants Variants let you attribute behavior to configuration, not chance. Use this layer to: * Confirm A/B test intent * Validate rollout sequencing * Identify variant-specific failures > Example: > Variant A answers directly. > Variant B always clarifies first. > > Conversation Review lets you confirm this per turn, not anecdotally. ### Entities Entities are where ASR, NLU, and logic meet. Inspect entities to: * Confirm values were actually captured * Detect silent failures (nulls) * Spot hallucinated structure > Example: > User says "tomorrow morning" > > Entity captured: date = today > > This is not a KB issue – it's extraction or phrasing. ### Turn latency and interruptions These layers reveal **experience quality**, not correctness. Use them to: * Identify responses that are too long for voice * Detect places users consistently interrupt * Tune pacing and verbosity > Example: > High interruption rate during policy explanations usually means the response is technically correct but poorly shaped for audio. ## Audio analysis (calls) At Level 2, audio review is not optional. Use split audio to: * Isolate ASR failures * Hear barge-in timing * Compare spoken length vs transcript length This often explains why "perfectly fine" text responses fail in voice. ## Annotations as a system, not notes At this stage, annotations should be **patterned**, not occasional. Use them to: * Track recurring KB gaps * Justify ASR tuning * Support decisions to split or retire topics > Example: > Five "Missing topic" annotations around refunds in one day is enough evidence to create a dedicated refund topic. Annotations turn subjective impressions into actionable signals. ## Check your understanding ## What good looks like A strong review session ends with **specific changes**, not general feelings: > Split topic X into two intents. Remove sample question Y. Add response control to suppress filler. You can say what changed, where, and why that layer is responsible. ## Readiness standard Before treating an agent as stable: * You can trace any response back to configuration * You can distinguish ASR, KB, rules, and variant causes * You can predict how a change affects behavior * You can verify impact in Conversation Review ## Try it yourself Looking at your Conversations table, you notice that Variant A has a 40% handoff rate and Variant B has a 15% handoff rate – for the same types of customer queries. Describe your investigation: 1. What is your first hypothesis? 2. Which diagnosis layers would you check first? 3. What specific data would confirm or rule out each hypothesis? Think systematically: what could cause two variants to behave differently for the same query? Consider: variant-specific fields, KB topic overrides, response controls, and function logic. 1. **First hypothesis:** Variant A has a handoff action wired to trigger more broadly – perhaps its SMS flow fails more often, or its fallback routing is more aggressive. 2. **Layers to check first:** * **Function traces** – compare whether `transfer_call` is being called after different triggers in A vs B * **Variant fields** – check if A has different escalation language or action overrides * **Topic citations** – confirm the same KB topics are being retrieved for both variants 3. **Confirming data:** * If function traces show `transfer_call` firing after different events → KB action branch issue * If topic citations differ between A and B → variant-specific KB override or sample question difference * If function traces are identical → check variant fields for different routing thresholds or transfer conditions ## Check your understanding ## Metrics and dashboards Beyond individual conversation review, you can use metrics and [dashboards](/analytics/dashboards/introduction) to identify patterns across many conversations. ### Filtering conversations The **Conversations** page supports filtering by both built-in and custom metrics. Built-in metrics include environment, call duration, variant, and handoff reason. Custom metrics are values you log from your functions – for example, `cancel_initiated`, `id_v_successful`, or the brand the user asked about. **Useful filter combinations:** * **All handoffs** – filter by handoff reason "has any value" to see every transferred call * **Specific handoff reason** – filter by a reason like "speak\_to\_agent" to find deflection opportunities * **Custom metric** – filter by `cancel_initiated` to review all cancellation flows Filter builder panel with multiple active conditions

Filter builder panel with multiple active conditions

Active filter chips above the conversations table summarising applied conditions

### QA metrics The QA metric identifies which knowledge topic the agent used to answer each query: * **Raven (voice)** – the LLM determines the QA metric directly by matching its response to the most relevant topic. This is accurate because the LLM has full context. * **GPT-based agents (chat)** – the system encodes the user utterance, finds the closest topics by embedding similarity, generates a response, then matches the response back to topics. This can be less accurate when responses blend multiple topics. A conversation can match more than one topic across turns. When that happens, the **QA** column in the conversations table shows every matched topic for that call, joined by commas (for example, `billing, handoff`), so you can see the full set of topics at a glance without opening each conversation. The same comma-joined format is used for any other custom metric that is logged multiple times on a single conversation. ### Using dashboards for improvement A well-built dashboard tracks your key metrics (containment, transfer rate, call duration, authentication success) over time. Focus on: 1. **Containment trends** – are your improvements actually moving the number? 2. **Top queries** – what are users asking about most? Are there unhandled intents? 3. **Handoff reasons** – which reasons have the highest volume? Can you add flows or topics to reduce transfers? For example, if "make an order" is a top query with high transfer rate, building an order troubleshooting flow could directly improve containment. Lesson 7 of 8 Recap and next steps