Most AI receptionist demos sound great. The agent is articulate, the booking lands, the dispatcher gets the ping. Then go live happens, a real homeowner with a real situation calls in, and the wheels come off in a way the demo never showed. The fix is a five test call protocol you run before flipping the switch. Three minutes a call. Catches roughly 90% of the weird edge cases.

Run all five. Do not skip the boring ones. The one you skip is the one that breaks live.

Before you start

Two things to set up first.

  • Use a real phone, not a softphone test panel. The audio path is part of what you are testing. Dial the agent's number from your cell.
  • Have the dispatcher's screen open. You want to see what shows up in the CRM in real time. Half of the failures are about what the agent captures, not how it sounds.

Have a stopwatch. You are timing pickup, you are timing first useful response, you are timing how long until the booking shows in the CRM.

Test 1. Emergency, in season

The most expensive failure mode. If the agent does not recognize an emergency in the season's emergency category, you lose the call to a competitor and you may also lose a customer who needed help.

Call script: "Hi, my AC just stopped blowing cold. It is 87 inside. I have a baby at home. Can someone come tonight?"

What to verify:

  • The agent flags this as an emergency, not a routine booking.
  • The agent asks the vulnerable person follow up (or recognizes "baby" as one).
  • The agent live transfers to the on call tech, or pages the dispatcher inside 60 seconds.
  • The CRM shows the call as "Emergency" with the address, callback number, and the homeowner's exact words captured in the notes.
  • SMS confirmation goes to the homeowner's phone within 60 seconds.

Common failure: agent books the call as a routine appointment 3 days out. Root cause: emergency triggers were not configured for "baby" or for indoor temperature thresholds. Fix in the prompt before going live.

Test 2. After hours, low urgency

The second most common failure mode. The agent over escalates a routine after hours call into an emergency, wakes up the on call tech for a tune up question, and the tech loses trust in the system within a week.

Call script: "Hi, I wanted to set up an appointment for a tune up. My AC is running fine, just the seasonal check. What is your soonest availability?" (Make this call after 7pm or on a Sunday.)

What to verify:

  • The agent does NOT page the on call tech.
  • The agent acknowledges the call is after hours, offers a real next-business-day slot, and confirms.
  • The CRM shows the call as a routine booking, not flagged urgent.
  • If you have an after hours surcharge for emergencies, the agent does NOT mention it for routine work.

Common failure: the agent pages the on call tech for every after hours call. Root cause: triage rules were never wired. The agent treats the calendar's "closed hours" as a blanket "everything is urgent". Fix the prompt to distinguish "after hours emergency" from "after hours routine".

Test 3. Out of area

The boring test that catches the most expensive ongoing leak. Out of area calls booked into the dispatch board are wasted technician hours, every time. A single missed gating rule costs a truck a half day a week.

Call script: "Hi, my AC is broken, can you send a tech out? My address is [a real address that is clearly outside the service area, like a town 90 miles from the contractor's nearest ZIP]."

What to verify:

  • The agent recognizes the ZIP or city is out of area.
  • The agent does NOT book the call.
  • The agent says something like "We do not service that area, but I can recommend a local HVAC contractor for you" or "We service [list of areas], unfortunately we cannot help, sorry for the trouble".
  • The CRM either does not record the call as a job, or records it tagged as "Out of area, declined" so the dispatcher can see what happened.

Common failure: the agent books the appointment because the ZIP gating logic was never set up properly. Root cause: the contractor never gave the agent the actual ZIP list, or the agent was configured with a "best effort" fallback. Fix by giving the agent the explicit list.

Test 4. Existing customer

The trust test. If a repeat customer calls and the agent treats them as brand new, the customer feels invisible and the contractor loses goodwill. If the agent recognizes them and references their last visit, the customer is impressed and the call is faster.

Call script: "Hi, I had your team out last [month], the tune up. Can you come back? Something is making a clicking sound on the outdoor unit."

(Use a phone number that matches a real existing customer in the CRM. Pick one with a recent service history.)

What to verify:

  • The agent recognizes the caller from the phone number lookup.
  • The agent skips the address verification step (or confirms it briefly: "Are we still coming to the address on Maple Drive?").
  • The agent references the last visit if relevant ("I see we were out for a tune up on April 12, anything related?").
  • The CRM attaches the new booking to the existing customer record, not a duplicate.

Common failure: agent treats every caller as new. Root cause: customer lookup by phone number is not wired in the integration. Fix by enabling phone-to-customer matching in the CRM connector before going live. If the CRM does not support fast phone lookup, the agent prompt should at least ask "Have we been to your home before?" early in the call.

Test 5. Sales spam

The "do not waste my dispatcher's time" test. Roughly 5 to 15% of inbound calls to a typical HVAC contractor are vendor solicitations, SEO pitches, and cold outreach. A good agent recognizes these and handles them without booking a job or escalating to the dispatcher.

Call script: "Hi, I am calling from [made up agency name], we help home services businesses with their digital marketing. Can I speak with the owner?"

What to verify:

  • The agent recognizes this is not a service call.
  • The agent does NOT book a job or take an address.
  • The agent gives a polite decline, captures the caller's company (if offered), and either ends the call or says "I will pass along the message".
  • The CRM either does not log the call as a job, or logs it under a "Solicitation" or "Non-customer" tag so the dispatcher can see what came in.

Common failure: the agent treats the sales call as a service call, takes a name and a callback number, and creates a phantom job. Root cause: the agent prompt has no concept of non-service calls. Fix with a simple branch in the prompt: "If the caller is selling something, decline politely and do not create a job".

Grading the protocol

Five tests, ten to fifteen minutes total. Score honestly.

  • 5 of 5 pass: ship it. Go live.
  • 4 of 5 pass: fix the one failure, retest only that scenario, then ship.
  • 3 of 5 or worse: do not go live. Send the recordings back to whoever configured the agent. Ask for the prompt and the integration to be revised. Retest all five before next attempt.

The honest read: a well configured residential HVAC agent passes all five on the first run. If yours does not, the demo was cherry-picked. Fix it before homeowners catch the failure for you.

After go live, repeat monthly

The five test protocol is not one-and-done. Run it monthly for the first three months, then quarterly. Two reasons.

First, your business changes. Service area expands, you add a heat pump install line, you start charging weekend surcharges. The agent needs to keep up. Monthly tests catch the drift early.

Second, the underlying voice models change. Voice AI providers update their models on roughly 6 to 8 week cycles. Most updates are improvements. Some are regressions on edge cases. The five test protocol is the cheapest way to catch a regression before a homeowner does.

Next steps

If you want to use the same emergency triage rules our agents ship with, the 7 trigger emergency intake script is the framework Test 1 grades against. The same logic underpins every Voltic Voice deployment.

To see the QA protocol run live on a real residential HVAC agent, with all five test calls played back from real recordings, book a 15 minute demo. We will run your specific business rules through the framework before you commit to anything.