ears without eyes

Apr 20, 2026
#ai
#agents
#multimodal
#video

an agent watching video should default to listening, and only open its eyes when audio admits it missed something.

watching video is expensive. listening to it is almost free.

a ten minute video at thirty frames per second is eighteen thousand frames. if you run a vision model on every one of them, you pay for eighteen thousand vision calls. at sixty frames per second you pay for thirty six. at one twenty you pay for seventy two. at two forty you pay for a hundred and forty four thousand vision calls on a single ten minute clip. the multiplier is the framerate, and the framerate is whatever the camera that shot the clip was set to, which you do not get to negotiate.

audio at the same length is a single transcription pass. whisper runs on a local machine in seconds on a small model. no per-token cost, no rate limits, no api quota burning down in real time. the asymmetry between the two channels is at least three orders of magnitude, sometimes four, depending on which vision model you route to.

most multimodal agents do not acknowledge this asymmetry. they either open their eyes on every frame and get expensive fast, or they keep their eyes closed and miss anything that was written on screen but not said out loud. both are wrong. and the second failure mode is the one that quietly poisons accuracy, because the agent sounds fluent while being silently incomplete.

why ocr is not the answer

the obvious patch is ocr. pipe every frame through a text recognition model, catch the letters, call it a day. this is the pattern most video agents have landed on, and i have written one of them myself.

it fails in two directions.

first, ocr is a separate model. you now have two reasoning systems looking at the same pixels, one reading strings, one reading meaning, and they do not agree. diagrams are invisible to ocr. charts are invisible. ui elements that rely on shape rather than text are invisible. faces, gestures, framing, camera movement. all invisible. ocr gives you the letters on the screen. the letters are not always the content.

second, running ocr on every frame still costs. it is cheaper than a full vision llm but not free, and the rate limit pressure does not go away. you are paying to extract a small slice of the frame when the frame might have been telling you something different entirely.

the right model for reading a frame is the same model doing the rest of the reasoning. one model, one view, end to end. the question is not whether to use vision. the question is when.

the cheap channel knows when it failed

here is the part that took me a while to see. the audio model already tells you when it was guessing.

run whisper-cli with the --output-json-full flag and every token comes back with a probability score. the confident words sit at 0.9 and above. the guessed ones sit at 0.3, 0.4, sometimes lower. whisper is not hiding its uncertainty, it is printing it. the signal has always been there, it just was not being read.

once you read the probabilities, you can build the rest of the system around them. tokens below a threshold form uncertainty zones. each zone carries a timestamp, the midpoint of when whisper got confused. that timestamp is a question addressed to the expensive channel. not "show me the video", but "show me this specific half second, because i was lost."

there is a softer signal next to the hard one. speakers often point at things. "check our site", "in the bio", "this command right here", "visit the link". these phrases almost never carry the actual information the speaker is referencing. the information is on screen. a small regex over the transcription surfaces these phrases with their own timestamps. they are a weaker signal than low-confidence tokens, but they stack.

together, confidence zones and demonstrative phrases give the model a shortlist of moments where looking would actually add something. everywhere else, the transcription is enough.

what glimpsing looks like

when the agent does decide to look, it does not sweep. it does not run at one frame per second and analyze a strip. it takes one frame. maybe two, spread across a zone. the timestamps come from the audio signal, not from a video scan.

in a recent test i pointed an agent at a 47 second instagram reel. the creator was naming three tools, fast, with strong accents and short consonants. whisper transcribed one of them as Emil Koval. the probability on the second syllable came back at 0.33, which is the model saying i have essentially no idea. two other moments in the same clip triggered similar low-confidence flags.

the agent looked at exactly those three moments. not at any of the other 44 seconds. the frames it got back showed the real names, typed cleanly on screen by the creator precisely because they knew nobody would hear them. the correct information was recovered. three vision calls. on a ten minute version of the same behavior, this is three calls instead of eighteen thousand. the accuracy is not compromised by the efficiency, the accuracy comes from the efficiency, because the three frames the model did look at were the three where looking mattered.

the same pattern extends beyond speech. any cheap channel that can report its own uncertainty can drive a more expensive channel on demand. metadata can defer to vision. a language model can defer to a retrieval call. retrieval can defer to a web fetch. the architecture is the same at every layer. the model default is to trust the cheap channel and ask the expensive one only when the cheap one admits it cannot answer.

what it feels like in practice

the first thing you notice is that rate limits stop being a concern for normal use. you can point a small agent at dozens of videos a day without thinking about api quota, because most of the work is done in transcription locally and the vision budget only gets spent at the moments that genuinely need it.

the second thing is that the agent stops being confidently wrong. this is the less obvious upgrade. before, when a transcription mangled a proper noun, the agent would pass the mangled name forward and the user would not know it was wrong until they tried to act on it. now, when the confidence score flags the token, either the agent looks at a frame and resolves the ambiguity, or it surfaces the uncertainty back to the user directly. no silent confabulation. the failure mode becomes visible, which means it can be fixed.

the third thing, and this is the one i did not expect, is that writing code for this pattern is simpler than writing code for the alternatives. you do not need a separate ocr pipeline, you do not need a vision scheduler, you do not need a confidence estimator. you get all three from one line of command flags on the audio model. the work is in wiring the signal through cleanly, and then trusting the model to use it.

the project that made this concrete

i built this into media-mcp. every transcription tool now runs whisper-cli -ojf, parses per-token probabilities, reports contiguous low-confidence spans, and scans the transcript for demonstrative phrases. a new companion tool, get_video_frames_at, takes an array of timestamps and returns one frame per timestamp. videos cache locally by a hash of the url so follow-up calls on the same clip do not re-download.

the code is under a hundred lines for the core pattern. the rest of the commit is plumbing. what matters is the interface. any agent using media-mcp can now point at a video, read the transcript, see where the transcript flags itself, and ask for frames only at those moments. one model, two channels, one deciding when to use the other.

repo. github.com/woosal1337/media-mcp

closing

the general shape of the pattern is simple. default to the cheap channel. make the cheap channel surface its own uncertainty. let the agent spend the expensive channel on the specific moments the cheap channel flagged. trust the model to do this without a hand-written rule for every case.

an agent that knows when to open its eyes does more work, cheaper, more accurately, than one that keeps them open all the time.