AI Research · Capstone Project · 2026
Hero is an AI system that understands human intent through gesture, voice, and behavioral signals — not just what you type.
What is Hero?
Most human-computer interaction is still built around keyboards and clicks. But humans communicate with far more than keystrokes. We gesture. We speak. We hesitate. We move in ways that carry intent before a single character is typed.
Hero is a research project built to change that. It processes multiple real-time input channels — gesture, voice, and behavioral patterns — and translates them into structured, actionable intent. No cloud required. No accounts. Everything runs in the browser, on your device.
All four phases are live. Phase 1 captures hand gestures via MediaPipe Hands. Phase 2 recognizes spoken commands through the Web Speech API. Phase 3 reads behavioral patterns from keystroke timing, mouse velocity, and scroll rhythm. Phase 4 fuses all three channels into a unified intent engine with confidence weighting and agreement detection.
Input Architecture
Hero reads input across three parallel channels — gesture, voice, and behavioral patterns. Each captures a different dimension of how you interact. Fused together, they form a picture of intent that no single source could provide alone.
Computers have always been blind to how humans actually behave.
Hero is a research project built to change that.
Three channels. One model.
Gesture
21 hand landmarks tracked
via MediaPipe + camera
Voice
Commands recognized live
via Web Speech API
Behavior
Keystrokes, mouse, scroll
as a continuous intent stream
Phase 1 · Live Now
Hero tracks 21 hand landmarks in real time using MediaPipe Hands — running entirely in your browser, zero server required. Hold any gesture for one second to activate it. Designed for hands-free control and mute communication.
Recognized signals
Phase 2 · Live Now
Voice intent is the second channel Hero understands. Using the Web Speech API — running entirely in your browser, no server required — Hero matches what you say to registered actions in real time. No wake word. No cloud. Just your voice, interpreted locally.
⚠ Chrome and Edge only — Web Speech API is not available in Firefox or Safari.
How it works
Hero reads from three input channels simultaneously. The camera captures 21 hand landmarks per frame through MediaPipe Hands. The microphone converts speech to text via the Web Speech API. JavaScript event listeners track keystroke timing, scroll velocity, and mouse movement as a continuous behavioral stream. Each channel runs independently, in real time, without leaving your device.
Raw inputs are normalized and matched against Hero's intent layer. Gesture landmarks are compared against known pose configurations. Voice transcripts are matched against registered intent patterns using regular expression matching. Behavioral signals are analyzed for rhythm, velocity, and pause patterns to infer engagement state. Each match produces an intent label and a confidence score in milliseconds.
The classified intent is dispatched to the appropriate handler — scrolling, navigation, toggling UI state, or triggering custom actions. The entire pipeline runs in under 20 milliseconds. The Phase 4 fusion engine combines gesture and voice signals using confidence weighting and agreement detection, producing a single reliable intent output from both channels simultaneously.
Capabilities
MediaPipe Hands tracks 21 landmarks per frame through your device's camera. Gestures are classified in milliseconds — no wearables, no special hardware, no internet. Just your hand and the model.
The Web Speech API converts spoken commands into structured intent. Hero matches what you say against registered patterns and triggers actions instantly — all on-device, in supported browsers.
Keystrokes, mouse paths, scroll velocity, and idle time form a continuous stream. Hero reads that stream to build context about focus, hesitation, and intent — without any camera or microphone.
Hands-free, keyboard-free, and voice-only interaction has been an unsolved problem for too long. Hero is designed from the ground up to give everyone a natural, direct way to control their device — regardless of how they're able to interact with it.
No data is sent to any server. Every signal Hero processes — camera frames, voice audio, keystrokes — stays on your device. Local inference means no accounts, no telemetry, and no latency from the network.
Signal Pipeline
Raw inputs — gesture coordinates, spoken text, keystroke intervals — are normalized and passed through Hero's classification model. The output is an intent label and confidence score, produced in under 20 milliseconds, entirely on your device.
Real-Time Design
Real-time means the response arrives before the action completes — not after. Hero targets classification within a single interaction frame. That constraint shapes every architectural decision in the project.
Face Expression Reading
Using MediaPipe FaceMesh, Hero maps 468 facial landmarks in real time — tracking micro-expressions, brow movement, and eye state to read emotional context alongside your other inputs. Nothing leaves your device.
Gesture, voice, and behavioral rhythm have always carried meaning. Hero is a research project exploring how to give computers the ability to understand that meaning — in real time, on-device, and without compromise.
Privacy & Security
Hero was built with one hard constraint: your camera feed, microphone audio, and behavioral patterns never leave your browser. Not compressed, not anonymized, not sampled — just never sent.
Every model, classifier, and intent decision runs locally in your browser via WebAssembly. Hero makes no outbound requests during recognition. Open your network tab — you'll see nothing.
There is no server receiving your data. No login, no analytics pipeline, no telemetry. Hero is a static site — it runs like a calculator, entirely on your machine.
Your video feed is processed frame-by-frame by MediaPipe running in WebAssembly. No frame is encoded, stored, or accessible outside your current browser tab. The moment you close it, it's gone.
Voice recognition uses the Web Speech API, which runs inside your browser. No audio clip, transcript, or partial phrase is transmitted to Hero. There are no Hero servers to receive it.
Keystroke timing, mouse velocity, and scroll patterns are computed in memory and discarded immediately after classification. Nothing is logged. No behavioral profile is built or persisted between sessions.
Every classification rule, sensor handler, and data path is readable in your browser's DevTools right now. No obfuscation. No hidden endpoints. What you see is exactly what runs.
Project Phases
Hero is a long-term research project developed as a software capstone. Each phase adds a new input modality. The goal is a unified, multi-modal intent model that understands humans the way humans understand each other.
21-point hand skeleton tracking via MediaPipe Hands running entirely in the browser. Recognizes 12 distinct gestures including thumbs up, peace, point, OK, and ILY — with sub-20ms classification latency. No wearables. No installation. Just a camera.
Real-time voice command recognition using the Web Speech API. Hero matches spoken phrases against registered intent patterns using regular expression matching. Supports 10 built-in commands, auto-restart on silence, and a simulated waveform fallback when mic access is denied.
JavaScript event listeners track keystroke timing, mouse velocity, and scroll rhythm to form a continuous behavioral stream. Patterns like hesitation, rapid scanning, and focused typing carry intent that neither gesture nor voice alone can provide. No camera. No microphone required.
A unified intent engine that fuses gesture + voice signals weighted by confidence. When both channels agree, confidence is boosted. When they conflict, the dominant signal wins with a penalty. The fusion layer produces a single reliable intent output from the full picture of human input.
Hero · Capstone Project · 2026
In Development
Phases 1, 2, and 4 are live. Try the fusion engine — gesture and voice working together in real time.
Hero Desktop · Coming Soon
Hero is currently a browser-based research demo. A native desktop app — with persistent gesture profiles, offline model caching, and system-level integration — is actively in development. Join the waitlist to be first in line.
Waitlist members get early access when the app ships.