AI Research · Capstone Project · 2026

The computer
that reads you.

Hero is an AI system that understands human intent through gesture, voice, and behavioral signals — not just what you type.

Try fusion demo Learn more ↓

What is Hero?

An AI system built to understand humans — not just their text.

Most human-computer interaction is still built around keyboards and clicks. But humans communicate with far more than keystrokes. We gesture. We speak. We hesitate. We move in ways that carry intent before a single character is typed.

Hero is a research project built to change that. It processes multiple real-time input channels — gesture, voice, and behavioral patterns — and translates them into structured, actionable intent. No cloud required. No accounts. Everything runs in the browser, on your device.

All four phases are live. Phase 1 captures hand gestures via MediaPipe Hands. Phase 2 recognizes spoken commands through the Web Speech API. Phase 3 reads behavioral patterns from keystroke timing, mouse velocity, and scroll rhythm. Phase 4 fuses all three channels into a unified intent engine with confidence weighting and agreement detection.

Computer Vision Natural Language Processing Human–Computer Interaction Accessibility On-Device AI Machine Learning

Input Architecture

Three signals.
One system.
Real-time intent.

Hero reads input across three parallel channels — gesture, voice, and behavioral patterns. Each captures a different dimension of how you interact. Fused together, they form a picture of intent that no single source could provide alone.

✓ thumbs_up conf: 0.97

Computers have always been blind to how humans actually behave.
Hero is a research project built to change that.

Three channels. One model.

Every signal,
understood.

Phase 1 · Live Now

Show Hero
your hand.

Hero tracks 21 hand landmarks in real time using MediaPipe Hands — running entirely in your browser, zero server required. Hold any gesture for one second to activate it. Designed for hands-free control and mute communication.

Recognized signals

✋ Open Hand ✊ Fist 👍 Thumbs Up 👎 Thumbs Down ☝️ Point ✌️ Peace 👌 OK 🤟 ILY 🤘 Rock On 🤙 Call Me 3️⃣ Three 4️⃣ Four
Fingers
Thumb
Index
Middle
Ring
Pinky
✋ Show Hands
● Loading model… 0 hand · 0 fps

Camera access is required for gesture tracking

Open Hand Tracking →
0 / 21

Signal history

    Phase 2 · Live Now

    Speak to Hero.
    It responds.

    Voice intent is the second channel Hero understands. Using the Web Speech API — running entirely in your browser, no server required — Hero matches what you say to registered actions in real time. No wake word. No cloud. Just your voice, interpreted locally.

    ⚠ Chrome and Edge only — Web Speech API is not available in Firefox or Safari.

    Idle — mic off

    How it works

    From signal to intent
    in three steps.

    01

    Signal capture

    Hero reads from three input channels simultaneously. The camera captures 21 hand landmarks per frame through MediaPipe Hands. The microphone converts speech to text via the Web Speech API. JavaScript event listeners track keystroke timing, scroll velocity, and mouse movement as a continuous behavioral stream. Each channel runs independently, in real time, without leaving your device.

    MediaPipe Hands Web Speech API getUserMedia DOM Events API
    02

    Classification

    Raw inputs are normalized and matched against Hero's intent layer. Gesture landmarks are compared against known pose configurations. Voice transcripts are matched against registered intent patterns using regular expression matching. Behavioral signals are analyzed for rhythm, velocity, and pause patterns to infer engagement state. Each match produces an intent label and a confidence score in milliseconds.

    Landmark Normalization RegExp Pattern Matching Confidence Scoring Pose Classification
    03

    Intent output

    The classified intent is dispatched to the appropriate handler — scrolling, navigation, toggling UI state, or triggering custom actions. The entire pipeline runs in under 20 milliseconds. The Phase 4 fusion engine combines gesture and voice signals using confidence weighting and agreement detection, producing a single reliable intent output from both channels simultaneously.

    Intent Dispatch < 20ms Latency Multi-modal Fusion · Live
    Gesture camera
    Voice mic
    Behavior events
    Classification
    Intent + confidence

    Capabilities

    Real-Time Gesture Recognition

    MediaPipe Hands tracks 21 landmarks per frame through your device's camera. Gestures are classified in milliseconds — no wearables, no special hardware, no internet. Just your hand and the model.

    Voice Intent

    The Web Speech API converts spoken commands into structured intent. Hero matches what you say against registered patterns and triggers actions instantly — all on-device, in supported browsers.

    Behavioral Pattern Reading

    Keystrokes, mouse paths, scroll velocity, and idle time form a continuous stream. Hero reads that stream to build context about focus, hesitation, and intent — without any camera or microphone.

    Built for Accessibility

    Hands-free, keyboard-free, and voice-only interaction has been an unsolved problem for too long. Hero is designed from the ground up to give everyone a natural, direct way to control their device — regardless of how they're able to interact with it.

    Fully On-Device

    No data is sent to any server. Every signal Hero processes — camera frames, voice audio, keystrokes — stays on your device. Local inference means no accounts, no telemetry, and no latency from the network.

    Input signals
    Intent class
    conf 0.97

    Signal Pipeline

    Input arrives.
    Intent emerges.
    Action follows.

    Raw inputs — gesture coordinates, spoken text, keystroke intervals — are normalized and passed through Hero's classification model. The output is an intent label and confidence score, produced in under 20 milliseconds, entirely on your device.

    Real-Time Design

    Fast enough
    to feel like
    an extension of you.

    Real-time means the response arrives before the action completes — not after. Hero targets classification within a single interaction frame. That constraint shapes every architectural decision in the project.

    Capture
    3 ms
    Preprocess
    2 ms
    Inference
    8 ms
    Post-process
    2 ms
    Total <16 ms

    Face Expression Reading

    Your face is
    already speaking.
    Hero listens.

    Using MediaPipe FaceMesh, Hero maps 468 facial landmarks in real time — tracking micro-expressions, brow movement, and eye state to read emotional context alongside your other inputs. Nothing leaves your device.

    ✓ focused brow_raise: 0.83 landmarks: 468

    Human communication
    has never been
    just text.

    Gesture, voice, and behavioral rhythm have always carried meaning. Hero is a research project exploring how to give computers the ability to understand that meaning — in real time, on-device, and without compromise.

    Privacy & Security

    Everything runs on
    your device. Always.

    Hero was built with one hard constraint: your camera feed, microphone audio, and behavioral patterns never leave your browser. Not compressed, not anonymized, not sampled — just never sent.

    🔒

    Zero network calls for inference

    Every model, classifier, and intent decision runs locally in your browser via WebAssembly. Hero makes no outbound requests during recognition. Open your network tab — you'll see nothing.

    📵

    No backend. No accounts.

    There is no server receiving your data. No login, no analytics pipeline, no telemetry. Hero is a static site — it runs like a calculator, entirely on your machine.

    🎥

    Camera frames stay local

    Your video feed is processed frame-by-frame by MediaPipe running in WebAssembly. No frame is encoded, stored, or accessible outside your current browser tab. The moment you close it, it's gone.

    🎙️

    Voice never leaves the browser

    Voice recognition uses the Web Speech API, which runs inside your browser. No audio clip, transcript, or partial phrase is transmitted to Hero. There are no Hero servers to receive it.

    🖱️

    Behavioral data is ephemeral

    Keystroke timing, mouse velocity, and scroll patterns are computed in memory and discarded immediately after classification. Nothing is logged. No behavioral profile is built or persisted between sessions.

    🔍

    Fully inspectable

    Every classification rule, sensor handler, and data path is readable in your browser's DevTools right now. No obfuscation. No hidden endpoints. What you see is exactly what runs.

    No server calls No data stored No accounts No tracking 100% on-device AI Open source

    Project Phases

    Built in the open,
    one phase at a time.

    Hero is a long-term research project developed as a software capstone. Each phase adds a new input modality. The goal is a unified, multi-modal intent model that understands humans the way humans understand each other.

    Phase 1 Complete

    Gesture Recognition

    21-point hand skeleton tracking via MediaPipe Hands running entirely in the browser. Recognizes 12 distinct gestures including thumbs up, peace, point, OK, and ILY — with sub-20ms classification latency. No wearables. No installation. Just a camera.

    MediaPipe Hands Web Camera API Canvas 2D
    Open Demo →
    Phase 2 Live now

    Voice Intent

    Real-time voice command recognition using the Web Speech API. Hero matches spoken phrases against registered intent patterns using regular expression matching. Supports 10 built-in commands, auto-restart on silence, and a simulated waveform fallback when mic access is denied.

    Web Speech API Web Audio API RegExp Matching
    Open Demo →
    Phase 3 Live now

    Behavioral Signals

    JavaScript event listeners track keystroke timing, mouse velocity, and scroll rhythm to form a continuous behavioral stream. Patterns like hesitation, rapid scanning, and focused typing carry intent that neither gesture nor voice alone can provide. No camera. No microphone required.

    DOM Events API Keystroke Timing Scroll / Mouse Analysis
    Open Demo →
    Phase 4 Live Now

    Multi-Modal Fusion

    A unified intent engine that fuses gesture + voice signals weighted by confidence. When both channels agree, confidence is boosted. When they conflict, the dominant signal wins with a penalty. The fusion layer produces a single reliable intent output from the full picture of human input.

    Fusion Engine Confidence Weighting Agreement Detection
    Open Fusion Demo →

    Technical Overview

    Project type AI Research · Capstone · Computer Vision + NLP
    Phase 1 — Gesture MediaPipe Hands · 21-point skeleton · in-browser · zero infrastructure
    Phase 2 — Voice Web Speech API · live intent matching · Chrome + Edge · no server
    Phase 3 — Behavior Keystroke timing · mouse velocity · scroll rhythm · 5 behavioral states · live
    Phase 4 — Fusion Gesture + voice fusion engine · confidence weighting · agreement detection · live
    Target latency < 20 ms end-to-end per channel
    Inference 100% on-device · no cloud · no data collection
    Use cases Accessibility · Hands-free HCI · AI / CV research

    Hero · Capstone Project · 2026

    In Development

    Phases 1, 2, and 4 are live. Try the fusion engine — gesture and voice working together in real time.

    Try fusion demo View on GitHub

    Hero Desktop · Coming Soon

    A native app is
    on the way.

    Hero is currently a browser-based research demo. A native desktop app — with persistent gesture profiles, offline model caching, and system-level integration — is actively in development. Join the waitlist to be first in line.

    macOS macOS 13 Ventura or later
    Windows Windows 10 or later · x64
    Join the waitlist →

    Waitlist members get early access when the app ships.