Hero — AI Intent Recognition from Human Input

What is Hero?

An AI system built to understand humans — not just their text.

Most human-computer interaction is still built around keyboards and clicks. But humans communicate with far more than keystrokes. We gesture. We speak. We hesitate. We move in ways that carry intent before a single character is typed.

Hero is a research project built to change that. It processes multiple real-time input channels — gesture, voice, and behavioral patterns — and translates them into structured, actionable intent. No cloud required. No accounts. Everything runs in the browser, on your device.

All four phases are live. Phase 1 captures hand gestures via MediaPipe Hands. Phase 2 recognizes spoken commands through the Web Speech API. Phase 3 reads behavioral patterns from keystroke timing, mouse velocity, and scroll rhythm. Phase 4 fuses all three channels into a unified intent engine with confidence weighting and agreement detection.

Computer Vision Natural Language Processing Human–Computer Interaction Accessibility On-Device AI Machine Learning

Three channels. One model.

Every signal,
understood.

Gesture

21 hand landmarks tracked
via MediaPipe + camera

Voice

Commands recognized live
via Web Speech API

Behavior

Keystrokes, mouse, scroll
as a continuous intent stream

Phase 1 · Live Now

Show Hero
your hand.

Hero tracks 21 hand landmarks in real time using MediaPipe Hands — running entirely in your browser, zero server required. Hold any gesture for one second to activate it. Designed for hands-free control and mute communication.

Recognized signals

✋ Open Hand ✊ Fist 👍 Thumbs Up 👎 Thumbs Down ☝️ Point ✌️ Peace 👌 OK 🤟 ILY 🤘 Rock On 🤙 Call Me 3️⃣ Three 4️⃣ Four

Fingers

Thumb

Index

Middle

Ring

Pinky

✋ Show Hands

● Loading model… 0 hand · 0 fps

Camera access is required for gesture tracking

Open Hand Tracking →

—

0 / 21

Signal history

Phase 2 · Live Now

Speak to Hero.
It responds.

Voice intent is the second channel Hero understands. Using the Web Speech API — running entirely in your browser, no server required — Hero matches what you say to registered actions in real time. No wake word. No cloud. Just your voice, interpreted locally.

⚠ Chrome and Edge only — Web Speech API is not available in Firefox or Safari.

Idle — mic off

How it works

From signal to intent
in three steps.

Signal capture

Hero reads from three input channels simultaneously. The camera captures 21 hand landmarks per frame through MediaPipe Hands. The microphone converts speech to text via the Web Speech API. JavaScript event listeners track keystroke timing, scroll velocity, and mouse movement as a continuous behavioral stream. Each channel runs independently, in real time, without leaving your device.

MediaPipe Hands Web Speech API getUserMedia DOM Events API

Classification

Raw inputs are normalized and matched against Hero's intent layer. Gesture landmarks are compared against known pose configurations. Voice transcripts are matched against registered intent patterns using regular expression matching. Behavioral signals are analyzed for rhythm, velocity, and pause patterns to infer engagement state. Each match produces an intent label and a confidence score in milliseconds.

Landmark Normalization RegExp Pattern Matching Confidence Scoring Pose Classification

Intent output

The classified intent is dispatched to the appropriate handler — scrolling, navigation, toggling UI state, or triggering custom actions. The entire pipeline runs in under 20 milliseconds. The Phase 4 fusion engine combines gesture and voice signals using confidence weighting and agreement detection, producing a single reliable intent output from both channels simultaneously.

Intent Dispatch < 20ms Latency Multi-modal Fusion · Live

Gesture camera

Voice mic

Behavior events

Classification

Intent + confidence

Capabilities

◎

Real-Time Gesture Recognition

MediaPipe Hands tracks 21 landmarks per frame through your device's camera. Gestures are classified in milliseconds — no wearables, no special hardware, no internet. Just your hand and the model.

⬡

Voice Intent

The Web Speech API converts spoken commands into structured intent. Hero matches what you say against registered patterns and triggers actions instantly — all on-device, in supported browsers.

⊹

Behavioral Pattern Reading

Keystrokes, mouse paths, scroll velocity, and idle time form a continuous stream. Hero reads that stream to build context about focus, hesitation, and intent — without any camera or microphone.

◈

Built for Accessibility

Hands-free, keyboard-free, and voice-only interaction has been an unsolved problem for too long. Hero is designed from the ground up to give everyone a natural, direct way to control their device — regardless of how they're able to interact with it.

✦

Fully On-Device

No data is sent to any server. Every signal Hero processes — camera frames, voice audio, keystrokes — stays on your device. Local inference means no accounts, no telemetry, and no latency from the network.

Input signals

Intent class

conf 0.97

Signal Pipeline

Input arrives.
Intent emerges.
Action follows.

Raw inputs — gesture coordinates, spoken text, keystroke intervals — are normalized and passed through Hero's classification model. The output is an intent label and confidence score, produced in under 20 milliseconds, entirely on your device.

Real-Time Design

Fast enough
to feel like
an extension of you.

Real-time means the response arrives before the action completes — not after. Hero targets classification within a single interaction frame. That constraint shapes every architectural decision in the project.

Capture

3 ms

Preprocess

2 ms

Inference

8 ms

Post-process

2 ms

Total <16 ms

Face Expression Reading

Your face is
already speaking.
Hero listens.

Using MediaPipe FaceMesh, Hero maps 468 facial landmarks in real time — tracking micro-expressions, brow movement, and eye state to read emotional context alongside your other inputs. Nothing leaves your device.

Brow raise · furrowed · neutral
Eye open · blink · squint
Mouth state · smile · pressed
Head pose · tilt · nod

Privacy & Security

Everything runs on
your device. Always.

Hero was built with one hard constraint: your camera feed, microphone audio, and behavioral patterns never leave your browser. Not compressed, not anonymized, not sampled — just never sent.

🔒

Zero network calls for inference

Every model, classifier, and intent decision runs locally in your browser via WebAssembly. Hero makes no outbound requests during recognition. Open your network tab — you'll see nothing.

📵

No backend. No accounts.

There is no server receiving your data. No login, no analytics pipeline, no telemetry. Hero is a static site — it runs like a calculator, entirely on your machine.

🎥

Camera frames stay local

Your video feed is processed frame-by-frame by MediaPipe running in WebAssembly. No frame is encoded, stored, or accessible outside your current browser tab. The moment you close it, it's gone.

🎙️

Voice never leaves the browser

Voice recognition uses the Web Speech API, which runs inside your browser. No audio clip, transcript, or partial phrase is transmitted to Hero. There are no Hero servers to receive it.

🖱️

Behavioral data is ephemeral

Keystroke timing, mouse velocity, and scroll patterns are computed in memory and discarded immediately after classification. Nothing is logged. No behavioral profile is built or persisted between sessions.

🔍

Fully inspectable

Every classification rule, sensor handler, and data path is readable in your browser's DevTools right now. No obfuscation. No hidden endpoints. What you see is exactly what runs.

✓ No server calls ✓ No data stored ✓ No accounts ✓ No tracking ✓ 100% on-device AI ✓ Open source

Project Phases

Built in the open,
one phase at a time.

Hero is a long-term research project developed as a software capstone. Each phase adds a new input modality. The goal is a unified, multi-modal intent model that understands humans the way humans understand each other.

Phase 1 Complete

Gesture Recognition

21-point hand skeleton tracking via MediaPipe Hands running entirely in the browser. Recognizes 12 distinct gestures including thumbs up, peace, point, OK, and ILY — with sub-20ms classification latency. No wearables. No installation. Just a camera.

MediaPipe Hands Web Camera API Canvas 2D

Open Demo →

Phase 2 Live now

Voice Intent

Real-time voice command recognition using the Web Speech API. Hero matches spoken phrases against registered intent patterns using regular expression matching. Supports 10 built-in commands, auto-restart on silence, and a simulated waveform fallback when mic access is denied.

Web Speech API Web Audio API RegExp Matching

Open Demo →

Phase 3 Live now

Behavioral Signals

JavaScript event listeners track keystroke timing, mouse velocity, and scroll rhythm to form a continuous behavioral stream. Patterns like hesitation, rapid scanning, and focused typing carry intent that neither gesture nor voice alone can provide. No camera. No microphone required.

DOM Events API Keystroke Timing Scroll / Mouse Analysis

Open Demo →

Phase 4 Live Now

Multi-Modal Fusion

A unified intent engine that fuses gesture + voice signals weighted by confidence. When both channels agree, confidence is boosted. When they conflict, the dominant signal wins with a penalty. The fusion layer produces a single reliable intent output from the full picture of human input.

Fusion Engine Confidence Weighting Agreement Detection

Open Fusion Demo →

Technical Overview

Project type AI Research · Capstone · Computer Vision + NLP

Phase 1 — Gesture MediaPipe Hands · 21-point skeleton · in-browser · zero infrastructure

Phase 2 — Voice Web Speech API · live intent matching · Chrome + Edge · no server

Phase 3 — Behavior Keystroke timing · mouse velocity · scroll rhythm · 5 behavioral states · live

Phase 4 — Fusion Gesture + voice fusion engine · confidence weighting · agreement detection · live

Target latency < 20 ms end-to-end per channel

Inference 100% on-device · no cloud · no data collection

Use cases Accessibility · Hands-free HCI · AI / CV research

Hero Desktop · Coming Soon

A native app is
on the way.

Hero is currently a browser-based research demo. A native desktop app — with persistent gesture profiles, offline model caching, and system-level integration — is actively in development. Join the waitlist to be first in line.

macOS macOS 13 Ventura or later

Windows Windows 10 or later · x64

Join the waitlist →

Waitlist members get early access when the app ships.

The computer
that reads you.

An AI system built to understand humans — not just their text.

Three signals.
One system.
Real-time intent.

Every signal,
understood.

Show Hero
your hand.

Speak to Hero.
It responds.

From signal to intent
in three steps.

Signal capture

Classification

Intent output

Real-Time Gesture Recognition

Voice Intent

Behavioral Pattern Reading

Built for Accessibility

Fully On-Device

Input arrives.
Intent emerges.
Action follows.

Fast enough
to feel like
an extension of you.

Your face is
already speaking.
Hero listens.

Human communication
has never been
just text.

Everything runs on
your device. Always.

Zero network calls for inference

No backend. No accounts.

Camera frames stay local

Voice never leaves the browser

Behavioral data is ephemeral

Fully inspectable

Built in the open,
one phase at a time.

Gesture Recognition

Voice Intent

Behavioral Signals

Multi-Modal Fusion

Technical Overview

A native app is
on the way.

The computerthat reads you.

An AI system built to understand humans — not just their text.

Three signals.One system.Real-time intent.

Every signal,understood.

Show Heroyour hand.

Speak to Hero.It responds.

From signal to intentin three steps.

Signal capture

Classification

Intent output

Real-Time Gesture Recognition

Voice Intent

Behavioral Pattern Reading

Built for Accessibility

Fully On-Device

Input arrives.Intent emerges.Action follows.

Fast enoughto feel likean extension of you.

Your face isalready speaking.Hero listens.

Human communicationhas never beenjust text.

Everything runs onyour device. Always.

Zero network calls for inference

No backend. No accounts.

Camera frames stay local

Voice never leaves the browser

Behavioral data is ephemeral

Fully inspectable

Built in the open,one phase at a time.

Gesture Recognition

Voice Intent

Behavioral Signals

Multi-Modal Fusion

Technical Overview

A native app ison the way.

The computer
that reads you.

Three signals.
One system.
Real-time intent.

Every signal,
understood.

Show Hero
your hand.

Speak to Hero.
It responds.

From signal to intent
in three steps.

Input arrives.
Intent emerges.
Action follows.

Fast enough
to feel like
an extension of you.

Your face is
already speaking.
Hero listens.

Human communication
has never been
just text.

Everything runs on
your device. Always.

Built in the open,
one phase at a time.

A native app is
on the way.