
NAOMI PROJECT
by Tobin M. Albanese
PORTFOLIO — PROJECT Tue Mar 05 2024
Overview. NAOMI ingests live or recorded video and analyzes rapid facial micro-movements to surface likely states such as stress, confidence, or deception. Outputs include per-frame landmarks, temporal deltas, and calibrated confidence scores designed for human-in-the-loop review.

Why this matters. Human evaluators miss sub-second changes under time pressure. NAOMI provides a second set of eyes: high-frequency, explainable cues that assist (not replace) judgment in interviews, investigations, user research, and training simulations.

Objectives.
- Realtime (< 40ms latency) overlays for analysts.
- Explainable outputs: AUs, heatmaps, and per-window deltas.
- Robustness to pose, occlusion, and lighting shifts.
- Auditable traces + privacy-first design.

Architecture. Modular pipeline with two paths:
- Edge (browser/WebRTC or desktop): detector + lightweight temporal model.
- Server (GPU): high-accuracy models + batch analytics + storage.
Components: capture → preprocess → landmark/feature → temporal model → intent head → calibration → reporting.

Signal pipeline. Face detection → landmark tracking → micro-window differencing (tens of ms) → temporal model (BiLSTM/Temporal-CNN) → intent classifier. The pipeline is resilient to jitter via smoothing and occlusion handling.

Features.
- Landmarks: 68/106-pt normalized to canonical frame.
- Micro-deltas: ∆x/∆y per landmark in 30–80ms windows.
- AUs: estimated via regression over localized regions.
- Stabilization: optical-flow aided tracking + EMA smoothing.

Modeling & training. Transfer learning from expression corpora with domain adaptation. Emphasis on label hygiene, inter-rater reliability, and post-hoc calibration (Platt/temperature scaling) to keep scores honest.

Data & labeling. Mix of public corpora + synthetic augmentations (pose, illumination, occlusion). Dual-rater labels with adjudication. All datasets tracked via DVC with provenance.

Evaluation.
- Latency: 25–38ms per frame on RTX-class GPU; 12–18fps on modern laptops (edge path).
- Robustness: ±15° yaw/pitch without significant drift.
- Calibration: ECE ≤ 0.06 after temperature scaling.

Interface. Analysts see overlays (action units, heatmaps), clip scrubbing, and frame-level charts. Batch mode supports bulk uploads; live mode supports WebRTC streams with adjustable sampling rate.

API (excerpt).
{ "frame": 1532, "landmarks": [[x,y], ...], "au": { "AU01": 0.22, "AU04": 0.61, ... }, "microDeltaRMS": 0.047, "intent": { "stress": 0.73, "confidence": 0.18, "deception": 0.09 }, "calibrated": true, "timestamp": "2024-03-05T21:11:10Z" }

Deployment.
- Edge: WebAssembly + WebGL; no video leaves device.
- Server: gRPC/REST; autoscaled GPU workers; object storage for traces.
- CI/CD: GitHub Actions → container registry → IaC templates.

Security & privacy. Role-based access, encryption at rest/in transit, and retention controls. Designed to be auditable with exportable inference traces. Optional on-device only mode (no upload).

Limitations. Sensitive to extreme occlusion; cultural and personal variation require careful thresholding. The system is an indicator — never a sole determiner of intent.

Ethics. Human-in-the-loop only. Clear consent, documented purpose limits, and transparency about error rates. Red-team reviews for misuse scenarios.

Roadmap.
- Multimodal fusion (voice prosody, keystroke dynamics).
- Improved self-supervised pretraining for low-light video.
- Edge-optimized transformer with distillation.
- Richer analyst notes → weak labels for continual learning.
Stack. PyTorch, ONNX, OpenCV, WebRTC, FastAPI/gRPC, React, D3, Docker, Terraform.