NAOMI PROJECT

by Tobin M. Albanese

PORTFOLIO — PROJECT Tue Mar 05 2024

Overview. NAOMI ingests live or recorded video and analyzes rapid facial micro-movements to surface likely states such as stress, confidence, or deception. Outputs include per-frame landmarks, temporal deltas, and calibrated confidence scores designed for human-in-the-loop review.

Project Image 1

Why this matters. Human evaluators miss sub-second changes under time pressure. NAOMI provides a second set of eyes: high-frequency, explainable cues that assist (not replace) judgment in interviews, investigations, user research, and training simulations.

Project Image 2

Objectives.

  • Realtime (< 40ms latency) overlays for analysts.
  • Explainable outputs: AUs, heatmaps, and per-window deltas.
  • Robustness to pose, occlusion, and lighting shifts.
  • Auditable traces + privacy-first design.
Project Image 3

Architecture. Modular pipeline with two paths:

  • Edge (browser/WebRTC or desktop): detector + lightweight temporal model.
  • Server (GPU): high-accuracy models + batch analytics + storage.

Components: capturepreprocesslandmark/featuretemporal modelintent headcalibrationreporting.

Project Image 4

Signal pipeline. Face detection → landmark tracking → micro-window differencing (tens of ms) → temporal model (BiLSTM/Temporal-CNN) → intent classifier. The pipeline is resilient to jitter via smoothing and occlusion handling.

Project Image 5

Features.

  • Landmarks: 68/106-pt normalized to canonical frame.
  • Micro-deltas: ∆x/∆y per landmark in 30–80ms windows.
  • AUs: estimated via regression over localized regions.
  • Stabilization: optical-flow aided tracking + EMA smoothing.
Project Image 6

Modeling & training. Transfer learning from expression corpora with domain adaptation. Emphasis on label hygiene, inter-rater reliability, and post-hoc calibration (Platt/temperature scaling) to keep scores honest.

Project Image 7

Data & labeling. Mix of public corpora + synthetic augmentations (pose, illumination, occlusion). Dual-rater labels with adjudication. All datasets tracked via DVC with provenance.

Project Image 8

Evaluation.

  • Latency: 25–38ms per frame on RTX-class GPU; 12–18fps on modern laptops (edge path).
  • Robustness: ±15° yaw/pitch without significant drift.
  • Calibration: ECE ≤ 0.06 after temperature scaling.
Project Image 9

Interface. Analysts see overlays (action units, heatmaps), clip scrubbing, and frame-level charts. Batch mode supports bulk uploads; live mode supports WebRTC streams with adjustable sampling rate.

Project Image 10

API (excerpt).

{
  "frame": 1532,
  "landmarks": [[x,y], ...],
  "au": { "AU01": 0.22, "AU04": 0.61, ... },
  "microDeltaRMS": 0.047,
  "intent": { "stress": 0.73, "confidence": 0.18, "deception": 0.09 },
  "calibrated": true,
  "timestamp": "2024-03-05T21:11:10Z"
}
Project Image 11

Deployment.

  • Edge: WebAssembly + WebGL; no video leaves device.
  • Server: gRPC/REST; autoscaled GPU workers; object storage for traces.
  • CI/CD: GitHub Actions → container registry → IaC templates.
Project Image 12

Security & privacy. Role-based access, encryption at rest/in transit, and retention controls. Designed to be auditable with exportable inference traces. Optional on-device only mode (no upload).

Project Image 13

Limitations. Sensitive to extreme occlusion; cultural and personal variation require careful thresholding. The system is an indicator — never a sole determiner of intent.

Project Image 14

Ethics. Human-in-the-loop only. Clear consent, documented purpose limits, and transparency about error rates. Red-team reviews for misuse scenarios.

Project Image 15

Roadmap.

  • Multimodal fusion (voice prosody, keystroke dynamics).
  • Improved self-supervised pretraining for low-light video.
  • Edge-optimized transformer with distillation.
  • Richer analyst notes → weak labels for continual learning.

Stack. PyTorch, ONNX, OpenCV, WebRTC, FastAPI/gRPC, React, D3, Docker, Terraform.


Resources & Links