BEHAVIORAL AI FOR THREAT DETECTION

by Tobin M. Albanese

Research Project Volume 1 Wed Nov 01 2023

Measure what helps a human decide, not what flatters a metric.

Project Image 1

Abstract

This work investigates whether brief facial events—micro-expressions and short Action Unit (AU) sequences—can serve as probabilistic signals that augment existing threat models. The ambition is deliberately narrow: provide calibrated, low-confidence prompts that help human operators decide when to look closer, not machines that claim certainty or intent. We treat model output as a nudge to interrogate context, never as a conclusion. The practical question is whether weak, time-bounded facial dynamics, when measured and calibrated correctly, can improve screening precision without inflating risk, bias, or overreach.

Success here looks like fewer unnecessary escalations at the same safety level (or better safety at the same workload), plus transparent documentation of where the approach works, where it fails, and how to retire it responsibly if the costs outweigh the benefits.

Problem & Scope

Protective and screening workflows operate under severe constraints: limited time, incomplete information, asymmetric costs for false positives and false negatives, and intense public scrutiny. In that setting, a tiny signal that occasionally highlights the right frame or moment—if it is honest about uncertainty—can be useful. Our scope therefore limits the system to research, evaluation, and corroborative prompting. The tool may flag segments for human review, but it cannot diagnose, attribute identity or intent, or trigger punitive action on its own.

We draw bright boundaries: prohibited uses include identity inference, face recognition, attribution of motive, or any downstream decision that meaningfully affects a person without independent corroboration. All contemplated uses must be documented with task definitions, accountable owners, and cost tables that make trade-offs explicit and reviewable.

Data & Labeling

Data collection is consent-based and contextual. Participants receive a plain-language overview of what is captured, why it is captured, how long it is kept, and how to opt out or request deletion. We avoid covert capture, “gotcha” designs, and any scenario that would surprise a reasonable participant. Every clip is paired with metadata describing lighting, camera, scenario, and timing so we can analyze confounders.

Labels acknowledge uncertainty. Multiple trained raters annotate AU events and micro-expressive segments with time-bounds and confidence scores. Disagreements are adjudicated using a written rubric; we report inter-rater reliability (e.g., Krippendorff’s α) alongside the dataset so consumers can see label noise rather than having it hidden. Ambiguous segments remain ambiguous: we mark them as such instead of forcing consensus that the data do not support.

Privacy controls include pre-defined retention windows, role-based access, encrypted storage, audit logs for every access, and documented deletion routes. Identity inference is excluded by design; we do not store or link identifying attributes beyond what is required for consent management.

Signals & Modeling

Features emphasize dynamics over static appearance. We analyze short temporal windows (≈300–800ms), onset/offset velocity, co-occurrence patterns among AUs, and simple temporal motifs (e.g., AU-12→AU-15 within a small lag). Landmarks are stabilized to reduce camera jitter; low-confidence frames are down-weighted or dropped.

Models produce probabilities, not verdicts. After training, we calibrate scores (isotonic or Platt) so that a predicted 0.30 behaves like “~30% of similar cases were positive” in evaluation. Outputs are grouped into bands intended for operational interpretation (e.g., “no action,” “log and move on,” “consider secondary review if another signal agrees”). We purposely avoid a single magic threshold and instead map bands to actions that reflect context and cost.

Evaluation Approach

We report performance with the prevalence of the event (base rate) front and center. AUROC and AUPRC are included, but we also publish calibration error (ECE), coverage (what fraction of cases the model is willing to score with confidence), and decision-curve analyses tied to the documented cost tables. Thresholds are selected to minimize expected cost, not to maximize a headline metric.

To test real-world durability, we evaluate across cameras, focal lengths, lighting regimes, and environments, and we use time-based splits so we can see performance drift as conditions change. We also run “selective prediction” baselines (abstain when unsure) to compare “say less, say it better” policies against always-on scoring.

Bias & Robustness

Average metrics can hide harm. We therefore slice results by lighting, camera sensor type, skin-tone ranges, head pose, and occlusions (e.g., glasses, masks). We publish the deltas—not just the averages—so gaps are visible. Where gaps appear, we experiment with data balancing, confidence-weighted training, and abstention rules that refuse to score in known failure zones.

Robustness is probed with a red-team protocol: exaggerated expressions, partial occlusion, head motion, blur, low bitrate compression, deliberate mimicry, and stress-induced artifacts. Failures are cataloged with reproduction steps and recommended mitigations (often “do not use the model in this condition”). We view a documented “no-go” list as a sign of maturity, not weakness.

Governance & Policy

Permitted use is limited to research and evaluation with explicit corroboration requirements. Any pilot must pass a lightweight privacy and data-protection review, document the second-signal sources it will rely on, define operator training, and designate an accountable owner. Raw video is never shared externally; derived features are minimized and access-logged. We maintain a change log for models, features, and policies so that decisions can be reconstructed.

De-scoping is built in. If evaluation shows poor calibration, unacceptable subgroup gaps, or net-negative operational value, the model is frozen or retired. Governance artifacts (risk register, DPIA notes, and model cards) are versioned alongside code so policy doesn’t drift away from implementation.

Operator Guidance

Outputs are prompts, not conclusions. A two-signal rule forbids acting on behavioral output alone; operators must cite an independent corroborating signal (contextual observation, independent sensor, or documented rule) before escalation. The UI shows the score, band, and a plain-language reminder of “what this means / does not mean.”

Runbooks include short checklists: what to record, who to notify, when to de-escalate, and when to stop. Every reviewed case logs inputs, banding, operator notes, and final disposition for audit and after-action review. Training materials emphasize failure modes and “don’t-use” scenarios as much as success cases to counter automation bias.

Deliverables

The project ships more than a model. Deliverables include: (1) an evaluation protocol and rater handbook with labeling rubrics and adjudication steps; (2) calibration cards per model and context with recommended action bands; (3) an operator playbook covering checklists, escalation paths, and explicit “don’t-use” cases; (4) a fairness & robustness report with subgroup deltas and red-team results; and (5) deprecation criteria that define when to retrain, freeze, or retire a model due to drift, gap growth, or negative cost-benefit.

All artifacts are versioned, with a simple README that explains how to reproduce results from raw data to final figures. If someone smarter and busier than us can’t repeat the evaluation in an afternoon, we consider that a bug to fix.

Limitations

Behavioral signals are noisy, culturally and contextually shaped, and easy to misinterpret—especially under stress. This work does not claim lie detection, intent detection, or identity inference, and it should never be used to make unilateral, punitive decisions. The contribution is smaller and more practical: show when weak facial dynamics can help as corroboration, when they should be ignored, and how to document both with honesty.

If the evidence says the approach isn’t worth the operational complexity, we will say so and publish the negative result. Knowing when to stop is part of responsible research.


Resources & Links