OPEN-SOURCE INTELLIGENCE METHODOLOGIES

by Tobin M. Albanese

PORTFOLIO — RESEARCH Wed Nov 15 2023

What this covers. A practical OSINT pipeline from collection to publication: intake, normalization & de-duplication, enrichment, verification, fusion, and reporting—designed for reproducibility and minimal capture of accidental PII.

Principles

Legality & consent first: respect terms of service, data-use policies, and local law; collect the minimum necessary.
Reproducible by others: every figure or claim can be regenerated from preserved inputs, config, and code.
Provenance preserved: every artifact carries origin, timestamp, and transformation history.
Evidence > opinion: confidence is scored; uncertainty is explicit.

Pipeline Overview

Collect → Normalize → De-dup → Enrich → Verify → Label → Report

Collect: public web pages, RSS/Atom feeds, official reports, satellite or weather layers, and reputable open datasets.
Normalize: store raw & normalized copies (UTF-8 text, canonical URLs, stable filenames).
De-dup: detect near-duplicates (shingling + simhash / perceptual hash for images) to reduce noise.
Enrich: extract entities, locations, languages; compute media hashes; pull basic EXIF if present.
Verify: cross-source corroboration, geo/chrono-location, metadata checks, archive lookups.
Label & Report: assign confidence, note contradictions, publish with a methods appendix.

Intake & Normalization

Watchlists: seed with official sources and reputable monitoring feeds; prefer feeds over ad-hoc scraping.
Archival snapshots: when citing pages, capture an archive URI alongside the live URL.
Canonicalization: strip tracking params, resolve redirects, and store a stable source_id.

De-duplication

Text: tokenize → shingles → simhash/TLSH to cluster near-duplicates; keep the earliest or most complete.
Images: compute perceptual hash (pHash/aHash/dHash) to group re-uploads & crops.
URL-level: canonical URL + content hash to avoid double counting mirrors.

Goal: reduce volume without losing unique claims or first-source material.

Enrichment

Entities: persons, orgs, locations with confidence scores and source spans.
Geocoding: resolve place names; store lat/lon with precision and method tags (exact, inferred, admin-centroid).
Media metadata: safe EXIF parsing when available; store hashes and dimensions for dedup/verification.
L10n: language ID & translation notes; keep original text alongside any translation.

Verification & Chrono/Geolocation

Triangulate: corroborate claims across independent sources; prefer primary over aggregated posts.
Geo: match skylines, landmarks, signage, terrain, road geometry; confirm with maps/satellite.
Chrono: shadows, weather, tide, traffic, vegetation; look for seasonal cues and construction timelines.
Metadata sanity: EXIF can mislead—treat as clues, not truth; check for editing traces.

Record the verification method (e.g., “landmark match + satellite layer”) and any counter-evidence considered.

Provenance & Reproducibility

Every artifact gets a manifest entry—hashes, timestamps, and transforms. Example:

{
  "id": "src_2023-11-15_00123",
  "uri_live": "https://example.gov/report.pdf",
  "uri_archive": "https://web.archive.org/web/20231115/https://example.gov/report.pdf",
  "sha256": "…",
  "collected_at": "2023-11-15T13:03:22Z",
  "transforms": ["pdf→text v1.2", "langid en", "ner v0.9"],
  "notes": "Official statement; broken link replaced with archived copy"
}

Keep raw inputs immutable; version configs; pin library versions; export a methods appendix with the report.

Fusion & Reporting

Entity resolution: merge references that are the same real-world thing; keep a cross-reference table.
Timelines & maps: present where/when alongside who/what; show gaps and contradictions.
Confidence rubric: e.g., 0–5 with criteria; justify the score in one sentence per key claim.
Risk review: scrub inadvertent PII; consider source safety before publishing sensitive details.

Operating Safely

Respect platform terms and legal constraints; prefer official export tools and archives to brittle scraping.
Minimize retention of identifiers that aren’t essential to the analytic question.
Document ethics choices where they affect what you collected or chose not to publish.

OPEN-SOURCE INTELLIGENCE METHODOLOGIES

by Tobin M. Albanese

PORTFOLIO — RESEARCH Wed Nov 15 2023

Principles

Pipeline Overview

Intake & Normalization

De-duplication

Enrichment

Provenance & Reproducibility

Fusion & Reporting

Operating Safely

Resources & Links

Methodology & Tradecraft

Verification & Geolocation

Archiving & Provenance

Datasets & Event Feeds

Project Pages