OPEN-SOURCE INTELLIGENCE METHODOLOGIES
by Tobin M. Albanese
PORTFOLIO — RESEARCH Wed Nov 15 2023
What this covers. A practical OSINT pipeline from collection to publication: intake, normalization & de-duplication, enrichment, verification, fusion, and reporting—designed for reproducibility and minimal capture of accidental PII.
Principles
- Legality & consent first: respect terms of service, data-use policies, and local law; collect the minimum necessary.
- Reproducible by others: every figure or claim can be regenerated from preserved inputs, config, and code.
- Provenance preserved: every artifact carries origin, timestamp, and transformation history.
- Evidence > opinion: confidence is scored; uncertainty is explicit.
Pipeline Overview
Collect → Normalize → De-dup → Enrich → Verify → Label → Report
- Collect: public web pages, RSS/Atom feeds, official reports, satellite or weather layers, and reputable open datasets.
- Normalize: store raw & normalized copies (UTF-8 text, canonical URLs, stable filenames).
- De-dup: detect near-duplicates (shingling + simhash / perceptual hash for images) to reduce noise.
- Enrich: extract entities, locations, languages; compute media hashes; pull basic EXIF if present.
- Verify: cross-source corroboration, geo/chrono-location, metadata checks, archive lookups.
- Label & Report: assign confidence, note contradictions, publish with a methods appendix.
Intake & Normalization
- Watchlists: seed with official sources and reputable monitoring feeds; prefer feeds over ad-hoc scraping.
- Archival snapshots: when citing pages, capture an archive URI alongside the live URL.
- Canonicalization: strip tracking params, resolve redirects, and store a stable
source_id.
De-duplication
- Text: tokenize → shingles → simhash/TLSH to cluster near-duplicates; keep the earliest or most complete.
- Images: compute perceptual hash (pHash/aHash/dHash) to group re-uploads & crops.
- URL-level: canonical URL + content hash to avoid double counting mirrors.
Goal: reduce volume without losing unique claims or first-source material.
Enrichment
- Entities: persons, orgs, locations with confidence scores and source spans.
- Geocoding: resolve place names; store lat/lon with precision and method tags (exact, inferred, admin-centroid).
- Media metadata: safe EXIF parsing when available; store hashes and dimensions for dedup/verification.
- L10n: language ID & translation notes; keep original text alongside any translation.
Verification & Chrono/Geolocation
- Triangulate: corroborate claims across independent sources; prefer primary over aggregated posts.
- Geo: match skylines, landmarks, signage, terrain, road geometry; confirm with maps/satellite.
- Chrono: shadows, weather, tide, traffic, vegetation; look for seasonal cues and construction timelines.
- Metadata sanity: EXIF can mislead—treat as clues, not truth; check for editing traces.
Record the verification method (e.g., “landmark match + satellite layer”) and any counter-evidence considered.
Provenance & Reproducibility
Every artifact gets a manifest entry—hashes, timestamps, and transforms. Example:
{
"id": "src_2023-11-15_00123",
"uri_live": "https://example.gov/report.pdf",
"uri_archive": "https://web.archive.org/web/20231115/https://example.gov/report.pdf",
"sha256": "…",
"collected_at": "2023-11-15T13:03:22Z",
"transforms": ["pdf→text v1.2", "langid en", "ner v0.9"],
"notes": "Official statement; broken link replaced with archived copy"
}
Keep raw inputs immutable; version configs; pin library versions; export a methods appendix with the report.
Fusion & Reporting
- Entity resolution: merge references that are the same real-world thing; keep a cross-reference table.
- Timelines & maps: present where/when alongside who/what; show gaps and contradictions.
- Confidence rubric: e.g., 0–5 with criteria; justify the score in one sentence per key claim.
- Risk review: scrub inadvertent PII; consider source safety before publishing sensitive details.
Operating Safely
- Respect platform terms and legal constraints; prefer official export tools and archives to brittle scraping.
- Minimize retention of identifiers that aren’t essential to the analytic question.
- Document ethics choices where they affect what you collected or chose not to publish.