STELLARIS — OSINT NLP ENGINE

by Tobin M. Albanese

PORTFOLIO — SPOTLIGHT Tue Oct 01 2024

Mission. STELLARIS (Structured Textual Extraction & Linking for Live Analysis of Real-time Intelligence Sources) exists to transform the unstructured noise of the open web into structured, defensible, and actionable intelligence. The platform continuously ingests massive volumes of text streams — ranging from news articles and government filings to online forums, RSS feeds, and PDF reports — and converts this raw, unstructured material into a living web of linked data. Analysts are no longer forced to manually sift through documents, guess at connections, or rely on brittle keyword searches; instead, they can follow a clear thread from a single individual to their associated addresses, companies, financial transactions, and cross-border shipments. Every connection remains anchored in its original source, so context and evidentiary lineage are never lost. The mission is simple but ambitious: empower investigators, researchers, and intelligence professionals to understand complex realities faster, more reliably, and with complete transparency.

Project Image 1

What it does. At its core, STELLARIS is a pipeline for turning words into structured networks of knowledge. The system applies advanced natural language processing (NLP) techniques — including named entity recognition, cross-document entity resolution, relation extraction, event detection, and temporal normalization — to every incoming document. This means the platform doesn’t just identify “who” is mentioned in a text, but also “how” those people or organizations are connected, “what” events they participated in, and “when” those events occurred. The extracted information is assembled into an interactive knowledge graph where analysts can explore relationships, filter by attributes, overlay geospatial or temporal views, and pivot across different types of entities seamlessly. Instead of static search results, users receive a living map of connections that evolves as new information flows in, making the invisible visible in real time.

Project Image 2

Analyst workflow. STELLARIS is designed around the way human analysts actually work. A typical workflow might begin with a single seed — a company, a username, a shipping record, or even a fragment of leaked data. From this starting point, the analyst can explore first- and second-degree relationships, visualizing how seemingly unrelated entities begin to cluster into meaningful patterns. They can pin subgraphs of interest, annotate edges with hypotheses or questions, and save customized views that preserve filters, time windows, and notes for future sessions or team sharing. Every node and edge is annotated with citations, model confidence, and version history, so nothing is ever taken on faith. This design makes it possible for teams to review, challenge, and reproduce each other’s findings, turning the platform into not just a discovery tool but also a collaborative research environment where insights are defensible and transparent.

Project Image 3

Stack. The technology stack behind STELLARIS combines distributed data engineering with cutting-edge machine learning. On the ingestion side, distributed workers handle incoming streams with robust backpressure controls, ensuring no single source overwhelms the system. Message queues balance loads, while FastAPI services orchestrate requests across the pipeline. Transformer-based NLP models (built on Hugging Face and spaCy) handle tasks like entity recognition, relation extraction, and event detection, producing structured records from messy text. ElasticSearch powers fast keyword and semantic search, while a graph database (Neo4j or JanusGraph) stores and queries the resulting networks. On the frontend, a React/Vite application renders graphs at scale with GPU-accelerated layouts, type-ahead entity search, and keyboard-driven pivoting, giving analysts a responsive, interactive workspace even with millions of nodes and edges. The result is a stack that is both modern and battle-tested, capable of handling real-world data at real-world scale.

Project Image 4

Data quality & provenance. In intelligence analysis, trust is everything. That is why STELLARIS treats provenance as a first-class concern. Every edge in the graph is linked back to the exact source from which it was derived — including the document URI, the paragraph offset, the model version that produced it, and the scoring features used in extraction. Ingestion processes are idempotent, relying on hashing to detect duplicates, while assertions can be re-scored as models improve over time. Analysts can invoke an “explain-this-edge” action to see the raw snippet, the extraction process, and even model confidence. Rollbacks are supported at every level, making it possible to test new models, audit old ones, or red-team sensitive cases without corrupting the graph. This rigorous approach ensures that every claim in the system can be verified, challenged, or disproven — the opposite of a black box.

Project Image 5

Scale & reliability. Real-world OSINT environments are messy and bursty — some days the system must absorb thousands of routine filings, while other days it is hit with floods of breaking news or viral posts. STELLARIS is engineered to handle both extremes gracefully. Message queues smooth out ingestion spikes, while batch and streaming modes run in parallel to balance throughput with latency. Retry policies and dead-letter queues ensure that problematic documents don’t clog the pipeline, while monitoring dashboards track queue depths, error rates, and model latencies in real time. Nightly compaction tasks merge duplicate entities and refresh indexes to keep queries fast. Schema migration scripts evolve the graph database without downtime, so analysts never lose access even during upgrades. The overall design principle is simple: reliability at scale, because analysts can’t afford gaps or outages in the middle of an investigation.

Security & governance. Because STELLARIS often deals with sensitive or personally identifiable information, governance is embedded into the platform itself. Fine-grained role-based access controls (RBAC) allow administrators to control who can view, edit, or export specific segments of the graph. Sensitive attributes can be masked or hidden entirely depending on clearance level, while export bundles are signed with checksums to prevent tampering. Secrets and credentials are rotated automatically, and all data — both at rest and in transit — is encrypted with modern standards. Audit logs capture every action, making it possible to review not just what the data says, but who accessed it, when, and how. This makes STELLARIS suitable not only for open-source research, but also for regulated environments where compliance and accountability are non-negotiable.

Impact. The practical outcome of all this engineering is measurable acceleration in the way analysts work. Tasks that once took hours — verifying an alias across multiple reports, surfacing intermediaries in a financial network, or mapping supply chain hops across borders — now take minutes. Instead of emailing screenshots or exporting static reports, teams can share reproducible graph views that carry all the filters, time windows, and citations baked in. This reduces duplication of effort, makes peer review far easier, and ensures that insights scale across an organization rather than living in individual silos. For organizations facing information overload, STELLARIS doesn’t just speed up analysis; it changes the very culture of how intelligence is produced, reviewed, and disseminated.

Roadmap. STELLARIS is already powerful, but its future is even more ambitious. Upcoming milestones include cross-lingual models that can normalize entities across languages and scripts, enabling global investigations without linguistic blind spots. Stance and claim clustering will allow analysts to group related narratives, distinguish between factual reporting and opinion, and identify coordinated campaigns. Natural-language graph queries will let users type questions like “Show all shell companies linked to X in 2022” and receive structured subgraphs as answers. Event-sequence anomaly detection will flag unusual chains of activity — like logistics routes that don’t match normal patterns. Finally, collaborative playbooks will allow teams to codify repeatable workflows as templates, so that common investigative patterns can be reused, audited, and improved over time. Together, these roadmap items point to a system that doesn’t just document the world, but actively helps analysts stay ahead of it.


Resources & Links