2025-07-03

Following Up On Entity Extraction In Umbrix

So I wanted to post a follow-up to my previous introduction to the umbrix platform, where I was using DSPy for entity extraction in cyber threat intelligence. If you missed that, you can find it here. This recap comes a week after I launched the platform.

Two days into getting Umbrix running optimally, I started running into many problems. I have been extracting a significant volume of entities, relationships, and nodes from my agents to populate in the graph. However, I quickly realized my system was burning through resources - both in terms of rate limits and agent costs.

The original pipeline was elegant but expensive - every piece of content went through DSPy-powered extraction using the _extract_entities method I showed in my previous post. While this gave us high-quality extraction with contextual understanding, processing thousands of feeds daily meant burning through tokens at an unsustainable rate.

At that point, I pivoted to optimize for the highest throughput at the lowest cost. The question became: what’s the most efficient LLM configuration for this specific entity extraction task?

Since I was hosting on GCP, I had access to various model options. While the performance differences for entity extraction were minimal between models, the cost differences were substantial. But honestly, it didn’t matter much because using LLMs for high-volume entity extraction proved expensive regardless of configuration.

This realization led me to step back and review the literature on state-of-the-art entity extraction techniques. My research pointed toward exploring spaCy and BERT models for entity extraction, which could potentially maintain quality while drastically reducing costs.

The Evolution: From Pure LLM to Hybrid Pipeline

We started with experiments in gradient balancing and multi-task learning, testing different approaches to optimize our extraction pipeline.

The new pipeline architecture leverages a tiered approach:

Tier 1: High-Volume Pattern Matching

First, we run lightweight regex-based extraction for well-defined patterns. This catches the obvious stuff - IPs, domains, hashes, CVEs - without any model inference:

# From our optimized pipeline
def tier1_extraction(text):
    """Ultra-fast pattern matching for common indicators"""
    entities = {
        'hashes': extract_hashes_regex(text),
        'ips': extract_ips_regex(text),
        'domains': extract_domains_regex(text),
        'cves': extract_cves_regex(text)
    }
    return entities

Tier 2: spaCy NER for Named Entities

For threat actors, malware families, and organizations, we use a custom-trained spaCy model. We fine-tuned it on cybersecurity text using data from our initial DSPy extractions.

# Custom spaCy pipeline for cyber entities
nlp = spacy.load("en_cybersec_ner_v2")
nlp.add_pipe("entity_ruler", before="ner")


# Add patterns for common threat actor naming conventions
patterns = [
    {"label": "THREAT_ACTOR", "pattern": [{"TEXT": {"REGEX": "APT\d+"}}]},
    {"label": "THREAT_ACTOR", "pattern": [{"LOWER": {"IN": ["lazarus", "apt28", "cozy bear"]}}]},
    {"label": "MALWARE", "pattern": [{"LOWER": {"IN": ["emotet", "trickbot", "ryuk"]}}]}
]
ruler = nlp.get_pipe("entity_ruler")
ruler.add_patterns(patterns)

Tier 3: BERT-based Classification and Disambiguation

Here’s where it gets interesting. We use a fine-tuned BERT model for two critical tasks:

Entity Type Classification: When spaCy isn’t confident, BERT classifies ambiguous entities
Relationship Extraction: BERT identifies relationships between entities in context

The BERT implementation leverages our multi-task training framework from the experiments:

# Simplified version of our BERT integration
class CyberEntityBERT:
    def __init__(self):
        self.model = AutoModel.from_pretrained("umbrix/cyber-entity-bert")
        self.tokenizer = AutoTokenizer.from_pretrained("umbrix/cyber-entity-bert")
       
    def classify_entity_context(self, text, entity):
        """Classify entity type based on surrounding context"""
        # Encode text with entity markers
        marked_text = text.replace(entity, f"[ENTITY]{entity}[/ENTITY]")
        inputs = self.tokenizer(marked_text, return_tensors="pt", truncation=True)
       
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
           
        return self._decode_predictions(predictions)

Tier 4: LLM for Complex Reasoning (Selective)

Only the most ambiguous or complex cases make it to LLM processing now. We’re talking about:

Novel threat actor attribution requiring reasoning
Complex attack chain analysis
Zero-day vulnerability impact assessment

This selective approach reduced our LLM costs by approximately 94% while maintaining extraction quality above 92% compared to pure DSPy. Unfortunately, as a student these costs are unsustainable, so I disabled this feature extraction.

The Graph Management Challenge

Another issue that emerged was that my Graph Librarian agent was being too aggressive in creating entity connections. The graph was becoming overly connected - too many relationships were being inferred, which reduced the signal-to-noise ratio.

The solution came from implementing a confidence-weighted relationship scoring system:

class EnhancedGraphLibrarian:
    def __init__(self):
        self.relationship_threshold = 0.7
        self.co_occurrence_window = 150  # tokens
       
    def score_relationship(self, entity1, entity2, context):
        """Score potential relationship based on multiple factors"""
        scores = {
            'proximity': self._calculate_proximity_score(entity1, entity2, context),
            'semantic': self._calculate_semantic_similarity(entity1, entity2),
            'temporal': self._calculate_temporal_correlation(entity1, entity2),
            'type_compatibility': self._check_relationship_validity(entity1.type, entity2.type)
        }
       
        # Weighted average with type compatibility as a gate
        if scores['type_compatibility'] < 0.3:
            return 0.0
           
        weighted_score = (
            scores['proximity'] * 0.3 +
            scores['semantic'] * 0.4 +
            scores['temporal'] * 0.3
        )
       
        return weighted_score

From this perspective, I figured it was a good time to use LLMs to analyze my system architecture and research state-of-the-art approaches from adjacent fields - particularly medical research entity extraction, biomedical text analysis, and ML pipelines. The goal was to build something that’s both:

More efficient (lower cost, higher throughput)
More accurate (better precision in entity extraction and relationship inference)
Better at graph management (quality over quantity in connections)

Current State and Next Steps

This brings us to where we are now. We’re implementing a hybrid approach that combines the aforementioned techniques and a new correlation analysis module that tracks entity relationships across time. For now, we will keep working and plan to release more details on our platform soon.