Crawler Dispatcher

also known as URL Domain Dispatcher, Crawler Factory

Route each incoming URL to a domain-specific crawler through a central dispatcher mapping URL patterns to registered crawler classes.

Context

An LLM application ingests text from many web sources — LinkedIn posts, Medium articles, GitHub repos, Substack posts, custom company sites. Each source has its own structure, login flow, rate limits, and quirks. The ingestion code accumulates per-source branches.

Problem

If-else branching by URL host scales badly. Adding a new source requires editing the ingestion module, the dispatching is mixed with the per-source logic, and conflict between contributors over the module file slows down adding sources. Tests for one source pull in dependencies of all sources. Without a registry-based dispatcher, ingestion becomes a fragile monolith where each new source rewrites the world.

Forces

New sources are added frequently; cost of adding must be low.
Per-source logic differs enough that one crawler cannot serve all.
Tests for a source should not pull in unrelated crawlers.
URL-to-crawler mapping is the only routing decision; it should be one place.

Example

A personal-knowledge LLM ingests content from LinkedIn, Substack, GitHub, and the author's personal site. The Dispatcher has four registrations. Adding a fifth source (Bluesky) is a new BlueskyCrawler class and one register call. The ingestion module is unchanged.

Diagram

flowchart LR URL[Incoming URL] --> Disp[Dispatcher] Disp --> R[Registry: pattern → class] R --> Pick[Pick matching crawler] Pick --> C1[LinkedInCrawler] Pick --> C2[MediumCrawler] Pick --> C3[GitHubCrawler] C1 --> Doc[Document] C2 --> Doc C3 --> Doc

Solution

Therefore:

Define a Crawler interface (e.g. `fetch(url) -> document`). Implement one crawler class per source (LinkedInCrawler, MediumCrawler, GitHubCrawler, ...). A Dispatcher object holds a registry of (URL pattern → crawler class). `dispatcher.get_crawler(url)` returns the right instance; adding a source is `dispatcher.register(pattern, CrawlerClass)`. The dispatcher is small and stable; the crawler classes evolve independently. Tests for one crawler don't import the others.

What this pattern forbids. URL-to-crawler dispatch must not be inlined as if-else branching in the ingestion code; the mapping lives in a central registry the dispatcher consults.

And the patterns that stand alongside it, or against it —

complementsAgent Adapter★★— An interface layer connecting an agent's tool-calling protocol to heterogeneous external tools, normalizing their schemas into one the agent expects.
complementsAugmented LLM★★— Build the foundational agent block as an LLM augmented with retrieval, tools, and memory that the model actively chooses to use, rather than a bare-model call.
complementsTool Use★★— Let the LLM produce typed calls against an external toolkit instead of producing free-form text the surrounding system has to parse.
composes-withFTI LLM Pipeline Split★★— Decompose an LLM/RAG system into three independently-deployable pipelines — feature, training, inference — communicating only via a feature store and a model registry.
complementsBrowser Agent★— Expose websites to the agent through a structured DOM/accessibility tree plus a small action vocabulary, sitting between raw HTML and pixel-level Computer Use.
complementsRate Limiting★★— Cap the number of requests, tokens, or tool calls per user (or session) within a time window.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Production LLM Platform
core
URL pattern → crawler class registry for heterogeneous source ingestion.

References

Provenance

Source: patterns/crawler-dispatcher.md on GitHub · commit 135ae3c · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.