EN 中文
← Blog

Global Scraper

293-source signal pipeline. Tier propagation, anomaly detection, circuit breakers. No LLM in the render path.

I wrote about “260 Sources” earlier — that was personal palantir in its first form. Lots of sources, no structure. This iteration is up to 293 sources with 132K+ scored signals, and a few structural pieces in place to clean up the data flow.

Snapshot dashboard → (captured 2026-05-05, not realtime)

A few ideas:

No LLM in the render path. Signals are persisted raw into SQLite (scraper.db). Scoring is rule-based (importance × urgency × novelty). The dashboard renders with plain code. LLMs do upstream schema inference and downstream analysis, but they don’t appear in the user-facing render path. Reproducible. Stable. Doesn’t hallucinate.

Tier propagation tracking. Sources split by tier — Tier-1 = regulatory / institutional (SEC filings, IMF WEO, Senate lobbying); Tier-2 = industry (Next Platform RSS, TrendForce); Tier-3 = community (HN, Reddit, Twitch trending). The same entity shows up in Tier-1, then in Tier-3 hours later. That gap is the alpha window. The tier-lag card on the dashboard tracks it.

Three anomaly modes stacked. Anomaly = z-score deviation from baseline (financial spikes); breaker = a source failing consecutively, auto-quarantined; absence = a source that historically produced signals and abruptly went silent — the last one is often the earliest leading indicator of an upstream change. Three modes catching “something happened” on different time scales.

Circuit breakers reset manually. Sources that fail consecutively get paused. No auto-recover — a fault gets eyeballed once before going back on. Failures don’t eat pipeline time.

Nothing especially clever about the design — but in the right slot, it works. SaaS “deep research” tools sell you 5 sources plus an LLM summary. Looks deep, mostly reading abstracts. My 293 are raw signals with transparent scoring and a tier-able structure. Different problem.

Snapshot dashboard →

Global Scraper

293 个源的信号 pipeline。Tier propagation、anomaly detection、circuit breaker。渲染路径不放 LLM。

之前的「260 Sources」我写过一遍 — 那时候还是 personal palantir 的雏形,源头多,但没结构化。这次更新到 293 个源 + 132K+ scored signal,补了几层结构,把数据流摆得更清楚。

Snapshot dashboard →(2026-05-05 抓的,非实时)

几个思路

渲染路径不放 LLM。 抓到的信号原样落 SQLite (scraper.db),scoring 走规则(importance × urgency × novelty),dashboard 用纯代码渲染。LLM 用来做事先 schema 推断和事后 analysis,不让它出现在 user-facing render path 上。可重现,稳定,不会瞎编。

Tier propagation tracking。 把源头按层分:Tier-1 = 监管 / 机构(SEC filings, IMF WEO, Senate lobbying),Tier-2 = 行业(Next Platform RSS, TrendForce),Tier-3 = 社区(HN, Reddit, Twitch trending)。同一个 entity 在 Tier-1 出现,几小时后 Tier-3 跟着出现,这个时间差就是 alpha 窗口。dashboard 上 tier-lag 卡片就在算这个。

三种异常监控叠在一起。 Anomaly = z-score 偏离 baseline(财经 spike);breaker = 源连续失败,自动隔离;absence = 历史一直产信号的源突然 0 — 最后这个往往是 upstream 出事的最早信号。三种叠起来,抓不同时间尺度上的“出事了”。

Circuit breaker 是手动 reset 的。 源连续失败就暂停轮询。不 auto-recover — 故障被人看到一次,再决定开。失败不会蚕食 pipeline 时间。

没什么特别牛的设计 — 但放对位置就 work。 SaaS “deep research” 工具卖 5 个 source 加 LLM 总结,看着 deep,实际是 reading 摘要。我这 293 个是原始信号,scoring 透明,分层可查。两者解决的不是同一个问题。

Snapshot dashboard →