260 Sources In, Here's What I Learned

Scaling a personal intelligence system from 100 to 260 sources, and why it already beats most “deep research” tools.

Six days ago I started building a system to monitor the internet for me. Not just Reddit and Twitter — government filings, patent offices, dark pool data, academic preprints, software registries, climate feeds, flight tracking. The kind of sources that individually seem unremarkable, but together paint a picture that no single platform can give you.

It started with 30 scrapers, passed 100 within a week, and now sits at 260 active sources, processing around 40,000 signals per day into a single daily briefing. The briefing tells me what is unusual, what multiple sources are converging on, and — most importantly — what the niche sources are seeing that mainstream platforms have not picked up yet.

The Registration Problem

Scaling to 260 sources means registering for a lot of APIs. Many of them have CAPTCHA walls, email verification, OAuth flows, and other friction designed to keep bots out. Doing this manually for hundreds of services is not realistic.

So I built a tool that handles API registration automatically — creating accounts, receiving verification emails through my own mail server, solving CAPTCHAs, and extracting API keys. For services with particularly aggressive bot detection, I use an anti-detection browser with C++ level fingerprint spoofing to get past Cloudflare and similar systems.

In the end, about half the sources I wanted did not even need authentication — open government datasets, academic APIs, and community-maintained feeds. The best data is often the most accessible. The other half required creative workarounds, and 13 of the original scrapers were replaced entirely with zero-auth alternatives that provided the same or better data.

Bottom-Up vs. Top-Down

Most AI deep research tools work top-down — you give them a question, and the model decides where to look, what to follow up on, and when to stop. Every step involves the model making subjective judgment calls about which direction to pursue. The output depends heavily on the search path, which means the same question asked twice can produce meaningfully different results.

This system works bottom-up. 260 sources feed data in continuously, math filters surface what is anomalous and where multiple sources converge, and conclusions emerge from the data itself. There is no model deciding where to look — the data arrives whether anyone asked for it or not. The output is stable because the underlying signals do not change based on how you frame the question.

This is not Bloomberg. It is not Palantir’s enterprise platform. It does not have institutional-grade data pipes or real-time streaming from exchanges. But for personal decision-making or a small business trying to stay ahead of market shifts, it is more than enough. The stability alone — getting consistent conclusions rather than model-dependent narratives — makes it a genuinely useful tool.

What Is Next

The system now logs its own predictions. The next phase is verification — grading whether the asymmetry windows it identifies actually play out as the signals suggest. This takes time. You need weeks and months of data to know if your predictions were right or just noise.

But honestly, prediction accuracy is not the main point. As long as you are not using this for financial trading — where milliseconds and precision matter — directional awareness is the core value. Knowing that something is happening before everyone else knows is valuable whether you are running a company, evaluating a market, or just trying to understand the world better.

The gap between this and an institutional system is real. But the gap between this and reading the news is enormous.

Here is the full list of all 260 sources.

六天前我开始搭建一个替我监听互联网的系统。不只是 Reddit 和 Twitter — 还有政府公开文件、专利局、暗池数据、学术预印本、软件包注册表、气候数据、航班追踪。单独看每个信息源都不起眼，但合在一起能拼出任何单一平台给不了的全景。

从 30 个爬虫起步，一周内突破 100 个，现在有 260 个活跃信息源，每天处理约 4 万条信号，浓缩成一份每日情报简报。简报告诉我什么是异常的、多个信息源在哪些方向上收敛，以及 — 最重要的 — 小众信源看到了什么而主流平台还没反应过来。

注册问题

接入 260 个信息源意味着要注册大量 API。很多都有 CAPTCHA 墙、邮箱验证、OAuth 流程和各种防机器人机制。手动注册几百个服务不现实。

所以我建了一个自动注册工具 — 自动创建账号、通过自建邮件服务器接收验证邮件、解决 CAPTCHA、提取 API 密钥。对于有特别激进的机器人检测的服务，我用一个带 C++ 级指纹伪装的反检测浏览器来绕过 Cloudflare 之类的系统。

最后发现，大约一半的信息源根本不需要认证 — 开放的政府数据集、学术 API、社区维护的信息流。最好的数据往往是最容易获取的。另一半需要一些创造性的变通方案，其中 13 个原始爬虫被零认证替代品完全取代，数据质量相同甚至更好。

自下而上 vs. 自上而下

大部分 AI 深度研究工具是自上而下的 — 你给一个问题，模型决定去哪搜、跟进什么、什么时候停。每一步都有模型在做主观判断，往哪个方向走。输出高度依赖搜索路径，同一个问题问两遍，结果可能差很远。

这套系统是自下而上的。260 个信息源持续灌入数据，数学层筛出异常信号和多源收敛，结论从数据本身浮现。没有模型在决定往哪看 — 数据不管有没有人问都会到。输出是稳定的，因为底层信号不会因为你怎么提问而改变。

这不是彭博终端，不是 Palantir 的企业级平台，没有机构级的数据通道，也没有交易所的实时数据流。但作为个人决策工具，或者一个想提前感知市场变化的小企业来说，这已经绰绰有余。光是输出的稳定性 — 得到一致的结论而不是依赖模型路径的叙事 — 就已经让它成为一个实实在在有用的工具。

下一步

系统现在会记录自己的预测。下一阶段是验证 — 给这些预测打分，看它识别出的信息不对称窗口是否真的按信号暗示的方式演变。这需要时间，需要几周甚至几个月的数据积累才能知道预测是对的还是噪音。

但说实话，预测准确率不是重点。只要不拿这套系统做金融交易 — 那种对毫秒和精度有要求的场景 — 方向性的感知才是核心价值。在所有人知道之前就知道某件事正在发生，无论你是在经营公司、评估市场、还是只是想更好地理解世界，这都是有价值的。

这套系统和机构级系统之间的差距是真实存在的。但它和「看新闻」之间的差距是巨大的。

这是完整的 260 个数据源清单。

260 Sources In, Here’s What I Learned

The Registration Problem

Bottom-Up vs. Top-Down

What Is Next

260 个信息源接入后，我学到了什么

注册问题

自下而上 vs. 自上而下

下一步