I Built a Real-Time HackerNews Trend Radar With AI (And It Runs Itself)
Every day, HackerNews quietly decides what the dev world will care about next. which topics are actually taking off right now, across threads and deep comment chains. So instead of manually refreshing HN, I built a real-time "trend radar" on top of it: Continuously ingests fresh HN stories and comments Uses an LLM to extract structured topics (companies, tools, models, tech terms) Streams everything into Postgres for instant querying like: "What's trending on HN right now?" "Which threads are driving the most hype for Claude / LangChain / Rust today?" All of this runs as a declarative CocoIndex flow with incremental syncs, LLM-powered extraction, and simple query handlers. In this post, you'll see how it works end-to-end and how you can fork it to track any community (Reddit, X, Discord, internal Slack, etc.). HackerNews is one of the strongest early signals for: New tools and frameworks devs actually try Which AI models/products are gaining mindshare Real sentiment and feedback in the comments Emerging startups and obscure libraries that might be big in 6-12 months But raw HN has three problems: Threads are noisy; comments are nested and messy There's no notion of "topics" beyond free text There's no built-in way to ask: "What's trending across the whole firehose?" The HackerNews Trending Topics example in CocoIndex is essentially: "turn HN into a structured, continuously updating topics index that AI agents and dashboards can query in milliseconds." At a high level, the pipeline looks like this: HackerNews API ↓ HackerNewsConnector (Custom Source) ├─ list() → thread IDs + updated_at ├─ get_value() → full threads + comments └─ provides_ordinal() → enables incremental sync ↓ CocoIndex Flow ├─ LLM topic extraction on threads + comments ├─ message_index collector (content) └─ topic_index collector (topics) ↓ Postgres ├─ hn_messages └─ hn_topics ↓ Query Handlers ├─ search_by_topic("Claude") ├─ get_trending_topics(limit=20) └─ get_threads_for_topic("Rust") Key idea: separate discovery from fetching. list() hits the HN Algolia search API to get lightweight metadata: thread IDs + updated_at timestamps. get_value() only runs for threads whose updated_at changed, fetching full content + comments from the items API. Ordinals (timestamps) let CocoIndex skip everything that hasn't changed, cutting API calls by >90% on subsequent syncs. This is what enables "live mode" with a 30-second polling interval without melting APIs or your wallet. First, you define the data model for threads and comments. class _HackerNewsThreadKey(NamedTuple): thread_id: str @dataclasses.dataclass class _HackerNewsComment: id: str author: str | None text: str | None created_at: datetime | None @dataclasses.dataclass class _HackerNewsThread: author: str | None text: str url: str | None created_at: datetime | None comments: list[_HackerNewsComment] Then you declare a SourceSpec that configures how to query HN: class HackerNewsSource(SourceSpec): """Source spec for HackerNews API.""" tag: str | None = None # e.g. "story" max_results: int = 100 # hits per poll The custom source connector wires this spec into actual HTTP calls: list() → calls https://hn.algolia.com/api/v1/search_by_date with hitsPerPage=max_results, yields PartialSourceRow objects keyed by thread ID, with ordinals based on updated_at. get_value() → calls https://hn.algolia.com/api/v1/items/{thread_id} and parses the full thread + nested comments into _HackerNewsThread and _HackerNewsComment. provides_ordinal() → returns True so CocoIndex can do incremental sync. CocoIndex handles the hard part: tracking ordinals and only re-pulling changed rows on each sync. Once the source is in the flow, the fun part starts: semantic enrichment. You define a minimal Topic type that the LLM will fill: @dataclasses.dataclass class Topic: """ A single topic extracted from text: - products, tools, frameworks - people, companies - domains (e.g. "vector search", "fintech") """ topic: str Inside the flow, every thread gets its topics extracted with a single declarative transform: with data_scope["threads"].row() as thread: thread["topics"] = thread["text"].transform( cocoindex.functions.ExtractByLlm( llm_spec=cocoindex.LlmSpec( api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o-mini", ), output_type=list[Topic], ) ) Same for comments: with thread["comments"].row() as comment: comment["topics"] = comment["text"].transform( cocoindex.functions.ExtractByLlm( llm_spec=cocoindex.LlmSpec( api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o-mini", ), output_type=list[Topic], ) ) Under the hood, CocoIndex: Calls the LLM with a structured prompt and enforces output_type=list[Topic] Normalizes messy free text into consistent topic strings Makes this just another column in your flow instead of a separate glue script This is what turns HN from "some text" into something an AI agent or SQL query can reason about. All structured data is collected into two logical indexes: message_index: threads + comments with their raw text and metadata topic_index: individual topics linked back to messages Collectors are declared once and then exported to Postgres: message_index = data_scope.add_collector() topic_index = data_scope.add_collector() message_index.export( "hn_messages", cocoindex.targets.Postgres(), primary_key_fields=["id"], ) topic_index.export( "hn_topics", cocoindex.targets.Postgres(), primary_key_fields=["topic", "message_id"], ) Now you have two tables you can poke with SQL or via CocoIndex query handlers: hn_messages: full-text search, content analytics, author stats hn_topics: topic-level analytics, trend tracking, per-topic thread ranking Step 4: Query Handlers - From "Cool Pipeline" to Real Product Here's where it stops being just a nice ETL project and becomes something you can actually ship. search_by_topic(topic): "Show Me All Claude Mentions" This query handler lets you search HN content by topic across threads and comments: @hackernews_trending_topics_flow.query_handler() def search_by_topic(topic: str) -> cocoindex.QueryOutput: topic_table = cocoindex.utils.get_target_default_name( hackernews_trending_topics_flow, "hn_topics" ) message_table = cocoindex.utils.get_target_default_name( hackernews_trending_topics_flow, "hn_messages" ) with connection_pool().connection() as conn: with conn.cursor() as cur: cur.execute( f""" SELECT m.id, m.thread_id, m.author, m.content_type, m.text, m.created_at, t.topic FROM {topic_table} t JOIN {message_table} m ON t.message_id = m.id WHERE LOWER(t.topic) LIKE LOWER(%s) ORDER BY m.created_at DESC """, (f"%{topic}%",), ) results = [ { "id": row[0], "url": f"https://news.ycombinator.com/item?id={row[1]}", "author": row[2], "type": row[3], "text": row[4], "created_at": row[5].isoformat(), "topic": row[6], } for row in cur.fetchall() ] return cocoindex.QueryOutput(results=results) You can literally run: cocoindex query main.py search_by_topic --topic "Claude" ...and get a clean JSON response with URLs, authors, timestamps, and which piece of content the topic appeared in. get_threads_for_topic(topic): Rank Threads by Topic Score Not all mentions are equal. If "Rust" is in the thread title, that's a primary discussion If it's buried in a comment, that's more of a side mention get_threads_for_topic uses a weighted scoring model to prioritize threads where the topic is central. get_trending_topics(limit=20): The Actual Trend Radar Finally, the endpoint that powers dashboards and agents - this surfaces a list like: ["Claude 3.7 Sonnet", "OpenAI o4-mini", "LangChain", "Modal", ...] with scores and latest mention times Each topic includes the top threads where it's being discussed right now You can wire this into: A live dashboard showing "top 20 topics in the last N hours" A Slack bot posting a daily "what's trending on HN" summary An internal research agent that watches for signals relevant to your stack Running It in Real Time Once the flow is defined, keeping it live is a one-liner: # On-demand refresh cocoindex update main # Live mode: keeps polling HN and updating indexes cocoindex update -L main CocoIndex handles: Polling HN every 30 seconds (configurable) Incrementally syncing only changed threads Re-running LLM extraction only where needed Exporting into Postgres and making query handlers available For debugging, CocoInsight lets you explore the flow, see lineage, and play with queries from a UI: cocoindex server -ci main # Then open: https://cocoindex.io/cocoinsight Once you have this pattern, you're not limited to HackerNews. Some obvious extensions: Cross-community trend tracking Add Reddit subs, X lists, Discord channels, internal Slack, etc. as additional sources Normalize topics across them to see which ideas propagate where and when Sentiment-aware trend analysis Plug in an LLM-based sentiment extraction step alongside topics Track not just what is trending, but whether devs love or hate it Influencer and key-contributor maps Use the author field to see who starts important discussions and whose comments move the conversation Continuous knowledge graphs Treat topics as nodes, threads as edges, and build a graph of tools, companies, and people linked by real discussions Real-time AI research agents Point an agent at the Postgres-backed index and let it answer questions like "What are the top new vector DBs people are experimenting with this week?" "Which AI eval frameworks are getting traction?" If you live in data, infra, or AI-land, this is basically a self-updating signal layer over HN that your tools and agents can query. You can find the fully working example (including flow definition, custom source, query handlers, and Postgres export) in the official HackerNews Trending Topics example on the CocoIndex docs and GitHub. If you end up: Pointing this at a different community Layering in embeddings, RAG, or sentiment Wiring it into a real product or agent ...definitely share it back. The coolest part of this pattern is how little code you need to go from "raw community noise" to a live, queryable trend radar.