How I am designing Rolsfera: real architecture of a news aggregator with scraping, RSS, AI and automation

Real architecture of Rolsfera, a news aggregator with scraping, RSS and AI. Technical decisions, stack, mistakes and next steps.

Cover for How I am designing Rolsfera: real architecture of a news aggregator with scraping, RSS, AI and automation

I have been building Rolsfera for months and I still have not written a single line about its architecture. I suppose it is the same thing that happens to most developers with their personal projects: you are so deep into solving problems that documenting feels like a luxury. But I think it is time.

Rolsfera is a news aggregator. Not one of those that simply displays RSS feeds in a nice interface. It is a system that collects information from multiple sources (RSS, scraping, APIs), processes it, filters it, summarizes it with AI and distributes it to channels like Telegram or X. All of that with a reasonable degree of automation, but with human review at the points where it matters.

This article is not a step-by-step tutorial. It is a map of the technical decisions I have made, why I made them and where I got it wrong. If you are thinking about building something similar, I hope to save you a few iterations.


The problem I am trying to solve

Information is fragmented. It is not a new problem, but in 2026 it has become more absurd than ever. If you want to follow a technical topic you need to check blogs, newsletters, RSS, Telegram, X, Reddit, Hacker News and probably three or four more sources depending on your niche.

Each source has its own format, its own frequency, its own noise level. And the result is that you spend more time searching and filtering than reading and processing.

Rolsfera was born out of a real frustration: I want a flow where information comes to me, filtered and summarized, without having to open fifteen tabs every morning.

But I do not want a fully automatic system that publishes without supervision. I have seen too many content bots that end up spewing garbage or duplicating irrelevant news. The idea is to automate the mechanical part (collecting, cleaning, summarizing) and keep editorial judgment in human hands.


Architecture diagram

Before going into detail, this is the general system flow:

┌─────────────────────────────────────────────────────────────────┐
│                         DATA SOURCES                            │
│                                                                 │
│   ┌─────────┐   ┌──────────┐   ┌─────────┐   ┌─────────────┐  │
│   │  RSS     │   │ Scraping │   │  APIs   │   │  Telegram   │  │
│   │  Feeds   │   │ (Python) │   │ (REST)  │   │  Channels   │  │
│   └────┬─────┘   └────┬─────┘   └────┬────┘   └──────┬──────┘  │
│        │              │              │               │          │
└────────┼──────────────┼──────────────┼───────────────┼──────────┘
         │              │              │               │
         ▼              ▼              ▼               ▼
┌─────────────────────────────────────────────────────────────────┐
│                   INGESTION AND NORMALIZATION                    │
│                                                                 │
│   n8n (orchestration) + Python (processing)                     │
│   - Feed parsing                                                │
│   - Content extraction                                          │
│   - Deduplication by URL and content hash                       │
│   - Format normalization                                        │
│                                                                 │
└────────────────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│                      AI PROCESSING                              │
│                                                                 │
│   - Category classification                                     │
│   - Summary generation                                          │
│   - Entity and tag extraction                                   │
│   - Relevance scoring                                           │
│                                                                 │
└────────────────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│                        STORAGE                                  │
│                                                                 │
│   PostgreSQL (articles, metadata, publication status)            │
│                                                                 │
└────────────────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│                   EDITORIAL VALIDATION                          │
│                                                                 │
│   Internal panel: approve / discard / edit before publishing    │
│                                                                 │
└────────────────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│                       DISTRIBUTION                              │
│                                                                 │
│   ┌───────────┐   ┌─────────┐   ┌─────────┐   ┌────────────┐  │
│   │ Telegram  │   │    X    │   │   Web   │   │  Newsletter │  │
│   └───────────┘   └─────────┘   └─────────┘   └────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

It is not a textbook diagram, but it reflects reality. There are four clear layers: ingestion, processing, validation and distribution. And between them, n8n acts as the glue.


The tech stack

These are the main pieces and why I chose them:

Python is the main processing language. I use it for the scrapers, RSS parsers, deduplication logic, communication with LLM APIs and summary generation. I do not need to justify this choice much: for extraction and data processing tasks, Python remains the most practical option. The libraries are there, the community is there and the development time is reasonable.

n8n is the orchestrator. It coordinates the flows: when scrapers run, when new articles get processed, when they are sent for review, when they get published. I could have used Airflow or a custom queue system, but n8n gives me something that is valuable for a personal project: a visual interface where I can see what happened in each execution without opening logs.

# Ejemplo simplificado: parser de RSS con feedparser
import feedparser
import hashlib
from datetime import datetime

def parse_feed(feed_url: str) -> list[dict]:
    feed = feedparser.parse(feed_url)
    articles = []

    for entry in feed.entries:
        content_hash = hashlib.sha256(
            entry.get("link", "").encode()
        ).hexdigest()

        articles.append({
            "title": entry.get("title", ""),
            "url": entry.get("link", ""),
            "published": entry.get("published", ""),
            "summary": entry.get("summary", ""),
            "content_hash": content_hash,
            "source": feed_url,
            "ingested_at": datetime.utcnow().isoformat(),
        })

    return articles

PostgreSQL as the database. It stores articles, metadata, publication status and processing logs. I use JSONB fields to store variable metadata (AI-extracted tags, scores, detected entities). PostgreSQL handles this without breaking a sweat and saves me from having to set up a separate system for semi-structured data.

BeautifulSoup and Playwright for scraping. BeautifulSoup for sites with static HTML and Playwright for those that render with JavaScript. Most news sources still serve static content, so BeautifulSoup covers 80% of cases. Playwright comes in when there is no alternative.

LLM APIs for intelligent processing. I use models via API to classify articles by category, generate short summaries and extract relevant entities. I do not train my own models; that would be overengineering for this use case.


Technical decisions that matter

Why RSS + scraping (and not just one of the two)

RSS is stable, ethical and easy to parse. But not all sources have RSS. And those that do do not always include the full content: many only expose a title and an excerpt.

Scraping fills in where RSS falls short. There are outlets that do not offer a feed, there are specific sections that are not in the general RSS and there is structured data (author, category, exact date) that the feed omits.

The combination of RSS as the primary source and scraping as a fallback gives me broad coverage without depending on a single extraction method.

Why n8n and not Airflow

Airflow is an excellent tool for production data pipelines. But for a personal project with flows that I modify every week, n8n has practical advantages:

  • The visual interface reduces iteration time. I can move nodes, test partial executions and see intermediate data without touching code.
  • Self-hosting is trivial. One Docker container and done.
  • For orchestration flows (trigger, HTTP, process, save), n8n is faster to set up than writing DAGs in Python.

Where n8n falls short is heavy processing. That is why the complex logic (parsing, deduplication, AI) runs in Python scripts that n8n calls as HTTP services.

Why not just APIs

It would be ideal if all outlets exposed clean APIs with their content. But the reality is that most do not, and those that do usually have rate, cost or access limitations.

RSS and scraping give me independence. I do not depend on a third party maintaining an API or on their terms of service changing overnight. Obviously, scraping has its own risks (structure changes, blocks), but those are risks I can manage technically.


The data flow in detail

1. Ingestion

Each source has a dedicated extractor. RSS feeds are processed with feedparser. Scrapers use BeautifulSoup or Playwright depending on the site. Each extractor returns a normalized format:

# Formato normalizado de artículo
{
    "title": str,
    "url": str,
    "content": str,          # texto limpio, sin HTML
    "published_at": str,     # ISO 8601
    "source_name": str,
    "source_type": str,      # "rss" | "scraper" | "api"
    "content_hash": str,     # SHA-256 del contenido
    "raw_metadata": dict,    # datos originales sin procesar
}

Deduplication happens at two levels: first by exact URL (the most obvious) and then by content hash (to detect the same article published on two different sites or with slightly different URLs).

2. AI processing

Once the article is normalized and deduplicated, it goes through an AI pipeline:

# Pipeline de procesamiento con LLM
def process_article(article: dict) -> dict:
    prompt = f"""Analiza el siguiente artículo de noticias:

Título: {article['title']}
Contenido: {article['content'][:3000]}

Responde en JSON con:
- category: categoría principal (tech, política, economía, ciencia, etc.)
- summary: resumen de 2-3 frases
- entities: lista de entidades mencionadas (personas, empresas, tecnologías)
- relevance_score: puntuación de 1-10 según relevancia para una audiencia técnica
"""

    response = call_llm(prompt)
    enriched = article.copy()
    enriched["ai_metadata"] = parse_json_response(response)
    return enriched

This step is not free. Each LLM API call has a token cost, and when you process hundreds of articles per day, the expense adds up. That is why I apply a first filter before the AI: if the article already exists in the database or its source has a historically low relevance rate, I do not process it.

3. Storage

Everything goes to PostgreSQL. The main articles table has fixed columns for essential data and a JSONB field for AI metadata:

CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    url TEXT UNIQUE NOT NULL,
    title TEXT NOT NULL,
    content TEXT,
    published_at TIMESTAMP,
    source_name TEXT NOT NULL,
    source_type TEXT NOT NULL,
    content_hash TEXT NOT NULL,
    ai_metadata JSONB,
    status TEXT DEFAULT 'pending',  -- pending, approved, rejected, published
    created_at TIMESTAMP DEFAULT NOW(),
    published_to JSONB DEFAULT '[]'
);

4. Editorial validation

This is where the system stops being automatic. Processed articles arrive with pending status to an internal panel where I can review them, edit them if needed and approve or discard them.

It is not a sophisticated panel. It is a simple interface that shows the title, the AI-generated summary, the relevance score and the original content. From there I decide what gets published and what does not.

This step seems inefficient, but it is what makes the difference between a spam bot and a channel with editorial judgment.

5. Distribution

Approved articles are published to the configured channels. Currently Telegram and X, with the web as a third channel in development. Each channel has its own formatter: Telegram uses Markdown with emojis, X requires short versions with a link, the web shows the full article.

n8n manages this distribution. When an article moves to approved status, a workflow triggers, formats the content for each channel and publishes it. If publication fails (rate limit, API error), it stays in queue for retry.


Mistakes and current limitations

Not everything is pretty. These are the real problems I live with:

Deduplication is not perfect. Two outlets can publish the same story with texts different enough that the hash does not match. I am experimenting with embeddings to detect semantic similarity, but it adds complexity and cost.

Scrapers break. An outlet changes its HTML and the extractor stops working. I have alerts set up to detect when a scraper returns fewer results than usual, but the response is still manual: open the site, see what changed and update the selectors.

AI costs scale poorly. Processing 200 articles daily with an LLM is not expensive. But if I want to scale to 2000 or handle longer articles, the cost multiplies fast. I am evaluating smaller, fine-tuned models for classification tasks, which do not need the most powerful LLM on the market.

Editorial validation is a bottleneck. If I do not review the articles, they do not get published. This limits publishing frequency. I am working on a trust-per-source system: if a source has a consistent history of approved articles, its next articles could be published automatically with less oversight.

The internal panel is rough. It works, but it is not comfortable. It has no advanced filters, does not allow grouping articles by topic and the interface is basically a table with buttons. It is the typical thing that in a personal project always gets pushed to later.


Next steps

Rolsfera is not finished and it will not be anytime soon. These are the open fronts:

Semantic duplicate detection. Moving from hash-based comparison to embedding-based comparison. The idea is that if two articles talk about the same thing with different words, the system detects them as duplicates and presents only the best one.

Source scoring system. Assigning a trust score to each source based on its approval and rejection history. This would allow automating publication from high-trust sources.

Public web interface. Taking Rolsfera out of my internal panel and turning it into a product others can use. This means authentication, source customization and a decent presentation layer.

Diversify distribution channels. Newsletter, own RSS feed (ironic, I know) and perhaps integration with feed readers.

Optimize AI costs. Evaluating smaller models for classification and reserving large LLMs only for summaries and tasks that truly need them.


Final thoughts

Rolsfera is not a revolutionary project. It is a news aggregator with scraping and AI, a concept that has existed for years. What makes it interesting to me is that it is a real lab where I test ideas around architecture, automation and data processing that I later apply in other contexts.

If I have learned anything building it, it is that the hard part is not setting up the pipeline. It is keeping it running when sources change, costs add up and real life leaves you little time to review articles. The architecture has to absorb that reality, not ignore it.

And that, I believe, is what separates a side project that survives from one that dies on the third commit.

OshyTech

Backend and data engineering focused on scalable systems, automation, and AI.

Navigation

Copyright 2026 OshyTech. All Rights Reserved