Umberto D'Ovidio

Building Your Own LLM-Powered RSS Digest

In an era of information overload, staying up-to-date with your favorite blogs and news sources can be overwhelming. On top of that, the internet is full of click bait content, that distract us from the real important information. RSS feeds offer a solution, but who has time to read dozens of articles daily? This tutorial shows you how to build an intelligent RSS digest that automatically summarizes content using AI.

We’ll walk through three key steps: parsing RSS feeds, extracting full article content, and generating AI-powered summaries using OpenAI’s API.

Prerequisites

Before we start, you’ll need:

  • Python 3.7+
  • An OpenAI API key
  • Basic familiarity with Python

Install the required packages:

1pip install feedparser requests beautifulsoup4 openai

Step 1: Parsing RSS Feeds with Python

RSS (Really Simple Syndication) feeds are XML files that websites use to publish their latest content. Python’s feedparser library makes it easy to work with these feeds.

 1import feedparser
 2import requests
 3from datetime import datetime, timezone
 4
 5def parse_rss_feed(feed_url):
 6    """Parse a single RSS feed and extract articles"""
 7    
 8    # Fetch the RSS feed
 9    response = requests.get(feed_url, timeout=30)
10    response.raise_for_status()
11    
12    # Parse the XML content
13    feed = feedparser.parse(response.content)
14    
15    articles = []
16    for entry in feed.entries:
17        # Extract publication date
18        pub_date = None
19        if hasattr(entry, 'published_parsed') and entry.published_parsed:
20            pub_date = datetime(*entry.published_parsed[:6], tzinfo=timezone.utc)
21        
22        # Get article content (summary from RSS)
23        content = getattr(entry, 'summary', '')
24        
25        article = {
26            'title': entry.get('title', 'No Title'),
27            'url': entry.get('link', ''),
28            'content': content,
29            'published': pub_date,
30            'author': entry.get('author', 'Unknown')
31        }
32        articles.append(article)
33    
34    return articles
35
36# Example usage
37feed_url = "https://feeds.bbci.co.uk/news/rss.xml"
38articles = parse_rss_feed(feed_url)
39print(f"Found {len(articles)} articles")

The feedparser library handles the complexity of XML parsing and provides a clean interface to access article metadata like titles, URLs, publication dates, and summary content.

Step 2: Extracting Full Content with Beautiful Soup

RSS feeds typically only include article summaries. To get the full content for better AI analysis, we need to fetch and parse the actual web pages.

 1from bs4 import BeautifulSoup
 2import re
 3
 4def extract_article_content(url):
 5    """Fetch and extract main content from a web page"""
 6    
 7    headers = {
 8        'User-Agent': 'Mozilla/5.0 (compatible; RSS Reader/1.0)'
 9    }
10    
11    try:
12        response = requests.get(url, headers=headers, timeout=30)
13        response.raise_for_status()
14        
15        # Parse HTML with Beautiful Soup
16        soup = BeautifulSoup(response.text, 'html.parser')
17        
18        # Remove unwanted elements
19        for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
20            element.decompose()
21        
22        # Extract text content
23        content = soup.get_text(separator=' ', strip=True)
24        
25        # Clean up whitespace
26        content = ' '.join(content.split())
27        
28        # Limit content length for LLM processing
29        max_chars = 8000  # Roughly 2000 tokens
30        if len(content) > max_chars:
31            content = content[:max_chars] + "..."
32        
33        return content if len(content) > 100 else None
34        
35    except Exception as e:
36        print(f"Error extracting content from {url}: {e}")
37        return None
38
39# Enhance articles with full content
40def enhance_articles(articles):
41    """Add full content to articles"""
42    
43    for article in articles:
44        if article['url']:
45            full_content = extract_article_content(article['url'])
46            if full_content and len(full_content) > len(article['content']):
47                article['content'] = full_content
48                print(f"Enhanced: {article['title']}")
49    
50    return articles

Beautiful Soup excels at parsing HTML and extracting clean text content. We remove navigation elements, scripts, and other non-content parts to focus on the article text.

Step 3: Generating AI Summaries with OpenAI

Now comes the magic: using OpenAI’s API to generate intelligent summaries of our collected articles.

 1import openai
 2
 3class DigestGenerator:
 4    def __init__(self, api_key):
 5        self.client = openai.OpenAI(api_key=api_key)
 6    
 7    def create_digest(self, articles):
 8        """Generate an AI-powered digest of articles"""
 9        
10        # Prepare articles for the prompt
11        article_summaries = []
12        for i, article in enumerate(articles[:10], 1):  # Limit to 10 articles
13            summary = {
14                'title': article['title'],
15                'content': article['content'][:1500],  # Truncate for token limits
16                'url': article['url'],
17                'published': article['published'].strftime('%Y-%m-%d') if article['published'] else 'Unknown'
18            }
19            article_summaries.append(summary)
20        
21        # Create the prompt
22        prompt = self._build_digest_prompt(article_summaries)
23        
24        try:
25            response = self.client.chat.completions.create(
26                model="gpt-3.5-turbo",
27                messages=[{"role": "user", "content": prompt}],
28                temperature=0.3,
29                max_tokens=2000
30            )
31            
32            return response.choices[0].message.content
33            
34        except Exception as e:
35            print(f"Error generating digest: {e}")
36            return self._fallback_digest(article_summaries)
37    
38    def _build_digest_prompt(self, articles):
39        """Build the prompt for AI digest generation"""
40        
41        prompt = f"""Create a comprehensive daily digest from these {len(articles)} articles. 
42        
43Instructions:
44- Summarize the key themes and trends
45- Group related topics together
46- Highlight the most important developments
47- Keep it engaging and informative
48- Use markdown formatting
49
50Articles:
51
52"""
53        
54        for i, article in enumerate(articles, 1):
55            prompt += f"""
56## Article {i}: {article['title']}
57**URL:** {article['url']}
58**Published:** {article['published']}
59
60{article['content']}
61
62---
63"""
64        
65        return prompt
66    
67    def _fallback_digest(self, articles):
68        """Simple fallback if AI fails"""
69        digest = f"# Daily Digest - {datetime.now().strftime('%Y-%m-%d')}\n\n"
70        
71        for article in articles:
72            digest += f"## {article['title']}\n"
73            digest += f"**Published:** {article['published']}\n"
74            digest += f"**Link:** {article['url']}\n\n"
75            digest += f"{article['content'][:200]}...\n\n---\n\n"
76        
77        return digest

The key to good AI summaries is crafting effective prompts. We give the AI clear instructions and provide structured article data for analysis.

Putting It All Together

Here’s the main function putting all three components together:

 1import os
 2
 3def main():
 4    RSS_FEEDS = [
 5        "https://feeds.bbci.co.uk/news/rss.xml",
 6        "http://rss.cnn.com/rss/cnn_latest.rss/"
 7    ]
 8
 9    print("Fetching articles from RSS feeds...")
10    all_articles = []
11    
12    # Step 1: Parse RSS feeds
13    for feed_url in RSS_FEEDS:
14        try:
15            articles = parse_rss_feed(feed_url)
16            all_articles.extend(articles)
17            print(f"Fetched {len(articles)} articles from {feed_url}")
18        except Exception as e:
19            print(f"Error processing {feed_url}: {e}")
20    
21    if not all_articles:
22        print("No articles found!")
23        return
24    
25    # Sort by publication date (newest first)
26    all_articles.sort(key=lambda x: x['published'] or datetime.min, reverse=True)
27    
28    # Step 2: Enhance with full content
29    print("Extracting full article content...")
30    enhanced_articles = enhance_articles(all_articles[:15])  # Process top 15
31    
32    # Step 3: Generate AI digest
33    print("Generating AI digest...")
34
35    openai_api_key = os.getenv("OPENAI_API_KEY")
36    generator = DigestGenerator(openai_api_key)
37    digest = generator.create_digest(enhanced_articles)
38    
39    # Save and display results
40    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
41    filename = f"digest_{timestamp}.md"
42    
43    with open(filename, 'w', encoding='utf-8') as f:
44        f.write(digest)
45    
46    print(f"\n✅ Digest saved to {filename}")
47    print("\n" + "="*60)
48    print(digest)
49    print("="*60)
50
51if __name__ == "__main__":
52    main()

The full program is available here

Key Considerations

When building your own RSS digest system, keep these points in mind:

Rate Limiting: Be respectful to websites by adding delays between requests and using proper User-Agent headers.

Error Handling: RSS feeds can be unreliable. Always include try-catch blocks and graceful fallbacks.

Content Filtering: Consider adding keyword filtering to focus on topics that interest you most.

Token Limits: OpenAI models have token limits. Truncate content appropriately and batch large requests.

Caching: Store processed articles in a database to avoid re-processing and respect API usage limits. Sqlite3 is a good candidate

Conclusion

With a couple hundred lines of Python code we were able to build a digest system that scrapes information from different feeds, and summarize those using an LLM. I hope this inspired you to byukd your own feed. If the topic is interesting to you, check out Colino, an configurable open source digest system.