Building Your Own LLM-Powered RSS Digest
In an era of information overload, staying up-to-date with your favorite blogs and news sources can be overwhelming. On top of that, the internet is full of click bait content, that distract us from the real important information. RSS feeds offer a solution, but who has time to read dozens of articles daily? This tutorial shows you how to build an intelligent RSS digest that automatically summarizes content using AI.
We’ll walk through three key steps: parsing RSS feeds, extracting full article content, and generating AI-powered summaries using OpenAI’s API.
Prerequisites
Before we start, you’ll need:
- Python 3.7+
- An OpenAI API key
- Basic familiarity with Python
Install the required packages:
1pip install feedparser requests beautifulsoup4 openai
Step 1: Parsing RSS Feeds with Python
RSS (Really Simple Syndication) feeds are XML files that websites use to publish their latest content. Python’s feedparser
library makes it easy to work with these feeds.
1import feedparser
2import requests
3from datetime import datetime, timezone
4
5def parse_rss_feed(feed_url):
6 """Parse a single RSS feed and extract articles"""
7
8 # Fetch the RSS feed
9 response = requests.get(feed_url, timeout=30)
10 response.raise_for_status()
11
12 # Parse the XML content
13 feed = feedparser.parse(response.content)
14
15 articles = []
16 for entry in feed.entries:
17 # Extract publication date
18 pub_date = None
19 if hasattr(entry, 'published_parsed') and entry.published_parsed:
20 pub_date = datetime(*entry.published_parsed[:6], tzinfo=timezone.utc)
21
22 # Get article content (summary from RSS)
23 content = getattr(entry, 'summary', '')
24
25 article = {
26 'title': entry.get('title', 'No Title'),
27 'url': entry.get('link', ''),
28 'content': content,
29 'published': pub_date,
30 'author': entry.get('author', 'Unknown')
31 }
32 articles.append(article)
33
34 return articles
35
36# Example usage
37feed_url = "https://feeds.bbci.co.uk/news/rss.xml"
38articles = parse_rss_feed(feed_url)
39print(f"Found {len(articles)} articles")
The feedparser
library handles the complexity of XML parsing and provides a clean interface to access article metadata like titles, URLs, publication dates, and summary content.
Step 2: Extracting Full Content with Beautiful Soup
RSS feeds typically only include article summaries. To get the full content for better AI analysis, we need to fetch and parse the actual web pages.
1from bs4 import BeautifulSoup
2import re
3
4def extract_article_content(url):
5 """Fetch and extract main content from a web page"""
6
7 headers = {
8 'User-Agent': 'Mozilla/5.0 (compatible; RSS Reader/1.0)'
9 }
10
11 try:
12 response = requests.get(url, headers=headers, timeout=30)
13 response.raise_for_status()
14
15 # Parse HTML with Beautiful Soup
16 soup = BeautifulSoup(response.text, 'html.parser')
17
18 # Remove unwanted elements
19 for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
20 element.decompose()
21
22 # Extract text content
23 content = soup.get_text(separator=' ', strip=True)
24
25 # Clean up whitespace
26 content = ' '.join(content.split())
27
28 # Limit content length for LLM processing
29 max_chars = 8000 # Roughly 2000 tokens
30 if len(content) > max_chars:
31 content = content[:max_chars] + "..."
32
33 return content if len(content) > 100 else None
34
35 except Exception as e:
36 print(f"Error extracting content from {url}: {e}")
37 return None
38
39# Enhance articles with full content
40def enhance_articles(articles):
41 """Add full content to articles"""
42
43 for article in articles:
44 if article['url']:
45 full_content = extract_article_content(article['url'])
46 if full_content and len(full_content) > len(article['content']):
47 article['content'] = full_content
48 print(f"Enhanced: {article['title']}")
49
50 return articles
Beautiful Soup excels at parsing HTML and extracting clean text content. We remove navigation elements, scripts, and other non-content parts to focus on the article text.
Step 3: Generating AI Summaries with OpenAI
Now comes the magic: using OpenAI’s API to generate intelligent summaries of our collected articles.
1import openai
2
3class DigestGenerator:
4 def __init__(self, api_key):
5 self.client = openai.OpenAI(api_key=api_key)
6
7 def create_digest(self, articles):
8 """Generate an AI-powered digest of articles"""
9
10 # Prepare articles for the prompt
11 article_summaries = []
12 for i, article in enumerate(articles[:10], 1): # Limit to 10 articles
13 summary = {
14 'title': article['title'],
15 'content': article['content'][:1500], # Truncate for token limits
16 'url': article['url'],
17 'published': article['published'].strftime('%Y-%m-%d') if article['published'] else 'Unknown'
18 }
19 article_summaries.append(summary)
20
21 # Create the prompt
22 prompt = self._build_digest_prompt(article_summaries)
23
24 try:
25 response = self.client.chat.completions.create(
26 model="gpt-3.5-turbo",
27 messages=[{"role": "user", "content": prompt}],
28 temperature=0.3,
29 max_tokens=2000
30 )
31
32 return response.choices[0].message.content
33
34 except Exception as e:
35 print(f"Error generating digest: {e}")
36 return self._fallback_digest(article_summaries)
37
38 def _build_digest_prompt(self, articles):
39 """Build the prompt for AI digest generation"""
40
41 prompt = f"""Create a comprehensive daily digest from these {len(articles)} articles.
42
43Instructions:
44- Summarize the key themes and trends
45- Group related topics together
46- Highlight the most important developments
47- Keep it engaging and informative
48- Use markdown formatting
49
50Articles:
51
52"""
53
54 for i, article in enumerate(articles, 1):
55 prompt += f"""
56## Article {i}: {article['title']}
57**URL:** {article['url']}
58**Published:** {article['published']}
59
60{article['content']}
61
62---
63"""
64
65 return prompt
66
67 def _fallback_digest(self, articles):
68 """Simple fallback if AI fails"""
69 digest = f"# Daily Digest - {datetime.now().strftime('%Y-%m-%d')}\n\n"
70
71 for article in articles:
72 digest += f"## {article['title']}\n"
73 digest += f"**Published:** {article['published']}\n"
74 digest += f"**Link:** {article['url']}\n\n"
75 digest += f"{article['content'][:200]}...\n\n---\n\n"
76
77 return digest
The key to good AI summaries is crafting effective prompts. We give the AI clear instructions and provide structured article data for analysis.
Putting It All Together
Here’s the main function putting all three components together:
1import os
2
3def main():
4 RSS_FEEDS = [
5 "https://feeds.bbci.co.uk/news/rss.xml",
6 "http://rss.cnn.com/rss/cnn_latest.rss/"
7 ]
8
9 print("Fetching articles from RSS feeds...")
10 all_articles = []
11
12 # Step 1: Parse RSS feeds
13 for feed_url in RSS_FEEDS:
14 try:
15 articles = parse_rss_feed(feed_url)
16 all_articles.extend(articles)
17 print(f"Fetched {len(articles)} articles from {feed_url}")
18 except Exception as e:
19 print(f"Error processing {feed_url}: {e}")
20
21 if not all_articles:
22 print("No articles found!")
23 return
24
25 # Sort by publication date (newest first)
26 all_articles.sort(key=lambda x: x['published'] or datetime.min, reverse=True)
27
28 # Step 2: Enhance with full content
29 print("Extracting full article content...")
30 enhanced_articles = enhance_articles(all_articles[:15]) # Process top 15
31
32 # Step 3: Generate AI digest
33 print("Generating AI digest...")
34
35 openai_api_key = os.getenv("OPENAI_API_KEY")
36 generator = DigestGenerator(openai_api_key)
37 digest = generator.create_digest(enhanced_articles)
38
39 # Save and display results
40 timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
41 filename = f"digest_{timestamp}.md"
42
43 with open(filename, 'w', encoding='utf-8') as f:
44 f.write(digest)
45
46 print(f"\n✅ Digest saved to {filename}")
47 print("\n" + "="*60)
48 print(digest)
49 print("="*60)
50
51if __name__ == "__main__":
52 main()
The full program is available here
Key Considerations
When building your own RSS digest system, keep these points in mind:
Rate Limiting: Be respectful to websites by adding delays between requests and using proper User-Agent headers.
Error Handling: RSS feeds can be unreliable. Always include try-catch blocks and graceful fallbacks.
Content Filtering: Consider adding keyword filtering to focus on topics that interest you most.
Token Limits: OpenAI models have token limits. Truncate content appropriately and batch large requests.
Caching: Store processed articles in a database to avoid re-processing and respect API usage limits. Sqlite3 is a good candidate
Conclusion
With a couple hundred lines of Python code we were able to build a digest system that scrapes information from different feeds, and summarize those using an LLM. I hope this inspired you to byukd your own feed. If the topic is interesting to you, check out Colino, an configurable open source digest system.