Introduction
AI applications frequently need to extract clean text content from web pages. Whether you are building a RAG pipeline, training a model, or monitoring competitor content, you need a way to turn messy HTML into clean, structured text. Keiro's /web-crawler endpoint does exactly this — and it is included in your Keiro subscription at no extra cost.
Why You Need a Web Crawler for AI
Raw HTML is not useful for LLMs. A typical web page contains navigation menus, advertisements, footers, JavaScript, CSS, and only a small amount of actual content. Feeding raw HTML to an LLM wastes tokens and confuses the model.
Keiro's /web-crawler endpoint handles all the messy work:
- Renders JavaScript-heavy pages
- Strips navigation, ads, and boilerplate
- Extracts the main content in clean text
- Preserves the page title and metadata
Basic Usage
Python
import requests
response = requests.post("https://kierolabs.space/api/web-crawler", json={
"apiKey": "your-keiro-api-key",
"url": "https://example.com/blog/ai-trends-2026"
})
data = response.json()
print(f"Title: {data.get('title', 'N/A')}")
print(f"Content length: {len(data.get('content', ''))} characters")
print(f"Content preview: {data.get('content', '')[:500]}")
JavaScript
const response = await fetch("https://kierolabs.space/api/web-crawler", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
apiKey: "your-keiro-api-key",
url: "https://example.com/blog/ai-trends-2026"
})
});
const data = await response.json();
console.log("Title:", data.title);
console.log("Content:", data.content);
cURL
curl -X POST https://kierolabs.space/api/web-crawler \
-H "Content-Type: application/json" \
-d '{
"apiKey": "your-keiro-api-key",
"url": "https://example.com/blog/ai-trends-2026"
}'
Use Case 1: Enriching Search Results
Search results often include only snippets. Use the web crawler to get the full content of the most relevant results:
import requests
KEIRO_BASE = "https://kierolabs.space/api"
API_KEY = "your-keiro-api-key"
def search_and_enrich(query: str, top_n: int = 3) -> list[dict]:
"""Search and then extract full content from top results."""
# Search
search_resp = requests.post(f"{KEIRO_BASE}/search", json={
"apiKey": API_KEY,
"query": query
})
results = search_resp.json().get("results", [])
# Enrich top results with full content
enriched = []
for result in results[:top_n]:
try:
crawl_resp = requests.post(f"{KEIRO_BASE}/web-crawler", json={
"apiKey": API_KEY,
"url": result["url"]
})
crawl_data = crawl_resp.json()
result["full_content"] = crawl_data.get("content", "")
except Exception as e:
result["full_content"] = result.get("snippet", "")
enriched.append(result)
return enriched
# Usage
results = search_and_enrich("transformer architecture innovations 2026")
for r in results:
print(f"{r['title']}: {len(r['full_content'])} chars")
Use Case 2: Content Monitoring
Monitor competitor pages or documentation for changes:
import requests
import hashlib
import json
import os
def monitor_pages(urls: list[str], cache_file: str = "page_cache.json"):
"""Monitor a list of URLs for content changes."""
API_KEY = "your-keiro-api-key"
# Load previous hashes
if os.path.exists(cache_file):
with open(cache_file) as f:
cache = json.load(f)
else:
cache = {}
changes = []
for url in urls:
resp = requests.post("https://kierolabs.space/api/web-crawler", json={
"apiKey": API_KEY,
"url": url
})
content = resp.json().get("content", "")
content_hash = hashlib.md5(content.encode()).hexdigest()
if url in cache and cache[url] != content_hash:
changes.append({"url": url, "status": "changed"})
elif url not in cache:
changes.append({"url": url, "status": "new"})
cache[url] = content_hash
with open(cache_file, "w") as f:
json.dump(cache, f)
return changes
# Monitor competitor pages
changes = monitor_pages([
"https://docs.exa.ai/reference/search",
"https://docs.tavily.com/introduction",
])
for change in changes:
print(f"{change['url']}: {change['status']}")
Use Case 3: Building a Knowledge Base
Extract content from a list of authoritative sources to build a knowledge base for your AI:
import requests
def build_knowledge_base(urls: list[str]) -> list[dict]:
"""Extract content from URLs to build a knowledge base."""
API_KEY = "your-keiro-api-key"
knowledge_base = []
for url in urls:
try:
resp = requests.post("https://kierolabs.space/api/web-crawler", json={
"apiKey": API_KEY,
"url": url
})
data = resp.json()
knowledge_base.append({
"url": url,
"title": data.get("title", ""),
"content": data.get("content", ""),
"word_count": len(data.get("content", "").split())
})
print(f"Extracted: {data.get('title', url)}")
except Exception as e:
print(f"Failed: {url} - {e}")
return knowledge_base
# Build a knowledge base from documentation pages
kb = build_knowledge_base([
"https://python.langchain.com/docs/introduction",
"https://docs.llamaindex.ai/en/latest/",
"https://docs.anthropic.com/claude/docs"
])
total_words = sum(doc["word_count"] for doc in kb)
print(f"\nKnowledge base: {len(kb)} documents, {total_words} total words")
Tips and Best Practices
- Respect robots.txt: Only crawl pages you are allowed to access. Keiro handles this automatically.
- Rate limit your crawling: Even though Keiro handles the actual fetching, be mindful of your request volume.
- Truncate content for LLMs: Most LLMs work best with 3,000-5,000 words of context. Truncate longer pages.
- Cache results: Keiro gives a 50% discount on cached requests, so repeated crawls of the same URL are cheaper.
- Handle errors: Some pages may block crawlers or be behind authentication. Always handle errors gracefully.
Keiro /web-crawler vs Alternatives
| Feature | Keiro /web-crawler | Firecrawl /scrape | DIY (BeautifulSoup) |
|---|---|---|---|
| JavaScript Rendering | Yes | Yes | No (need Selenium) |
| Clean Content Extraction | Yes | Yes | Manual |
| Search Integration | Same API | Separate API needed | Separate API needed |
| Additional Cost | Included in plan | $19+/mo extra | Free (but dev time) |
| Maintenance | Zero | Zero | Ongoing |
Conclusion
Keiro's /web-crawler endpoint is a simple, reliable way to extract clean content from any web page. Combined with Keiro's search endpoints, you have everything you need to discover and extract web content for AI applications — all from a single API and subscription.
Start extracting web content with Keiro at kierolabs.space. Plans start at $5.99/month.