March 15, 2026·Tutorial

Web Scraping for AI: Extract Clean Content with Keiro Web Crawler

Learn how to use Keiro's /web-crawler endpoint to extract clean, structured content from any web page for use in AI applications.

8 min readKeiro Team

Introduction

AI applications frequently need to extract clean text content from web pages. Whether you are building a RAG pipeline, training a model, or monitoring competitor content, you need a way to turn messy HTML into clean, structured text. Keiro's /web-crawler endpoint does exactly this — and it is included in your Keiro subscription at no extra cost.

Why You Need a Web Crawler for AI

Raw HTML is not useful for LLMs. A typical web page contains navigation menus, advertisements, footers, JavaScript, CSS, and only a small amount of actual content. Feeding raw HTML to an LLM wastes tokens and confuses the model.

Keiro's /web-crawler endpoint handles all the messy work:

Renders JavaScript-heavy pages
Strips navigation, ads, and boilerplate
Extracts the main content in clean text
Preserves the page title and metadata

Basic Usage

Python

import requests

response = requests.post("https://kierolabs.space/api/web-crawler", json={
    "apiKey": "your-keiro-api-key",
    "url": "https://example.com/blog/ai-trends-2026"
})

data = response.json()
print(f"Title: {data.get('title', 'N/A')}")
print(f"Content length: {len(data.get('content', ''))} characters")
print(f"Content preview: {data.get('content', '')[:500]}")

JavaScript

const response = await fetch("https://kierolabs.space/api/web-crawler", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    apiKey: "your-keiro-api-key",
    url: "https://example.com/blog/ai-trends-2026"
  })
});

const data = await response.json();
console.log("Title:", data.title);
console.log("Content:", data.content);

cURL

curl -X POST https://kierolabs.space/api/web-crawler \
  -H "Content-Type: application/json" \
  -d '{
    "apiKey": "your-keiro-api-key",
    "url": "https://example.com/blog/ai-trends-2026"
  }'

Use Case 1: Enriching Search Results

Search results often include only snippets. Use the web crawler to get the full content of the most relevant results:

import requests

KEIRO_BASE = "https://kierolabs.space/api"
API_KEY = "your-keiro-api-key"

def search_and_enrich(query: str, top_n: int = 3) -> list[dict]:
    """Search and then extract full content from top results."""
    # Search
    search_resp = requests.post(f"{KEIRO_BASE}/search", json={
        "apiKey": API_KEY,
        "query": query
    })
    results = search_resp.json().get("results", [])

    # Enrich top results with full content
    enriched = []
    for result in results[:top_n]:
        try:
            crawl_resp = requests.post(f"{KEIRO_BASE}/web-crawler", json={
                "apiKey": API_KEY,
                "url": result["url"]
            })
            crawl_data = crawl_resp.json()
            result["full_content"] = crawl_data.get("content", "")
        except Exception as e:
            result["full_content"] = result.get("snippet", "")
        enriched.append(result)

    return enriched

# Usage
results = search_and_enrich("transformer architecture innovations 2026")
for r in results:
    print(f"{r['title']}: {len(r['full_content'])} chars")

Use Case 2: Content Monitoring

Monitor competitor pages or documentation for changes:

import requests
import hashlib
import json
import os

def monitor_pages(urls: list[str], cache_file: str = "page_cache.json"):
    """Monitor a list of URLs for content changes."""
    API_KEY = "your-keiro-api-key"

    # Load previous hashes
    if os.path.exists(cache_file):
        with open(cache_file) as f:
            cache = json.load(f)
    else:
        cache = {}

    changes = []
    for url in urls:
        resp = requests.post("https://kierolabs.space/api/web-crawler", json={
            "apiKey": API_KEY,
            "url": url
        })
        content = resp.json().get("content", "")
        content_hash = hashlib.md5(content.encode()).hexdigest()

        if url in cache and cache[url] != content_hash:
            changes.append({"url": url, "status": "changed"})
        elif url not in cache:
            changes.append({"url": url, "status": "new"})

        cache[url] = content_hash

    with open(cache_file, "w") as f:
        json.dump(cache, f)

    return changes

# Monitor competitor pages
changes = monitor_pages([
    "https://docs.exa.ai/reference/search",
    "https://docs.tavily.com/introduction",
])
for change in changes:
    print(f"{change['url']}: {change['status']}")

Use Case 3: Building a Knowledge Base

Extract content from a list of authoritative sources to build a knowledge base for your AI:

import requests

def build_knowledge_base(urls: list[str]) -> list[dict]:
    """Extract content from URLs to build a knowledge base."""
    API_KEY = "your-keiro-api-key"
    knowledge_base = []

    for url in urls:
        try:
            resp = requests.post("https://kierolabs.space/api/web-crawler", json={
                "apiKey": API_KEY,
                "url": url
            })
            data = resp.json()
            knowledge_base.append({
                "url": url,
                "title": data.get("title", ""),
                "content": data.get("content", ""),
                "word_count": len(data.get("content", "").split())
            })
            print(f"Extracted: {data.get('title', url)}")
        except Exception as e:
            print(f"Failed: {url} - {e}")

    return knowledge_base

# Build a knowledge base from documentation pages
kb = build_knowledge_base([
    "https://python.langchain.com/docs/introduction",
    "https://docs.llamaindex.ai/en/latest/",
    "https://docs.anthropic.com/claude/docs"
])

total_words = sum(doc["word_count"] for doc in kb)
print(f"\nKnowledge base: {len(kb)} documents, {total_words} total words")

Tips and Best Practices

Respect robots.txt: Only crawl pages you are allowed to access. Keiro handles this automatically.
Rate limit your crawling: Even though Keiro handles the actual fetching, be mindful of your request volume.
Truncate content for LLMs: Most LLMs work best with 3,000-5,000 words of context. Truncate longer pages.
Cache results: Keiro gives a 50% discount on cached requests, so repeated crawls of the same URL are cheaper.
Handle errors: Some pages may block crawlers or be behind authentication. Always handle errors gracefully.

Keiro /web-crawler vs Alternatives

Feature	Keiro /web-crawler	Firecrawl /scrape	DIY (BeautifulSoup)
JavaScript Rendering	Yes	Yes	No (need Selenium)
Clean Content Extraction	Yes	Yes	Manual
Search Integration	Same API	Separate API needed	Separate API needed
Additional Cost	Included in plan	$19+/mo extra	Free (but dev time)
Maintenance	Zero	Zero	Ongoing

Conclusion

Keiro's /web-crawler endpoint is a simple, reliable way to extract clean content from any web page. Combined with Keiro's search endpoints, you have everything you need to discover and extract web content for AI applications — all from a single API and subscription.

Start extracting web content with Keiro at kierolabs.space. Plans start at $5.99/month.

tutorialweb-crawlerscrapingpythonjavascript