Web Scraping for AI: Extract Clean Content with Keiro Web Crawler

Learn how to use Keiro's /web-crawler endpoint to extract clean, structured content from any web page for use in AI applications.

8 min readKeiro Team

Introduction

AI applications frequently need to extract clean text content from web pages. Whether you are building a RAG pipeline, training a model, or monitoring competitor content, you need a way to turn messy HTML into clean, structured text. Keiro's /web-crawler endpoint does exactly this — and it is included in your Keiro subscription at no extra cost.

Why You Need a Web Crawler for AI

Raw HTML is not useful for LLMs. A typical web page contains navigation menus, advertisements, footers, JavaScript, CSS, and only a small amount of actual content. Feeding raw HTML to an LLM wastes tokens and confuses the model.

Keiro's /web-crawler endpoint handles all the messy work:

  • Renders JavaScript-heavy pages
  • Strips navigation, ads, and boilerplate
  • Extracts the main content in clean text
  • Preserves the page title and metadata

Basic Usage

Python

import requests

response = requests.post("https://kierolabs.space/api/web-crawler", json={
    "apiKey": "your-keiro-api-key",
    "url": "https://example.com/blog/ai-trends-2026"
})

data = response.json()
print(f"Title: {data.get('title', 'N/A')}")
print(f"Content length: {len(data.get('content', ''))} characters")
print(f"Content preview: {data.get('content', '')[:500]}")

JavaScript

const response = await fetch("https://kierolabs.space/api/web-crawler", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    apiKey: "your-keiro-api-key",
    url: "https://example.com/blog/ai-trends-2026"
  })
});

const data = await response.json();
console.log("Title:", data.title);
console.log("Content:", data.content);

cURL

curl -X POST https://kierolabs.space/api/web-crawler \
  -H "Content-Type: application/json" \
  -d '{
    "apiKey": "your-keiro-api-key",
    "url": "https://example.com/blog/ai-trends-2026"
  }'

Use Case 1: Enriching Search Results

Search results often include only snippets. Use the web crawler to get the full content of the most relevant results:

import requests

KEIRO_BASE = "https://kierolabs.space/api"
API_KEY = "your-keiro-api-key"

def search_and_enrich(query: str, top_n: int = 3) -> list[dict]:
    """Search and then extract full content from top results."""
    # Search
    search_resp = requests.post(f"{KEIRO_BASE}/search", json={
        "apiKey": API_KEY,
        "query": query
    })
    results = search_resp.json().get("results", [])

    # Enrich top results with full content
    enriched = []
    for result in results[:top_n]:
        try:
            crawl_resp = requests.post(f"{KEIRO_BASE}/web-crawler", json={
                "apiKey": API_KEY,
                "url": result["url"]
            })
            crawl_data = crawl_resp.json()
            result["full_content"] = crawl_data.get("content", "")
        except Exception as e:
            result["full_content"] = result.get("snippet", "")
        enriched.append(result)

    return enriched

# Usage
results = search_and_enrich("transformer architecture innovations 2026")
for r in results:
    print(f"{r['title']}: {len(r['full_content'])} chars")

Use Case 2: Content Monitoring

Monitor competitor pages or documentation for changes:

import requests
import hashlib
import json
import os

def monitor_pages(urls: list[str], cache_file: str = "page_cache.json"):
    """Monitor a list of URLs for content changes."""
    API_KEY = "your-keiro-api-key"

    # Load previous hashes
    if os.path.exists(cache_file):
        with open(cache_file) as f:
            cache = json.load(f)
    else:
        cache = {}

    changes = []
    for url in urls:
        resp = requests.post("https://kierolabs.space/api/web-crawler", json={
            "apiKey": API_KEY,
            "url": url
        })
        content = resp.json().get("content", "")
        content_hash = hashlib.md5(content.encode()).hexdigest()

        if url in cache and cache[url] != content_hash:
            changes.append({"url": url, "status": "changed"})
        elif url not in cache:
            changes.append({"url": url, "status": "new"})

        cache[url] = content_hash

    with open(cache_file, "w") as f:
        json.dump(cache, f)

    return changes

# Monitor competitor pages
changes = monitor_pages([
    "https://docs.exa.ai/reference/search",
    "https://docs.tavily.com/introduction",
])
for change in changes:
    print(f"{change['url']}: {change['status']}")

Use Case 3: Building a Knowledge Base

Extract content from a list of authoritative sources to build a knowledge base for your AI:

import requests

def build_knowledge_base(urls: list[str]) -> list[dict]:
    """Extract content from URLs to build a knowledge base."""
    API_KEY = "your-keiro-api-key"
    knowledge_base = []

    for url in urls:
        try:
            resp = requests.post("https://kierolabs.space/api/web-crawler", json={
                "apiKey": API_KEY,
                "url": url
            })
            data = resp.json()
            knowledge_base.append({
                "url": url,
                "title": data.get("title", ""),
                "content": data.get("content", ""),
                "word_count": len(data.get("content", "").split())
            })
            print(f"Extracted: {data.get('title', url)}")
        except Exception as e:
            print(f"Failed: {url} - {e}")

    return knowledge_base

# Build a knowledge base from documentation pages
kb = build_knowledge_base([
    "https://python.langchain.com/docs/introduction",
    "https://docs.llamaindex.ai/en/latest/",
    "https://docs.anthropic.com/claude/docs"
])

total_words = sum(doc["word_count"] for doc in kb)
print(f"\nKnowledge base: {len(kb)} documents, {total_words} total words")

Tips and Best Practices

  • Respect robots.txt: Only crawl pages you are allowed to access. Keiro handles this automatically.
  • Rate limit your crawling: Even though Keiro handles the actual fetching, be mindful of your request volume.
  • Truncate content for LLMs: Most LLMs work best with 3,000-5,000 words of context. Truncate longer pages.
  • Cache results: Keiro gives a 50% discount on cached requests, so repeated crawls of the same URL are cheaper.
  • Handle errors: Some pages may block crawlers or be behind authentication. Always handle errors gracefully.

Keiro /web-crawler vs Alternatives

FeatureKeiro /web-crawlerFirecrawl /scrapeDIY (BeautifulSoup)
JavaScript RenderingYesYesNo (need Selenium)
Clean Content ExtractionYesYesManual
Search IntegrationSame APISeparate API neededSeparate API needed
Additional CostIncluded in plan$19+/mo extraFree (but dev time)
MaintenanceZeroZeroOngoing

Conclusion

Keiro's /web-crawler endpoint is a simple, reliable way to extract clean content from any web page. Combined with Keiro's search endpoints, you have everything you need to discover and extract web content for AI applications — all from a single API and subscription.

Start extracting web content with Keiro at kierolabs.space. Plans start at $5.99/month.

Ready to build something?

Join developers using Keiro — 10× cheaper with superior performance.

Get started