smart-web-scraper
v1.0.0Extract structured data from any web page. Supports CSS selectors, auto-detection of tables and lists, JSON/CSV output formats. Use when asked to scrape a website, extract data from a page, pull product info, gather contact details, or collect listings from a URL.
Installation
Please help me install the skill `smart-web-scraper` from SkillHub official store.
npx skills add mariusfit/smart-web-scraper
Smart Web Scraper
Extract structured data from web pages into clean JSON or CSV.
Quick Start
# Scrape a page, extract all text content
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com"
# Extract specific elements with CSS selector
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com/products" -s ".product-card"
# Auto-detect and extract tables
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing"
# Extract all links from a page
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com"
# Extract structured data (title, meta, headings, links)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py structure "https://example.com"
# Output as JSON
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".item" -f json
# Output as CSV
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s "table tr" -f csv
# Save to file
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".product" -f json -o products.json
# Multi-page scrape (follow pagination)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py crawl "https://example.com/page/1" --pages 5 -s ".article"
Commands
| Command | Args | Description |
|---|---|---|
extract |
<url> [-s selector] [-f format] [-o file] |
Extract content, optionally filtered by CSS selector |
tables |
<url> [-f format] [-o file] |
Auto-detect and extract all HTML tables |
links |
<url> [--external] [--internal] |
Extract all links (href + text) |
structure |
<url> |
Extract page structure: title, meta, headings, images, links |
crawl |
<url> --pages N [-s selector] [-f format] [-o file] |
Follow pagination links, extract from multiple pages |
Output Formats
| Format | Flag | Description |
|---|---|---|
| Text | -f text |
Plain text (default) |
| JSON | -f json |
Structured JSON array |
| CSV | -f csv |
Comma-separated values |
| Markdown | -f md |
Markdown-formatted |
Examples
Extract product listings
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://shop.example.com" -s ".product" -f json
Output:
[
{"text": "Widget Pro - $29.99", "tag": "div", "class": "product"},
{"text": "Widget Max - $49.99", "tag": "div", "class": "product"}
]
Extract pricing table
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing" -f csv
Get all external links
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com" --external
Rate Limiting
- Default: 1 request per second (respectful crawling)
- Override with
--delay 0.5(seconds between requests) - Respects
robots.txtby default (override with--ignore-robots)
Notes
- Requires
beautifulsoup4andlxml(auto-installed byuv run --with) - Uses a standard browser User-Agent to avoid blocks
- Handles redirects, encoding detection, and error pages gracefully
- No JavaScript rendering (use for static HTML pages)