SkillHub

scrapling-extract

v1.0.3

Web scraping and data extraction using the Python Scrapling library. Use to scrape static HTML pages, JavaScript-rendered pages (Playwright), and anti-bot or Cloudflare-protected sites (stealth browser). Supports CSS selectors, XPath, adaptive DOM relocation so selectors survive site redesigns, sess...

Sourced from ClawHub, Authored by PiyushZinc

Installation

Please help me install the skill `scrapling-extract` from SkillHub official store. npx skills add PiyushZinc/scrapling-extract

Scrapling

Extract structured website data with resilient selection patterns, adaptive relocation, and the right Scrapling fetcher mode for each target.

Workflow

  1. Identify target type before writing code:
  2. Use Fetcher for static pages and API-like HTML responses.
  3. Use DynamicFetcher when JavaScript rendering is required.
  4. Use StealthyFetcher when anti-bot protection or browser fingerprinting issues are likely.
  5. Choose output contract first:
  6. Return JSON for pipelines/automation.
  7. Return Markdown/text for summarization or RAG ingestion.
  8. Keep stable field names even if selector strategy changes.
  9. Implement selectors in this order:
  10. Start with CSS selectors and pseudo-elements (for example ::text, ::attr(href)).
  11. Fall back to XPath for ambiguous DOM structure.
  12. Enable adaptive relocation for brittle or changing pages.
  13. Add safety controls:
  14. Respect target site terms and legal boundaries.
  15. Add timeouts, retries, and explicit error handling.
  16. Log status code, URL, and selector misses for debugging.
  17. Validate on at least 2 pages:
  18. Test one happy path and one edge case page.
  19. Confirm required fields are non-empty.
  20. Keep extraction deterministic (no hidden random choices).

Quick Setup

  1. Install base package:
  2. pip install scrapling
  3. Install fetchers when browser-based fetching is needed:
  4. pip install "scrapling[fetchers]"
  5. scrapling install
  6. python3 -m playwright install (required for DynamicFetcher and StealthyFetcher)
  7. Install optional extras as needed:
  8. pip install "scrapling[shell]" for shell + extract commands
  9. pip install "scrapling[ai]" for MCP capabilities

Execution Patterns

Pattern: One-off terminal extraction

Use Scrapling CLI for fastest no-code extraction:

scrapling extract get "https://example.com" content.md --css-selector "main"

Pattern: Python extraction script

Use the bundled helper:

# Static page (default)
python scripts/extract_with_scrapling.py --url "https://example.com" --css "h1::text"

# JavaScript-rendered page
python scripts/extract_with_scrapling.py --url "https://example.com" --fetcher dynamic --css "h1::text"

# Anti-bot protected page
python scripts/extract_with_scrapling.py --url "https://example.com" --fetcher stealthy --css "h1::text"

Pattern: Session-based scraping

Use session classes when cookies/state must persist across requests.

from scrapling.fetchers import FetcherSession

session = FetcherSession()
login_page = session.post("https://example.com/login", data={"user": "...", "pass": "..."})
protected_page = session.get("https://example.com/dashboard")
headline = protected_page.css_first("h1::text")

Use StealthySession or DynamicSession as drop-in replacements for anti-bot or JS-rendered targets.

Pattern: DOM change resilience

Use auto_save=True on initial capture and retry with adaptive selection on later runs when selectors break.

from scrapling.fetchers import Fetcher

# First run: saves DOM snapshot so adaptive relocation can work later
page = Fetcher.auto_match("https://example.com", auto_save=True, disable_adaptive=False)
price = page.css_first(".price::text")

# Later runs: automatically relocates the selector even if the DOM changed
page = Fetcher.auto_match("https://example.com", auto_save=False, disable_adaptive=False)
price = page.css_first(".price::text")

References

  • Use scrapling-reference.md for fetcher/API examples and selector patterns.
  • Use extract_with_scrapling.py for a reusable CLI script template.