Modern Web Scraping with AI: A 1-Hour Tutorial

Overview

In this tutorial, we’ll explore how AI-powered tools like GitHub Copilot have revolutionized web scraping. Instead of manually writing complex parsing logic, we’ll leverage AI to help us quickly build robust scrapers that can collect, parse, and save data from websites.

What you’ll learn:

  • Setting up GitHub Copilot with VS Code
  • Using AI prompts to generate web scraping code
  • Extracting data from a real website
  • Handling common challenges (rate limiting, blocking)
  • Ethics and best practices

Prerequisites:

  • Basic Python knowledge
  • VS Code installed
  • GitHub account

1. Setup: GitHub Copilot + VS Code (10 minutes)

Install VS Code

If you don’t have VS Code installed:

Setup GitHub Copilot

  1. Subscribe to GitHub Copilot:
  2. Install GitHub Copilot Extension:
    • Open VS Code
    • Go to Extensions (Ctrl+Shift+X). Mac users need to replace Ctrl with Cmd for the shortcuts in this document.
    • Search for “GitHub Copilot”
    • Install the official extension by GitHub
    • Install “GitHub Copilot Chat” as well
  3. Authenticate:
    • Press Ctrl+Shift+P and type “GitHub Copilot: Sign In”
    • Follow the authentication flow
    • In VS Code, open Copilot Chat with Ctrl+Shift+I

2. Project Setup (5 minutes)

Create Project Structure

Step 1: Create a new folder for your project

  • Create a new folder called tutorial on your desktop or preferred location
  • Open this folder in VS Code (File → Open Folder)

Step 2: Set up Python virtual environment A virtual environment creates an isolated Python workspace that prevents package conflicts between different projects and keeps your system Python installation clean.

  • Open the VS Code terminal (Terminal → New Terminal)
  • For Mac/Linux users: Run these commands one by one:
    python -m venv crawl
    source crawl/bin/activate
    
  • For Windows users: Run these commands one by one:
    python -m venv crawl
    crawl\Scripts\activate
    
  • You should see (crawl) appear at the beginning of your terminal prompt, indicating the virtual environment is active

AI Prompt for Dependencies

Prompt:

  • Select the “Agent” mode
  • Pick the model you want to use. I use the latest Claude.
I'm building a web scraper in Python virtual environment crawl.
What packages should I install for modern web scraping?
I want to scrape HTML, handle JavaScript-rendered pages, save to CSV,
and be respectful with rate limiting.

3. Target Website Selection (5 minutes)

For this tutorial, we’ll scrape Books to Scrape (https://books.toscrape.com/), a website specifically designed for scraping practice.

Why this website?

  • ✅ Legal and ethical to scrape
  • ✅ No authentication required
  • ✅ Structured data (titles, prices, ratings)
  • ✅ Multiple pages for pagination practice
  • ✅ No aggressive blocking

Alternative websites:


4. Building the Scraper with AI (25 minutes)

Step 1: Basic Page Fetching

Prompt:

Write a Python code book.py to fetch the HTML content from "https://books.toscrape.com/".
Include proper error handling and user agent headers to be respectful.

Expected code structure:

import requests
from bs4 import BeautifulSoup
import time
import csv
import pandas as pd

def fetch_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

Step 2: Parse Book Data

Prompt:

Parse the HTML from books.toscrape.com to extract book information.
I need: title, price, rating (stars), and availability.
Use BeautifulSoup and return a list of dictionaries.

Step 3: Handle Pagination

Prompt:

Modify the scraper to handle pagination on books.toscrape.com.
The site has "Next" buttons to navigate through pages.
Scrape all books from all pages and combine the results.
Use a timer to show how long the program takes to finish.
Save the scraped book data to a CSV file with proper column headers.

Complete Example Structure:

def parse_book_data(html):
    # AI will generate BeautifulSoup parsing logic
    pass

def scrape_all_books():
    # AI will generate pagination handling
    pass

def save_to_csv(books_data, filename='books.csv'):
    # AI will generate CSV saving logic
    pass

if __name__ == "__main__":
    books = scrape_all_books()
    save_to_csv(books)
    print(f"Scraped {len(books)} books successfully!")

5. Improving the Code with AI (10 minutes)

Data types in CSV

Prompt:

When saving the CSV file, make sure to change the price and rating to numeric values,
and use 1 for being available and 0 for out of stock.
Make sure to fetch all pages and all books.

Performance Optimization

Prompt:

Optimize my web scraper for better performance.
Add concurrent requests using threading or async programming while maintaining rate limits.

Documentation

Prompt:

Create a README.MD to explain the project.

Handling Rate Limiting

Prompt:

Add rate limiting to my web scraper to be respectful to the server.
Add random delays between requests and implement exponential backoff for failed requests.

Bypassing Basic Blocks

Prompt:

My scraper is getting blocked.
Help me add rotation of user agents, session handling, and proxy support
to make it more robust against basic anti-bot measures.

Adding Robust Error Handling

Prompt:

Improve error handling in my scraper.
Add logging, retry logic for network failures, and graceful handling of parsing errors.

The Golden Rules

  1. Always check robots.txt first: website.com/robots.txt
  2. Respect rate limits - don’t overwhelm servers
  3. Read Terms of Service - some sites prohibit scraping
  4. Don’t scrape personal/private data without permission
  5. Use official APIs when available

Prompt for Ethics Check:

I want to scrape [website]. Help me check if this is ethical and legal.
What should I look for in their robots.txt and terms of service?

Good Practices:

  • ✅ Start with small-scale testing
  • ✅ Cache responses to avoid re-scraping
  • ✅ Identify yourself with proper User-Agent
  • ✅ Use official APIs when available
  • ❌ Don’t scrape copyrighted content for commercial use
  • ❌ Don’t ignore robots.txt directives
  • ❌ Don’t overwhelm servers with rapid requests

7. Troubleshooting Common Issues

Problem: “403 Forbidden” or “429 Too Many Requests”

Prompt:

My scraper is getting 403/429 errors.
Help me implement proper rate limiting, user agent rotation,
and session management to handle this respectfully.

Problem: JavaScript-Rendered Content

Prompt:

The website loads content with JavaScript.
Help me use Selenium WebDriver to scrape dynamic content that's not in the initial HTML.

Problem: CAPTCHAs or Complex Anti-Bot

Ethical Response:

  • Don’t try to bypass CAPTCHAs
  • Look for official APIs instead
  • Contact the website owner for permission

8. Hands-On Exercise (Remaining Time)

Your Turn!

Create some prompt to analyze the csv file of books using jupyter notebooks.

What if I do want to save the pictures of the books?

Pick one of these websites and use AI to help you build a scraper:

  1. Quotes to Scrape: http://quotes.toscrape.com/
    • Extract: quote text, author, tags
  2. Hacker News: https://news.ycombinator.com/
    • Extract: story titles, scores, comment counts
  3. Wikipedia Recent Changes: https://en.wikipedia.org/wiki/Special:RecentChanges
    • Extract: page titles, edit summaries, timestamps

Prompting Strategy:

  1. Start with: “Help me scrape [website] to extract [specific data]”
  2. Ask for improvements: “Make this more robust against errors”
  3. Add features: “Add data cleaning and validation”
  4. Optimize: “How can I make this faster while being respectful?”

Key Takeaways

  1. AI accelerates development - What used to take hours now takes minutes
  2. Start with clear prompts - Be specific about what you want to extract
  3. Iterate and improve - Use AI to continuously enhance your code
  4. Ethics first - Always scrape responsibly and legally
  5. APIs are usually better - When available, use official APIs instead

Resources for Continued Learning


Next Steps

  • Explore advanced AI tools (Claude Code, Cursor, Windsurf, GPT-5, Gemini etc.)
  • Learn about headless browsers and browser automation
  • Practice on scraping-friendly websites