Modern Web Scraping with AI: A 1-Hour Tutorial
Overview
In this tutorial, we’ll explore how AI-powered tools like GitHub Copilot have revolutionized web scraping. Instead of manually writing complex parsing logic, we’ll leverage AI to help us quickly build robust scrapers that can collect, parse, and save data from websites.
What you’ll learn:
- Setting up GitHub Copilot with VS Code
- Using AI prompts to generate web scraping code
- Extracting data from a real website
- Handling common challenges (rate limiting, blocking)
- Ethics and best practices
Prerequisites:
- Basic Python knowledge
- VS Code installed
- GitHub account
1. Setup: GitHub Copilot + VS Code (10 minutes)
Install VS Code
If you don’t have VS Code installed:
- Download from: https://code.visualstudio.com/
- Install the Python extension: https://marketplace.visualstudio.com/items?itemName=ms-python.python
Setup GitHub Copilot
- Subscribe to GitHub Copilot:
- Visit: https://github.com/features/copilot
- Get free access if you’re a student/teacher: https://education.github.com/pack
- Install GitHub Copilot Extension:
- Open VS Code
- Go to Extensions (Ctrl+Shift+X). Mac users need to replace Ctrl with Cmd for the shortcuts in this document.
- Search for “GitHub Copilot”
- Install the official extension by GitHub
- Install “GitHub Copilot Chat” as well
- Authenticate:
- Press Ctrl+Shift+P and type “GitHub Copilot: Sign In”
- Follow the authentication flow
- In VS Code, open Copilot Chat with Ctrl+Shift+I
2. Project Setup (5 minutes)
Create Project Structure
Step 1: Create a new folder for your project
- Create a new folder called
tutorialon your desktop or preferred location - Open this folder in VS Code (File → Open Folder)
Step 2: Set up Python virtual environment A virtual environment creates an isolated Python workspace that prevents package conflicts between different projects and keeps your system Python installation clean.
- Open the VS Code terminal (Terminal → New Terminal)
- For Mac/Linux users: Run these commands one by one:
python -m venv crawl source crawl/bin/activate - For Windows users: Run these commands one by one:
python -m venv crawl crawl\Scripts\activate - You should see
(crawl)appear at the beginning of your terminal prompt, indicating the virtual environment is active
AI Prompt for Dependencies
Prompt:
- Select the “Agent” mode
- Pick the model you want to use. I use the latest Claude.
I'm building a web scraper in Python virtual environment crawl.
What packages should I install for modern web scraping?
I want to scrape HTML, handle JavaScript-rendered pages, save to CSV,
and be respectful with rate limiting.
3. Target Website Selection (5 minutes)
For this tutorial, we’ll scrape Books to Scrape (https://books.toscrape.com/), a website specifically designed for scraping practice.
Why this website?
- ✅ Legal and ethical to scrape
- ✅ No authentication required
- ✅ Structured data (titles, prices, ratings)
- ✅ Multiple pages for pagination practice
- ✅ No aggressive blocking
Alternative websites:
- Quotes to Scrape: http://quotes.toscrape.com/
4. Building the Scraper with AI (25 minutes)
Step 1: Basic Page Fetching
Prompt:
Write a Python code book.py to fetch the HTML content from "https://books.toscrape.com/".
Include proper error handling and user agent headers to be respectful.
Expected code structure:
import requests
from bs4 import BeautifulSoup
import time
import csv
import pandas as pd
def fetch_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.text
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
Step 2: Parse Book Data
Prompt:
Parse the HTML from books.toscrape.com to extract book information.
I need: title, price, rating (stars), and availability.
Use BeautifulSoup and return a list of dictionaries.
Step 3: Handle Pagination
Prompt:
Modify the scraper to handle pagination on books.toscrape.com.
The site has "Next" buttons to navigate through pages.
Scrape all books from all pages and combine the results.
Use a timer to show how long the program takes to finish.
Save the scraped book data to a CSV file with proper column headers.
Complete Example Structure:
def parse_book_data(html):
# AI will generate BeautifulSoup parsing logic
pass
def scrape_all_books():
# AI will generate pagination handling
pass
def save_to_csv(books_data, filename='books.csv'):
# AI will generate CSV saving logic
pass
if __name__ == "__main__":
books = scrape_all_books()
save_to_csv(books)
print(f"Scraped {len(books)} books successfully!")
5. Improving the Code with AI (10 minutes)
Data types in CSV
Prompt:
When saving the CSV file, make sure to change the price and rating to numeric values,
and use 1 for being available and 0 for out of stock.
Make sure to fetch all pages and all books.
Performance Optimization
Prompt:
Optimize my web scraper for better performance.
Add concurrent requests using threading or async programming while maintaining rate limits.
Documentation
Prompt:
Create a README.MD to explain the project.
Handling Rate Limiting
Prompt:
Add rate limiting to my web scraper to be respectful to the server.
Add random delays between requests and implement exponential backoff for failed requests.
Bypassing Basic Blocks
Prompt:
My scraper is getting blocked.
Help me add rotation of user agents, session handling, and proxy support
to make it more robust against basic anti-bot measures.
Adding Robust Error Handling
Prompt:
Improve error handling in my scraper.
Add logging, retry logic for network failures, and graceful handling of parsing errors.
6. Ethics and Legal Considerations (5 minutes)
The Golden Rules
- Always check robots.txt first:
website.com/robots.txt - Respect rate limits - don’t overwhelm servers
- Read Terms of Service - some sites prohibit scraping
- Don’t scrape personal/private data without permission
- Use official APIs when available
Prompt for Ethics Check:
I want to scrape [website]. Help me check if this is ethical and legal.
What should I look for in their robots.txt and terms of service?
Good Practices:
- ✅ Start with small-scale testing
- ✅ Cache responses to avoid re-scraping
- ✅ Identify yourself with proper User-Agent
- ✅ Use official APIs when available
- ❌ Don’t scrape copyrighted content for commercial use
- ❌ Don’t ignore robots.txt directives
- ❌ Don’t overwhelm servers with rapid requests
7. Troubleshooting Common Issues
Problem: “403 Forbidden” or “429 Too Many Requests”
Prompt:
My scraper is getting 403/429 errors.
Help me implement proper rate limiting, user agent rotation,
and session management to handle this respectfully.
Problem: JavaScript-Rendered Content
Prompt:
The website loads content with JavaScript.
Help me use Selenium WebDriver to scrape dynamic content that's not in the initial HTML.
Problem: CAPTCHAs or Complex Anti-Bot
Ethical Response:
- Don’t try to bypass CAPTCHAs
- Look for official APIs instead
- Contact the website owner for permission
8. Hands-On Exercise (Remaining Time)
Your Turn!
Create some prompt to analyze the csv file of books using jupyter notebooks.
What if I do want to save the pictures of the books?
Pick one of these websites and use AI to help you build a scraper:
- Quotes to Scrape: http://quotes.toscrape.com/
- Extract: quote text, author, tags
- Hacker News: https://news.ycombinator.com/
- Extract: story titles, scores, comment counts
- Wikipedia Recent Changes: https://en.wikipedia.org/wiki/Special:RecentChanges
- Extract: page titles, edit summaries, timestamps
Prompting Strategy:
- Start with: “Help me scrape [website] to extract [specific data]”
- Ask for improvements: “Make this more robust against errors”
- Add features: “Add data cleaning and validation”
- Optimize: “How can I make this faster while being respectful?”
Key Takeaways
- AI accelerates development - What used to take hours now takes minutes
- Start with clear prompts - Be specific about what you want to extract
- Iterate and improve - Use AI to continuously enhance your code
- Ethics first - Always scrape responsibly and legally
- APIs are usually better - When available, use official APIs instead
Resources for Continued Learning
- Hands-On Large Language Models: https://a.co/d/bK30tce
- Github Copilot Tutorial: https://www.youtube.com/watch?v=SJqGYwRq0uc
- Claude Code: https://www.claude.com/product/claude-code
Next Steps
- Explore advanced AI tools (Claude Code, Cursor, Windsurf, GPT-5, Gemini etc.)
- Learn about headless browsers and browser automation
- Practice on scraping-friendly websites