Khai báo & nạp các thư viện liên quan cần sử dụng¶
In [ ]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
Khai báo cái pattern của các trang sách cho tiện sử dụng¶
In [ ]:
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
In [ ]:
all_books = []
for page in tqdm(range(1, 51), desc="Đang cào dữ liệu"):
    url = base_url.format(page)
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    books = soup.select(".col-xs-6.col-sm-4.col-md-3.col-lg-3")
    for book in books:
        # Tên sách
        title = book.h3.a["title"]
        # Link ảnh
        img_url = book.find("div", class_="image_container").a.img["src"]
        img_url = "https://books.toscrape.com/" + img_url.replace("../", "")
        # Rating
        rating_class = book.find("p", class_="star-rating")["class"]
        rating = rating_class[1]  # One, Two, Three...
        # Giá
        price = book.find("p", class_="price_color").text.strip()
        all_books.append({
            "Title": title,
            "Image URL": img_url,
            "Rating": rating,
            "Price": price
        })
df = pd.DataFrame(all_books)
Đang cào dữ liệu: 100%|██████████| 50/50 [00:10<00:00, 4.86it/s]
                                   Title  \
0                   A Light in the Attic   
1                     Tipping the Velvet   
2                             Soumission   
3                          Sharp Objects   
4  Sapiens: A Brief History of Humankind   
                                           Image URL Rating    Price  
0  https://books.toscrape.com/media/cache/2c/da/2...  Three  £51.77  
1  https://books.toscrape.com/media/cache/26/0c/2...    One  £53.74  
2  https://books.toscrape.com/media/cache/3e/ef/3...    One  £50.10  
3  https://books.toscrape.com/media/cache/32/51/3...   Four  £47.82  
4  https://books.toscrape.com/media/cache/be/a5/b...   Five  £54.23  
Cào xong rồi, mình để vào một dataframe nên nó sẽ nhìn như thế này nha¶
In [ ]:
df
Out[ ]:
| Title | Image URL | Rating | Price | |
|---|---|---|---|---|
| 0 | A Light in the Attic | https://books.toscrape.com/media/cache/2c/da/2... | Three | £51.77 | 
| 1 | Tipping the Velvet | https://books.toscrape.com/media/cache/26/0c/2... | One | £53.74 | 
| 2 | Soumission | https://books.toscrape.com/media/cache/3e/ef/3... | One | £50.10 | 
| 3 | Sharp Objects | https://books.toscrape.com/media/cache/32/51/3... | Four | £47.82 | 
| 4 | Sapiens: A Brief History of Humankind | https://books.toscrape.com/media/cache/be/a5/b... | Five | £54.23 | 
| ... | ... | ... | ... | ... | 
| 995 | Alice in Wonderland (Alice's Adventures in Won... | https://books.toscrape.com/media/cache/96/ee/9... | One | £55.53 | 
| 996 | Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1) | https://books.toscrape.com/media/cache/09/7c/0... | Four | £57.06 | 
| 997 | A Spy's Devotion (The Regency Spies of London #1) | https://books.toscrape.com/media/cache/1b/5f/1... | Five | £16.97 | 
| 998 | 1st to Die (Women's Murder Club #1) | https://books.toscrape.com/media/cache/2b/41/2... | One | £53.98 | 
| 999 | 1,000 Places to See Before You Die | https://books.toscrape.com/media/cache/d7/0f/d... | Five | £26.08 | 
1000 rows × 4 columns