Khai báo & nạp các thư viện liên quan cần sử dụng¶

In [ ]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm

Khai báo cái pattern của các trang sách cho tiện sử dụng¶

In [ ]:
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
In [ ]:
all_books = []

for page in tqdm(range(1, 51), desc="Đang cào dữ liệu"):
    url = base_url.format(page)
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')

    books = soup.select(".col-xs-6.col-sm-4.col-md-3.col-lg-3")

    for book in books:
        # Tên sách
        title = book.h3.a["title"]

        # Link ảnh
        img_url = book.find("div", class_="image_container").a.img["src"]
        img_url = "https://books.toscrape.com/" + img_url.replace("../", "")

        # Rating
        rating_class = book.find("p", class_="star-rating")["class"]
        rating = rating_class[1]  # One, Two, Three...

        # Giá
        price = book.find("p", class_="price_color").text.strip()

        all_books.append({
            "Title": title,
            "Image URL": img_url,
            "Rating": rating,
            "Price": price
        })

df = pd.DataFrame(all_books)
Đang cào dữ liệu: 100%|██████████| 50/50 [00:10<00:00,  4.86it/s]
                                   Title  \
0                   A Light in the Attic   
1                     Tipping the Velvet   
2                             Soumission   
3                          Sharp Objects   
4  Sapiens: A Brief History of Humankind   

                                           Image URL Rating    Price  
0  https://books.toscrape.com/media/cache/2c/da/2...  Three  £51.77  
1  https://books.toscrape.com/media/cache/26/0c/2...    One  £53.74  
2  https://books.toscrape.com/media/cache/3e/ef/3...    One  £50.10  
3  https://books.toscrape.com/media/cache/32/51/3...   Four  £47.82  
4  https://books.toscrape.com/media/cache/be/a5/b...   Five  £54.23  

Cào xong rồi, mình để vào một dataframe nên nó sẽ nhìn như thế này nha¶

In [ ]:
df
Out[ ]:
Title Image URL Rating Price
0 A Light in the Attic https://books.toscrape.com/media/cache/2c/da/2... Three £51.77
1 Tipping the Velvet https://books.toscrape.com/media/cache/26/0c/2... One £53.74
2 Soumission https://books.toscrape.com/media/cache/3e/ef/3... One £50.10
3 Sharp Objects https://books.toscrape.com/media/cache/32/51/3... Four £47.82
4 Sapiens: A Brief History of Humankind https://books.toscrape.com/media/cache/be/a5/b... Five £54.23
... ... ... ... ...
995 Alice in Wonderland (Alice's Adventures in Won... https://books.toscrape.com/media/cache/96/ee/9... One £55.53
996 Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1) https://books.toscrape.com/media/cache/09/7c/0... Four £57.06
997 A Spy's Devotion (The Regency Spies of London #1) https://books.toscrape.com/media/cache/1b/5f/1... Five £16.97
998 1st to Die (Women's Murder Club #1) https://books.toscrape.com/media/cache/2b/41/2... One £53.98
999 1,000 Places to See Before You Die https://books.toscrape.com/media/cache/d7/0f/d... Five £26.08

1000 rows × 4 columns