Khai báo & nạp các thư viện liên quan cần sử dụng¶
In [ ]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
Khai báo cái pattern của các trang sách cho tiện sử dụng¶
In [ ]:
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
In [ ]:
all_books = []
for page in tqdm(range(1, 51), desc="Đang cào dữ liệu"):
url = base_url.format(page)
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
books = soup.select(".col-xs-6.col-sm-4.col-md-3.col-lg-3")
for book in books:
# Tên sách
title = book.h3.a["title"]
# Link ảnh
img_url = book.find("div", class_="image_container").a.img["src"]
img_url = "https://books.toscrape.com/" + img_url.replace("../", "")
# Rating
rating_class = book.find("p", class_="star-rating")["class"]
rating = rating_class[1] # One, Two, Three...
# Giá
price = book.find("p", class_="price_color").text.strip()
all_books.append({
"Title": title,
"Image URL": img_url,
"Rating": rating,
"Price": price
})
df = pd.DataFrame(all_books)
Đang cào dữ liệu: 100%|██████████| 50/50 [00:10<00:00, 4.86it/s]
Title \ 0 A Light in the Attic 1 Tipping the Velvet 2 Soumission 3 Sharp Objects 4 Sapiens: A Brief History of Humankind Image URL Rating Price 0 https://books.toscrape.com/media/cache/2c/da/2... Three £51.77 1 https://books.toscrape.com/media/cache/26/0c/2... One £53.74 2 https://books.toscrape.com/media/cache/3e/ef/3... One £50.10 3 https://books.toscrape.com/media/cache/32/51/3... Four £47.82 4 https://books.toscrape.com/media/cache/be/a5/b... Five £54.23
Cào xong rồi, mình để vào một dataframe nên nó sẽ nhìn như thế này nha¶
In [ ]:
df
Out[ ]:
Title | Image URL | Rating | Price | |
---|---|---|---|---|
0 | A Light in the Attic | https://books.toscrape.com/media/cache/2c/da/2... | Three | £51.77 |
1 | Tipping the Velvet | https://books.toscrape.com/media/cache/26/0c/2... | One | £53.74 |
2 | Soumission | https://books.toscrape.com/media/cache/3e/ef/3... | One | £50.10 |
3 | Sharp Objects | https://books.toscrape.com/media/cache/32/51/3... | Four | £47.82 |
4 | Sapiens: A Brief History of Humankind | https://books.toscrape.com/media/cache/be/a5/b... | Five | £54.23 |
... | ... | ... | ... | ... |
995 | Alice in Wonderland (Alice's Adventures in Won... | https://books.toscrape.com/media/cache/96/ee/9... | One | £55.53 |
996 | Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1) | https://books.toscrape.com/media/cache/09/7c/0... | Four | £57.06 |
997 | A Spy's Devotion (The Regency Spies of London #1) | https://books.toscrape.com/media/cache/1b/5f/1... | Five | £16.97 |
998 | 1st to Die (Women's Murder Club #1) | https://books.toscrape.com/media/cache/2b/41/2... | One | £53.98 |
999 | 1,000 Places to See Before You Die | https://books.toscrape.com/media/cache/d7/0f/d... | Five | £26.08 |
1000 rows × 4 columns