์ƒˆ์†Œ์‹

LANGUAGES/Python

[Python] ์›น ํฌ๋กค๋Ÿฌ ๋งŒ๋“ค๊ธฐ with Requests/BeautifulSoup

  • -

๐Ÿ”– ์›น ํฌ๋กค๋Ÿฌ๋ž€?

์–ด๋– ํ•œ ์ •๋ณด๋ฅผ ๋ธŒ๋ผ์šฐ์ €์—์„œ๋งŒ ๋ณด๋Š” ๊ฒƒ๋ฟ ์•„๋‹ˆ๋ผ ๋‚ด๊ฐ€ ์ด์šฉํ•˜๊ธฐ ํŽธํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ณด๊ด€ํ•˜๊ฑฐ๋‚˜ ์กฐ์ž‘ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋‹ค.

Python์„ ์ด์šฉํ•˜๋ฉด ๊ฐ„๋‹จํ•œ ์ฝ”๋“œ ๋ช‡์ค„ ๋งŒ์œผ๋กœ๋„ ์‰ฝ๊ฒŒ ์›น ์‚ฌ์ดํŠธ์—์„œ ์›ํ•˜๋Š” ์ •๋ณด๋งŒ์„ ๊ฐ€์ ธ์™€ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค.

 

๐Ÿ”– ์›น์—์„œ ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ

โžค Requests๋ฅผ ์ด์šฉํ•˜๊ธฐ

python์—๋Š” requests ๋ผ๋Š” ์œ ๋ช…ํ•œ http request๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์žˆ๋‹ค.

โžค ์„ค์น˜ํ•˜๊ธฐ

pip3 install requests

from rest_framework.views import APIView
from rest_framework.response import Response

from pocket.models import List
from pocket.serializers import ListSerializer

import requests

class ParseAPIView(APIView):
    def get(self, request):
        queryset = List.objects.all()
        serializer = ListSerializer(queryset.first())

        # HTTP GET Requests
        req = requests.get('https://laagom.tistory.com/')

        # HTTP ์†Œ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ
        html = req.text

        # HTTP Header ๊ฐ€์ ธ์˜ค๊ธฐ
        header = req.headers

        # HTTP Status ๊ฐ€์ ธ์˜ค๊ธฐ (200: ์ •์ƒ)
        status = req.status_code

        # HTTP๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ๋˜์—ˆ๋Š”์ง€ (True/False)
        is_ok = req.ok

        return Response(serializer.data)

์œ„ ์ฝ”๋“œ์—์„œ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•  ๊ฒƒ์€ HTML ์†Œ์Šค๋ฅผ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ html = req.text๋ฅผ ์ด์šฉํ•œ๋‹ค.

์œ„์— ๋‚˜์˜จ html, header, status, is_ok๋ฅผ ์ถœ๋ ฅํ•ด ๋ณธ ๊ฒฐ๊ณผ ์ œ๋Œ€๋กœ ์š”์ฒญํ•ด์„œ ๊ฐ€์ ธ์˜จ ๋ฌธ์ž์—ด์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

header: {    'Date': 'Mon, 07 Nov 2022 07:46:32 GMT'
        ,     'Content-Type': 'text/html;charset=UTF-8'
        ,     'Transfer-Encoding': 'chunked'
        ,     'Vary': 'Accept-Encoding'
        ,     'T_USERID': '800cdc655e5524bc59b66df34b1f68145caade80'
        ,     'Set-Cookie': 'REACTION_GUEST=eb3148932a1f9674bf875d5d26457e0543a9abf5'
        ,     'X-Content-Type-Options': 'nosniff'
        ,     'X-XSS-Protection': '1; mode=block'
        ,     'Cache-Control': 'no-cache, no-store, max-age=0, must-revalidate'
        ,     'Pragma': 'no-cache'
        ,     'Expires': '0'
        ,     'Strict-Transport-Security': 'max-age=31536000 ; includeSubDomains'
        ,     'Content-Encoding': 'gzip'}
status: 200
is_ok: True

 

โžค BeautifulSoup ์ด์šฉํ•˜๊ธฐ

Requests๋Š” ์ •๋ง ์ข‹์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด์ง€๋งŒ, html์„ python์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ์ฒด๊ตฌ์กฐ๋กœ ์ด์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹ค๋ฅธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๊ฒƒ์ด ๋ฐ”๋กœ BeautifulSoup์ด๋‹ค. BeautifulSoup์€ html์ฝ”๋“œ๋ฅผ Python์ด ์ดํ•ดํ•˜๋Š” ๊ฐ์ฒด ๊ตฌ์กฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” Parsing์„ ๋งก๊ณ  ์žˆ๊ณ , ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด ์šฐ๋ฆฌ๋Š” ์ œ๋Œ€๋กœ ๋œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•ด ๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

pip3 install bs4

BeautifulSoup์„ ์ง์ ‘ ์ณ์„œ ์„ค์น˜ํ•˜๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, bs4๋ผ๋Š” wrapper๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์„ค์น˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋” ์‰ฝ๊ณ  ์•ˆ์ „ํ•˜๋‹ค.

โžค ์ด์šฉ๋ฐฉ๋ฒ•

์œ„์—์„œ ์ด์šฉํ•œ ์†Œ์Šค์ฝ”๋“œ๋ฅผ ์ข€ ๋” ๋‹ค๋“ฌ์–ด ๋ณด์ž.

import requests
from bs4 import BeautifulSoup

class ParseAPIView(APIView):
    def get(slef, request):
        queryset = List.objects.all()
        serializer = ListSerializer(queryset.first())

        # HTTP GET Requests
        req = requests.get('https://laagom.tistory.com/')

        # HTTP ์†Œ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ
        html = req.text

        ## BeautifulSoup๋ฅผ ์ด์šฉ
        # BeautifulSoup์œผ๋กœ html์†Œ์Šค๋ฅผ python๊ฐ์ฒด๋กœ ๋ณ€ํ™˜
        soup = BeautifulSoup(html, 'html.parser')

        return Response(serializer.data)

์ด soup๊ฐ์ฒด์—์„œ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

BeautifulSoup์—์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” select๋ฅผ ์ด์šฉํ•œ๋‹ค. select๋Š” css selector๋ฅผ ์ด์šฉํ•ด ์กฐ๊ฑด๊ณผ ์ผ์น˜ํ•˜๋Š” ๋ชจ๋“  ๊ฐ์ฒด๋“ค์„ List๋กœ ๋ฐ˜ํ™˜ํ•ด ์ค€๋‹ค. ์˜ˆ์‹œ๋กœ ๋‚ด๊ฐ€ ์ž‘์„ฑํ•˜๊ณ  ์žˆ๋Š” ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ์—์„œ a๋งํฌ๋กœ ๊ฑธ๋ ค์žˆ๋Š” ๊ฐ’์„ ๋ชจ๋‘ ๊ฐ€์ ธ์™€ ๋ณด๋„๋ก ํ•˜์ž

ํ˜„์žฌ ์œ„์˜ ์ด๋ฏธ์ง€๋Š” ๋‚ด๊ฐ€ ์ž‘์„ฑํ•˜๊ณ  ์žˆ๋Š” ๊ธฐ์ˆ ๋ธ”๋กœ๊ทธ ํ˜„ํ™ฉํ™”๋ฉด์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ๋ฅผ ์—ด์–ด ๋ธ”๋กœ๊ทธ์˜ ๋ชจ๋“  ์ œ๋ชฉ์„ ๊ฐ€์ ธ์™€ ๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

BeautifulSoup์—์„œ๋Š” html์š”์†Œ๋ฟ ์•„๋‹ˆ๋ผ css๋„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— '.index-item > .article-info > a' ์— ์†ํ•˜๋Š” ๋ชจ๋“  ํ•ญ๋ชฉ์„ ๊ฐ€์ ธ์˜ค๋ฉด ๋ธ”๋กœ๊ทธ์˜ ์ œ๋ชฉ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

from rest_framework.views import APIView
from rest_framework.response import Response

from pocket.models import List
from pocket.serializers import ListSerializer

import requests
from bs4 import BeautifulSoup

class ParseAPIView(APIView):
    def get(slef, request):
        queryset = List.objects.all()
        serializer = ListSerializer(queryset.first())

        # HTTP GET Requests
        req = requests.get('https://laagom.tistory.com/')

        # HTTP ์†Œ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ
        html = req.text

        ## BeautifulSoup๋ฅผ ์ด์šฉ
        # BeautifulSoup์œผ๋กœ html์†Œ์Šค๋ฅผ python๊ฐ์ฒด๋กœ ๋ณ€ํ™˜
        soup = BeautifulSoup(html, 'html.parser')

        my_titles = soup.select(
            '.index-item > .article-info > a'
        )

        print(my_titles)

        return Response(serializer.data)

my_titles๋ผ๋Š” ๋ณ€์ˆ˜์— soup๊ฐ์ฒด์˜ selectํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด css๊ฐ์ฒด์— ์ ‘๊ทผํ•ด์„œ ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ํƒœ๊ทธ์™€ ๋ธ”๋กœ๊ทธ์˜ ์ œ๋ชฉ์„ ๊ฐ€์ง€๊ณ  ์™€๋ดค๋‹ค.

 

์œ„์˜ base shell์—์„œ ์ถœ๋ ฅํ•œ ๊ฒฐ๊ณผ๋Š” my_titles์— ๋‹ด๊ธด ๋ชฉ๋ก์„ ๊ทธ๋Œ€๋กœ ๋ณด์—ฌ์ค€ ๊ฑฐ๋ผ ๊น”๋”ํ•˜์ง€ ์•Š๊ณ  ํƒœ๊ทธ์™€ ์š”์†Œ๊ฐ€ ์„ž์—ฌ ์žˆ์–ด ๋ณด๊ธฐ ์–ด๋ ต๋‹ค.

        # HTTP GET Requests
        req = requests.get('https://laagom.tistory.com/')

        # HTTP ์†Œ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ
        html = req.text

        ## BeautifulSoup๋ฅผ ์ด์šฉ
        # BeautifulSoup์œผ๋กœ html์†Œ์Šค๋ฅผ python๊ฐ์ฒด๋กœ ๋ณ€ํ™˜
        soup = BeautifulSoup(html, 'html.parser')

        my_titles = soup.select(
            '.index-item > .article-info > a'
        )
        # my_titles๋Š” list๊ฐ์ฒด
        for title in my_titles:
            # Tag์•ˆ์˜ ํ…์ŠคํŠธ
            print(f'title = {title.text}')
            # Tag์˜ ์†์„ฑ์„ ๊ฐ€์ ธ์˜ค๊ธฐ(ex: href์†์„ฑ)
            print(f'url = {title.get("href")}')

์œ„์ฒ˜๋Ÿผ ๋ฐ˜๋ณต๋ฌธ์„ ๋Œ๋ ค ์ œ๋ชฉ์˜ text์™€ url๋งŒ ๋ฝ‘์•„์„œ ํ™•์ธํ•ด ๋ณด์ž.

 

Contents

ํฌ์ŠคํŒ… ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค

์ด ๊ธ€์ด ๋„์›€์ด ๋˜์—ˆ๋‹ค๋ฉด ๊ณต๊ฐ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.