[python] 고전 사이트 게시판 텍스트 추출 크롤링

이건 어쨌든 고전 사이트가 있었거든요.고전 사이트 안에 게시판 형식이 있습니다. 게시판 형식 중에 텍스트를 어떻게 클로징하는지 예시로 가져왔습니다. 특별한 것은 없고 큰 부분도 아닙니다. 단순한 테스트라고 생각하세요.그렇게 어려운 내용은 아니지만 아까 하나 추가로 한글이 깨졌을 때 문제가 있을 수 있습니다.

python 고전 사이트 게시판 텍스트 추출 크롤링

크롤링을 할 때 여러분들이 BeautifulSoup를 많이 쓰잖아요. 한글이 깨지는 경우가 있기 때문에 그 때 어떻게 하면 좋을지 적어놓은 것이라고 생각하면 됩니다. 보통 한글이 깨진 경우는 두 가지 경우입니다. 문자 인코딩 디코딩 여기서 주로 한글 깨짐 현상이 발생하거든요. 인코딩과 디코딩을 어떻게 해야 할까요? 지금 그것만 고려해서 코딩을 해주시면 거의 특별한 일 없이 보통 것은 크롤링이 된다고만 생각하시면 될 것 같습니다. 아래와 같은 코딩을 만들어 놓았습니다. 부분만 참조하셔서 어떤 사이트에 게시판 형식으로 텍스트를 추출하거나 크롤링하는데 도움이 됐으면 좋겠어요.

import requests
from    bs4      import BeautifulSoup
import time


f = open('C:/Users/user/raw/literature/modern_literature.csv', 'w',encoding='utf-8-sig')
f.write("category,subject,wr_add" + '\n')

for page_num in range(1, 71) :
    #page_num = '2'
    url = 'http://yoursite.kr/php/board.php?board=modern&no=&command=list&page={}'.format(page_num)

    #content.decode('euc-kr', 'replace')

    raw = requests.get(url)
    html = BeautifulSoup(raw.content.decode('euc-kr', 'replace'), 'lxml')
    #BeautifulSoup 한글이 꺠질시 raw.content.decode('euc-kr', 'replace') 을 넣으면 한글이 제대로 나옴

    #list_category = html.find_all("td", {"class" : "list_category"})  #카테고리
    #list_subject = html.find_all("td", {"class" : "list_subject"}) #글제목

    #table = html.find("table", {"id" : "mainIndexTable"})
    #tds = table.find_all('td') 

    #list_category = whole.find("td", {"class" : "list_category"})
    #list_category

    #results = html.select('#mainIndexTable > tbody')

    trs = html.find_all("tr", {"height" : "24"})  #게시판 글 가져오기

    for idx, tr in enumerate(trs) :
        if idx >= 0 :
            list_category = tr.find("td", {"class" : "list_category"}).get_text().strip() #카테고리
            list_subject = tr.find("td", {"class" : "list_subject"}).get_text().strip() #글제목
            list_subject = list_subject.replace(',', '')
            #list_subject = list_subject.replace('~', '')
            list_wr_add = tr.find_all("td", {"class" : "list_wr_add"}) #주제어
            list_wr_add_2 = list_wr_add[1].get_text().strip() #2번째꺼만
            if list_wr_add_2 == '' :
                list_wr_add_2 ='주제어 없음'
            else :
                list_wr_add_2 = list_wr_add_2.replace(",", " ")
            f.write(list_category+','+list_subject+','+list_wr_add_2+'\n') #엑셀파일 입력

f.close()        
print('최종 완료')

import requests
from    bs4      import BeautifulSoup
import time

f = open('C:/Users/user/raw/literature/classic_literature.csv', 'w',encoding='utf-8-sig')
f.write("category,subject,wr_add" + '\n')

for page_num in range(1, 21) :
    #page_num = '2'
    url = 'http://yoursite.kr/php/board.php?board=classic&no=&command=list&page={}'.format(page_num)

    #content.decode('euc-kr', 'replace')

    raw = requests.get(url)
    html = BeautifulSoup(raw.content.decode('euc-kr', 'replace'), 'lxml')
    #BeautifulSoup 한글이 꺠질시 raw.content.decode('euc-kr', 'replace') 을 넣으면 한글이 제대로 나옴

    #list_category = html.find_all("td", {"class" : "list_category"})  #카테고리
    #list_subject = html.find_all("td", {"class" : "list_subject"}) #글제목

    #table = html.find("table", {"id" : "mainIndexTable"})
    #tds = table.find_all('td') 

    #list_category = whole.find("td", {"class" : "list_category"})
    #list_category

    #results = html.select('#mainIndexTable > tbody')

    trs = html.find_all("tr", {"height" : "24"})  #게시판 글 가져오기

    for idx, tr in enumerate(trs) :
        if idx >= 0 :
            list_category = tr.find("td", {"class" : "list_category"}).get_text().strip() #카테고리
            list_subject = tr.find("td", {"class" : "list_subject"}).get_text().strip() #글제목
            list_subject = list_subject.replace(',', '')
            #list_subject = list_subject.replace('~', '')
            list_wr_add = tr.find_all("td", {"class" : "list_wr_add"}) #주제어
            list_wr_add_2 = list_wr_add[1].get_text().strip() #2번째꺼만
            if list_wr_add_2 == '' :
                list_wr_add_2 ='주제어 없음'
            else :
                list_wr_add_2 = list_wr_add_2.replace(",", " ")
            f.write(list_category+','+list_subject+','+list_wr_add_2+'\n') #엑셀파일 입력

f.close()        
print('최종 완료')

'Python' 카테고리의 다른 글

[python] 이미지 내 텍스트 자동 삽입 (0)	2022.08.19
[python] 네이버쇼핑 연관검색어 수정 버전 키워드 크롤링 (0)	2022.08.19
[python] 정규식 영문 숫자 제외 한글 텍스트 추출 (0)	2022.08.14
[python] 공공 데이터 포털 API 소상공인시장진흥공단 상가 상권 정보 조회 (0)	2022.08.14
[python] 공공 데이터 포털 API 기상청 단기예보 동네예보 조회 (0)	2022.08.14

퍼포먼스 마케팅 데이터 분석

[python] 고전 사이트 게시판 텍스트 추출 크롤링

python 고전 사이트 게시판 텍스트 추출 크롤링

'Python' 카테고리의 다른 글

댓글

티스토리툴바

[python] 고전 사이트 게시판 텍스트 추출 크롤링

python 고전 사이트 게시판 텍스트 추출 크롤링

'Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바