[네이버]Open API 없이 네이버 뉴스 웽 크롤러

"""네이버 뉴스 기사 웹 크롤러 모듈"""

from bs4 import BeautifulSoup

import urllib.request

# 출력 파일 명

OUTPUT_FILE_NAME = 'output.txt'

# 긁어 올 URL

URL = 'http://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=103&oid=055'\

      '&aid=0000445667'

# 크롤링 함수

def get_text(URL):

    source_code_from_URL = urllib.request.urlopen(URL)

    soup = BeautifulSoup(source_code_from_URL, 'html.parser', from_encoding='utf-8')

    text = ''

    for item in soup.find_all('div', id='articleBodyContents'):

        text = text + str(item.find_all(text=True))

    return text

# 메인 함수

def main():

    open_output_file = open(OUTPUT_FILE_NAME, 'w')

    result_text = get_text(URL)

    open_output_file.write(result_text)

    open_output_file.close()

if __name__ == '__main__':

    main()

"" 텍스트 정제 모듈
    영어, 특수기호 모두 제거
"""
 
import re
 
# 입,출력 파일명
INPUT_FILE_NAME = 'output.txt'
OUTPUT_FILE_NAME = 'output_cleand.txt'
 
 
# 클리닝 함수
def clean_text(text):
    cleaned_text = re.sub('[a-zA-Z]', '', text)
    cleaned_text = re.sub('[\{\}\[\]\/?.,;:|\)*~`!^\-_+<>@\#$%&\\\=\(\'\"]',
                          '', cleaned_text)
    return cleaned_text
    
 
# 메인 함수
def main():
    read_file = open(INPUT_FILE_NAME, 'r')
    write_file = open(OUTPUT_FILE_NAME, 'w')
    text = read_file.read()
    text = clean_text(text)
    write_file.write(text)
    read_file.close()
    write_file.close()
 
 
if __name__ == "__main__":
    main()

저작자표시 (새창열림)

'Python' 카테고리의 다른 글

[파이썬3] urllib.request 불러오기 (0)	2017.01.10
[파이썬3]URL 여러 페이지 자동으로 만들기 (0)	2017.01.09
[네이버] 웹문서 섹션 크롤링 (0)	2017.01.09
[네이버] 파이썬 네이버 카페 크롤링 (4)	2017.01.09
[네이버]뉴스 크롤링 (0)	2017.01.08

퍼포먼스 마케팅 데이터 분석

[네이버]Open API 없이 네이버 뉴스 웽 크롤러

'Python' 카테고리의 다른 글

댓글

티스토리툴바

[네이버]Open API 없이 네이버 뉴스 웽 크롤러

'Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바