[python] 정규식 특수문자 기타 단어 필터링

python 정규식 특수문자 기타 단어 필터링

이거는 제 개인 차원에서 정리를 하는 건데 파이스턴 코딩 중에서 정규식으로 특수문자 이러한 것들을 어떻게 정리를 할 것인가 특히 저 같은 경우에는 텍스트를 처리를 하는 텍스트 내에서도 어떤 특수문자 이러한 것들이 많이 있지 않습니까 예를 들자면

한문이 있을 수도 있고요 영어나 숫자만 없앤다든가 아니면 어떤 특수 문자들만 없는 경우가 있다든가 기타 등등 이런 것들이 많았습니다. 특히 어떤 키워드를 뭔가 추출을 한다든가 아니면 어떤 중복을 뭔가 없앤다든가 아니면 공백을 뭔가 빈칸 이런 것들을 없앤다든가 여러 기타 등등이 있었어서 그럴 때 어떻게 활용할 것인가 차원에서 가지고 온 겁니다.

많이 피곤하긴 하죠. 정규직으로 이거를 처리를 하고 뭘 처리를 하고 텍스트를 어떻게 할 것인가 이런 것들이 고민이 많이 있어서 그럴 때 한번 개인용으로 보관을 해보자 리스트를 리스트로 어떻게 전 처리를 시키는지 그거를 확인하는 용도로서 저장을 해놓은 겁니다. 쓸 만한 부분들이 다 제가 임의적으로 정리를 해놓은 거라서 참고 사항으로만 보시면 될 것 같습니다.

import re
from hanspell import spell_checker


f = open(r"C:\Users\user\Desktop\text.txt", 'r', encoding='utf-8-sig')

content_text = f.readlines()  #read 로 해야됨

content_text_renews = []

special_char = '「」≪≫~〉[]〈\/*?"<>|-․‘’–' #특수문자

for content_text in content_text :
    for c in special_char:
        if c in content_text :
            content_text = content_text.replace(c, ' ')
            content_text = content_text.replace("(가)", "")  #필터링
            content_text = content_text.replace("(나)", "") #필터링
            content_text = content_text.replace("(다)", "") #필터링
            content_text = content_text.replace("p.", "") #필터링
            content_text.strip()   
            content_text_renews.append(content_text) #새 리스트에 넣기
        #else :
            #content_text_renews.append(content_text)


content_text_renew_spells = []
for content_text_renew in content_text_renews :
    content_text_renew = re.sub('[|A-Za-z|]+', '', content_text_renew)  #영문 제외
    content_text_renew = re.sub('[|0-9|]+', '', content_text_renew)  #숫자 제외
    content_text = re.compile('[|ㄱ-ㅎ|ㅏ-ㅣ]+').sub('',content_text) # 정규식에서 일치되는 부분을  제외하고 저장
    content_text_renew = content_text_renew.strip() #빈칸 제거
    content_text_renew = content_text_renew.replace("(수능)", "") #필터링
    content_text_renew = content_text_renew.replace("(수능)", "") #필터링
    content_text_renew = content_text_renew.replace(",", "") #필터링
    content_text_renew = content_text_renew.replace("’", "") #필터링
    content_text_renew = content_text_renew.replace("‘", "") #필터링
    content_text_renew = content_text_renew.replace("-", "")
    content_text_renew = content_text_renew.replace("<", "")
    content_text_renew = content_text_renew.replace("(", "")
    content_text_renew = content_text_renew.replace(")", "")
    content_text_renew = content_text_renew.replace(">", "")
    content_text_renew = content_text_renew.replace("/", "")
    content_text_renew = content_text_renew.replace("[", "")
    content_text_renew = content_text_renew.replace("]", "")
    content_text_renew_spells.append(content_text_renew)
    
    
checked_list = []    
for i in content_text_renew_spells :
    result = spell_checker.check(i)
    dict_result =  result.as_dict()
    checked = dict_result['checked']
    #checked = checked.replace("  ", "")
    #checked = checked.replace("\ue64a", "")
    checked_list.append(checked)
print("완료")

python 리스트 내 중복 개수 제거

##### 리스트 내 중복 개수 제거

#my_list = ['A', 'B', 'C', 'D', 'B', 'D', 'E']
#new_list = []
#for v in my_list:
#    if v not in new_list:
#        new_list.append(v)
#print(new_list)
#출력된 값 ['A', 'B', 'C', 'D', 'E']

new_list = []

for v in checked_list :
    if v not in new_list :
        new_list.append(v)
print(new_list)
len(new_list)

특정 특수문자 및 공백 필터링

new_list_2 = []

for i in new_list :
    i = i.replace("  ", " ").lstrip()
    i = i.replace("– ", "").rstrip()
    i = i.replace("  ", " ").rstrip()
    i = i.replace("· ", " ").rstrip()
    i = i.replace("  ", " ").rstrip()

    if i == '' :  #빈칸 제외
        pass
    elif i == '   ': #빈칸 제외
        pass
    elif i == ' ':
        pass
    else :
        new_list_2.append(i)
        print(i)

python 한 번 더 리스트 내 중복 개수 제거

new_list_final = []

for v in new_list_2 :
    if v not in new_list_final :
        new_list_final.append(v)
#print(new_list_final)
len(new_list_final)

python 최종 특정 단어만 후 결과 보기

for i in new_list_final :
    i = i.replace("수능 ", "").lstrip()
    i = i.replace("문학 ", "").lstrip()
    i = i.replace("현대", "").lstrip()
    i = i.replace("고전", "").lstrip()
    i = i.replace("소설", "").lstrip()
    i = i.replace("산문", "").lstrip()
    i = i.replace("시나리오", "").rstrip()
    i = i.replace("숙특 고시", "").lstrip()
    i = i.replace("수특 고시", "").lstrip()
    i = i.replace("숙특 ", "").lstrip()
    i = i.replace("수특 ", "").lstrip()
    i = i.replace("개념 ", "").lstrip()
    i = i.replace("작자 미상 ", "").lstrip()

    print(i)

'Python' 카테고리의 다른 글

[python] 네이버 웹마스터도구 색인 URL 웹페이지 요청 자동화 해보기 (0)	2022.09.05
[python] 다음 블로그 제목, 본문 내용 텍스트 추출 크롤링 후 메모장 저장 (0)	2022.09.05
[python] 다음 블로그 URL 추출 및 백업 (0)	2022.09.05
[python] URL 멀티미디어 다운로드 자동화 (0)	2022.09.05
[python] 쉼표로 텍스트 구분하여 리스트화 시키기 (0)	2022.09.05

퍼포먼스 마케팅 데이터 분석

[python] 정규식 특수문자 기타 단어 필터링