Python으로 naver 뉴스 크롤링 crawler

안녕하세요. 연봉 1억 개발자 dev-woo 입니다

오늘은 python 코드를 사용해서 naver 뉴스에 있는 글들을 크롤링해서 csv 파일로 저장하는 것을 해보겠습니다

크롤링은 아시다시피 naver에서 주기적으로 html태그의 속성을 변경하는 식으로 막고 있는데

2023.03 월 기준으로 잘 동작합니다 혹시라도 안되면 댓글 남겨 주세요 :)

크롤링 대상은 네이버 뉴스에 제목, 내용, 발행일, 댓글 입니다.

혹시 블로그 크롤링도 궁금하시다면?? 아래 링크를 참고하세요

python으로 naver 블로그 크롤링 crawler

안녕하세요. 연봉 1억 개발자 dev-woo 입니다 오늘은 python 코드를 사용해서 naver블로그에 있는 글들을 크롤링해서 csv 파일로 저장하는 것을 해보겠습니다 크롤링은 아시다시피 naver에서 주기적으

developer-woo.tistory.com

뉴스 크롤링 동작 순서

저 같은 경우 naver api를 통해 제가 원하는 뉴스 키워드에 해당하는 뉴스 url을 가져온 후 그걸 BeautifulSoup를 이용해 Html문을 파싱 해서 파일로 만들고 있습니다.

그래서 전체적인 개요 순서는 다음과 같습니다

1. naver api로 원하는 검색 키워드에 해당하는 뉴스 url을 가져온다

2. url 요청을 통해 html문을 가져온다.

3. 가져온 html문에서 title, links, contents, pubdate, comments (제목, 원본링크, 내용, 발행일, 댓글)을 추출한다.

4. csv파일로 저장한다.

naver api로 원하는 검색 키워드에 해당하는 뉴스 url 가져오기 위한 Key 발급

우선 naver api 사용을 위해 naver api key를 발급받아야 합니다

https://developers.naver.com/apps/#/list

애플리케이션 - NAVER Developers

developers.naver.com

여기서 애플리케이션 등록 신청을 하여 발급받습니다.

애플리케이션 이름에 아무 이름이나 넣으셔도 됩니다.

사용 API에서는 "검색"을 선택합니다

환경 추가에 "WEB 설정"을 선택한 후 웹서비스 URL 란에 "http://127.0.0.1"을 넣어주고 등록하면 됩니다.

이렇게 하면 naver api key발급이 끝납니다

아주 간단하죠? 1단계 끝입니다.

naver api로 원하는 검색 키워드에 해당하는 뉴스 url 가져오는 Python Code

아래는 작성된 python code입니다

python 3.6 이상에서 테스트했으므로 3.6 이상으로 추천드립니다

python 설치는 여기서 별도로 다루지는 않지만 miniconda로 하시면 편합니다

from bs4 import BeautifulSoup
import requests
import re
import time
import os
import sys
import urllib.request
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys

# WebDriver Setting
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.implicitly_wait(3)


# selenium으로 검색 페이지 불러오기 #
naver_urls = []
pub_dates = []

# Naver API key 입력
client_id = '' 
client_secret = ''

# 검색어 입력
keword = input("검색할 키워드를 입력해주세요:")
encText = urllib.parse.quote(keword)

# 검색을 끝낼 페이지 입력
end = input("\n크롤링을 끝낼 페이지 위치를 입력해주세요. (기본값:1, 최대값:100):")  
if end == "":
    end = 1
else:
    end = int(end)
print("\n 1 ~ ", end, "페이지 까지 크롤링을 진행 합니다")

# 한번에 가져올 페이지 입력
display = input("\n한번에 가져올 페이지 개수를 입력해주세요.(기본값:10, 최대값: 100):")
if display == "":
    display = 10
else:
    display = int(display)
print("\n한번에 가져올 페이지 : ", display, "페이지")


for start in range(end):
    url = "https://openapi.naver.com/v1/search/news?query=" + encText + "&start=" + str(start+1) + "&display=" + str(display+1) # JSON 결과
    request = urllib.request.Request(url)
    request.add_header("X-Naver-Client-Id",client_id)
    request.add_header("X-Naver-Client-Secret",client_secret)
    response = urllib.request.urlopen(request)
    rescode = response.getcode()
    if(rescode==200):
        response_body = response.read()
        data = json.loads(response_body.decode('utf-8'))['items']
        for row in data:
            if('news.naver' in row['link']):
                naver_urls.append(row['link'])
                pub_dates.append(row['pubDate'])
        time.sleep(3)
    else:
        print("Error Code:" + rescode)

clinet_id와 client_secret 넣는 곳에 발급받으신 key를 넣습니다. (아주 중요!)

그리고 소스를 보시면 keyword를 input으로 입력받고 있는데 python으로 해당 코드를 실행하고 나서

원하는 뉴스 검색 keyword를 입력받아 해당 주제를 크롤링할 수 있습니다

크롤링 끝낼 위치인 end 값은 반복문을 종료시킬 값이고 가져올 페이지 개수인 display은 한 번에 가져올 데이터 개수입니다.

즉 만약 end가 2이고 display가 5이라면 2* 5 = 10 즉 총 10개의 뉴스를 크롤링합니다

그리고 이 단계에서 naver url과 pubdate(발행일)을 파싱 해서 배열에 넣어주고 있습니다.

url을 요청을 통해 html문을 가져온다

###naver 기사 본문 및 제목 가져오기###

# ConnectionError방지
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/98.0.4758.102"}

titles = []
contents = []
comments_texts = []
for i in naver_urls:
    original_html = requests.get(i, headers=headers)
    html = BeautifulSoup(original_html.text, "html.parser")

html을 가져와 내용을 파싱 하는 단계입니다.

가져온 내용은 beautifulSoup을 통해 html문을 가져옵니다.

가져온 html문에서 title, contents, pubdate을 추출한다.

# 뉴스 제목 가져오기
    title = html.select("div#ct > div.media_end_head.go_trans > div.media_end_head_title > h2")
    # list합치기
    title = ''.join(str(title))
    # html태그제거
    pattern1 = '<[^>]*>'
    title = re.sub(pattern=pattern1, repl='', string=title)
    titles.append(title)

    # 뉴스 본문 가져오기

    content = html.select("div#dic_area")

    # 기사 텍스트만 가져오기
    # list합치기
    content = ''.join(str(content))

    # html태그제거 및 텍스트 다듬기
    content = re.sub(pattern=pattern1, repl='', string=content)
    pattern2 = """[\n\n\n\n\n// flash 오류를 우회하기 위한 함수 추가\nfunction _flash_removeCallback() {}"""
    content = content.replace(pattern2, '')
    contents.append(content)

단계에서는 title과 content 정보만 파싱 해서 저장합니다

전처리 단계에서 쓸데없는 Html 코드나 \n 같은 문자들은 제거해 깔끔하게 내용만 추출하도록 합니다

selenium으로 댓글 가져오기

네이버 뉴스 댓글의 경우 일반적인 html 요청만으로는 원본내용을 가져올 수 없게 막고 있기 때문에

selenium 라이브러리를 사용해서 driver가 실제 인간처럼 댓글 더보기 창을 click 하여 오픈하고 댓글을 가져오는 방식입니다

# 댓글 가져오기
    driver.get(i)
    time.sleep(1)  # 대기시간 변경 가능
    # 네이버 댓글 눌러서 댓글 가져오기#
    a = driver.find_element(By.CSS_SELECTOR, 'a._COMMENT_COUNT_VIEW')
    
    # 댓글 더보기 클릭
    a.click()

    # 대기시간 변경 가능
    time.sleep(3)  

    # 네이버 뉴스 댓글 가져오기
    html = driver.page_source
    c_soup = BeautifulSoup(html)

    comments = c_soup.select('span.u_cbox_contents')
    comments_text = ', '.join([comment.text.strip() for comment in comments])
    comments_texts.append(comments_text)

가져온 댓글 정보는 전처리 과정을 거친후 comments 배열에 별도 저장합니다

csv파일로 저장한다.

# 데이터프레임으로 정리
import pandas as pd

news_df = pd.DataFrame({'title': titles, 'link': naver_urls, 'content': contents, 'comments': comments_texts,'date': pub_dates})

news_df.to_csv('news.csv', index=False, encoding='utf-8-sig')

DataFrame으로 만들어서 csv 파일로 저장합니다

. to_csv()의 첫 번째 파라미터에 저는 "news.csv"로 파일명을 저장하라고 하였습니다

잘 동작하지요? 최종적인 코드도 편의상 올려드리겠습니다

from bs4 import BeautifulSoup
import requests
import re
import time
import os
import sys
import urllib.request
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys

# WebDriver Setting
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.implicitly_wait(3)


# selenium으로 검색 페이지 불러오기 #
naver_urls = []
pub_dates = []

# Naver API key 입력
client_id = '' 
client_secret = ''

# 검색어 입력
keword = input("검색할 키워드를 입력해주세요:")
encText = urllib.parse.quote(keword)

# 검색을 끝낼 페이지 입력
end = input("\n크롤링을 끝낼 페이지 위치를 입력해주세요. (기본값:1, 최대값:100):")  
if end == "":
    end = 1
else:
    end = int(end)
print("\n 1 ~ ", end, "페이지 까지 크롤링을 진행 합니다")

# 한번에 가져올 페이지 입력
display = input("\n한번에 가져올 페이지 개수를 입력해주세요.(기본값:10, 최대값: 100):")
if display == "":
    display = 10
else:
    display = int(display)
print("\n한번에 가져올 페이지 : ", display, "페이지")


for start in range(end):
    url = "https://openapi.naver.com/v1/search/news?query=" + encText + "&start=" + str(start+1) + "&display=" + str(display+1) # JSON 결과
    request = urllib.request.Request(url)
    request.add_header("X-Naver-Client-Id",client_id)
    request.add_header("X-Naver-Client-Secret",client_secret)
    response = urllib.request.urlopen(request)
    rescode = response.getcode()
    if(rescode==200):
        response_body = response.read()
        data = json.loads(response_body.decode('utf-8'))['items']
        for row in data:
            if('news.naver' in row['link']):
                naver_urls.append(row['link'])
                pub_dates.append(row['pubDate'])
        time.sleep(3)
    else:
        print("Error Code:" + rescode)


###naver 기사 본문 및 제목 가져오기###

# ConnectionError방지
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/98.0.4758.102"}

titles = []
contents = []
comments_texts = []
for i in naver_urls:
    original_html = requests.get(i, headers=headers)
    html = BeautifulSoup(original_html.text, "html.parser")
    # 뉴스 제목 가져오기
    title = html.select("div#ct > div.media_end_head.go_trans > div.media_end_head_title > h2")
    # list합치기
    title = ''.join(str(title))
    # html태그제거
    pattern1 = '<[^>]*>'
    title = re.sub(pattern=pattern1, repl='', string=title)
    titles.append(title)

    # 뉴스 본문 가져오기

    content = html.select("div#dic_area")

    # 기사 텍스트만 가져오기
    # list합치기
    content = ''.join(str(content))

    # html태그제거 및 텍스트 다듬기
    content = re.sub(pattern=pattern1, repl='', string=content)
    pattern2 = """[\n\n\n\n\n// flash 오류를 우회하기 위한 함수 추가\nfunction _flash_removeCallback() {}"""
    content = content.replace(pattern2, '')
    contents.append(content)

    # 댓글 가져오기
    driver.get(i)
    time.sleep(1)  # 대기시간 변경 가능
    # 네이버 댓글 눌러서 댓글 가져오기#
    a = driver.find_element(By.CSS_SELECTOR, 'a._COMMENT_COUNT_VIEW')
    
    # 댓글 더보기 클릭
    a.click()

    # 대기시간 변경 가능
    time.sleep(3)  

    # 네이버 뉴스 댓글 가져오기
    html = driver.page_source
    c_soup = BeautifulSoup(html)

    comments = c_soup.select('span.u_cbox_contents')
    comments_text = ', '.join([comment.text.strip() for comment in comments])
    comments_texts.append(comments_text)


# 데이터프레임으로 정리
import pandas as pd

news_df = pd.DataFrame({'title': titles, 'link': naver_urls, 'content': contents, 'comments': comments_texts,'date': pub_dates})

news_df.to_csv('news.csv', index=False, encoding='utf-8-sig')

이상입니다.

실행은 당연히 아시겠지만

python "파일명"으로 하시면 됩니다 :)

ex:) python news.py

혹시나 안 되는 부분이 있으면 댓글 달아주시면 도와드리겠습니다

즐거운 하루 보내세요~

30살 연봉 1억 3천 5백 - 왜 내가 회사를 그만 두고 싶어하는지?

가난에서 중산층으로 한국에서 연봉 1억을 받고 있는 부자들에 관한 뉴스를 들어본 적이 있다. 대학교를 자퇴하고 중소기업에서 연봉 2천만원을 받고 있던 23살의 나에게는 꿈 같은 연봉이었고