Having fun with Physics, Math, and Tech (UBC): August 2016

When scraping Chinese website using python (Windows 10 system) and requests, it usually returns "gb2312" code for Chinese characters. However, if one does not declare the encoding of requests, it uses its default encoding, which usually is not "gb2312". After reading the website, we would like to save the Chinese content with the unicode encoding in some data file, so that it is easier read next time. The following example is a solution to this issue.

import requests
from bs4 import BeautifulSoup
# set sys default enconding to be unicode utf-8
import sys
sys.setdefaultencoding("utf-8")

# website scraping with request
url_to_scrape = 'http://www.mitbbs.com'

readOut = requests.get(url_to_scrape)
# in request, there is a method to search/get the real encoding of the website which is
# apparent_endcoding, so one need to set the encoding to be the apparent_encoding
readOut.encoding = readOut.apparent_encoding

# use beautifulsoup to get the text information
textSoup = BeautifulSoup(readOut.text, "lxml")

# now when printing out the content in the textSoup, you will get the right display of Chinese characters.
print(textSoup.title.string)

# with the sys default encoding to be uft-8, it will give the right display of Chinese characters in the txt file.
fileToWrite = open("fileToWrite.txt", "w")
fileToWrite.write("%s" %textSoup.title.string)
fileToWrite.close()

Having fun with Physics, Math, and Tech (UBC)

Tuesday, 30 August 2016

Python to scrape Chinese websites: gb2312 decoding issue solved