Python问题记录1--BeautifulSoup爬取网页过程中会出现乱码

2021-01-26 开发 Python 0 评论

问题描述

在尝试爬取小说时，发现爬取到的正文格式是正确的，但是章节列表会出现乱码，经过仔细搜索终于解决，特此记录

##源代码

req = requests.get(url=self.target)
bf = BeautifulSoup(req.text, 'html.parser')
div = bf.findAll('div', class_='listmain')
a_bf = BeautifulSoup(str(div[0]), "html.parser")
a = a_bf.findAll('a')
print(a[0].)

解决方案

确定当前网页的编码格式，可以在控制台中查看，在console中输入

1	document.charset

下图可以看出，该网页是采用GBK编码

添加代码，将编码格式设置为对应的编码格式

1
2
3

req = requests.get(url=self.target)
req.encoding = 'GBK' //将编码格式设置为网页对应的格式，在这里就是GBK
bf = BeautifulSoup(req.text, 'html.parser')

问题解决

本文链接： https://maydaychen.github.io/2021/01/26/Python问题记录1--BeautifulSoup爬取网页过程中会出现乱码/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

MaydaychenDeveloper & Infra engineer

前Android/Vue开发，现Infra从业人员，主营监控/AWS