I have an array called "URL", in which there are several URLs. Now I want to use the crawler to crawl the title and body of each web page, and then store them together in a TXT file, and then generate a word cloud belonging to this group of web pages. This is the first document(urls.py):
def urlsgetword(url):
from urllib import request
import os
from bs4 import BeautifulSoup
response = request.urlopen(url) # 发出打开网页的请求
content = response.read().decode('utf-8') # 获取网页内容并用utf-8解码
soup = BeautifulSoup(content, 'lxml')
title = soup.title # 得到网页标题
article = soup.find('div', class_='wp_articlecontent') # 得到网页内容
title = title.text # 得到标题文本内容
title = ''.join(title.split()) # 去除空格
article = article.get_text(strip=True) # 得到文档文本内容,strip=True用以去除文本前后空白行
article = ''.join(article.split())
info = title + '\n' + article
if not os.path.exists("F:/python-file/"):
os.mkdir("F:/python-file/")
with open("urls.txt", 'w', encoding='utf-8') as f:
f.write(info)
f.close()
This is the second document(wordcloud.py):
def wcloud():
import matplotlib.pyplot as plt
import wordcloud
import jieba
text = open('F:/python-file/urls.txt').read()
wordlist_after_jieba = jieba.cut(text, cut_all=True)
wl_space_split = " ".join(wordlist_after_jieba)
my_wordcloud = wordcloud.WordCloud().generate(wl_space_split)
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
Finally, I want the word cloud output in the main file, so I write this:
for u in urls:
urlsgetword(u)
wcloud()
As a result, the program failed. Which file is wrong?