step1:获取歌词
大略的爬虫小程序就能实现这个小目标,但是还是提醒各位看官一下好了,由于是js加载的,以是源代码中没有歌曲的ID,这个要特殊把稳一下。此外,写入文件的时候,记得数据类型转换一下,通过str()命令末了转换成字符串的形式方便写入文档。
好了,多余的话到此为止,直接上代码。
# -- coding:utf-8 --
import requests
import re
import os
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from bs4 import BeautifulSoup
def get_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
try:
response = requests.get(url, headers=headers)
html = response.content
return html
except:
print('request error')
pass
def download_by_music_id(music_id):
lrc_url = 'http://music.163.com/api/song/lyric?'+'id='+str(music_id) + '&lv=1&kv=1&tv=-1'
r = requests.get(lrc_url)
json_obj = r.text
j = json.loads(json_obj)
try:
lrc = j['lrc']['lyric']
pat = re.compile(r'\[.\]')
lrc = re.sub(pat, \公众\公众,lrc)
lrc = lrc.strip()
return lrc
except:
pass
def get_music_ids_by_musician_id(singer_id):
r = get_html(singer_url)
soupObj = BeautifulSoup(r,'lxml')
song_ids = soupObj.find('textarea').text
jobj = json.loads(song_ids)
ids = {}
for item in jobj:
print item['id']
ids[item['name']] = item['id']
return ids
def download_lyric(uid):
try:
os.mkdir(str(uid))
except:
pass
os.chdir(str(uid))
music_ids = get_music_ids_by_musician_id(uid)
for key in music_ids:
text = download_by_music_id(music_ids[key])
file = open(key+'.txt','a')
file.write(key+'\n')
file.write(str(text))
file.close()
download_lyric(10473)
效果图如下:
图1 获取歌词
step2:词频统计、剖析及词云可视化
把稳:用来做词云的底图须要从网高下载存到本地,而且该当只管即便选择比拟多较高的图片,否则制作的词云的轮廓效果可能会不太好,末了是白色的背景,深色的图像,这样比拟度会高一点。
详细代码如下:
# -- coding:utf-8 --
__author__ = 'lenovo'
import os
import json
import jieba.analyse
from PIL import Image, ImageSequence
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def read_content(content_path):
content = ''
for f in os.listdir(content_path):
print('loading {}'.format(file_fullpath))
content += open(file_fullpath, 'r').read()
content += '\n'
print('done loading')
return content
content = read_content('E:\\MaritimeData\\OnePiece\\Song\\yuanyawei')
keywords = dict()
for i in result:
keywords[i[0]] = i[1]
dcp = json.dumps(keywords)
yuanyawei = dcp.decode(\"大众unicode-escape\"大众)
print (yuanyawei)
image = Image.open('E:\\MaritimeData\\OnePiece\\Song\\image\\yuanyawei.jpg')
graph = np.array(image)
wc = WordCloud(font_path='C:/Windows/Fonts/STXINGKA.TTF',
background_color='white', max_words=1000, mask=graph)
wc.generate_from_frequencies(keywords)
image_color = ImageColorGenerator(graph)
plt.imshow(wc)
plt.imshow(wc.recolor(color_func=image_color))
plt.axis(\"大众off\"大众)
plt.show()
这段代码也可以和上面的代码整合到一起,形成一个有机的整体,但是
终极效果图如下:
图2 词频统计剖析
从词云上看,涌现频率最高的是“天下”,“没有”,“不会”,“绽放”,“觉得”,“韶光”,“相信”,“蓝色”,“玉轮”,
对词云的图的解读当然仁者见仁,智者见智,这里就不多说了,感兴趣的人可以在此根本上进一步剖析。
图3 词云可视化
从词云上看,涌现频率最高的是“天下”,“没有”,“不会”,“绽放”,“觉得”,“韶光”,“相信”,“蓝色”,“玉轮”,
对词云的图的解读当然仁者见仁,智者见智,这里就不多说了,感兴趣的人可以在此根本上进一步剖析。