https://study.163.com/course/courseLearn.htm?courseId=1005913008#/learn/text?lessonId=1053258283&courseId=1005913008
把豆瓣“即将上映”的电影爬出来。
很简陋的代码,现在的知识只能写这样了,做个记录吧。以后有能力再完善。
1 # coding=gbk 2 # 豆瓣即将上映的电影 3 # https://movie.douban.com/cinema/later/changsha/爬取这个页面的信息吧 4 5 import requests 6 from bs4 import BeautifulSoup 7 import json 8 9 10 def get_page():11 url = 'https://movie.douban.com/cinema/later/changsha/'12 headers = { "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}13 response = requests.get(url, headers, verify=False)14 text = response.text15 return text16 17 18 def get_data(text):19 movies = []20 soup = BeautifulSoup(text, 'lxml')21 # class="item mod " class="item mod odd"有两个都是的。22 # 第一种思路,直接分开成两个来处理23 divList1 = soup.find_all('div', attrs={ 'class': "item mod "})24 divList2 = soup.find_all('div', attrs={ 'class': "item mod odd"})25 # 合并两个list26 divList = divList1+divList227 '''2829 30 31 32 '''41 for div in divList:42 movie = {}43 thumb = div.find('img')['src']44 # print(thumb)45 # 获取所有a标签,其中第二个a标签里是电影名46 title = div.find_all('a')[1].text47 # print(title)48 # 有些电影里没有类型这列,所有要考虑下,不能直接赋值类型49 li_tag = div.find_all('li')50 if len(li_tag)==4:51 date = li_tag[0].text52 type = li_tag[1].text53 area = li_tag[2].text54 want = li_tag[3].text55 else:56 date = li_tag[0].text57 type = ''58 area = li_tag[1].text59 want = li_tag[2].text60 61 # print(date)62 movie['title'] = title63 movie['thumb'] = thumb64 movie['date'] = date65 movie['type'] = type66 movie['area'] = area67 movie['want'] = want68 movies.append(movie)69 return movies70 71 72 def save_text(data):73 with open('douban2.json', 'w', encoding='utf-8') as fp:74 json.dump(data, fp, ensure_ascii=False)75 76 77 if __name__ == '__main__':78 text = get_page()79 data = get_data(text)80 save_text(data)
两个li标签的区别
或者以后可以参考https://www.jianshu.com/p/c64fe2a20bc9,这个代码写的可以借鉴。