> TapTap是一个推荐高品质手游的手游分享社区,实时同步全球各大应用市场游戏排行榜,与全球玩家共同交流并发掘高品质手游。
taptap排行榜的抓取稍微有点绕,让我门分析下它的排行榜如何抓取吧!
首先起始页面的地址为:[https://www.taptap.com/top/download](https://www.taptap.com/top/download)
排行榜是一个典型的列表+详情结构
列表页如下
![微信截图_20200918105434.png](https://static.studygolang.com/200918/06f41d0a53186ee659cad900ec0c4a7c.png)
chrome浏览器按f12打开调试,定位到列表的dom,可以发现,列表的样式为一个div,div上有一个名为```taptap-top-card```的class
如下:
![微信截图_20200918105834.png](https://static.studygolang.com/200918/bf82bf58b97e43301af749f20568e027.png)
因此,digger爬虫配置中的list_css为:
```yaml
list_css: div.taptap-top-card
```
同上,我们拿到详情页的链接地址的css选择器:
```div.top-card-middle>a```
列表页的配置就可以是这样的:
```yaml
- name: list_games
is_list: true
is_unique: false
list_xpath: ""
list_css: div.taptap-top-card
page_xpath: ""
page_css: ""
page_attr: ""
plugin: extract_html@s2
fields:
- name: game_url
is_array: false
is_html: false
xpath: ""
css: div.top-card-middle>a
attr: href
plugin: ""
remark: ""
next_stage: game_detail
```
根据详情页地址,我们就可以继续定义一个详情页的stage,来爬取游戏详情,css选择器的操作同上,最终得到一个配置文件如下:
```yaml
start_urls:
- https://www.taptap.com/ajax/top/download?total=30&page=1
start_stage: list_games
stages:
- name: list_games
is_list: true
is_unique: false
list_xpath: ""
list_css: div.taptap-top-card
page_xpath: ""
page_css: ""
page_attr: ""
plugin: extract_html@s2
fields:
- name: game_url
is_array: false
is_html: false
xpath: ""
css: div.top-card-middle>a
attr: href
plugin: ""
remark: ""
next_stage: game_detail
- name: game_detail
is_list: false
is_unique: false
list_xpath: ""
list_css: ""
page_xpath: ""
page_css: ""
page_attr: ""
plugin: get_user_reviews@s2
fields:
- name: icon
is_array: false
is_html: false
xpath: ""
css: div.show-main-header>div.main-header-icon>div.header-icon-body>img
attr: src
plugin: ""
remark: ""
next_stage: ""
- name: name
is_array: false
is_html: false
xpath: ""
css: div.show-main-header>div.main-header-text>div.base-info-wrap>h1
attr: ""
plugin: ""
remark: 游戏名称
next_stage: ""
- name: author
is_array: false
is_html: false
xpath: ""
css: div.show-main-header>div.main-header-text>div.base-info-wrap>div.header-text-author>a>span:nth-last-child(1)
attr: ""
plugin: ""
remark: 厂商
next_stage: ""
- name: rating
is_array: false
is_html: false
xpath: ""
css: span.app-rating-score
attr: ""
plugin: ""
remark: taptap评分
next_stage: ""
- name: install_count
is_array: false
is_html: false
xpath: ""
css: p.description>span:nth-child(1)
attr: ""
plugin: remove_suffix@s4
remark: 安装人数
next_stage: ""
- name: star_count
is_array: false
is_html: false
xpath: ""
css: p.description>span:nth-child(2)
attr: ""
plugin: remove_suffix@s4
remark: 关注人数
next_stage: ""
- name: videos
is_array: true
is_html: false
xpath: ""
css: ul#imageShots>li video
attr: data-src
plugin: ""
remark: 宣传视频
next_stage: ""
- name: screenshots
is_array: true
is_html: false
xpath: ""
css: ul#imageShots>li>a
attr: href
plugin: ""
remark: 截图
next_stage: ""
- name: developer_speak
is_array: false
is_html: true
xpath: ""
css: div#developer-speak
attr: ""
plugin: ""
remark: 开发者的话
next_stage: ""
- name: group_number
is_array: false
is_html: false
xpath: ""
css: div.main-body-number>p
attr: ""
plugin: ""
remark: 交流群
next_stage: ""
- name: additional
is_array: true
is_html: false
xpath: ""
css: ul.main-body-additional>li>span
attr: ""
plugin: ""
remark: 附加信息
next_stage: ""
- name: tags
is_array: true
is_html: false
xpath: ""
css: ul#appTag>li>a
attr: ""
plugin: ""
remark: 游戏标签
next_stage: ""
- name: description
is_array: false
is_html: true
xpath: ""
css: div#description
attr: ""
plugin: ""
remark: 游戏简介
next_stage: ""
- name: game_size
is_array: false
is_html: false
xpath: //span[@class="info-item-title" and text()="文件大小:"]/../span[@class="info-item-content"]
css: ""
attr: ""
plugin: ""
remark: 游戏大小
next_stage: ""
- name: cur_version
is_array: false
is_html: false
xpath: //span[@class="info-item-title" and text()="当前版本:"]/../span[@class="info-item-content"]
css: ""
attr: ""
plugin: ""
remark: 当前版本
next_stage: ""
- name: updated_date
is_array: false
is_html: false
xpath: //span[@class="info-item-title" and text()="更新时间:"]/../span[@class="info-item-content"]
css: ""
attr: ""
plugin: ""
remark: 更新时间
next_stage: ""
- name: user_reviews
is_array: true
is_html: false
xpath: ""
css: ul#review-label-list>li>div>a
attr: ""
plugin: ""
remark: 用户评价
next_stage: ""
- name: user_review_count
is_array: false
is_html: false
xpath: //a[@data-taptap-tab="review"]/small
css: ""
attr: ""
plugin: ""
remark: 用户评价数
next_stage: ""
headers:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,like
Gecko) Chrome/78.0.3904.108 Safari/537.36
accept-language: zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7,zh-TW;q=0.6
settings:
CONCURRENT_REQUESTS: "5"
node_affinity:
- ""
```
保存之后,启动新任务,就能爬取全部的排行榜了,爬虫结果示例如下:
![微信截图_20200918110749.png](https://static.studygolang.com/200918/14fb1f604e2cbb048deed6cde4838ac8.png)
---
## Digger介绍
[Digger](https://github/hetianyi/digger)是用纯[Golang](https://golang.org)开发的配置式分布式跨平台爬虫系统,支持使用Javascript编写插件来实现各种你想要达到的目标。Digger及相关组件能够以极低的资源开销运行在各种廉价服务器和开发板上,如树莓派。
Digger没有复杂的依赖,部署十分简单,支持Linux和Windows平台,目前支持的CPU架构有:```amd64```, ```arm```, ```arm64```
您可以在 [Demo演示环境](https://demo.diggerit.me/) https://demo.diggerit.me 快速体验功能。
> 由于资源有限,请合理使用演示环境,定时任务会在每天0点清理数据。
Github地址:https://demo.diggerit.me/
## 功能简介
- 支持使用Css选择器和Xpath选择器
- 支持多种结果类型:plain text,html,array等
- web在线调试爬虫配置,精准定位问题
- 支持插件功能
- 实时浏览爬虫日志
- 结果在线浏览、导出,一键生成数据库schema(postgres和mysql)
- 定时任务
- 支持暂停任务
- 分布式worker实例,有效避免爬虫被block
- 支持任务和worker标签匹配调度功能
- 支持配置导入导出
- 邮件通知功能
有疑问加站长微信联系(非本文作者)