PHP爬虫语言的优点与缺点

laical · · 754 次点击 · · 开始浏览

这是一个创建于的文章，其中的信息可能已经有所发展或是发生改变。

对于网络爬虫来说，python和java是大众的选择语言框架，其实编程中有许多框架语言来编写自己的爬虫程序。至于选择什么语言，根据自己的情况而定。爬虫的接口很简单，分析目标网站，找到该网站的API链接，在下载链接数据。开发语言框架我选择了PHP。

PHP语言的优点与缺点：

优点：

1、操作简单，使用方便

2、可支持C，java去执行代码

3、支持多种平台，多种框架

4、支持采集各种数据

5、成本低

缺点：

1、语法编码不太完善

2、不支持多线程

3、难以拓展，运维复杂

以下是PHP使用爬虫代理IP的代码：

<?php
    // 要访问的目标页面
    $url = "http://httpbin.org/ip";
    $urls = "https://httpbin.org/ip";

    // 代理服务器(产品官网 www.16yun.cn)
    define("PROXY_SERVER", "tcp://t.16yun.cn:31111");

    // 代理身份信息
    define("PROXY_USER", "username");
    define("PROXY_PASS", "password");

    $proxyAuth = base64_encode(PROXY_USER . ":" . PROXY_PASS);

    // 设置 Proxy tunnel
    $tunnel = rand(1,10000);

    $headers = implode("\r\n", [
        "Proxy-Authorization: Basic {$proxyAuth}",
        "Proxy-Tunnel: ${tunnel}",
    ]);
    $sniServer = parse_url($urls, PHP_URL_HOST);
    $options = [
        "http" => [
            "proxy"  => PROXY_SERVER,
            "header" => $headers,
            "method" => "GET",
            'request_fulluri' => true,
        ],
        'ssl' => array(
                'SNI_enabled' => true, // Disable SNI for https over http proxies
                'SNI_server_name' => $sniServer
        )
    ];
    print($url);
    $context = stream_context_create($options);
    $result = file_get_contents($url, false, $context);
    var_dump($result);

    // 访问 HTTPS 页面
    print($urls);
    $context = stream_context_create($options);
    $result = file_get_contents($urls, false, $context);
    var_dump($result);
?>

为了避免必要的触发网站的反爬，可以利用爬虫代理来进行稳定有效的采集相关数据。

有疑问加站长微信联系（非本文作者）