【求助】Go爬虫无法获取北邮人论坛cookie的疑惑

freezer-glp · 2017-08-08 12:06:34 · 1398 次点击 · 大约8小时之前 开始浏览    置顶
这是一个创建于 2017-08-08 12:06:34 的主题,其中的信息可能已经有所发展或是发生改变。

背景

这两天尝试写个Go爬虫爬北邮人论坛,期望能登录后保存cookie,后续的访问都带着这个cookie。查看资料推荐用net/http/cookiejar
目前能登录成功,获取成功登录Json。但是发现并未获取登录后的cookie,导致后续直接Get帖子正文报错“您未登录,请登录后继续操作”
请教各位大神,这种情况哪里出错了?

实现

package main

import (
  "net/http/cookiejar"
  "net/url"
  "strings"
  "fmt"
  "net/http"
  "crypto/tls"
  "io/ioutil"
)

func main() {
  // init cookiejar
  var cookieJar *cookiejar.Jar
  cookieJar, _ = cookiejar.New(nil)

  // init client with cookiejar
  httpClient := &http.Client{
    Jar: cookieJar,
  }

  // login param
  postValues := url.Values{}
  postValues.Set("id", "ID")
  postValues.Set("passwd", "PWD")
  postValues.Set("s-mode", "0")
  postValues.Set("CookieDate", "3")

  // request for login
  httpReq, _ := http.NewRequest("POST", "https://bbs.byr.cn/user/ajax_login.json", strings.NewReader(postValues.Encode()))
  httpReq.Header.Set("Content-Type", "application/x-www-form-urlencoded; param=value")
  httpReq.Header.Add("X-Requested-With", "XMLHttpRequest")
  httpReq.Header.Add("Connection", "keep-alive")
  httpReq.Header.Add("User-Agent", "Mozilla/5.0")
  httpReq.Header.Add("Referer", "https://bbs.byr.cn")
  httpReq.Header.Add("Accept", "application/json, text/javascript, */*; q=0.01")
  httpReq.Header.Add("authority", "bbs.byr.cn")
  // for nginx/1.10
  httpClient.Transport = &http.Transport{
    TLSNextProto: make(map[string]func(authority string, c *tls.Conn) http.RoundTripper),
  }

  // login
  httpResp, _ := httpClient.Do(httpReq)
  fmt.Printf("req cookies: %s \n", httpReq.Cookies())
  fmt.Printf("resp cookies: %s \n", httpResp.Cookies())

  // request to get article content
  httpReq1, _ := http.NewRequest("GET", "https://bbs.byr.cn/article/Golang/842", nil)
  httpReq1.Header.Add("X-Requested-With", "XMLHttpRequest")
  httpResp1, _ := httpClient.Do(httpReq1)
  body, _ := ioutil.ReadAll(httpResp1.Body)
  fmt.Println(string(body))
}

输出(可见cookie为空):

req cookies: [] 
resp cookies: [] 
(...省略...)
<h5>产生错误的可能原因:</h5><ul><li><samp class="ico-pos-dot"></samp>您未登录,请登录后继续操作</li>
(...省略...)

困扰多时,求各位指点


有疑问加站长微信联系(非本文作者)

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

1398 次点击  
加入收藏 微博
5 回复  |  直到 2017-08-16 12:18:40
channel
channel · #1 · 8年之前

试试其他网站可以吗?

LuYuChengProject
LuYuChengProject · #2 · 8年之前

手动捕获Cookies再Add进去

marlonche
marlonche · #3 · 8年之前

貌似go解析Set-Cookie时认为 [ 是无效的字符,所以httpResp.Cookies()返回空,下面是把httpResp整个打印出来获取的Set-Cookie header:

"Set-Cookie":[]string{"nforum[UTMPUSERID]=guest; path=/; domain=bbs.byr.cn", "nforum[UTMPKEY]=21970208; path=/; domain=bbs.byr.cn", "nforum[UTMPNUM]=29282; path=/; domain=bbs.byr.cn", "nforum[UTMPUSERID]=guest; path=/; domain=bbs.byr.cn", "nforum[UTMPKEY]=21970208; path=/; domain=bbs.byr.cn", "nforum[UTMPNUM]=29282; path=/; domain=bbs.byr.cn"}

下面是go解析cookie时依照的RFC标准: <http://tools.ietf.org/html/rfc6265 >

cookie-pair       = cookie-name "=" cookie-value
cookie-name       = token
cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                     ; US-ASCII characters excluding CTLs,
                     ; whitespace DQUOTE, comma, semicolon,
                     ; and backslash
token             = 1*
separators        = "(" | ")" | "<" | ">" | "@"
                       | "," | ";" | ":" | "\" | <">
                       | "/" | "[" | "]" | "?" | "="
                       | "{" | "}" | SP | HT
freezer-glp
freezer-glp · #4 · 8年之前
marlonchemarlonche #3 回复

貌似go解析Set-Cookie时认为 [ 是无效的字符,所以httpResp.Cookies()返回空,下面是把httpResp整个打印出来获取的Set-Cookie header: "Set-Cookie":[]string{"nforum[UTMPUSERID]=guest; path=/; domain=bbs.byr.cn", "nforum[UTMPKEY]=21970208; path=/; domain=bbs.byr.cn", "nforum[UTMPNUM]=29282; path=/; domain=bbs.byr.cn", "nforum[UTMPUSERID]=guest; path=/; domain=bbs.byr.cn", "nforum[UTMPKEY]=21970208; path=/; domain=bbs.byr.cn", "nforum[UTMPNUM]=29282; path=/; domain=bbs.byr.cn"} 下面是go解析cookie时依照的RFC标准: cookie-pair = cookie-name "=" cookie-value cookie-name = token cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE ) cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E ; US-ASCII characters excluding CTLs, ; whitespace DQUOTE, comma, semicolon, ; and backslash token = 1* separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT

我见标准里有如下规范:

token          = 1*
separators     = "(" | ")" | "<" | ">" | "@"
                      | "," | ";" | ":" | "\" | <">
                      | "/" | "[" | "]" | "?" | "="
                      | "{" | "}" | SP | HT

也就是说,token不能包含分隔符,但是"[" "]"又恰好属于分隔符,所以被判为无效?

marlonche
marlonche · #5 · 8年之前
freezer-glpfreezer-glp #4 回复

#3楼 @marlonche 我见标准里有如下规范: ``` token = 1* separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT ``` 也就是说,token不能包含分隔符,但是`"[" "]"`又恰好属于分隔符,所以被判为无效?

是的

添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传