net/html adds a body tag even if the source document don't have one?

blov · · 466 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>Writing a web scraper with Go using the <code>net/html</code> package, on one unit test I noticed that it couldn&#39;t fail as I made it to behave, even if the source document use for the test don&#39;t have a <code>&lt;body&gt;</code> tag. The node looks like this:</p> <pre><code>&amp;{Parent:0x18813100 FirstChild:&lt;nil&gt; LastChild:&lt;nil&gt; PrevSibling:0x18813240 NextSibling:&lt;nil&gt; Type:3 DataAtom:body Data:body Namespace: Attr:[]} </code></pre> <p>and this is how the source document use for that test looks:</p> <pre><code>&lt;!DOCTYPE html&gt; &lt;html&gt; &lt;head&gt; &lt;meta charset=&#39;utf-8&#39;&gt; &lt;/head&gt; &lt;/html&gt; </code></pre> <p>That&#39;s how the net/html was made? I would like to know! :)</p> <hr/>**评论:**<br/><br/>HectorJ: <pre><p>It seems it does: <a href="https://github.com/golang/net/blob/master/html/parse.go#L678" rel="nofollow">https://github.com/golang/net/blob/master/html/parse.go#L678</a></p> <pre><code>p.parseImpliedToken(StartTagToken, a.Body, a.Body.String()) </code></pre> <p><a href="https://github.com/golang/net/blob/master/html/parse.go#L1956" rel="nofollow">https://github.com/golang/net/blob/master/html/parse.go#L1956</a></p> <pre><code>// parseImpliedToken parses a token as though it had appeared in the parser&#39;s // input. </code></pre></pre>HadronHubbub: <pre><p>In HTML, <a href="https://html.spec.whatwg.org/multipage/semantics.html#the-body-element" rel="nofollow">the body tags are optional.</a> Any HTML parser that conforms to the specification will do the same thing and give you a body element even if there are no body tags. (Unfortunately you&#39;ll find that many &#34;HTML&#34; parsing libraries get this wrong and basically act like they&#39;re parsing XML without draconian error-handling).</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

466 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传