htmlparse: a powerful go tool to parse a html document

Htmlparse is a go tool for parsing a html document.

It converts a html document into a tree. Each node in the tree is either a tag or a text. Given a tag, a programmer

can easily get its original infos, including its metadata, its children, its siblings and the text wrapped in it.

One can also modify a tree, by writing something into or delete a tag.

It can be used in web crawlers, analysis, batch formating and etc.


go get -u



  • Parse() *Tree

      The only one method needed to convert the original bytes to a tree.  
              import (
          content, _ := ioutil.ReadFile("index.html")
              parser := htmlparse.NewParse(content)
              tree := parser.Parse()


  • Filter(filter map[string]string) []*Tag

      Find some tags from the document with a filter, which is a key-value formated map.  
              products := tree.Filter(map[string]string{"tagName", "div", "class": "product"})
  • Find(conditions map[string]string) *TagSets

      Similar to the Filter method, except that its return value is of TagSets type who has some useful methods.  
  • String() string

      Return the original document.  
  • Modify() string

      Return the modified document.  


  • Find(map[string]string) *TagSets

      Return a set of tags from a set of tags or their children using a filter.  
      It can be used with a chain style.  
              photos := tree.Find(map[string]string{
                      "tagName": "div", 
                      "class": "product",
                      "tagName": "img",
                      "class": "photo",
  • All() []*Tag

      Get all the tags in this set.  
  • GetAttributes(attr ...string) []map[string]string

      Get some attributes from each tag in this set.  
              inputs := form.Find(map[string]string{
                      "tagName": "input"
              }).GetAttributes("type", "name", "value", "data-id")
              for _, input := range inputs {
                              input["type"], input["name"], input["value"], input["data-id"]
  • String() string


  • Find(map[string]string) *TagSets

      Find the tags from a tag's children.  
  • GetContent() []byte

      Return the original bytes of a tag in the document, the tag's metadata is included.  
      By design, each tag or text has a pair of pointers which determined its absolute position in the document.  
      So whenever one gets the original content of a tag or text, it just fetches the subslice document[head:tail],   
      which can be no more faster.
  • String() string

      Satisfy the Stringer.  
  • Extract() []byte

      Filter the text from the original data of a tag. Tags wont't be included.  
  • Index() int64

      Get the index of a tag in its among its brothers.  
  • Prev() *Tag

      Get the previous tag of a tag under the same parent.  
  • Next() *Tag

      Similar to Prev().  
  • Modify() string

      Return the modified data of a tag.  
      One should call this after writing to a tag.  
  • WriteText(position int64, data []byte) (*Text, error)

      Write text into a tag at the given index in the tag's children.  
              names := products.Find(map[string]string{"class": "productName"})
              //<div class="productName">Product1</div>
              //<div class="productName">Product2</div>
              //<div class="productName">Product3</div>
              for _, name := range names {
                      name.WriteText(0, []byte("[ONSALE] "))
              //<div class="productName">[ONSALE] Product1</div>
              //<div class="productName">[ONSALE] Product2</div>
              //<div class="productName">[ONSALE] Product3</div>
  • WriteTag(position int64, tagname string) (*Tag, error)

      Write a tag into a tag at the given index in the tag's chidren.  
              script, _ := body.WriteTag(1000, "script")    
              //if ths position is greater than the count of the tag's children, it'll be set to the last
              script.Attributes["src"] = ""
  • Delete() error

      Delete a tag.  
              garbage := tree.Find(map[string]string{"class":"advertisement"}).All()[0]


  • String() string

      Smilar to tag.
  • Index() int64

      Smilar to tag.
  • Modify() string

      Smilar to tag.
  • Delete()

      Smilar to tag.


