XML parsing and memory usage

xuanbao · · 439 次点击

这是一个分享于的资源，其中的信息可能已经有所发展或是发生改变。

Hey; I'm developing a pipeline that takes as input XML files with a reasonable complex structure and size, the nested structure might have several layers and these files can go from 900 MB to 3 GB in size. I can't change that because it's not on my control how they are formed, I just need to read them. Right now I have the pipeline working by mapping a nested group of structs to the XML file, something like this: (code already works, this is just to server as an example on how I'm doing) <pre><code> xmlFile, e := os.Open(f) if e != nil { return e } defer xmlFile.Close() b, _ := ioutil.ReadAll(xmlFile) var xmlHead XmlHeadStruct reader := bytes.NewReader(b) decoder := xml.NewDecoder(reader) decoder.CharsetReader = charset.NewReader if e = decoder.Decode(&xmlHead); e != nil { return e } </code></pre> Now, everything works perfectly but with one exception, the memory usage! A file with approximately 900 MB is consuming 4 GB RAM and one with 3.3 GB can consume almost 14 GB RAM. Not cool! Do you guys have any suggestions on how to circumvent this ? Thanks <hr/>**评论：** jerf: <pre>The encoding/xml library in Go is, in my opinion, an undersung hero. Let me explain. There are two basic types of XML parser, the DOM type parser and the SAX type parser. The DOM type parser takes a chunk of XML, basically as a string, and converts it into some objects all in one shot. You then navigate the set of objects it represents. I call it "DOM type" because the "correct" default type to present are DOM objects that conform to the DOM object standard, but you can consider marshaling those into another more local format as just another change. There are also SAX-like parsers. These are named after the first (big successful) library that implemented this style. In this style, an XML document is presented to a callback as a series of events; "here's a start tag with these attributes", "here's some text", "here's some more text", "here's another start tag", "here's some whitespace", etc. This allows streaming of the XML with a very small memory footprint (provided that you're not being actively attacked by someone who sends you a 3GB tag or something), but at the cost of being a much more painful way to navigate the document. Go's default XML library is a rather nice hybrid. You're using it in the DOM mode, where you ask the library to parse the entire document in one shot into a single object. The "SAX"-style functionality is available via the <a href="https://golang.org/pkg/encoding/xml/#Decoder.Token">Token method</a>. If you feed an XML document to the parser and use nothing but the Token method to read it, you get a SAX-style parser. But the hidden secret is that you can do both, and the parser does the sensible things. You can use .Token() to read the first opening tag. Maybe you just skip it because it doesn't do anything useful on its own. Then, you've got three basic cases: <ol> <li>If you know what the next chunk of text is going to be, and you want to unmarshal it, call .Decode with your target struct type. The XML decoder will decode that element into your struct, and then await your next command after the matching end tag, having only consumed the matching portion of the document.</li> <li>If you don't know what the next chunk is, call .Token() until you get a start tag. At that point, if you want to marshal that start tag, call .DecodeElement but pass it that element. It will then work like in case #1.</li> <li>If you don't know what the next chunk is, and you call .Token() and get something you "don't want", you call .Skip() on the decoder and it will correctly skip over the rest of the contents of that tag, regardless of what they are.</li> </ol> You can mix and match a bit too; for instance, if you encounter a "list" of something you can .Token your way past that, the keep .DecodeElement'ing until you encounter the close tag for the list. (You will need to use DecodeElement in this case, unless your XML format rigidly specifies the number of elements in this list somehow.) You end up with pretty much the best of both worlds; you may have to use SAX-style parsing for the top level of the document, but you retain the convenience of DOM-style parsing for any smaller chunk of the document you can find. While JSON is still easier when it works, I find Go to be one of the better environments to work with XML, and it's not really Go qua Go, it's the API offered by this nice library. You can tell that somebody who has clocked a lot of time with XML has written it. (In much the same way that it is obvious to me that neither the original SAX API nor the original DOM API was written by anyone with lots of experience in XML. After all, how would they have gotten that lots of experience before writing the first APIs? DOM got better, especially with XPath in DOM3; I still think the API is klunky but the worst omissions of the first standard were rectified. DOM1 had some really big holes, IMHO.) You can also flip things around; in the middle of a DOM-style parse, if you have a struct that implements <a href="https://golang.org/pkg/encoding/xml/#Unmarshaler">Unmarshaler</a>, that struct gets direct access to the decoder and can use .Token() itself.</pre>prvst: <pre>Thanks a lot for that, this is a honest reply from someone truly committed and dedicated to a programming language like Go. kudos Sir !</pre>jfarlow: <pre>That's really useful to know. And a really good idea!</pre>silviucm: <pre>As a small add-on to the previous two great answers by <a href="/u/jerf" rel="nofollow">/u/jerf</a> and <a href="/u/icholy" rel="nofollow">/u/icholy</a> , and without knowing how your real code looks like, note that those memory numbers you quoted do make sense, in a perverse way, if you store the entire file + the corresponding Go structs in memory. That would be due to the manner Go allocates necessary memory in default mode. For example, 900 MB = the ioutil.ReadAll + (can't tell exactly but let's just say a similar amount for the deserialized Go structs) = 1.8 GB to 2GB You can have a read here: <a href="https://blog.golang.org/go15gc" rel="nofollow">https://blog.golang.org/go15gc</a> To quote: The default value of 100 means that total heap size is now 100% bigger than (i.e., twice) the size of the reachable objects after the last collection. 200 means total heap size is 200% bigger than (i.e., three times) the size of the reachable objects. If you want to lower the total time spent in GC, increase GOGC. If you want to trade more GC time for less memory, lower GOGC. So this sort of explains why you get that 4 GB of RAM usage. I cannot comment on your need to keep all those structs in memory, as opposed to reusing just a pool of objects, because I don't know the requirements. Still, on a dev box, you can definitely play with adjusting the GOGC and see if you can squeeze any memory gains from there as well, in addition to the previously mentioned buffered read instead of ReadAll Cheers</pre>prvst: <pre>Thanks, I'll take a read on those links you sent me</pre>icholy: <pre>Don't read the whole file into memory. <pre><code>f, _ := os.Open(fname) defer f.Close() var head XmlHeadStruct decoder := xml.NewDecoder(f) _ = decoder.Decode(&head) </code></pre></pre>prvst: <pre>That's another good tip, I'll try that, Thanks !!</pre>nsd433: <pre>You may want buffered IO rather than raw filesystem IO here. <pre><code>xml.NewDecoder(bufio.NewReader(f)) </code></pre> I'm not familiar with the xml decoder, but the json decoder does lots of tiny reads and benefits from an IO buffer. A benchmark will show whether or not it's a performance gain. The memory size of the buffer is trivial.</pre>prvst: <pre>Guys, thanks for all the great answers!!!</pre>nsd433: <pre>The use the SAX side of the API to skip over more tags, and the tune GOGC are very good suggestions. Once you've done those, if you want more, look at a runtime heap profile of the in-use objects and in-use space and you might find some low hanging fruit.</pre>