<p>Here's my problem. I am trying to parse text files and extract portions of them. For the sake of argument, we can assume that the text files are small enough to fit in memory (say - 50 kilobytes max). The files are constructed in such a way:</p>
<pre><code>-- section: foo
this is the content for section foo
-- end
-- section: bar
this is the content for section bar
-- end
</code></pre>
<p>the sections are <em>not</em> dynamic. that is - the file should always contain the section 'foo' and 'bar' but may, or may not contain a section 'baz'. the order of the sections may be random.</p>
<p>what I would like to receive at the output would be:</p>
<p><code>(some parser object).getSection('foo') => []byte</code></p>
<p>or something of similar matter.</p>
<p>I have already written a function which parses this but I am not sattisfied in how it operates. Basically, I am reading the file byte by byte and I am checking if n bytes matches one of the tokens. If so, I am saving start and end positions of the sections. I am pretty sure there is a library which does such a thing, I just can't find it because I can't seem to define a class of a problem I am trying to solve ;)</p>
<p>I've also thought of reading the file in memory and doing a regexp but that just seems as a wrong approach. I have seen that go has <code>text/scanner</code> package, but I haven't been able to determine whether it's a good approach</p>
<p>thanks in advance!</p>
<hr/>**评论:**<br/><br/>justinisrael: <pre><p>If this is the complete spec of the file then it doesn't seem complex enough to warrant a grammar/parser library. Looks like you can just use a Scanner to scan lines. If it's a start section, save the name. Then read lines into a body until you hit the end section. There are only two tokens to look for. </p></pre>justinisrael: <pre><p><a href="/u/icholy" rel="nofollow">/u/icholy</a> beat me to it, but here is another version which keeps the results structures, and with the ability to preserve sections order in addition to looking up a specific section: <del><a href="https://play.golang.org/p/D5GvjR8ksV8" rel="nofollow">https://play.golang.org/p/D5GvjR8ksV8</a></del></p>
<p><em>Edit:</em> I liked the approach icholy took with a nested scan for the section body and end, so I cleaned up my version a bit more: <a href="https://play.golang.org/p/IStu-G5CuQd" rel="nofollow">https://play.golang.org/p/IStu-G5CuQd</a></p></pre>icholy: <pre><p><a href="https://play.golang.org/p/rj-4e1hXjyo" rel="nofollow">https://play.golang.org/p/rj-4e1hXjyo</a></p></pre>toudi: <pre><p>Thank you very much for the help!</p></pre>Killing_Spark: <pre><p>If you run into problems with memory you could change your format slightly by requiring the first lines to be an 'index' of sections available and where they start in the file. Then you dont need to read the whole file to find the last section</p></pre>justinisrael: <pre><p>But reading in the whole file already isn't necessarily requires. One could scan lines until they read the desired section. </p></pre>Killing_Spark: <pre><p>'last' section. Always assume the worst case that could happen. Also you could assume that the first section takes 90% of the file and if you search for any others you need to scan all of these lines from the first section. I know this is probably not really necessary as he said tge files would probably be in the kib range, but thinking about scalability is never wrong ;) </p></pre>justinisrael: <pre><p>Or you could parse the section names once and store the offsets in the Parser with the assumption they are valid for the life of the parser. Then you don't have to change the format and you can still scan the file once. </p></pre>
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889
0 回复
- 请尽量让自己的回复能够对别人有帮助
- 支持 Markdown 格式, **粗体**、~~删除线~~、
`单行代码`
- 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
- 图片支持拖拽、截图粘贴等方式上传