<p>Hello golang experts,</p>
<p>I'm trying to read text file which is 2GB of size and has each new json object on new line. The idea is to use buffer of 10-50mb of RAM to read file by batches to reduce load on storage and do not load everything into RAM.</p>
<p>How to do it in golang?
I found this function however it loads everything into memory and I get "fatal error: runtime: out of memory"</p>
<pre><code>func readLines(filename string) ([]string, error) {
var lines []string
file, err := ioutil.ReadFile(filename)
if err != nil {
return lines, err
}
buf := bytes.NewBuffer(file)
for {
line, err := buf.ReadString('\n')
if len(line) == 0 {
if err != nil {
if err == io.EOF {
break
}
return lines, err
}
}
lines = append(lines, line)
if err != nil && err != io.EOF {
return lines, err
}
}
return lines, nil
}
</code></pre>
<p>Edit: I found buffer settings. If it's smaller than one line then .Scan stops because can not find \n</p>
<pre><code>file, err := os.Open("json.txt")
if err != nil {
println(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
buf := make([]byte, 0, 1024*1024)
scanner.Buffer(buf, 10*1024*1024)
for scanner.Scan() {
print(scanner.Text())
}
</code></pre>
<hr/>**评论:**<br/><br/>raff99: <pre><p>You should actually use "stream decoding" as described here:
<a href="https://golang.org/pkg/encoding/json/#example_Decoder_Decode_stream" rel="nofollow">https://golang.org/pkg/encoding/json/#example_Decoder_Decode_stream</a></p>
<p>If your JSON file is "well formed" (i.e. your entries are part of a big JSON array) you may need to skip the initial "[" and the "," between entries.</p></pre>binaryblade: <pre><p>Actually he said eac object is on a new line which means you can just call decode in a loop.</p></pre>simplewhite1: <pre><p>I found a settings for a buffer size and this code worked very well</p>
<pre><code>file, err := os.Open("json.txt")
if err != nil {
println(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
buf := make([]byte, 0, 1024*1024)
scanner.Buffer(buf, 10*1024*1024)
for scanner.Scan() {
print(scanner.Text())
}
</code></pre></pre>rosencreuz: <pre><p>Why don't you use the scanner and limit the number of lines you read?</p>
<pre><code>file, err := os.Open(filename)
if err != nil {
return nil, err
}
defer file.Close()
lines := []string{}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lines = append(lines, scanner.Text())
}
</code></pre></pre>simplewhite1: <pre><p>Thanks, I think this will work.</p></pre>simplewhite1: <pre><p>i do not know why but .Scan only reads 103 lines from file with 2700 lines</p></pre>aristofanio: <pre><p>To test we use the follow code:</p>
<pre><code>package main
import (
"os"
)
//35 bytes
const data = `{"name": "ari", "hello": "world"}`
const newl = "\r\n"
func main(){
f, _ := os.OpenFile("jsonfile.txt", os.O_CREATE|os.O_WRONLY, os.ModePerm)
for i:=0; i < 2*1024*1024*1024/35; i++ {
f.Write([]byte(data))
f.Write([]byte(newl))
}
f.Close()
}
</code></pre>
<p>The result was use of 6GB of RAM (by ~3 minutes) with your code.
Following yeah-ok without goroutine the result was ~500Kb in 1~2 second;
Code used in tests:</p>
<pre><code>package main
import (
"io"
"os"
)
func main() {
//
tot := int64(0)
buf := make([]byte, 35*1024)
//
f, _ := os.OpenFile("jsonfile.txt", os.O_RDONLY, os.ModePerm)
for i := 0; i < 2*1024*1024*1024/35; i++ {//break in 2*1024*1024/(35*35)
r, e := f.Read(buf)
if r == 0 {
if e != nil || e == io.EOF {
break
}
}
tot += int64(r)
//process 35*1024 json inputs
}
f.Close()
//
print("Read ")
println(tot)
}
</code></pre></pre>placeybordeaux: <pre><p>the variable lines will contain the entire file.</p></pre>simplewhite1: <pre><blockquote>
<p>file, err := ioutil.ReadFile(filename)</p>
</blockquote>
<p>it says not enough memory on this line file, err := ioutil.ReadFile(filename)</p></pre>nswshc: <pre><p>ReadFile reads the <em>whole</em> file into memory. So it won't work. You want a plain <code>os.Open</code>, it just opens a file handle and then you wrap that file handle in a <code>bufio.Scanner</code> to read the file line by line.</p></pre>kardianos: <pre><p>os.Open
<a href="https://godoc.org/bufio#Reader.ReadLine" rel="nofollow">https://godoc.org/bufio#Reader.ReadLine</a></p></pre>Frakturfreund: <pre><p>What exactly do you want to do with all this json objects?</p>
<p>If you want to Unmarshal them into structs (so that you can retrieve special values), <a href="https://golang.org/pkg/encoding/json/#Decoder" rel="nofollow"><code>json.Decoder</code></a> (to decode one object at a time from a stream) is probably a good fit for you. The Standard Library has a <a href="https://golang.org/pkg/encoding/json/#example_Decoder" rel="nofollow">simple</a> and an <a href="https://golang.org/pkg/encoding/json/#example_Decoder_Decode_stream" rel="nofollow">advanced</a> example.</p></pre>yeah-ok: <pre><p>Since your file can be split line by line simply count up number of lines in your file then divide up as necessary to fit in the amount of RAM you have available and process chunk by chunk?</p></pre>printf_hello_world: <pre><p>Your question has been answered by other commenters, but I thought it might be helpful to review some OS and filesystem basics. I know you're beyond this level, but others might not be.</p>
<p><strong>What happens when you open/read a file</strong></p>
<p>When you <em>open</em> a file, you are essentially asking the operating system to locate a particular file for you. Sometimes you're also asking the OS to create it for you. However, you are <em>not</em> asking it to load the whole file into memory.</p>
<p>Later, when you <em>read</em> the file, you are asking the operating system to fetch the next chunk of the file. Probably it's a 2 or 4 kilobyte chunk, but that's not the important thing. Rather, the important point is that you are only need to hold a small part of the file in memory at any given time.</p>
<p><strong>What the go standard library can do for you</strong></p>
<p>The simple option from the <code>ioutil</code> package (<code>ReadFile()</code>) loads the entire file into memory. This is convenient for small files, but it's not very good if your file is really big, as you discovered.</p>
<p>What you want is to <em>stream</em> the file, which means that you read data into a buffer, process that data, read new data into the buffer, process that data, and so on.</p>
<p>When you stream data, you have to decide how much to read each time before processing. For many applications, a single line is sufficient: hence, some commenters suggested <code>ReadLine()</code> or something similar. Your case appears to be a bit more complex than that though.</p>
<p><code>Decoder</code>, from the <code>encoding/json</code> package, will read enough to process one JSON document. That is probably ideal for your use case. Just remember to decode, process and then <em>throw away the decoded data</em> before decoding the next one. That will make your stream consume the least possible memory.</p>
<p><em>So finally, your real question: how to read 30-50 MB at a time?</em></p>
<p>Use the <code>bufio</code> package. Wrap your <code>*os.File</code> in <code>bufio.NewReaderSize()</code>. Pass the resulting <code>io.Reader</code> to the JSON stream decoder. Profit.</p></pre>binaryblade: <pre><p><a href="https://play.golang.org/p/WND9YPik_M" rel="nofollow">do it in place</a></p></pre>dtoebe: <pre><p>Maybe this will help: <a href="https://sourcegraph.com/github.com/golang/go/-/info/GoPackage/os/-/File/Read" rel="nofollow">https://sourcegraph.com/github.com/golang/go/-/info/GoPackage/os/-/File/Read</a></p>
<p>I don't have time to look into it deeper, but I'll watch the thread and if I have time and no solution I'll help out.</p>
<p>Also depending on your runtime, maybe allocate some space to swap to load the whole file. Loading it piece by piece like you want, will make it very difficult to parse the data. And possibly make it so the binary itself will use to much memory. </p></pre>
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889
- 请尽量让自己的回复能够对别人有帮助
- 支持 Markdown 格式, **粗体**、~~删除线~~、
`单行代码`
- 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
- 图片支持拖拽、截图粘贴等方式上传