Removing Duplicate Values in a CSV File

polaris · · 796 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>Hello Reddit, so I am trying to append a CSV file to another CSV file, and then delete duplicates in the CSV file. Both csv files should be nearly identical, and the purpose of this is to keep all unique values from both files. I am having trouble with this. The appending is no issue, and the deletion of duplicates works great, only if the files are identical. If I add or subtract a row from the appended CSV file, the program doesn&#39;t successfully remove duplicates. It doesn&#39;t make any sense to me. I have tried out a few functions that remove duplicates, and the one that is currently in the code is: </p> <pre><code>func RemoveDuplicates(lines *[]string) { found := make(map[string]bool) j := 0 for i, x := range *lines { if !found[x] { found[x] = true (*lines)[j] = (*lines)[i] j++ } } *lines = (*lines)[:j] } </code></pre> <p>The code that actually removes the duplicates looks something like this (where input is a string containing the CSV data, and path is the name of the CSV file): </p> <pre><code>lines := strings.Split(string(input), &#34;\n&#34;) RemoveDuplicates(&amp;lines) output := strings.Join(lines, &#34;\n&#34;) err = ioutil.WriteFile(path, []byte(output), 0644) if err != nil { log.Fatalln(err) } </code></pre> <p>Any advice is greatly appreciated! </p> <hr/>**评论:**<br/><br/>robertmeta: <pre><p>First thing I would do is use the CSV reader and writer: <a href="https://golang.org/pkg/encoding/csv/">https://golang.org/pkg/encoding/csv/</a></p> <p>Since it seems like your dataset size is trivial. Load two slices with all the data, two loops, only print non-dups to a CSV writer, done.</p> <p>EXAMPLE: <a href="https://play.golang.org/p/C5vvglCSeX">https://play.golang.org/p/C5vvglCSeX</a></p></pre>klauspost: <pre><p>First of all, you main approach should be to write tests. A case like this is easy to test, so be sure to write them.</p> <pre><code> func RemoveDuplicates(lines *[]string) { </code></pre> <p>You are complicating things a lot by sending a pointer to a slice. It is extremely rare that you should do that, and it should be a warning, that you are probably doing something wrong, unless it is required by reflection (which is also a warning sign).</p> <p>If your function takes a slice and modifies it, <strong>return the modified slice</strong>. If you mangle the incoming slice as you do here, note it in the documentation, or simply return a new slice.</p> <p>My advise is a &#34;don&#39;t try to be clever&#34;. Create a new slice and append the unique entries, and return the new slice.</p> <p>A small side note is that you can use <code>map[string]struct{}</code>, and do <code>if _, ok := found[x]; ok {</code>, which is a common idiom for this type of operation. </p></pre>lapingvino: <pre><p>Especially the map trick is one you should know and which is actually pretty efficient :).</p></pre>StabbyCutyou: <pre><p>If you&#39;re doing this as a one-off, and not something you <em>have</em> to use go for, you might find it&#39;s easier to use the cat, sort, and uniq shell commands (assuming you&#39;re using a *nix os). It&#39;d look something like:</p> <pre><code>cat f1 &gt;&gt; f2 &amp;&amp; sort f2 | uniq &gt; f3 </code></pre> <p>Of course if you need to do this in go, the guy with the CSVReader approach is the one to listen to imo.</p></pre>SportingSnow21: <pre><p>If keeping the order is important, you can create an index slice and define a sort that swaps both data and index values. </p> <pre><code>import ( &#34;fmt&#34; &#34;sort&#34; ) var data, index []int func main() { data = []int{5, 3, 3, 8, 6, 5, 7, 9, 1, 3, 6, 6, 8} index = make([]int, len(data)) for i := range data { index[i] = i } sort.Sort(IntDup(data)) for i, tmp := 1, data[0]; i &lt; len(data); i++ { if data[i] == tmp { data = append(data[:i], data[i+1:]...) index = append(index[:i], index[i+1:]...) i-- } else { tmp = data[i] } } sort.Sort(IntDup(index)) fmt.Println(data) } type IntDup []int func (d IntDup) Len() int { return len(d) } func (d IntDup) Less(i, j int) bool { return d[i] &lt; d[j] } func (d IntDup) Swap(i, j int) { data[i], data[j] = data[j], data[i] index[i], index[j] = index[j], index[i] } </code></pre></pre>jeffrey_f: <pre><p>Get MySQL and import the csv files into a table.</p> <p>Any time I need to de-dup a data file, this is my tool of choice because it can be used over and over and even automated.....</p> <p>Query distinct from the table to an export file and you are done.</p> <p>Not sure what your distinct field would be, however, that can be figured out later.......</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

796 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传