Go and ETL

polaris · · 1706 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I have an ETL project in Go that pulls in a data feed and a configuration and generates a new data feed. The configuration contains mappings (map this column to this json field) and transformers (trim this field to this length). I&#39;m running into a lot of cases in all of our transformers where we have to do tons of type assertions. For example: sometimes the &#34;price&#34; field is a string, sometimes its a float as the JSON comes from clients.</p> <p>Is there a &#34;right&#34; way to handle these type assertions other than setting up a switch in each mapper/transformer that needs them?</p> <hr/>**评论:**<br/><br/>kardianos: <pre><p>In my opinion, the first step to ETL is normalization. In the first pipe or pass, ripe through the data and unify the types. Then have all others just assume the types.</p></pre>weberc2: <pre><p>You&#39;ll still need to handle the case in which the data is not of the expected type.</p></pre>kardianos: <pre><p>Yep. Often I&#39;ll first have the raw data, then do validations, not just on format but also on consistency, business rules, and then actually ingest or process the data as needed. If there is an error on any of the preconditions, the actions are specific to the nature of the process. Sometimes you want to omit everything that is related and note the error. Other times blowing up makes sense.</p></pre>weberc2: <pre><p>I don&#39;t know how performant this needs to be, but you will be slowed down by the JSON parser. The JSON parser is going to take each value and stick it into an interface{} (assuming you don&#39;t know the structure of your blob at compile time) which is probably an allocation per cell in your data, and then you have to do a type assertion on each cell to get your data back as a type. It would be better if you could get each JSON token and do the type switch <em>before</em> you parse it into a Go type (e.g., &#34;if this token is an int token, parse the blob of bytes into an int and shove it into this Column&#34;). If you build your Column abstraction right, you can make your allocations and type switches O(1) instead of O(n).</p></pre>timetravelhunter: <pre><p>O(1) vs O(N) where n ~ 1</p></pre>weberc2: <pre><p>No, O(1) irrespective of N.</p></pre>Tikiatua: <pre><p>We did have similar problems and in the end settled for a java based ETL tool. It somehow seems to me, that this is one of the use cases where go is not really the best tool for the job.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

1706 次点击  
加入收藏 微博
0 回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传