I have an ETL project in Go that pulls in a data feed and a configuration and generates a new data feed. The configuration contains mappings (map this column to this json field) and transformers (trim this field to this length). I'm running into a lot of cases in all of our transformers where we have to do tons of type assertions. For example: sometimes the "price" field is a string, sometimes its a float as the JSON comes from clients.
Is there a "right" way to handle these type assertions other than setting up a switch in each mapper/transformer that needs them?
评论:
kardianos:
weberc2:In my opinion, the first step to ETL is normalization. In the first pipe or pass, ripe through the data and unify the types. Then have all others just assume the types.
kardianos:You'll still need to handle the case in which the data is not of the expected type.
weberc2:Yep. Often I'll first have the raw data, then do validations, not just on format but also on consistency, business rules, and then actually ingest or process the data as needed. If there is an error on any of the preconditions, the actions are specific to the nature of the process. Sometimes you want to omit everything that is related and note the error. Other times blowing up makes sense.
timetravelhunter:I don't know how performant this needs to be, but you will be slowed down by the JSON parser. The JSON parser is going to take each value and stick it into an interface{} (assuming you don't know the structure of your blob at compile time) which is probably an allocation per cell in your data, and then you have to do a type assertion on each cell to get your data back as a type. It would be better if you could get each JSON token and do the type switch before you parse it into a Go type (e.g., "if this token is an int token, parse the blob of bytes into an int and shove it into this Column"). If you build your Column abstraction right, you can make your allocations and type switches O(1) instead of O(n).
weberc2:O(1) vs O(N) where n ~ 1
Tikiatua:No, O(1) irrespective of N.
We did have similar problems and in the end settled for a java based ETL tool. It somehow seems to me, that this is one of the use cases where go is not really the best tool for the job.