Is it possible to create a file with utf-8 as charset?

blov · · 1277 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<hr/>**评论:**<br/><br/>_gall0ws: <pre><p>Utf-8 was literally invented by Rob Pike and Ken Thompson.</p></pre>TheMerovius: <pre><p>To unpack this question a bit (bear with me, genuinely trying to be helpful :) ):</p> <ul> <li>Text is (for the purpose of this comment) a string of abstract &#34;characters&#34;, devoid of any way to represent them. Unicode tries to be a collection of any character ever used, so we might very well talk about &#34;Unicode codepoints&#34; (which are integers in the range 0 to roughly 1.1M) as characters.</li> <li>Files, in general, don&#39;t have a &#34;charset&#34;, strictly speaking. They are simply a bunch of bytes. But</li> <li>charsets (or encodings) encode strings of characters into bytes and back, so a file <em>might</em> contain text encoded in a certain charset. Therefore</li> <li>The question should be, whether we can write UTF-8 encoded text to a file.</li> <li>The obvious and unhelpful answer is &#34;yes, we can write any bytes we want, so we can also write the bytes you choose&#34;</li> </ul> <p>Here the question branches out a bit, depending on where you are coming from. You want to encode text into UTF-8, but where is this text coming from and how is it represented?</p> <ul> <li>Canonically, in Go, text is represented by a <code>string</code> containing a bunch of bytes representing the text in <code>UTF-8</code>. But as others in this thread pointed out, this is not a strict requirement, so depending on where the <code>string</code> comes from, it might or might not be valid UTF-8. But if you have a simple string-literal it is safe to assume that this will be represented by UTF-8 bytes.</li> <li>If you are reading the text from some other file or the network or something, you have to <em>decode</em> it, based on what encoding was used when writing it. Of course, that might well be UTF-8, in which case, great! You have UTF-8 encoded bytes.</li> <li>It might be a slice of runes, which literally is a series of unicode codepoints. In this particular case, the language gives you a convenient tool to encode them to UTF-8: After <code>x := []rune{&#39;♥&#39;, &#39;þ&#39;, &#39;Σ&#39;}; s := string(x)</code>, <code>s</code> will contain the UTF-8 encoded string of those three characters.</li> <li>Lastly, if you know the encoding of a bunch of bytes, you can see if <a href="https://godoc.org/golang.org/x/text/encoding" rel="nofollow">this encoding package</a> contains a way to decode it; or Google around for some other.</li> </ul> <p>All of this may seem terribly complex, but know that this is not complexity specific to Go: It is simply a consequence of how history worked out, in that we represent all information by bytes, but &#34;text&#34; is an abstract concept that predates computers. So every language needs to somehow deal with the de- and encoding problem of text :)</p> <p>The Go specific part is a) a bunch of bytes can either be a <code>[]byte</code> or a <code>string</code>, b) the latter <em>canonically</em> is UTF-8, but that&#39;s not checked, c) the language doesn&#39;t have a &#34;text&#34;-type that represent the abstract notion of text but d) <code>[]rune</code> probably comes closest, but is relatively uncommon to actually use.</p> <p>Most people end up just assuming UTF-8 or not caring; for most operations it doesn&#39;t actually matter <em>a lot</em> anyway.</p></pre>aNNdii: <pre><p>First of all, thank you for your detailed answer, but maybe i should have asked my question differently. I will try to explain my problem.</p> <p>I&#39;am writing a program which outputs user inputs into a csv file. The csv file has the ascii encoding. My problem is now, in germany we have special characters like &#34;ü&#34;, &#34;ö&#34;, &#34;ä&#34; and &#34;ß&#34;. If i try to import the csv file, which has ascii as encoding, into a database the characters mentioned above get replaced with a &#34;?&#34;. I found out that, if i change the encoding to utf-8 the file gets imported correctly. I wanted to know if there is a way to set an encoding of a file while creating it. </p> <p>Kind regards.</p></pre>TheMerovius: <pre><blockquote> <p>I wanted to know if there is a way to set an encoding of a file while creating it.</p> </blockquote> <p>But the answer still stands. You can write UTF-8 to a file, no problem. For example <a href="https://play.golang.org/p/cO9cy1eTn4N" rel="nofollow">this code</a> will write some UTF-8 to &#34;foo.txt&#34;. The file will still only contain bytes and does not &#34;have&#34; an encoding.</p> <p>Your problems could be caused by many different things, but most likely not Go. Let&#39;s walk through it:</p> <ul> <li>You say you are writing user inputs into a csv file. Where do you get the user inputs? Are you reading them from stdin (e.g. via fmt.Scan*)? In that case, maybe your terminal isn&#39;t writing UTF-8, because your locale isn&#39;t set to one that does. You&#39;d have to either switch to a UTF-8 locale, or decode the user input from the encoding that is specified in the locale (most likely ISO 8859-1 if it&#39;s not UTF-8?)</li> <li>If you are not reading them from the terminal, but e.g. via a web-request or something, you have to rely on the <code>Content-Type</code> header or other out-of-band information that specifies the encoding used.</li> <li>Once you know what encoding the user input is in, you can try and decode it using the library I linked. You can also try and find good Go bindings for iconv, which is the C-library that is usually used to de- and encode different character sets - or another, similar Go library.</li> <li>I assume you are not touching the bytes themselves, after you read them and are just writing them out; that means, that whatever encoding the user input is in, that&#39;s going to be the encoding of your csv file</li> <li>The database <em>might</em> try and read a UTF-8 csv-file as ISO 8859-1 or vice-versa (both would lead to the symptom you described), so even if the user-input is correctly encoded and written correctly, it might still jumble it up.</li> </ul> <p>What do you mean by this:</p> <blockquote> <p>I found out that, if i change the encoding to utf-8 the file gets imported correctly.</p> </blockquote> <p>As I mentioned, files don&#39;t &#34;have&#34; an encoding, they only contain bytes. So this most likely means you are using some other program to convert the file from one encoding to another? Does it tell you which encoding it guessed?</p></pre>lhxtx: <pre><p>Yes. </p></pre>TheMerovius: <pre><p>The Go community already has a pretty negative reputation, let&#39;s not add to this :) Please try to imagine how you would feel if your first contact with a new community would be a comment like this. If you think the question is stupid (I don&#39;t), it&#39;s fine to simply not reply :)</p></pre>lhxtx: <pre><p>No hostility was intended. It was just a simple answer. The question isn’t stupid but the it also wasn’t “how do I write a UTF8 file”. </p> <p>Similarly if someone asked “does go have an ORM equivalent to sql alchemy” I would reply “no”. </p></pre>shekelharmony: <pre><p>Yeah. UTF-8 is the default text encoding used by Go so you can just use <code>WriteString()</code>.</p> <p><a href="https://play.golang.org/p/nylS2MI1Qxt" rel="nofollow">https://play.golang.org/p/nylS2MI1Qxt</a></p> <p>(WriteString returns the number of bytes written. Each code point takes up 3 bytes in UTF-8. 5*3=15).</p></pre>metamatic: <pre><blockquote> <p>Each code point takes up 3 bytes in UTF-8. </p> </blockquote> <p>Each code point takes up from 1 to 4 bytes in UTF-8, depending on the code point.</p></pre>0xjnml: <pre><p>Go has no default text encoding modulo its source code is always UTF-8. Runtime strings have no text encoding at all. They&#39;re just a sequence of bytes. Any bytes. The WriteString method just writes those bytes, no encoding involved.</p></pre>TheMerovius: <pre><blockquote> <p>They&#39;re just a sequence of bytes.</p> </blockquote> <p>In general: Yes. In detail: No. There are several operations built into the language which make UTF-8 the de-facto encoding for strings:</p> <ul> <li><code>range</code> assumes an UTF-8 encoded string</li> <li>Escaped unicode-characters in string-literals are encoded as UTF-8: e.g. even though <code>&#34;\u2665&#34;</code> is ASCII it encodes to the bytes <code>[]byte{0xe2, 0x99, 0xa5}</code></li> <li>Similarly, <code>string(&#39;♥&#39;)</code> and (equivalently) <code>string(0x2665)</code> are both encoded as UTF-8, even though the rune/integer literal are both denoting unicode code points, not UTF-8 bytes</li> </ul> <p>It may seem pedantic, but IMO it is both incorrect to say &#34;strings are UTF-8&#34; and &#34;strings are just a bunch of bytes&#34;. I think the most precise way to phrase it would be along the lines of &#34;strings are canonically UTF-8 encoded but this requirement is not enforced&#34;.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

1277 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传