Can I use a regex with a slice of runes?

polaris · · 191 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>It doesn&#39;t seem like there&#39;s a regex function to match a slice of runes - just <em>string</em> and <em>[]char</em>. Is there a way to use a slice of runes instead?</p> <p>The reason I need to do it is because my parser needs to recognise unicode characters, so if there&#39;s another way to do this but still using a string or []char using a regex, that&#39;d be useful too.</p> <hr/>**评论:**<br/><br/>porkbonk: <pre><p>In addition to what singron said, keep in mind Go doesn&#39;t use PCRE so details/syntax might be different.</p> <p><a href="https://github.com/google/re2/wiki/Syntax" rel="nofollow">Ctrl+f &#34;Unicode&#34; just to be sure</a> :)</p></pre>singron: <pre><p>You can just use the normal string methods in the regexp package. They support unicode.</p></pre>zacgarby: <pre><p>Really? I thought I tried that - I guess my regexes just didn&#39;t support unicode. Thanks :)</p></pre>TheMerovius: <pre><p>To be clear: They support <em>UTF-8</em>. &#34;unicode&#34; is not well-defined in this context. If your strings are not UTF-8 (for example on Windows, UTF-16 is still very common), you are going to have to convert them first.</p></pre>zacgarby: <pre><p>Oh yeah - what I meant was that my regexes don&#39;t match unicode strings</p></pre>TheMerovius: <pre><p><a href="https://play.golang.org/p/nvSGagdufO" rel="nofollow">Seems to be working fine</a></p> <p>And to be clear: &#34;unicode&#34; is a character set - utf-8 is an encoding. It is important to distinguish the two, because if you are not using utf-8, but a different unicode-encoding (either in the regexp or in the searched string) it won&#39;t work. So, if &#34;you thought you tried that&#34;, that might be explained by an encoding-fubar :) It helps to be specific here about what unicode-encoding you tried to use in your regexp, what unicode-encoding you where trying to match against and - if it doesn&#39;t work - give examples of specific strings/regexps where the results don&#39;t match your expectations. &#34;unicode&#34; just isn&#39;t the right term to use in this question :)</p></pre>denise-bryson: <pre><p>When you say <code>my parser</code> it sounds like you already have the data as a <code>[]rune</code> possibly for reasons other than just the regexp matching.</p> <p>If that&#39;s the case then you can also implement an <a href="https://golang.org/pkg/io/#RuneReader" rel="nofollow">io.RuneReader</a> and use <a href="https://golang.org/pkg/regexp/#Regexp.FindReaderSubmatchIndex" rel="nofollow">FindReaderSubmatchIndex</a>, <a href="https://golang.org/pkg/regexp/#Regexp.MatchReader" rel="nofollow">MatchReader</a> or <a href="https://golang.org/pkg/regexp/#Regexp.FindReaderIndex" rel="nofollow">FindReaderIndex</a></p> <p>play.golang.org doesn&#39;t allow me to share so attaching an untested and undocumented snippet below. Feel free to question anything that&#39;s not clear.</p> <pre><code>package main import ( &#34;fmt&#34; &#34;io&#34; &#34;regexp&#34; ) type runeReader struct { src []rune pos int } func (r *runeReader) ReadRune() (rune, int, error) { if r.pos &gt;= len(r.src) { return -1, 0, io.EOF } nextRune := r.src[r.pos] r.pos++ return nextRune, 1, nil } func main() { s := &#34;Hello, 世界! 世 界 World 世界 World!&#34; rs := []rune(s) re := regexp.MustCompile(`(?i)(\S+界 W\w+)`) fmt.Println(&#34;match:&#34;) fmt.Println(re.MatchString(s)) fmt.Println(re.MatchReader(&amp;runeReader{src: rs})) fmt.Println(&#34;findIndex:&#34;) m := re.FindStringSubmatchIndex(s) fmt.Println(m, s[m[2]:m[3]]) m = re.FindReaderSubmatchIndex(&amp;runeReader{src: rs}) fmt.Println(m, string(rs[m[2]:m[3]])) } </code></pre></pre>
191 次点击  
加入收藏 微博
0 回复
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传