Parsing text based content / Large dictionnary

xuanbao · 2017-12-07 18:00:11 · 778 次点击

这是一个分享于 2017-12-07 18:00:11 的资源，其中的信息可能已经有所发展或是发生改变。

Hey guys,

Hope you are all well !

I was wondering what are the keywords/concepts used in order to detect concurrently some known patterns into a file or list of files.

Use case: - Generate alpine docker files based on all included/required dependencies by a Github/Gitlab repo by scanning the source code. - Enable the task through git hooks

Context: - 10 000 github starred repositories - 10 000 know patterns like c++/go/nodejs frameworks or packages, editors or companies (eg: https://github.com/porter-io/tagg-python/blob/master/tagg/default_defs.json)

Goals: - Distributed and recursive files scanning of github repository - Large dictionaries of words and combinations - Create a report of matched patterns - Generate a dockerfile based on detection and mapping the correspondent apk package by binding alpine package repository (eg. https://github.com/jessfraz/apk-file)

Questions: - Is there any frameworks, golang based, in order to detect and tag in a fast and concurrently manner those know entities ? - Is there a better alternative than using regex rules ? like runes ? - What are the proper concept I should watch/check to understand the best approach to such task ?

References found: https://github.com/svent/sift https://github.com/jessfraz/apk-file https://github.com/DavidGamba/ffind

Thanks for any insights and advices !

Cheers, Richard

评论：

epiris:

Both your posts have way too high level of problems, you're asking how to write a project rather than how to solve a problem you are facing while writing the project. Effectively your "questions" is how to perform software engineering using Go, it doesn't have a single answer.

If you are just starting with Go I suggest getting started and effective Go which may paint a partial picture of some of your questions. Then I would begin the project yourself and if you get stuck then feel free to post here or maybe join the gophers slack channel for a lower latency feedback loop. Make sure not to provide an XY, the best way to do this is to post a working playground example showing something as close as possible to what you want to do.

dgryski:

For the string matching part, you haven't quite given enough details but it sounds like you might want https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm . There's an implementation from Cloudflare: https://github.com/cloudflare/ahocorasick

HowardTheGrum:

Distributed and recursive files scanning:

For the file scanning: my suggestion here would be to look at the Sift project: https://github.com/svent/sift

It is a code-focused grep alternative written in Go, it is a project that already knows about things like reading .gitignore files, using parallelism in searching, ignoring irrelevant files, skipping binary files, etc.

It should give you at least an idea of how to handle concurrently consuming a file set while matching regular expressions. (Edited to add: just saw you do list sift in your references found... don't think that was there when I started writing this, but maybe I just missed it.)

As to the 'distributed' part, this seems relatively trivial - basically, the problem space is already chunked for you, by being divided into files. So you would have something like one app that receives the notification from a git hook that a new check-in has occurred (I assume - not sure how else git hooks would be relevant), which would go into a queue of apks to be produced, I assume, and it would read the repository file list, and then hand those files off to your distributed workers, in the form of a JSON or grpc query with a list of files and a git URL+ref combo or something like that to rejoin the returned results. Distributed workers submit results, either to that same master, or to some other worker that collates them into a master list of dependencies for that repository. Cache that somewhere. Have the first, master service recognize the git url/ref pair and pull from the cache instead of streaming to the workers when it gets the same request again.

Take that and match those dependencies up to your apks, find the ones that are missing from your list of already generated apks, and submit them back to the first service to be scanned for dependencies. If there are NO missing items, then this project is ready to be wrapped up into an apk, so either do that, or submit that to a service that does that.

Now, go through that and add something to detect and break import loops so you don't end up with infinite requests flowing between your services.

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

778 次点击

加入收藏微博

github

python

git

0 回复

暂无回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Parsing text based content / Large dictionnary

用户登录

今日阅读排行

一周阅读排行

最新主题