Check out documentation for git-import:
./clickhouse git-import --help
Then the tool can be run directly inside the git repository.
It will collect data like commits, file changes and changes of every
line in every file for further analysis.
It works well even on largest repositories like Linux or Chromium.
Example of a trivial query:
SELECT author AS k, count() AS c FROM line_changes WHERE
file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20
Example of some non-trivial query - a matrix of authors, how much code
of one author is removed by another:
SELECT k, written_code.c, removed_code.c,
round(removed_code.c * 100 / written_code.c) AS remove_ratio
FROM (
SELECT author AS k, count() AS c
FROM line_changes
WHERE sign = 1 AND file_extension IN ('h', 'cpp')
AND line_type NOT IN ('Punct', 'Empty')
GROUP BY k
) AS written_code
INNER JOIN (
SELECT prev_author AS k, count() AS c
FROM line_changes
WHERE sign = -1 AND file_extension IN ('h', 'cpp')
AND line_type NOT IN ('Punct', 'Empty')
AND author != prev_author
GROUP BY k
) AS removed_code USING (k)
WHERE written_code.c > 1000
ORDER BY c DESC LIMIT 500
Changing the content from an html page to a shell script based on user-agent is a pretty bad abuse of HTTP. Why not at least require `-H 'Accept: text/x-shellscript'`? Or be more basic and give the script its own URL
These are totally legit concerns, while the behaviour of the site has been around for quite sometimes and many ClickHouse installation script may have them so we will keep it for backward compatibility, we will add the usual install.sh url later and start sharing them more often.
(Pull request is in ... it should be deployed on Monday and you can use https://clickhouse.com/install.sh ). Love the feedbacks, please keep them coming!
The concept is interesting but the data exposed makes it kind of hard to run interesting queries. Like, do we have a branch where the last commit was a year ago?
Also it seems up arrow doesn't work to recall the previous command?
We have a data lake with our commits at work. I use it to look at developer activity data. Not for performance reasons but to see which of our code bases are active or not. I look at things like iteration time or length a PR/CR sits, number of revisions, etc. This lets me see where we have opportunities to speed up development loops for engineers. Also lets me understand if a service that is having poor performance or growing in cost is under active development. I'll find outliers and go dig into them. This also lets me justify work on tooling because I can quantify the problems and thus the benefits.
SQLite’s source itself is being managed using Fossil, which uses an SQLite DB as a backing store (instead of a .git directory or similar). That’s a clever use, since the .git dir basically reinvents a database.
SQLite VTables is great but i want also to have some customization inpsired by MySQL for example user defined variables and functions, maybe add PUSH or PULL Statement and control the features
I like the idea, and I like sql, but I feel like this is a lot more typing for the same information you could extract using just git commands. Am I missing something? Take for instance:
The commit example:
select name, count(name), from commits group by name
These SQL-like commands are probably meant to be familiar and guessable. Not terse. And then you can change the query instead of using all sorts of utility commands in some ad hoc pipeline (or look up whatever switch they threw in to turn on and off things like SHA1 as first column). If you don’t like the latter.
‘select * from branches’ is explicitly and unambiguously going to give me information, not change state. I always feel apprehensive with ‘git branch’ and end up reading the git-branch man page with its 500 options. For an SQL native GQL looks wonderful.
I mean you intentionally chose the most basic of the queries and translated them into different git gui commands.
SELECT name, COUNT(name) AS commit_num FROM commits GROUP BY name ORDER BY commit_num DESC LIMIT 10
something like that is more complex and less easily gleaned by the git gui. or something else that's dead simple to remember how to do in sql vs probably having to open up the git manpages or cobble some script together to do:
SELECT name, commit_count FROM branches WHERE commit_count BETWEEN 0 .. 10
I used the examples they provided on the repo README. I would still argue I can get the same results with less typing.
git shortlog -sn | sort -r | head -n 10
If I run the 2nd command example against the chromium repo, it takes several minutes where a shell script using git for-each-ref takes about 8 seconds. I would also argue that this might be better expressed as commit_count <= 10 but I realize you're also using an example. This is the script I ran:
As weaksauce mention the main goal at this point is not the speed but correctness and support more SQL features, but performance is very important, the next release should be faster by 20% because the migration to gix and after be stable as language and engine features we will work more on optimization
> I used the examples they provided on the repo README.
I was using the more complex examples from their README as well (except added a name to the second one to make it more useful).
> I would still argue I can get the same results with less typing.
so what? we type words and sentences much quicker than having to type control characters like - and |. also, you still have to know the invocations to sort and head off the top of your head without looking them up whereas a simple sql select statement is something most programmers would know how to do in their sleep.
I think we can agree this is much more complicated to write than:
SELECT name, commit_count FROM branches WHERE commit_count BETWEEN 0 .. 10
> If I run the 2nd command example against the chromium repo, it takes several minutes where a shell script using git for-each-ref takes about 8 seconds. I would also argue that this might be better expressed as commit_count <= 10 but I realize you're also using an example. This is the script I ran:
if speed matters then your script is fine to bang out if you need it but for one off queries that you know to be correct (look at how simple the second query is) it is probably fine and most people aren't working on chromium so speed is probably never going to be an issue in the first place.
Surprised by the variety of answers here. My view on this is that I won't use this on a daily basis. I doubt most will. But I'd love to remember about this when I want to extract a very specific information from git
The Goal is to provide most of the SQL features so if one query can replaced by command we should support it so user is free to choice any tool also to do filter on git branch you need to get it in format and use grep on linux but it's very easy for most developer to filter with SQL language
The existence of this GQL is yet another example showing the failure of the unix model, where programs output text instead of structured data. Now we need a new query language and a new external adapter program just to do a basic select. Yet we call it progress.
Imagine instead if there was a standard xQL that could run on any tool.
We have a special language for monitoring our metrics at work. It's fussy and obscure. I replaced it with SQL on our team, which despite "not being good for timeseries data" works fine for this.
This is a super cool idea, I've been playing with Git internals and having this as a tool/reference will be super helpful. Thanks for sharing!
I don't see it in the code, but have you considered creating secondary indexes to speed up some operations? Things like looking up the path of a deleted file in a large repo?
What would a more complex query look like with that, for example, querying for any commit that affected any file in a list of files, by a specific author?
Unfortunately this is going to need a name change, or at least modified initials, as GQL is for Graph Query Language [https://www.gqlstandards.org/] which is being standardized and will be the primary way the world uses Property Graphs like Neo4j. Can't compete with that.
"Famously bad" is an opinion cloaked in adverbial authority. Plenty of people like both, and even if they don't like them, they think better in one vs the other.
That git(1) (the user interface) is badly designed is an opinion shared by many people, including many on the Git mailing list. Backwards compatibility is mostly used to argue for the status quo, not that the status quo is good.
Download clickhouse: curl https://clickhouse.com/ | sh
Check out documentation for git-import: ./clickhouse git-import --help
Then the tool can be run directly inside the git repository. It will collect data like commits, file changes and changes of every line in every file for further analysis. It works well even on largest repositories like Linux or Chromium.
Example of a trivial query:
SELECT author AS k, count() AS c FROM line_changes WHERE file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20
Example of some non-trivial query - a matrix of authors, how much code of one author is removed by another:
SELECT k, written_code.c, removed_code.c, round(removed_code.c * 100 / written_code.c) AS remove_ratio FROM ( SELECT author AS k, count() AS c FROM line_changes WHERE sign = 1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') GROUP BY k ) AS written_code INNER JOIN ( SELECT prev_author AS k, count() AS c FROM line_changes WHERE sign = -1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') AND author != prev_author GROUP BY k ) AS removed_code USING (k) WHERE written_code.c > 1000 ORDER BY c DESC LIMIT 500