Hacker News new | past | comments | ask | show | jobs | submit login
GQL – Git Query Language (github.com/amrdeveloper)
240 points by amrdeveloper on Dec 2, 2023 | hide | past | favorite | 64 comments



Lovely! I only learnt today that clickhouse has a git-import tool from my colleagues at ClickHouse. So if you also want to give it a go:

Download clickhouse: curl https://clickhouse.com/ | sh

Check out documentation for git-import: ./clickhouse git-import --help

Then the tool can be run directly inside the git repository. It will collect data like commits, file changes and changes of every line in every file for further analysis. It works well even on largest repositories like Linux or Chromium.

Example of a trivial query:

SELECT author AS k, count() AS c FROM line_changes WHERE file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20

Example of some non-trivial query - a matrix of authors, how much code of one author is removed by another:

SELECT k, written_code.c, removed_code.c, round(removed_code.c * 100 / written_code.c) AS remove_ratio FROM ( SELECT author AS k, count() AS c FROM line_changes WHERE sign = 1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') GROUP BY k ) AS written_code INNER JOIN ( SELECT prev_author AS k, count() AS c FROM line_changes WHERE sign = -1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') AND author != prev_author GROUP BY k ) AS removed_code USING (k) WHERE written_code.c > 1000 ORDER BY c DESC LIMIT 500


> Download clickhouse: curl https://clickhouse.com/ | sh

Does this check the useragent to change the response? Clicking that link shows their home page.


that is exaxtly what it does ;) if you don't feel comfortable with curl | sh , you can download clickhouse binary from the repo here https://github.com/ClickHouse/ClickHouse/releases

;)


Changing the content from an html page to a shell script based on user-agent is a pretty bad abuse of HTTP. Why not at least require `-H 'Accept: text/x-shellscript'`? Or be more basic and give the script its own URL


Based on what reasoning? (Honestly curious)


If I want to download your homepage with curl to read offline, I get a script? If I use a tool you don't know you get the installer, I execute HTML?

If I run curl on Windows, do I get this script? A PowerShell version?

Why not make it https://clickhouse.com/linux-installer?


These are totally legit concerns, while the behaviour of the site has been around for quite sometimes and many ClickHouse installation script may have them so we will keep it for backward compatibility, we will add the usual install.sh url later and start sharing them more often.

(Pull request is in ... it should be deployed on Monday and you can use https://clickhouse.com/install.sh ). Love the feedbacks, please keep them coming!


Because someone may want to preview the script in browser.

Because someone may not have curl and use another tool your server doesn't know.


To what Resourse does this URL (universal Resourse locator) refer? A web page or a script?


The concept is interesting but the data exposed makes it kind of hard to run interesting queries. Like, do we have a branch where the last commit was a year ago?

Also it seems up arrow doesn't work to recall the previous command?


We have a data lake with our commits at work. I use it to look at developer activity data. Not for performance reasons but to see which of our code bases are active or not. I look at things like iteration time or length a PR/CR sits, number of revisions, etc. This lets me see where we have opportunities to speed up development loops for engineers. Also lets me understand if a service that is having poor performance or growing in cost is under active development. I'll find outliers and go dig into them. This also lets me justify work on tooling because I can quantify the problems and thus the benefits.


The diff information exposed is also quite anemic - just which files changed.


Have you considered using SQLite vtables for this?


+1 Being able to browse with an sqlite ui would be lovely.


SQLite’s source itself is being managed using Fossil, which uses an SQLite DB as a backing store (instead of a .git directory or similar). That’s a clever use, since the .git dir basically reinvents a database.


SQLite VTables is great but i want also to have some customization inpsired by MySQL for example user defined variables and functions, maybe add PUSH or PULL Statement and control the features


Tried something similar here: https://github.com/shayonj/branch_base

Lmk what you think :)


I like the idea, and I like sql, but I feel like this is a lot more typing for the same information you could extract using just git commands. Am I missing something? Take for instance:

The commit example:

    select name, count(name), from commits group by name
is actually:

    git shortlog -sn
The tag example:

    select * from tags
is actually:

    git tag
The branch example:

    select * from branches
is actually:

    git branch


It may be more typing but I can write SQL in my sleep whereas the git CLI I cannot.

Though if I regularly needed the information that this tool retrieves I would probably have memorized the relevant CLI commands by now.


> git branch

Yep. And to list worktrees you use:

    $ git worktree
    error: need a subcommand
Oh woops. But to show all refs instead of just the branches you surely just:

    $ git show-ref
    0003692409f153dd725b3455dfc2e128276cfbe2 refs/branchless/0003692409f153dd725b3455dfc2e128276cfbe2
No. But I can just cut(1) out that SHA1...

These SQL-like commands are probably meant to be familiar and guessable. Not terse. And then you can change the query instead of using all sorts of utility commands in some ad hoc pipeline (or look up whatever switch they threw in to turn on and off things like SHA1 as first column). If you don’t like the latter.


‘select * from branches’ is explicitly and unambiguously going to give me information, not change state. I always feel apprehensive with ‘git branch’ and end up reading the git-branch man page with its 500 options. For an SQL native GQL looks wonderful.


I mean you intentionally chose the most basic of the queries and translated them into different git gui commands.

    SELECT name, COUNT(name) AS commit_num FROM commits GROUP BY name ORDER BY commit_num DESC LIMIT 10
something like that is more complex and less easily gleaned by the git gui. or something else that's dead simple to remember how to do in sql vs probably having to open up the git manpages or cobble some script together to do:

   SELECT name, commit_count FROM branches WHERE commit_count BETWEEN 0 .. 10


I used the examples they provided on the repo README. I would still argue I can get the same results with less typing.

  git shortlog -sn | sort -r | head -n 10
If I run the 2nd command example against the chromium repo, it takes several minutes where a shell script using git for-each-ref takes about 8 seconds. I would also argue that this might be better expressed as commit_count <= 10 but I realize you're also using an example. This is the script I ran:

  git for-each-ref --format='%(refname:short)' refs/heads/ | while read branch; do
    commit_count=$(git rev-list --count $branch)
    [ "$commit_count" -le 10 ] && echo "Branch: $branch, Commit Count: $commit_count"
  done


As weaksauce mention the main goal at this point is not the speed but correctness and support more SQL features, but performance is very important, the next release should be faster by 20% because the migration to gix and after be stable as language and engine features we will work more on optimization


> I used the examples they provided on the repo README.

I was using the more complex examples from their README as well (except added a name to the second one to make it more useful).

> I would still argue I can get the same results with less typing.

so what? we type words and sentences much quicker than having to type control characters like - and |. also, you still have to know the invocations to sort and head off the top of your head without looking them up whereas a simple sql select statement is something most programmers would know how to do in their sleep.

  git for-each-ref --format='%(refname:short)' refs/heads/ | while read branch; do
    commit_count=$(git rev-list --count $branch)
    [ "$commit_count" -le 10 ] && echo "Branch: $branch, Commit Count: $commit_count"
  done
I think we can agree this is much more complicated to write than:

   SELECT name, commit_count FROM branches WHERE commit_count BETWEEN 0 .. 10
> If I run the 2nd command example against the chromium repo, it takes several minutes where a shell script using git for-each-ref takes about 8 seconds. I would also argue that this might be better expressed as commit_count <= 10 but I realize you're also using an example. This is the script I ran:

if speed matters then your script is fine to bang out if you need it but for one off queries that you know to be correct (look at how simple the second query is) it is probably fine and most people aren't working on chromium so speed is probably never going to be an issue in the first place.


This approach falls flat when you start doing analytical queries, and it assumes you already know the git incantations, which most people don't.


The git CLI is notoriously irregular, as can be seen by your examples. Having a clean, regular language to query git objects seems valuable.


Surprised by the variety of answers here. My view on this is that I won't use this on a daily basis. I doubt most will. But I'd love to remember about this when I want to extract a very specific information from git


Sql is a more well known DSL than git CLI


1. SQL is more consistent

2. SQL is more capable

3. SQL is more well-known


At least the SQL-like syntax is more consistent. It might be easier to figure out how to make any given query starting from the SQL “schema”.


The Goal is to provide most of the SQL features so if one query can replaced by command we should support it so user is free to choice any tool also to do filter on git branch you need to get it in format and use grep on linux but it's very easy for most developer to filter with SQL language


It reduces the amount to learn. If you know SQL you need to remember the command to list the tables and describe them. Then you are productive.

In addition SQL is a powerful functional programming language with joins! So that is handy too.


The existence of this GQL is yet another example showing the failure of the unix model, where programs output text instead of structured data. Now we need a new query language and a new external adapter program just to do a basic select. Yet we call it progress.

Imagine instead if there was a standard xQL that could run on any tool.


Apache Calcite has this as an integration as does SQLite though virtual tables.

Nonetheless nice work


YESSSS!! All those years of tuning queries paid off. I knew SQL was immortal.


We have a special language for monitoring our metrics at work. It's fussy and obscure. I replaced it with SQL on our team, which despite "not being good for timeseries data" works fine for this.


This is a super cool idea, I've been playing with Git internals and having this as a tool/reference will be super helpful. Thanks for sharing!

I don't see it in the code, but have you considered creating secondary indexes to speed up some operations? Things like looking up the path of a deleted file in a large repo?


Pardon my imprudence here, but it looks way worse than Mercurial's templates and revsets:

https://repo.mercurial-scm.org/hg/help/template

https://repo.mercurial-scm.org/hg/help/revset

(but arguably enables few things like aggregations that would require the churn extension)


What would a more complex query look like with that, for example, querying for any commit that affected any file in a list of files, by a specific author?


Amazing work! Just tried it out on a couple of my repos and it works great. Thank you, starred on Github ;)

I would like to request a feature --> Can you allow the output produced by GQL to be in different serializable formats, like JSON?

I see a lot of potential to this if the GQL output could be piped to C# LINQ queries.



Unfortunate name, given that GraphQL already exists.

https://en.wikipedia.org/wiki/Graph_Query_Language


You say "GraphQL" but link to "Graph Query Language (GQL)."

Those are two entirely separate things.

EDIT: And neither of these should be confused with "Graph Query Language (GSQL)" by TigerGraph: https://www.tigergraph.com/gsql/


my mistake points out just how bad a name gql is!


I've always seen it written as GraphQL, rather than GQL, though that wikipedia page indeed calls it GQL multiple times.


Because that one is apparently unrelated to GraphQL. Horrible naming.


Also an ISO track thing https://www.gqlstandards.org/


I guess it's not the only unfortunate name considering your link begins with the words "Not to be confused with GraphQL".


Looks very cool. Does this produce a shell snippet with the appropriate git commands? I like the idea of this being a very thin frontend to git.


it uses gix library, which is a ground-up reimplementation of git, so there's no git CLI involved there at any point.


Unfortunately this is going to need a name change, or at least modified initials, as GQL is for Graph Query Language [https://www.gqlstandards.org/] which is being standardized and will be the primary way the world uses Property Graphs like Neo4j. Can't compete with that.


Looks pretty awesome. gonna try it!


Would be cool to have duckdb be able to read git commit history as a data source.


Looks cool. Reminds me of Steampipe except this works on a local git repository.


And it will be called "geequal".


this seems converting graph/tree data model to relational data model ? hmm...


Why would someone use SQL, famously bad language, as a query language for git, famously badly designed CLI?


"Famously bad" is an opinion cloaked in adverbial authority. Plenty of people like both, and even if they don't like them, they think better in one vs the other.


That git(1) (the user interface) is badly designed is an opinion shared by many people, including many on the Git mailing list. Backwards compatibility is mostly used to argue for the status quo, not that the status quo is good.


All the more reason for alternative interfaces.


SQL is ok, it just needs to be refined a bit. This was such an opportunity but they didn't cease it.


I like your phrase “adverbial authority”


For Me I Think SQL is great for most of the features here and the most important is that people don't need to learn a new full language to use GitQL




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: