GQL – Git Query Language

ebfe1 on Dec 2, 2023 | [–]

Lovely! I only learnt today that clickhouse has a git-import tool from my colleagues at ClickHouse. So if you also want to give it a go:

Download clickhouse: curl https://clickhouse.com/ | sh

Check out documentation for git-import: ./clickhouse git-import --help

Then the tool can be run directly inside the git repository. It will collect data like commits, file changes and changes of every line in every file for further analysis. It works well even on largest repositories like Linux or Chromium.

Example of a trivial query:

SELECT author AS k, count() AS c FROM line_changes WHERE file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20

Example of some non-trivial query - a matrix of authors, how much code of one author is removed by another:

SELECT k, written_code.c, removed_code.c, round(removed_code.c * 100 / written_code.c) AS remove_ratio FROM ( SELECT author AS k, count() AS c FROM line_changes WHERE sign = 1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') GROUP BY k ) AS written_code INNER JOIN ( SELECT prev_author AS k, count() AS c FROM line_changes WHERE sign = -1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') AND author != prev_author GROUP BY k ) AS removed_code USING (k) WHERE written_code.c > 1000 ORDER BY c DESC LIMIT 500

koolba on Dec 2, 2023 | | [–]

> Download clickhouse: curl https://clickhouse.com/ | sh

Does this check the useragent to change the response? Clicking that link shows their home page.

ebfe1 on Dec 2, 2023 | | | [–]

that is exaxtly what it does ;) if you don't feel comfortable with curl | sh , you can download clickhouse binary from the repo here https://github.com/ClickHouse/ClickHouse/releases

;)

pcthrowaway on Dec 3, 2023 | | | [–]

Changing the content from an html page to a shell script based on user-agent is a pretty bad abuse of HTTP. Why not at least require `-H 'Accept: text/x-shellscript'`? Or be more basic and give the script its own URL

SomaticPirate on Dec 3, 2023 | | | [–]

Based on what reasoning? (Honestly curious)

remram on Dec 3, 2023 | | | [–]

If I want to download your homepage with curl to read offline, I get a script? If I use a tool you don't know you get the installer, I execute HTML?

If I run curl on Windows, do I get this script? A PowerShell version?

Why not make it https://clickhouse.com/linux-installer?

ebfe1 on Dec 3, 2023 | | | [–]

These are totally legit concerns, while the behaviour of the site has been around for quite sometimes and many ClickHouse installation script may have them so we will keep it for backward compatibility, we will add the usual install.sh url later and start sharing them more often.

(Pull request is in ... it should be deployed on Monday and you can use https://clickhouse.com/install.sh ). Love the feedbacks, please keep them coming!

never_inline on Dec 3, 2023 | | | | [–]

Because someone may want to preview the script in browser.

Because someone may not have curl and use another tool your server doesn't know.

dotancohen on Dec 3, 2023 | | | | [–]

To what Resourse does this URL (universal Resourse locator) refer? A web page or a script?

_ugfj on Dec 2, 2023 | | [–]

The concept is interesting but the data exposed makes it kind of hard to run interesting queries. Like, do we have a branch where the last commit was a year ago?

Also it seems up arrow doesn't work to recall the previous command?

grogenaut on Dec 3, 2023 | | [–]

We have a data lake with our commits at work. I use it to look at developer activity data. Not for performance reasons but to see which of our code bases are active or not. I look at things like iteration time or length a PR/CR sits, number of revisions, etc. This lets me see where we have opportunities to speed up development loops for engineers. Also lets me understand if a service that is having poor performance or growing in cost is under active development. I'll find outliers and go dig into them. This also lets me justify work on tooling because I can quantify the problems and thus the benefits.

pydry on Dec 3, 2023 | | | [–]

The diff information exposed is also quite anemic - just which files changed.

zffr on Dec 2, 2023 | | [–]

Have you considered using SQLite vtables for this?

savolai on Dec 2, 2023 | | [–]

+1 Being able to browse with an sqlite ui would be lovely.

nikeee on Dec 3, 2023 | | | [–]

SQLite’s source itself is being managed using Fossil, which uses an SQLite DB as a backing store (instead of a .git directory or similar). That’s a clever use, since the .git dir basically reinvents a database.

amrdeveloper on Dec 3, 2023 | | | [–]

SQLite VTables is great but i want also to have some customization inpsired by MySQL for example user defined variables and functions, maybe add PUSH or PULL Statement and control the features

shayonj on Dec 3, 2023 | | | [–]

Tried something similar here: https://github.com/shayonj/branch_base

Lmk what you think :)

bastardoperator on Dec 2, 2023 | | [–]

I like the idea, and I like sql, but I feel like this is a lot more typing for the same information you could extract using just git commands. Am I missing something? Take for instance:

The commit example:

    select name, count(name), from commits group by name

is actually:

    git shortlog -sn

The tag example:

    select * from tags

is actually:

    git tag

The branch example:

    select * from branches

is actually:

    git branch

snordgren on Dec 2, 2023 | | [–]

It may be more typing but I can write SQL in my sleep whereas the git CLI I cannot.

Though if I regularly needed the information that this tool retrieves I would probably have memorized the relevant CLI commands by now.

avgcorrection on Dec 2, 2023 | | | [–]

> git branch

Yep. And to list worktrees you use:

    $ git worktree
    error: need a subcommand

Oh woops. But to show all refs instead of just the branches you surely just:

    $ git show-ref
    0003692409f153dd725b3455dfc2e128276cfbe2 refs/branchless/0003692409f153dd725b3455dfc2e128276cfbe2

No. But I can just cut(1) out that SHA1...

These SQL-like commands are probably meant to be familiar and guessable. Not terse. And then you can change the query instead of using all sorts of utility commands in some ad hoc pipeline (or look up whatever switch they threw in to turn on and off things like SHA1 as first column). If you don’t like the latter.

willtemperley on Dec 2, 2023 | | | [–]

‘select * from branches’ is explicitly and unambiguously going to give me information, not change state. I always feel apprehensive with ‘git branch’ and end up reading the git-branch man page with its 500 options. For an SQL native GQL looks wonderful.

weaksauce on Dec 3, 2023 | | | [–]

I mean you intentionally chose the most basic of the queries and translated them into different git gui commands.

    SELECT name, COUNT(name) AS commit_num FROM commits GROUP BY name ORDER BY commit_num DESC LIMIT 10

something like that is more complex and less easily gleaned by the git gui. or something else that's dead simple to remember how to do in sql vs probably having to open up the git manpages or cobble some script together to do:

   SELECT name, commit_count FROM branches WHERE commit_count BETWEEN 0 .. 10

bastardoperator on Dec 3, 2023 | | | [–]

I used the examples they provided on the repo README. I would still argue I can get the same results with less typing.

  git shortlog -sn | sort -r | head -n 10

If I run the 2nd command example against the chromium repo, it takes several minutes where a shell script using git for-each-ref takes about 8 seconds. I would also argue that this might be better expressed as commit_count <= 10 but I realize you're also using an example. This is the script I ran:

  git for-each-ref --format='%(refname:short)' refs/heads/ | while read branch; do
    commit_count=$(git rev-list --count $branch)
    [ "$commit_count" -le 10 ] && echo "Branch: $branch, Commit Count: $commit_count"
  done

amrdeveloper on Dec 4, 2023 | | | [–]

As weaksauce mention the main goal at this point is not the speed but correctness and support more SQL features, but performance is very important, the next release should be faster by 20% because the migration to gix and after be stable as language and engine features we will work more on optimization

weaksauce on Dec 3, 2023 | | | | [–]

> I used the examples they provided on the repo README.

I was using the more complex examples from their README as well (except added a name to the second one to make it more useful).

> I would still argue I can get the same results with less typing.

so what? we type words and sentences much quicker than having to type control characters like - and |. also, you still have to know the invocations to sort and head off the top of your head without looking them up whereas a simple sql select statement is something most programmers would know how to do in their sleep.

  git for-each-ref --format='%(refname:short)' refs/heads/ | while read branch; do
    commit_count=$(git rev-list --count $branch)
    [ "$commit_count" -le 10 ] && echo "Branch: $branch, Commit Count: $commit_count"
  done

I think we can agree this is much more complicated to write than:

   SELECT name, commit_count FROM branches WHERE commit_count BETWEEN 0 .. 10

> If I run the 2nd command example against the chromium repo, it takes several minutes where a shell script using git for-each-ref takes about 8 seconds. I would also argue that this might be better expressed as commit_count <= 10 but I realize you're also using an example. This is the script I ran:

if speed matters then your script is fine to bang out if you need it but for one off queries that you know to be correct (look at how simple the second query is) it is probably fine and most people aren't working on chromium so speed is probably never going to be an issue in the first place.

esafak on Dec 2, 2023 | | | [–]

This approach falls flat when you start doing analytical queries, and it assumes you already know the git incantations, which most people don't.

dtgriscom on Dec 2, 2023 | | | [–]

The git CLI is notoriously irregular, as can be seen by your examples. Having a clean, regular language to query git objects seems valuable.

michaelsalim on Dec 3, 2023 | | | [–]

Surprised by the variety of answers here. My view on this is that I won't use this on a daily basis. I doubt most will. But I'd love to remember about this when I want to extract a very specific information from git

umvi on Dec 2, 2023 | | | [–]

Sql is a more well known DSL than git CLI

paulddraper on Dec 2, 2023 | | | [–]

1. SQL is more consistent

2. SQL is more capable

3. SQL is more well-known

joshuanapoli on Dec 2, 2023 | | | [–]

At least the SQL-like syntax is more consistent. It might be easier to figure out how to make any given query starting from the SQL “schema”.

amrdeveloper on Dec 3, 2023 | | | [–]

The Goal is to provide most of the SQL features so if one query can replaced by command we should support it so user is free to choice any tool also to do filter on git branch you need to get it in format and use grep on linux but it's very easy for most developer to filter with SQL language

quickthrower2 on Dec 2, 2023 | | | [–]

It reduces the amount to learn. If you know SQL you need to remember the command to list the tables and describe them. Then you are productive.

In addition SQL is a powerful functional programming language with joins! So that is handy too.

Too on Dec 3, 2023 | | | [–]

The existence of this GQL is yet another example showing the failure of the unix model, where programs output text instead of structured data. Now we need a new query language and a new external adapter program just to do a basic select. Yet we call it progress.

Imagine instead if there was a standard xQL that could run on any tool.

setheron on Dec 2, 2023 | | [–]

Apache Calcite has this as an integration as does SQLite though virtual tables.

Nonetheless nice work

cglendenning on Dec 3, 2023 | | [–]

YESSSS!! All those years of tuning queries paid off. I knew SQL was immortal.

hot_gril on Dec 3, 2023 | | [–]

We have a special language for monitoring our metrics at work. It's fussy and obscure. I replaced it with SQL on our team, which despite "not being good for timeseries data" works fine for this.

dafelst on Dec 2, 2023 | | [–]

This is a super cool idea, I've been playing with Git internals and having this as a tool/reference will be super helpful. Thanks for sharing!

I don't see it in the code, but have you considered creating secondary indexes to speed up some operations? Things like looking up the path of a deleted file in a large repo?

ezst on Dec 4, 2023 | | [–]

Pardon my imprudence here, but it looks way worse than Mercurial's templates and revsets:

https://repo.mercurial-scm.org/hg/help/template

https://repo.mercurial-scm.org/hg/help/revset

(but arguably enables few things like aggregations that would require the churn extension)

lloydatkinson on Dec 2, 2023 | | [–]

What would a more complex query look like with that, for example, querying for any commit that affected any file in a list of files, by a specific author?

UltraMagnus on Dec 3, 2023 | | [–]

Amazing work! Just tried it out on a couple of my repos and it works great. Thank you, starred on Github ;)

I would like to request a feature --> Can you allow the output produced by GQL to be in different serializable formats, like JSON?

I see a lot of potential to this if the GQL output could be piped to C# LINQ queries.

gingerwizard on Dec 3, 2023 | | [–]

Some good examples and docs on this data https://clickhouse.com/docs/en/getting-started/example-datas...

fragmede on Dec 2, 2023 | | [–]

Unfortunate name, given that GraphQL already exists.

https://en.wikipedia.org/wiki/Graph_Query_Language

paulddraper on Dec 2, 2023 | | [–]

You say "GraphQL" but link to "Graph Query Language (GQL)."

Those are two entirely separate things.

EDIT: And neither of these should be confused with "Graph Query Language (GSQL)" by TigerGraph: https://www.tigergraph.com/gsql/

fragmede on Dec 3, 2023 | | | [–]

my mistake points out just how bad a name gql is!

ghusbands on Dec 2, 2023 | | | [–]

I've always seen it written as GraphQL, rather than GQL, though that wikipedia page indeed calls it GQL multiple times.

8n4vidtmkvmk on Dec 2, 2023 | | | [–]

Because that one is apparently unrelated to GraphQL. Horrible naming.

jakewins on Dec 2, 2023 | | | [–]

Also an ISO track thing https://www.gqlstandards.org/

Hamuko on Dec 2, 2023 | | | [–]

I guess it's not the only unfortunate name considering your link begins with the words "Not to be confused with GraphQL".

pokstad on Dec 2, 2023 | | [–]

Looks very cool. Does this produce a shell snippet with the appropriate git commands? I like the idea of this being a very thin frontend to git.

pornel on Dec 3, 2023 | | [–]

it uses gix library, which is a ground-up reimplementation of git, so there's no git CLI involved there at any point.

Sargos on Dec 3, 2023 | | [–]

Unfortunately this is going to need a name change, or at least modified initials, as GQL is for Graph Query Language [https://www.gqlstandards.org/] which is being standardized and will be the primary way the world uses Property Graphs like Neo4j. Can't compete with that.

zubairq on Dec 3, 2023 | | [–]

Looks pretty awesome. gonna try it!

grogenaut on Dec 3, 2023 | | [–]

Would be cool to have duckdb be able to read git commit history as a data source.

ellisv on Dec 2, 2023 | | [–]

Looks cool. Reminds me of Steampipe except this works on a local git repository.

it on Dec 3, 2023 | | [–]

And it will be called "geequal".

nittanymount on Dec 2, 2023 | | [–]

this seems converting graph/tree data model to relational data model ? hmm...

jakubmazanec on Dec 2, 2023 | [–]

Why would someone use SQL, famously bad language, as a query language for git, famously badly designed CLI?

mpalmer on Dec 2, 2023 | | [–]

"Famously bad" is an opinion cloaked in adverbial authority. Plenty of people like both, and even if they don't like them, they think better in one vs the other.

avgcorrection on Dec 2, 2023 | | | [–]

That git(1) (the user interface) is badly designed is an opinion shared by many people, including many on the Git mailing list. Backwards compatibility is mostly used to argue for the status quo, not that the status quo is good.

mpalmer on Dec 2, 2023 | | | [–]

All the more reason for alternative interfaces.

8n4vidtmkvmk on Dec 2, 2023 | | | | [–]

SQL is ok, it just needs to be refined a bit. This was such an opportunity but they didn't cease it.

onionisafruit on Dec 3, 2023 | | | | [–]

I like your phrase “adverbial authority”

amrdeveloper on Dec 3, 2023 | | [–]

For Me I Think SQL is great for most of the features here and the most important is that people don't need to learn a new full language to use GitQL