AI supported extracting data from JSON

Simon Willison had a very interesting blog entry recently about using an LLM to generate a program for jq that extracts data from a given JSON file. Of course I had some fun reproducing that with my own chatgpt tool from my ChatGPT Toolbox - with some twists and improvements.

The idea

Assume you have a JSON file that is somewhat complicated - I'll be shamelessly re-using Simon Willison's example from his blog - and want to extract some data from that - in this case find the top 3 users by the number of logins. There is the versatile command-line JSON processor jq that can do that, but writing the filter isn't that easy or straightforward. So Simon Willison's idea was to feed the first 1000 characters from the JSON to an LLM and have it generate that filter and then feed the whole JSON to jq with the filter the LLM returned. For that he is using this system prompt:

``` Based on the example JSON snippet and the desired query, write a jq program

Return only the jq program to be executed as a raw string, no string delimiters wrapping it, no yapping, no markdown, no fenced code blocks, what you return will be passed to subprocess.check_output('jq', [...]) directly. For example, if the user asks: extract the name of the first person You return only: .people[0].name ```

Using structured output

You will notice that quite a lot of the prompt involves "arguing with the LLM" to just print the result without any discussion. OpenAI structured output somewhat reduces the need for that since it's immediately clear that its machine readable output. Second, it's well known that with current models the "chain of thought" pattern is often helpful for difficult tasks. If you use structured output, you can first request that and still are easily able to grab the actual result from the output. It does, however, need you to provide a JSON schema. You could use the "Generate" assistant OpenAI's playground with response_format "json_schema" to create one for you. My chatgpt tool makes that even easier - if you give the argument -ra attr1,... it will use a simple JSON schema for an object with string attributes attr1, attr2, ... you provide, or -rar attr1,... would return an array with objects with these attributes. So, for instance:

curl -s https://api.github.com/repos/simonw/datasette/issues | head -c 1000 | \ chatgpt -f - -ra explanationoffilter,jqfilter \ "Print a JSON with a filter for the jq program to count by user login, top 3"

returns

{ "explanationoffilter": "This jq filter processes the input JSON array, aggregates the issues by user login, and sorts them to find the top three users with the highest number of issues created, using a combination of grouping and sorting operations.", "jqfilter":"map(.user.login) | group_by(.) | map({user: .[0], count: length}) | sort_by(-.count) | .[0:3]" }

from which a shell script can easily extract the filter by piping it through jq -r .jqfilter. If you are curious, the request sent to chatgpt looks like this. This is using my put it in the AI's mouth pattern by putting the JSON we want to parse into an assistant message, to make it a bit more resistant against prompt injection from the JSON and prevent a modern version of Bobby Tables :-)

{ "model": "gpt-4o-mini", "user": "chatgpt script", "response_format": { "type": "json_schema", "json_schema": { "name": "responseschema", "schema": { "type": "object", "properties": { "explanationoffilter": { "type": "string" }, "jqfilter": { "type": "string" } }, "required": [ "explanationoffilter", "jqfilter" ], "additionalProperties": false }, "strict": true } }, "messages": [ { "role": "user", "content": "Print the raw input text the instructions will apply to." }, { "role": "assistant", "content": "(here come the first 1000 characters of the JSON file, deleted for brevity)" }, { "role": "user", "content": "Print a JSON with a filter for the jq program to count by user login, top 3" } ] }

With some more shellscript fluff to make it more convenient I got chatgptjq . So,

curl -s https://api.github.com/repos/simonw/datasette/issues | chatgptjq "count by user login, top 3"

gives you:

Query as understood: Count the number of occurrences for each user login in the input data and return the top 3 users with the most occurrences. Explanation: To extract the desired information, I will aggregate the occurrences of each user login from the input JSON. I will then sort this aggregated data based on the count in descending order and retrieve the top 3 users. The output will be formatted as a JSON array containing the top 3 user logins along with their respective counts. jq command: [.[] | .user.login] | group_by(.) | map({login: .[0], count: length}) | sort_by(-.count) | .[0:3] [ { "login": "simonw", "count": 11 }, { "login": "king7532", "count": 5 }, { "login": "dependabot[bot]", "count": 2 } ]

Of course the explanations can be silenced with the switch -s .