2024-11-04 AI supported extracting data from JSON
AI supported extracting data from JSON
Simon Willison had a very interesting blog entry
recently about using an LLM to generate a program for jq
that extracts data from a given JSON file. Of course I had
some fun reproducing that with my own chatgpt
tool from my ChatGPT Toolbox -
with some twists and improvements.
The idea
Assume you have a JSON file that is somewhat complicated - I'll be shamelessly re-using Simon Willison's
example from his blog - and want to extract some data from
that - in this case find the top 3 users by the number of logins. There is the versatile
command-line JSON processor jq that can do that, but writing the filter isn't that
easy or straightforward. So Simon Willison's idea was to feed the first 1000 characters from the JSON to an LLM and have
it generate that filter and then feed the whole JSON to jq
with the filter the LLM returned.
For that he is using this system prompt:
``` Based on the example JSON snippet and the desired query, write a jq program
Return only the jq program to be executed as a raw string, no string delimiters wrapping it, no yapping, no markdown, no fenced code blocks, what you return will be passed to subprocess.check_output('jq', [...]) directly. For example, if the user asks: extract the name of the first person You return only: .people[0].name ```
Using structured output
You will notice that quite a lot of the prompt involves "arguing with the LLM" to just print the result without any
discussion. OpenAI structured output
somewhat reduces the need for that since it's immediately clear that its machine
readable output. Second, it's well known that with current models the "chain of thought" pattern is often helpful for
difficult tasks. If you use structured output, you can first request that and still are easily able to grab the actual
result from the output. It does, however, need you to provide a JSON schema. You could use the "Generate" assistant
OpenAI's playground with response_format "json_schema" to create one
for you. My chatgpt
tool makes that even easier - if you give the argument -ra attr1,...
it will use a simple
JSON schema for an object with string attributes attr1, attr2, ... you provide, or -rar attr1,...
would return
an array with objects with these attributes. So, for instance:
curl -s https://api.github.com/repos/simonw/datasette/issues | head -c 1000 | \
chatgpt -f - -ra explanationoffilter,jqfilter \
"Print a JSON with a filter for the jq program to count by user login, top 3"
returns
{
"explanationoffilter": "This jq filter processes the input JSON array, aggregates the issues by user login, and sorts them to find the top three users with the highest number of issues created, using a combination of grouping and sorting operations.",
"jqfilter":"map(.user.login) | group_by(.) | map({user: .[0], count: length}) | sort_by(-.count) | .[0:3]"
}
from which a shell script can easily extract the filter by piping it through jq -r .jqfilter
. If you are curious, the
request sent to chatgpt looks like this. This is using my
put it in the AI's mouth pattern
by putting the JSON we want to parse into an assistant message, to make it a bit more resistant against prompt injection
from the JSON and prevent a modern version of Bobby Tables :-)
{
"model": "gpt-4o-mini",
"user": "chatgpt script",
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "responseschema",
"schema": {
"type": "object",
"properties": {
"explanationoffilter": {
"type": "string"
},
"jqfilter": {
"type": "string"
}
},
"required": [
"explanationoffilter",
"jqfilter"
],
"additionalProperties": false
},
"strict": true
}
},
"messages": [
{
"role": "user",
"content": "Print the raw input text the instructions will apply to."
},
{
"role": "assistant",
"content": "(here come the first 1000 characters of the JSON file, deleted for brevity)"
},
{
"role": "user",
"content": "Print a JSON with a filter for the jq program to count by user login, top 3"
}
]
}
With some more shellscript fluff to make it more convenient I got chatgptjq . So,
curl -s https://api.github.com/repos/simonw/datasette/issues | chatgptjq "count by user login, top 3"
gives you:
Query as understood: Count the number of occurrences for each user login in the input data and return the top 3 users with the most occurrences.
Explanation: To extract the desired information, I will aggregate the occurrences of each user login from the input JSON. I will then sort this aggregated data based on the count in descending order and retrieve the top 3 users. The output will be formatted as a JSON array containing the top 3 user logins along with their respective counts.
jq command: [.[] | .user.login] | group_by(.) | map({login: .[0], count: length}) | sort_by(-.count) | .[0:3]
[
{
"login": "simonw",
"count": 11
},
{
"login": "king7532",
"count": 5
},
{
"login": "dependabot[bot]",
"count": 2
}
]
Of course the explanations can be silenced with the switch -s .