Embeddings, RAG and AI supported search with Apache Sling / AEM

While the bigger large language models (LLMs) do have a huge amount of general knowledge gathered from the public part of the web and other stuff, they will usually not have specific knowledge about your stuff. You might be lucky if you own extensive web pages that have become part of the training data, but even so especially the cheaper smaller LLM will probably not remember many details - just as you cannot remember individual sentences of your school books, just the general gist. That's where Retrieval Augmented Generation (RAG) comes in: before using the LLM for a task, detailed knowledge is gathered from a database, and hopefully relevant text snippets are provided to the LLM as background knowledge for the task. Often embeddings are used to identify the best text snippets by turning them into vectors and comparing them with the query. For the Composum AI I recently did a prototypical (though usable) implementation of an AI supported search and RAG to answer questions, and I also integrated a simple embedding based search into my personal link database and my things I've learned site. This blog will give you a tour through the basic ideas; you are also invited to delve into the code since all of that is open source.

How to find the most relevant text snippets

If you want to go with the latest AI technology, the use of embeddings is all the rage here - that's what you often hear with embeddings and vector databases. An embedding model is a neural network that was trained to turn text fragments into vectors (in the sense of a fixed length array of numbers) so that texts of similar meanings are close to each other in the vector space. If you want to go all the way with this, you can use a vector database to store the embeddings of all texts (usually you'll divide each text into text fragments to be more precise) and then calculate the embedding for the query and search the most similar embeddings in that database.

However, if you have many texts I've heard several times that it's better to use proven traditional search methods to find candidates that might be relevant for the query - e.g. the search result ranking method BM25, and only then use embeddings to select the most relevant for them. In the context of Apache Sling / AEM this has two advantages. First, Jackrabbit Oak uses a Lucene index for full text search, which in turn uses BM25 to rank the results. So that's easy to implement (though you might have to set up such an index ). Second, to use a vector database you'd have to calculate embeddings for all texts before you can start with the search - which might be somewhat costly since embeddings are a comparatively cheap AI service but far from free. (I'm usually using OpenAI embeddings, though there are also local models you could use for that.) If you use the lucene index instead, you can just calculate the embeddings of the top N results on demand (probably cached) and then use them for the final rating by comparing them with the embedding of the query. Of course, many projects use other ways implementing search, like SOLR or Elasticsearch - these can obviously used the same way.

Preprocessing the query for improving the search

BM25 search is word based - that will work well if the query shares many words with the documents that answer it best, but might not work too well when the question contains a lot of filler words or if answers given in the documents would use different words. That's another thing you could use a LLM for: preprocess the query to generate a word list used for the search. In my prototype I send as the first step a chat completion request to the OpenAI API with the following system instructions and then the users' query, and use the result for the fulltext search in the lucene index:

Print up to 7 keywords to search for in documents with a BM25 algorithm which are likely to appear in documents answering the users question, but not in documents irrelevant to that. The keywords should be selected to maximize the relevance of the retrieved high scoring documents, specifically aiming to answer the user's question. The keywords can be words from the users question, synonyms or other words you would expect to be present especially in a document answering the question. Print the keywords (single words) as comma separated list.

The actual RAG (Retrieval Augmented Generation)

If you're implementing just a search, you can use the search result directly, after sorting it by similarity with the query though the embedding. Or you can do an actual RAG by presenting the texts of these results together with the query to the LLM. As always, I'm adopting my put it into the AI's mouth pattern for that chat completion request and ask it to continue this kind of made up chat with an appropriate answer:

---------- user ----------
For answering my question later, retrieve the text of the possibly relevant page /foo/bar.html
---------- assistant ----------
{text of /foo/bar.html is included here}
---------- user ----------
For answering my question later, retrieve the text of the possibly relevant page /baz/foobar.html
---------- assistant ----------
{text of /baz/foobar.html is included here}
---------- assistant ----------
Considering this information, please answer the following as Markdown text, 
including links to the relevant retrieved pages above:

{the users query is included here}

Summing it up

If you are working with AEM or Apache Sling and like to try that out yourself, you can install the free Composum AI and try the prototypes. It implements the steps:

(optionally) preprocess the query to generate better search keywords
search the lucene index for the best matching documents
calculate embeddings for the top N results, and rank them by similarity with the query
either present the best of these results as search result, or use then with RAG to answer a query.