By Dr. Karl Häfner and Max Kühn
"How do I have to adjust a component in my machine of a certain series to reduce vibrations?"
"What is the best way to repair it if a fault occurs?"
"Where can I find more information on how to solve a problem sustainably?"
In the past, one of our customers had to wade through mountains of documents to find answers to such questions, investing a lot of time and nerves. Today, he poses these questions to an interactive chatbot, quickly receives the right answer and can turn his attention to the actual solution to the problem just as quickly.
The second article in our AI Deep Dive series looks at how artificial intelligence can help companies make the most of their in-house expertise and the technologies and steps required to do so.
Innovative knowledge management thanks to generative AI and RAG
A chatbot like the one we described in the introduction is made possible by a technology that is currently attracting a great deal of media interest: Generative Artificial Intelligence (GenAI) or Large Language Models (LLM). Made famous by applications such as ChatGPT and Microsoft Copilot, they can give very eloquent answers based on publicly available information that has been incorporated into the training of the models behind them.
The use case described, however, is aimed at internal company knowledge that has not been trained into the models. In such cases, Retrieval Augmented Generation (RAG) provides a remedy. RAG extends the capabilities of LLMs by using ad hoc external data in addition to the trained "knowledge" to answer user questions.
In the scenario described, the chatbot uses machine documentation, maintenance reports, error lists, etc. as external data, processes them and incorporates the information they contain into its response. In this way, linguistic interfaces to internal company data can be created comparatively easily and quickly. Fine-tuning the LLMs, which is extremely resource-intensive and requires huge amounts of data, is therefore unnecessary.
Data preparation: the key to successful AI application
Data preparation is essential for a functioning RAG application. In order for an LLM to process the texts well, they must be cut into suitably sized sections, enriched with metadata and embedded by LLMs (embedding: a numerical representation of the semantic meaning of a text).
The requirements for data pre-processing are very individual depending on the customer situation, as they depend on many factors such as the source formats, the quantity or consistency of the data. Due to successful implementations and the resulting increase in expectations of RAG applications, we are also experiencing an ever-increasing range of data to be integrated within individual customer projects.
Another challenge when integrating your own data into RAG applications lies in the internal structure of the documents. Typical documentation texts contain not only plain text, but also many illustrations, such as diagrams, drawings or photos, tables, cross-references, warnings and the like. In addition, many documents are full of abbreviations and cryptic terms that are initially incomprehensible to outsiders (and therefore also to LLMs).
Everything contains important information, and often the meaning only becomes apparent when all elements of a document are considered together, which makes editing essential. For example, illustrations can be verbalized, abbreviations written out, cryptic designations explained in short texts and important metadata such as warnings saved. And all of this must ultimately be automated and performed efficiently.
Four steps to optimal data processing
In order to tackle this challenge in a structured way and to be able to implement efficient processing routes quickly, we roughly divide the data processing process for RAG applications into four steps: Import, Cracking, Enrichment and Provisioning. We will provide a brief overview below.
- Import: Provision of the source data at an accessible location
In this step, the output data to be processed must be provided in a known and accessible location. Depending on the output format, this can be in an object memory or a database, for example.
- Cracking: Structuring and extraction of texts, tables and figures
In the cracking section, the source data is broken down and fundamentally structured in order to prepare it for further processing. Essentially, this involves recognizing texts, tables and illustrations of all kinds as such in order to be able to treat them separately in subsequent steps.
The effort required here can vary greatly. Structured or semi-structured formats often only require simple parsing scripts to process the data. For highly unstructured formats such as PDF files, the process is more complex. Here, it may be necessary to extract the content using complex machine learning-based processes. Examples of this include optical character recognition (OCR) or layout analysis based on computer vision.
It is not only important to extract content such as images, tables and text, but also to include contextual information. This can be, for example, image captions or the position of an element in the document hierarchy ((sub)headings).
- Enrichment: Enrichment of the data for use by RAG applications
In the enrichment step, the data is enriched and shaped in such a way that it can later be meaningfully used by RAG applications. This step depends on the type of data and can be as complex as required.
- Text: Essentially, texts should be divided into content-related sections (segmentation), which can then be enriched with metadata. It can be useful, for example, to classify the content of a section or to identify specific text objects (entities) in order to be able to filter for certain topics later or to programmatically display warnings.
- Tables: Basically the same applies to tables, except that they must first be converted into a form that can be processed by an LLM. This can be a textual description or a format such as Markdown, for example.
- Images: There are two ways to make images accessible for RAG applications. On the one hand, they can be embedded as such and thus, like texts, transferred to a vector space that represents their meaning. Secondly, they can be verbalized using multi-modal LLMs such as GPT4-Vision. The description obtained in this way is then treated like normal text.
The result of the enrichment step is content-encapsulated and coherent segments that are enriched with the most meaningful metadata possible.
- Provisioning: Preparation and provision of the enriched segments
Finally, the enriched segments are prepared and made available for the use of a specific LLM. The first step here is chunking. This involves cutting the segments into a size that is suitable for the respective RAG application, which primarily depends on the token limit of the LLM used there. The segments are further subdivided or combined accordingly.
The individual chunks are then embedded and made available in a vector database to which the RAG application has access. It makes sense to keep references to the source data so that the RAG application can refer to them later when generating the response. In addition to the vector database, there are other ways of providing the data, e.g. via graphs, database tables or similar.
If large volumes of data are to be processed and these grow over time, it can make sense to design the process described incrementally for reasons of efficiency in order to avoid redundant processing.
Implementing data pipelines: challenges and solutions
Whether incremental or not, the implementation of data pipelines that cover all steps can be quite a complex undertaking, depending on the initial situation. Suitable tools can help here. When making a selection, it is important to remember that there is no one solution that can do everything and always fits. A careful analysis of the initial situation is therefore essential.
Thanks to our broad portfolio, we can flexibly select and implement the most suitable technologies and tools for our customers. For the customer mentioned at the beginning of the article, for example, we implemented incremental pipelines on the Microsoft Azure platform with services such as Azure AI Search, Azure OpenAI and Azure Databricks, which fit in well with the customer's existing system landscape.
If you would also like to test the possibilities of LLM-based RAG applications for your use case or are already convinced and looking for an implementation partner, we look forward to hearing from you.
About the authors:
Dr. Karl Häfner studied economic geography and has been working as a data scientist in the field of natural language processing and AI since 2019. He has been supporting companies in integrating generative AI and large language models into their business processes since 2021. At the Dataciders subsidiary ixto GmbH, he heads the AI Solutions team.
Max Kühn has been involved with data science & BI since his studies in business informatics and focuses on data engineering in his work as a consultant. He supports companies in laying the foundations for data-driven decisions.