Memory und Gedächtnis Integration für OpenWebUI  Image © OpenWebUIMemory und Gedächtnis Integration für OpenWebUI (Image © OpenWebUI)

The storage system works by sharing user messages and stored data with the configured LLM to facilitate the retrieval and consolidation of reminders. While the system can be run locally, when using remote embedding models, the data is processed by the respective external providers.

As the system uses complex prompts and embeds the retrieved reminders into the chat context, there has been an increase in token usage. To keep costs under control, the system is compatible with various models, including local LLMs or efficient public models such as qwen3-instruct, gpt-5-nano and gemini-2.5-flash-lite.

Technical architecture and retrieval method

The system uses a two-stage workflow to manage information. In the inbound phase, the software first determines whether a message should be ignored - it filters out technical data, code or mathematical queries using regex patterns and semantic classification. If the message is relevant, the system performs a semantic search for related memories. To optimize accuracy, an LLM-assisted reranking process is automatically triggered when the number of initial candidates exceeds 50% of the retrieval limit.

After the LLM response, the screening phase begins. This background process analyzes the conversation to determine which reminders need to be created, updated or deleted. This consolidation ensures that personal facts are enriched over time and not duplicated.

Hardware and software optimization

To maintain performance, the system integrates three special caches for embedding, retrieval and storage. These caches use an LRU eviction strategy to ensure that memory requirements remain limited and efficient. In addition, the system uses stacked and normalized embeddings to speed up similarity calculations and reduce redundant API calls.