Dataset
On the Multi-Session Chat (MSC) dataset, which consists of multi-session chat logs created by human labelers who were each expected to portray a consistent persona for the duration of all sessions, we test MemGPT and our fixed-context baselines. Each MSC multisession conversation contains a total of five sessions, with each session having about a dozen messages.
MemGPT for Document Analysis
Due to the constrained context windows of current transformer models, document analysis also confronts difficulties. For instance, the cutting-edge open source Llama 2 models have a maximum of only 4k input tokens, whereas OpenAI’s (closed) GPT models, which power their well-known ChatGPT consumer chatbot application, have a limit of 32k input tokens. The bestselling book “The Shining” by Stephen King has about 150k words, or about 200k tokens (words-to-token varies depending on the specific tokenizer used), and legal or financial documents like annual reports (SEC Form 10-K) can easily surpass the million token mark. Anthropic has released (closed) models handling up to 100k tokens, but many documents easily surpass that length.
Limitations of MemGPT
The OpenAI GPT-4 models have been optimized for function calling in particular. OpenAI’s API description indicates that when employing function fine-tuned models, the provided function schema is turned into a system message that the model is trained to comprehend through a fine-tuning process. However, the inner workings of OpenAI’s models are proprietary and not publicly disclosed. We found that GPT-4 function fine-tuned models rarely made syntactic or semantic mistakes on the MemGPT function set, whereas GPT-3.5 fine-tuned models consistently generated incorrect function calls or attempted to use functions incorrectly. GPT models that have been fine-tuned for function-calling still require a parser to verify outputs as valid function syntax.
Additionally, it was discovered that the most widely used Llama 2 70B model versions (even those that had been optimized for function calling) would frequently produce inaccurate function calls or even hallucinate functions outside of the specified schema.
Conclusion
MemGPT is a unique LLM system for managing the constrained context windows of big language models. It was influenced by operating systems. MemGPT gives the appearance that LLM context resources are greater by creating a memory hierarchy and control flow that are similar to those of conventional operating systems. This OS-inspired method was tested in document analysis and conversational agents, two areas where existing LLM performance is limited by finite context lengths. By efficiently paging important context in and out of memory, MemGPT might handle lengthy texts for document analysis that would go much beyond the context restrictions of current LLMs.
MemGPT made it possible for conversational agents to preserve long-term memory, consistency, and adaptability across protracted conversations. MemGPT shows that even with fixed context lengths, operating system features like interrupts and hierarchical memory management may unlock the power of LLMs.