In a recent study conducted by startup Patronus AI, it was revealed that large language models, such as the one used in ChatGPT, often struggle to provide accurate answers to questions derived from Securities and Exchange Commission (SEC) filings. Patronus AI researchers discovered that even OpenAI’s top-performing AI model configuration, GPT-4-Turbo, was only able to answer 79% of questions correctly on Patronus AI’s new test, despite having access to nearly the entire filing alongside the question.
Surprisingly, these large language models not only refused to answer some questions but also tended to generate fictional figures and facts that were not present in the SEC filings. This aspect of “hallucination” raises concerns about the reliability and trustworthiness of their outputs.
Patronus AI cofounder Anand Kannappan expressed discontent with the performance rate, deeming it “absolutely unacceptable.” According to Kannappan, for these models to be truly effective and viable in an automated and production-ready manner, their accuracy needs to be significantly higher.
These findings shed light on the challenges faced by AI models as major corporations, particularly those in regulated industries like finance, strive to integrate cutting-edge technology into their operations. Whether for customer service or research purposes, the incorporation of AI into such industries requires substantial improvements in accuracy and reliability to ensure the models can navigate complex tasks involving regulatory filings effectively.
Source (CNBC)


