Evaluating the Performance of Large Language Models


Details
In the rapidly evolving field of Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text and answering questions. However, evaluating the performance of these models and ensuring their effective deployment in production environments pose significant challenges. This talk will delve into the intricacies of LLM evaluation, focusing on key LLM based metrics for assessing the truth and quality of generated text. We will explore various evaluation techniques, including G-Eval, SelfCheckGPT, and QAG scores. Additionally, we will address the pitfalls of statistical scorers such as BLEU, ROUGE and METEOR and explain that for the most part they suck compared to model-based scorers.
Our Speaker:
Paul Arsenovic serves as the Associate Director of Data Science within the CIQ Solutions Data Science Group at S&P Global Market Intelligence. He specializes in developing and deploying AI-powered applications tailored for the CIQ Desktop platform. With a robust skill set in Python and expertise in building NLP applications, Paul excels at creating data-linking solutions and managing large-scale document processing. His proficiency extends to implementing automated infrastructure monitoring and signal processing, ensuring that the applications he oversees are both innovative and reliable. In his spare time you may see him gliding through the water on his hydrofoil or kiteboard in Chesapeake Bay.

Evaluating the Performance of Large Language Models