AI QA Test Engineering: Testing AI Applications with the Power of AI


Details
To participate, please complete your free registration here
### ๐๐ฎ๐ป ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ ๐ผ๐๐๐ฝ๐๐ ๐ฏ๐ฒ ๐๐ฟ๐๐๐๐ฒ๐ฑ?
During this event you'll explore methods for evaluating AI applications using three tools designed to protect you from generative mistakes.
1. Testing AI Applications with DeepEval
DeepEval is an open-source evaluation framework designed for structured and automated testing of AI outputs. It allows you to define custom metrics, set expectations, and benchmark responses from LLMs. In this session, we'll explore how QA engineers and developers can use DeepEval to test the quality, accuracy, and reliability of AI-generated responses across different use cases like chatbots, summarization, and code generation.
2. Testing AI Applications with LLM as Judge
LLM-as-a-Judge is a powerful technique where an AI model evaluates the outputs of another model. Instead of relying solely on manual review or static metrics, we'll learn how to use trusted LLMs (like GPT-4) to provide qualitative assessments-grading correctness, coherence, tone, or factuality. This method enables scalable and human-like evaluation in real-time AI testing pipelines.
3. Evaluating LLMs with Hugging Face Evaluate
Hugging Face's evaluate library offers a robust suite of prebuilt metrics and tools to measure the performance of LLMs and NLP models. This topic will cover how to integrate and use evaluate in your testing workflows to assess text generation, classification, translation, and more-using standardized metrics like BLEU, ROUGE, and accuracy, alongside custom metrics for GenAI applications.
A Q&A session with Karthik K.K. will follow. Prepare your questions!

AI QA Test Engineering: Testing AI Applications with the Power of AI