Innodata Inc. announced that it has released an open-source LLM Evaluation Toolkit, together with a repository of 14 semi-synthetic and human-crafted evaluation datasets, that enterprises can utilize for evaluating the safety of their Large Language Models (LLMs) in the context of enterprise tasks. Using the toolkit and the datasets, data scientists can automatically test the safety of underlying LLMs across multiple harm categories simultaneously. By identifying the precise input conditions that generate problematic outputs, developers can understand how their AI systems respond to a variety of prompts and can identify remedial fine-tuning required to align the systems to the desired outcomes.

Innodata encourages enterprise LLM developers to begin utilizing the toolkit and the published data sets as-is. Innodata expects a commercial version of the toolkit and more extensive, continually- updated benchmarking datasets to become available later this year. Together with the release of the toolkit and the datasets, Innodata published its underlying research around its methods for benchmarking LLM safety.

In the paper, Innodata shares the reproduceable results it achieved using the toolkit to benchmark Llama2, Mistral, Gemma, and GPT for factuality, toxicity, bias, and hallucination propensity.