Description
It's incredibly difficult to accurately evaluate what LLMs can do and the risks they pose. LLM performance is highly sensitive to how they are prompted. Test data might have been part of their training data, leading to overestimated capabilities ("test-set contamination"). Evaluations can also be biased by the LLMs themselves (if used to evaluate other LLMs) or by the human evaluators.