Description
It's hard to know exactly what LLMs can and cannot do. Their abilities can be very different from human capabilities, showing inconsistent performance on tasks where humans are consistent, or excelling at tasks far beyond human speed (e.g., learning a new language from a grammar book in-context). Current testing methods (benchmarking) often don't distinguish between a model lacking a capability and it failing to understand what's being asked or choosing not to comply.