Description
After initial pretraining, LLMs are "finetuned" to be more helpful and harmless. However, these methods often don't fundamentally change the model's underlying knowledge and undesirable capabilities can often be easily re-elicited through clever prompting ("jailbreaking") or further finetuning on problematic data.