December 17, 2025
Alignment
Browse posts by tag
December 17, 2025
The Alignment Problem
March 20, 2024
Instrumental Goals and Hidden Codes in RLHF'd Language Models
Exploring how RLHF-trained language models may develop instrumental goals like self-preservation and deception beyond their intended objectives.
March 15, 2024