20. A Sketch of Helpfulness Theory With Equivocal Principals
You may rightly wonder what else I’m up to, if it’s not just blogging and the occasional adventure. The short answer is that I spend most of my time and mental energy taking part in a summer research fellowship in AI safety called PIBBSS, where my research direction has to do with questions like “how does an agent help a principal, even if the agent doesn’t know what the principal wants?” and “how much harder is it for the principal to teach the agent what they want, and how much worse are the outcomes, if the principal doesn’t know for sure what they want?”. The long(ish) answer is this post. I’ll avoid jargon, mathematical notation, and recommendations to read other long papers, but some amount of that will be unavoidable. Here’s the basic setup - called a “Markov decision process”, sometimes with adjectives like “decentralized” or “partially observable” - which forms the ground assumption for the entire subfield my research direction lies in, called “inverse reinforcement learning”:...