61. Atoms to Agents As Filtered Through Some Tame Research-Creature
(Epistemic status: Neat-seeming ideas that promise to start paradigmatizing agent foundations, as expounded by JSW... but I dunno, pal, I'm just some research-creature. I didn't run this by him and I didn't consult notes or videos, either, so it might be very wrong or desperately incomplete in places. But also: in neglecting/refusing/failing/setting-off without recourse to notes, maybe I'll say something new and worth poking at some more.)
In seeking a paradigm for AI safety or AI alignment, we sometimes find ourselves seeking a paradigm for agent foundations, the study of what mind-type things that take actions for reasons towards goals might do in maximal generality. But where do we even begin? We find solace in materialistic reductionism, shunning most of metaphysics in the process: agents are a phenomenon first and foremost of the world of things, so it's as a very special type of thing that we will try to understand them.
From reductionism we take a guiding principle: things are made of parts. Understand the parts and how the parts fit together, and you're most of the way to understanding the whole. But we observe agents in the world, like humans and cats and maybe bacteria and maybe even thermostats(???), and if we go down the ladder of abstraction far enough, we ground out abruptly in atoms and particles. Atoms and particles are not, as best we can tell, remotely agents. So... what's going on with that? Why can we keep putting non-agentic parts together in some unspecified way and after some unspecified point, end up with things displaying goal-directed behavior at all?
This calls for recourse to a guiding principle of empiricism: what you postulate must be able to explain what you see, so what you see should guide what you postulate. Around us we see numerous examples of agents, and in nearly all cases, those agents aggregate into superagents - through markets, egregores, colonial organisms, and the like - and are themselves in turn comprised mostly of some combination of subagents and objects displaying some kind of selection pressure towards being good for some evident purpose - even one as simple as structural support or material transport.
But we must always let the constraints of pure mathematics guide us: in this case, we care about well-foundedness. Structures cannot contain equally complex structures all the way down forever; we must ground out somewhere. This points us towards a sense that there are at least two types of agent and possibly more, and that they can be at least partially ordered by some measure of sophistication. Let's try to limit ourselves as much as possible.
In the Atoms to Agents frame, we postulate exactly three types of agent-ish thing, which I'll term pre-agents: tools, simple agents, and complex agents. We treat homeostasis/stability as a trivial goal that might not count, because anything at all that we see must expend some minimal effort towards not disintegrating.
- Tools are not agents, but they're almost agents. They have no goal, but seem to be clearly designed for some purpose, or have been subjected to some kind of selection pressure comparable in optimization strength to intentional design. They are always composed of other tools, all the way down to atoms; they are the bedrock of our theory. I claim no constraint on their actionspace, or the actionspace expansion that they afford to an agent wielding them. Hammers, bones, pens, ribosomes, and spoons are all tools; viruses are marginal.
- Simple agents are true agents, taking actions within their actionspaces of their own will for reasons towards a goal, but they always have a single fixed goal to pursue, possibly on top of or replaced by stability/homeostasis. They are comprised of tools and sometimes other simple agents. Bacteria, thermostats, sea sponges, and rice cookers are all simple agents; LLMs are marginal (and I'll talk about them later).
- Complex agents are also true agents, and they must have multiple subgoals, again on top of homeostasis. They prioritize, trade off, and swap among those subgoals, frequently wielding tools in pursuit of those goals. They are comprised of tools and simple agents, and sometimes other complex agents. Humans, foxes, corporations, cats, and nation-states can all be considered complex agents; an AGI or ASI would have to be a complex agent.
Of course, I've skipped over some complexity here. Sometimes, complex agents can aggregate into a simple agent - think of stampedes and crowd crushes. Sometimes, a tool can hijack the agent wielding it - think of a bonfire, or a contact poison. Sometimes, a simple agent can overcome and hijack a complex agent - think of toxoplasmosis. We won't worry about those more complicated cases today. LLMs are also slightly fraught in this frame: we might see the behavioral constraints that labs impose on them - HHH, safety refusal, NSFW mitigation, and the like - as subgoals, or as some hybrid of design constraints and selective pressure. If we choose to see them as subgoals, then we might deem them minimal complex agents, in the same way that a thermostat is a minimal simple agent. Homeostasis itself as a trivial goal has much broader implications than might first present themselves: consider that the fact that the only things that we can generally observe are the kinds of things that stick around implies certain things about their nature and capabilities; not all objects can do so, so the brute fact of having made it through some combination of selection pressure, design constraints, and the ceaseless march of entropy tells us at least a little about the kind of object that can be a pre-agent - even tools must stick around for at least a little while.
A promising initial sign for the applicability of the Atoms to Agents frame to AI alignment is what it has to say about what kinds of tradeoffs between subgoals a complex agent might find it appropriate to make. In particular, we may observe that complex agents
descriptively generally decline to take the kinds of sweeping actions
which satisfy one subgoal - however important - at the expense of
trampling on numerous other subgoals through broad or expensive physical
changes. This happens to be a major fragment of corrigibility. The classic example is that humans don't generally set their houses on fire in pursuit of a freshly-baked cake, even if they lack an oven. They might purchase one, or make a mug cake in a microwave, or go without for now, but for one to burn down something both valuable and crucial to numerous other subgoals is rightly considered deeply off-spec behavior.
Three mysteries present themselves: goal-switching, goal-aggregation, and ontological scale.
- The mystery of goal-switching is: how does a complex agent prioritize, trade off among, and swap between its subgoals? We might postulate fictitious subagents, as in shard theory, or point at Nash bargaining, but neither of these answers seem fully satisfactory: shard theory seems purely post-hoc if good for modeling, and Nash bargaining neglects to explain goal-switching events at all, talking at best of proportions of effort expended.
- The mystery of goal-aggregation is: how do the singular goals of the simple agents that comprise a complex agent themselves also get aggregated into the goals of the complex agent? This represents an especially thorny problem given that some of the goals of complex agents seem not to correspond well to the kinds of goals simple agents have at all - let alone the goals of the specific subagents that we observe to comprise the complex agent in question. Nothing about white blood cells seems to point towards a desire to read a good book.
- The mystery of ontological scale is: whether or not a thing should be treated as an agent seems to depend on its context and the scale that we view it at. Yeast might be a simple agent when viewed on its own scale, but humans treat it as a tool for turning sugars into some combination of carbon dioxide and ethanol all the time. A mouse might be a complex agent in its own right, but cornered with no hope of survival by a cat, it might as well be a simple agent at best.
More complications present themselves. Consider that there exist pairs of entities where it's not perfectly clear which one is the agent, and the answer changes bistably and completely depending on your reasonable choice of point of view.
- Between humanity on the one hand and morphine and all its chemical cousins on the other, which is the agent and which not? Maybe the agent here is humanity - after all, humans, its components, seem to be complex agents, and they cultivate opium poppies and refine its sap for a variety of toolful purposes, from analgesics and cough suppression to pure hedonicity. But on the other hand, maybe the agent is the morphine family, the sum total of all such molecules. How else can we explain the bizarre abundance of one specific plant, or the inability of a simple tool like humanity to rid itself of a recurring source of harm, internal conflict, and degradation?
- Between the human population of a region and a river like the Yellow River or the Mississippi River, which is the agent and which not? Do we count humanity and its many sapient fingers and their clever hydraulic engineering works as the complex agent and the river as a tool at best, perhaps even a mere object or background feature? Or should we instead treat the river as a simple agent with the clear goal of "get to the sea as fast as possible" and humanity as some kind of annoying environmental feature? After all, in the river-favoring frame, the river still "wants" to take the quickest path to the sea, and will happily try to bash straight through whatever dams and levees it might encounter.
To change gears entirely, what should we make of conflict between agents or subjection of one agent by another? When opposed agents fight rather than more straightforwardly aggregating, how much sense does it make to treat the entire cartoon-metaphorical fight-cloud as an agent in its own right? And what of the possibility of subgoal-switching as modelled by conflict between subagents resulting in bargaining and movement to a Pareto-improved state, or to a values handshake?
For one final novel note, the viable systems model from cybernetics (the Ashby/Beer/Allende one, not the cyborg one) presents itself as a possible conceptual handle on agents in this frame - especially complex agents; a full explanation of the viable systems model of the operation of autonomous systems is outside the scope of this humble post. We might suppose that within a complex agent, numerous simple or even simpler complex agents might fill the roles of each of the component Systems One through Five, with some having the classical pursuit of and actuation towards a singular (sub)goal, others having the goal of coordination, task-switching, oversight, or supply of other connected simple agents, and still others the explicit goal of observation or goal-production.
Comments
Post a Comment