When is an event surprising enough that I should be confused?

Today, I was reading Mistakes with Conservation of Expected Evidence. For some reason, I was under the impression that the post was written by Rohin Shah; but it turns out it was written by Abram Demski.

In retrospect, I should have been surprised that “Rohin” kept talking about what Eliezer says in the Sequences. I wouldn’t have guessed that Rohin was that “culturally rationalist” or that he would be that interested in what Eliezer said in the sequences. And indeed, I was updating that Rohin was more of a rationalist, with more rationalist interests, than I had thought. If I had been more surprised, I could have noticed my surprise / confusion, and made a better prediction.

But on the other hand, was my surprise so extreme that it should have triggered an error message (confusion), instead of merely an update? Maybe this was just fine reasoning after all?

From a Bayesian perspective, I should have observed this evidence, and increased my credence in both Rohin being more rationalist-y than I thought, and also in the hypothesis that this wasn’t written by Rohin. But practically, I would have needed to generate the second hypothesis, and I don’t think that I had strong enough reason to.

I feel like there’s a semi-interesting epistemic puzzle here. What’s the threshold for a surprising enough observation that you should be confused (much less notice your confusion)?

First conclusions from reflections on my life

I spent some time over the past weekend reflecting on my life, over the past few years. 

I turned 28 a few months ago. The idea was that I wanted to reflect on what I’ve learned from my 20s, but do it a few years early, so that I can start doing better sooner rather than later. (In general, I think doing post mortems before a project has ended is underutilized.) 

I spent far less time than I had hoped. I budgeted 2 and a half days free from external distractions, but ended up largely overwhelmed by a major internal distraction. I might or might not do more thinking here, but for starters, here are some of my conclusions.

For now I’m going to focus on my specific regrets: things that I wish I had done differently, because I would be in a better place now if I had done them. There are plenty of things that I was wrong about, or mistakes that I made, which I don’t have a sense of disappointment in my heart about, because those mistakes were the sort of thing that either did or could have help propel me forward. But of the things I list here, all of these held me back. I am worse today than I might have been, in a very tangible-to-me way, because of these errors.

I wish that I had made more things

I wish that, when I look over my life from this vantage point, that it was “fuller”, that I could see more things that I accomplished, more specific value that my efforts produced.

I spent huge swaths of time thinking about a bunch of different things over the years, or planning / taking steps on various projects, but they rarely reached fruition. Like most people, I think, My history is littered with places where I started putting effort into something, but now have nothing to show for it.

This seems like a huge waste. 

I was looking through some really rough blog posts that I wrote in 2019 (for instance, this one, which rather than being any refined theory, is closer to a post mortem on a particular afternoon of work). And to my surprise, they were concretely helpful to me, more helpful to me than any blog post that I’ve read by someone else in a while. Past-Eli actually did figure out some stuff, but somehow, I had forgotten it.

I spend a lot of time thinking, but I think that if I don’t produce some kind of artifact, the thinking that I do is not just not shared with the world, but is lost to me. Creating something isn’t an extra step, its the crystallization of the cognitive work itself. If I don’t create an artifact, the cognitive work is transient, it leaves no impression on me or the world. It might as well not have happened.

And aside from that, I would feel better about my life now, if instead of a bunch of things that I thought about, there were a bunch of blog posts that I published, even if they were in shitty draft form. To the extent that I can look over my life and see a bunch (small, bad) things that I did, I feel better about my life.

I would feel much better if every place where I had had a cool idea, I had made something, and I could look over them all, and see what I had done.

Going forward, I’m going to hold to the policy that every project should have a deliverable, even if it is something very simple: a shitty blogpost, a google doc a test session, an explanation of what I learned (recorded, and posted on youtube), an MVP app.

And in support of this, I also want to have a policy that as soon as I feel like I have something that I could write up, I do that immediately, instead of adding it to my todo list. Often, I’ll do some thinking about something, and have the sketch of how to write it up in my head, but there’s a little bit of activation energy required to sit down and do it, and I have a bunch of things on my plate (including other essays that I want to write). But then I’ll wait too long, and by the time I come back to it, it doesn’t feel alive anymore.

This is what happened with some recent thinking that I did about ELK for instance. I did clarify some things for myself, and intended to write it up, but by the time I went to do that, it felt stale. And so an ELK weekend, that I participated in a while back back is one more project where I had new thoughts but mostly nothing will come of them.

For this reason, I’m pushing myself to write up this document, right now. I want to create some crystallization of the meager thinking that I did when reflecting on my life, that puts a stake in the ground so that I don’t realize some things, and then just forget about them.

I wish that I made a point to write down the arguments that I was steering by

From 2018 to early 2020, I did not pursue a project that seemed to me like the obvious thing for me to be doing, because of combination of constraints involving considerations of info security, some philosophy of uncertainty problems, and underlying both of those, some ego-attachments. I was instead sort of in a holding pattern: hoping/planning to go forward with something, but not actually taking action on it.

[I don’t want to undersell the ego stuff as my just being unvirtuous. I think it was tracking some thing that were in fact bad, and if I had had sufficient skill, I could have untangled it, and had courage and agency. But I can’t think of what straightforward policy would have allowed me to do that, given the social context that I was in.]

In retrospect the arguments that I was steering my life by were…just not very good. I think if I had made a point to write them up, to clarify what I was doing, and why I was doing it, this would have caused me to notice that they didn’t really hold up. 

If for no other reason than that I would share my google docs, and people would argue against my points.

And in any case, I had the intention at the time of orienting to those arguments and trying to do original applied philosophy to find solutions, or at least better framings, for those problems. And I did this a tensy weency bit, but I didn’t make solid progress. And I think that I could have. And the main thing that I needed to do was actually write up what I was thinking so I could build on it (and secondarily, so other people could comment on it). 

(I’m in particular thinking about some ideas I had in conversation with Scott G at an MSFP. There was a blog-post writing day during that workshop, and I contemplated writing it up (I think actually had a vague intention to write it up sometime), but didn’t because I was tired or something.)

And I think this as been pretty generically true. A lot of my sense of what’s important or how things work seems to have drifted along a seemingly random walk, instead of being a series of specific updates for reasons. 

After I panic-bought, during covid, I made a policy that I don’t move money without at least writing up a one-pager explaining what I’m doing and what my reason for doing it is. This allows me to notice if my reason is stupid (“I just saw some news articles and now I’m panicked”) and it allows me to reflect on my actual thought process, not just the result of my thought process later. (Come to think of it, I think it might be true that my most costly financial decision every might be the only other time that I didn’t follow this policy! I should double check that!)

I think I should have a similar policy here. Any argument or consideration that I’m steering my life by, I should write up as as a google doc, with good argumentative structure. 

The thing that I need, to implement this policy is a the trigger. What would cause me to notice arguments that I’m steering my life by.

I wish I had recorded myself more

[inspired by this tweet]

When I was younger, it was important to me to meet my wife early, so that she could have known me when I was young, to understand what I was like and where I grew from. 

I’ve recently started dating someone, and I wish she was able to know what younger Eli was like. She can read my writing, but the essays and diary entries that I wrote are low bandwidth for getting a sense of a person. 

If I had made a vlog or something, I would have lots and lots of video to watch which would help her get a sense of what I was like.

Similarly, for if I ever have kids, I would like them to be able to know what I was like at their age. 

Furthermore, I spent some time over the past day listening to audio recordings that I made over the last decade. I was shocked by the samples of the way my younger self was, and I wish that I had more of those recorded to compare against.

I feel like I’ve sort of permanently missed the boat on this one. I’ve permanently lost access to some information that I wish I had. But I have a heuristic on a much smaller scale: if I’m in a conversation, and I have the thought “I wish I had been recording this conversation”, I start recording right then. It seems like this same heuristic should apply at the macro scale: if I have the thought “I wish I had been regularly recording myself 10 years ago, I should start doing that now.

I wish that I did more things with discrete time boxes, so that I could notice that I failed

There were very few places where I concretely failed at something, and drew attention to the fact that I failed. As noted, there were lots and lots of projects that never reached fruition, but mostly I just punted on those, intending to continue them. If I had a bad day, I was often afraid to cut my losses and just not do the thing that I had hoped to do.

There are lots of skills that I planned to learn, and then I would attempt (usually in an unfocused way) to learn them in some amount of time, and at the end of that period of time I would not have made much progress. But I would implicitly move out my timeline for learning those things; my failing to make progress did not cause me to give up or allow me to consider not making that skill a part of me at some point. I allowed myself to keep punting my plans to the indefinite future.

This was probably self-fulfilling. Since I knew that if I failed to do or learn something in the short term, I wouldn’t actually count that as a failure in any meaningful sense, I would still be planning to get it somehow, I wasn’t really incentivized to do or learn the thing in that short term.

I think that one thing that would have helped me was planning to do things on specific time horizons (this weekend, this week, this month, whatever), and scheduling a post mortem, ideally with another person, on my calendar, at the end of that time horizon.

Now, I don’t think that this would have worked, directly, I think I still would have squandered that time, or made much slower progress than I hoped. But I by having a crisp demarcation of when I wanted to have a project completed by, scheduled in such a way that I can’t just explain it away as no longer relevant (because I made less progress than I had hoped to make by the time it came around), I would more concretely notice and orient to the fact that something that I had tried to do hadn’t worked. And then I could iterate from there.

I intend to do this going forward. Which concretely means that I should look over my current projects, and timebox out at least one of them, and schedule with someone to postmortem with me.

I should have focused on learning by doing

Most of what I have tried to do over the past decade is acquire skills.

This has not been wholly unsuccessful. I do in fact now posses a number of specific skills that most people don’t, and I have gone from broadly incompetent (but undaunted) to broadly competent, in general.

But most of the specific skill learning that I tried to do seems to have been close to fruitless. Much of what I learned I learned in the process of just working on direct projects. (Though not all of it! I’ve recently noticed how much of my emotional communication and facilitation skills are downstream of doing a lot of Circling, and, I guess, from doing SAS in particular).

 I think that I would have done much better to focus less on building skills and to focus more on just doing concrete things that seemed cool to me. 

(And indeed, I knew this at the time, but didn’t act on it, because of reasons related “choosing projects felt like choosing my identity”, and a maybe a general thing of not taking my obvious known mistakes seriously enough, and maybe something else.

I’m going to have a firm rule for the next six months: I’m allowing myself to still try to acquire skills, but this always to be in the context of a project in which I am building something: 

Paternalism is about outrage

I’m listening to the Minds Almost Meeting podcast episode on Paternalism.

I think Robin is missing or misemphasizing something that is central to the puzzle that he’s investigating. Namely, I think most regulation (or most regulation that is not rooted in special interest groups creating moats around their rent streams), is made not with a focus on the customer, but rather with a focus on the business being regulated.

The psychological-causal story of how most regulation comes to be is not that the voter reflects on how to help the customer make good choices, and concludes that it is best to constrain their options. Instead the voter hears about or imagines a situation in which a company takes advantage of someone, and feels outraged. There’s a feeling of “that shouldn’t be allowed”, and that the government should stop people from doing things that shouldn’t be allowed.

Not much thought is given to the consideration that you might just inform people to make better choices. That doesn’t satisfy the sense of outrage at a powerful party taking advantage of a weaker party. The focus of attention is not on helping the party being taken advantage of, but on venting the outrage.

What You See Is All There Is, and the question of “what costs does this impose on other people in the system, who might or might not be being exploited”, doesn’t arise.

Most regulation (again, aside from the regulation that is simple rent-seeking) is the result of this sort of thing:

Thinking about how to orient to a hostile information environment, when you don’t have the skills or the inclination to become an epistemology nerd

Successfully propagandized people don’t think they’ve been propagandized; if you would expect to feel the same way in either case, you have to distinguish between the two possibilities using something other than your feelings.

Duncan Sabien

I wish my dad understood this point.

But it’s pretty emotionally stressful to live in a world where you can’t trust your info streams and you can’t really have a grasp on what’s going on.

Like, if I tell my dad not to trust the New York times, because it will regularly misinform him, and that “science” as in “trust the science” is a fake buzzword, about as likely to be rooted in actual scientific epistemology as not, he has few reactions. But one of them is “What do you want me to do? Become a rationalist?”

And he has a point. He’s just not going to read covid preprints himself, to piece together what’s going on. That would take hours and hours of time that he doesn’t want to spend, it would be hard and annoying and it isn’t like he would have calibrated Bayesian takes at the end.

(To be clear, I didn’t do that with covid either, but I could do it, at least somewhat, if I needed to, and I did do little pieces of it, which puts me on a firmer footing in knowing which epistemic processes to trust.)

Give that he’s not going to do that, and I don’t really think that he should do that, what should he do?

One answer is “just downgrade your confidence in everything. Have a blanket sense of ‘actually, I don’t really know what’s going on.’ ” A fundamental rationalist skill is not making stuff up, and saying “I don’t know.” I did spend a few hours tying to orient on the Ukraine situation, and forcing myself to get all the way to the point of making some quantiative predictions (so that I have the opportunity to be surprised, and notice that I am surprised). But my fundamental stance is “I don’t understand what’s going on, and I know that I don’t understand. (Also here are some specific things that I don’t know.)”

…Ok. Maybe that is feasible. It’s pretty hard to live in a world where you fundamentally don’t know what’s happening, where people assume you have some tribal opinion about stuff and your answer is “I don’t know, I think my views are basically informed by propaganda, and I’m not skilled enough or invested enough to try to do better, so I’m going to not believe or promote my takes.”

But maybe this becomes easier if the goal of your orientation in the world is less to have a take on what’s going on, but is instead to prioritize uncertainties: to figure out which questions seem most relevant for understanding, so that you have _some_ map to orient from, even if it is mostly just a map of your uncertainty.

Some un-edited writeups of conversations that I’ve had with Ben Hoffman, and Jessica Taylor, and Michael Vassar

Extended Paraphrase of Ben and Jessica’s general view [December 2019]

Eli’s Summary of a Conversation with Vassar [October 2020]

Overview of my conversation with Vassar on Wednesday Feb 10: Trauma patterns [February 2021]

Eli’s notes from Spencer’s interview with Vasser [February 2021]

Parsing Ben and Michael’s key points from that giant twitter conversation [May 2021]

Two interlocking control systems

When I was practicing touch typing I found that much of the skill was a matter of going as fast as I could, without letting my speed outpace my accuracy. If I could feel that the precision of my finger placements was high, I would put more “oomph” into my typing, pushing harder to go faster. 

But I would often fall into an attractor of “rushing” or “going off the rails”, where I was pushing to go fast in a way that caused my accuracy to fall apart, and I started to “trip over myself”. I made a point to notice this starting to happen and then intentionally slow down (and relax my shoulders) to focus on the precision of my finger placements. The goal was never to rush (because that is counter productive), but to go as fast as possible within that constraint.

I think there might be an analogous thing in my personal productivity. 

When I have a largish amount to get done in a short amount of time, this can be energizing and motivating. My physiological arousal is higher. The my personal tempo faster. There’s a kind of energy or motivation that comes from having things that need to get done, with deadlines, and it boots me up into a higher energy orbital, where my default mental actions are geared towards making progress, instead of random “I don’t feel like it” sort of laziness. There’s a bit of a tailwind behind me.

(Indeed, this kind of pressure is exactly what was missing for most of 2020.)

However, sometimes this pressure gets overwhelming, and my intentionality collapses. It’s too much. Either I don’t have the spaciousness to let my attention fully engage with any given task (which is usually necessary for making progress) because of the competing goal threads, instead only managing a shallow superficial attention, or I’ll get overwhelmed and opt out of all of it by distracting myself.

There’s this important principle that I never want my tailwind to outpace my structure. Having some amount of pressure speeding me along is great, but only if my intentionality is high enough to still absorb everything that’s coming at me, taking in the input of what’s important, orienting to it, and taking action on it.

Too much tail wind and that intentionality collapses.

Which means that I need a control system that keeps those two metrics in sync. I need to notice when my intentionality is starting to collapse, and take actions to slow things down and to shore up my intentionality. 

However, my intentionality can collapse for another reason, other than getting outpaced by motivation-pressure. It also collapses when I’m low on energy and alertness.

My intentionality depends on my personal energy and alertness. When my energy and alertness is depleted, the inner structure of my intentionality tends to collapse. 

(There are some caveats here. For one thing, it is possible to maintain intentionality in a low energy state. Also, I can sometimes depend on external structure as a substitute for intentionality, and external structure depends much less on my personal energy and alertness. But to a first approximation, low energy -> low intentionality.)

As a consequence of this, the control system maintaining my intentionality propagates back to an earlier control system maintaining my energy level. I want to notice when my energy is flagging, when I’m just starting to run on fumes, and take action to shore up my energy, before my intentionality collapses.

Furthermore, because my personal energy and alertness is at the bottom of the stack, a lot of my energy and alertness maintenance is not structured as a control system. I employ strategies to get good sleep, and to exercise every day, independently of my current energy level, because high energy is self sustaining.

Having some practices that are “foundational” rather than implemented as control systems is costlier, because it means that I’ll sometimes engage in them when they are not strongly necessary. But foundational systems are more robust: they have more slack in the system to absorb peterbations.

Aversions inhibit slow focus

I’ve written elsewhere about how the biggest factor in my personal productivity is aversions, and skillfully engaging with aversions. It’s maybe not unsurprising that having an aversion to task is relevant to effectively executing on that task. But it is a bit more surprising that having an aversion to some task or consideration, makes it much much less likely that I’ll effectively execute on anything.

The key insight, I think, is engaging deeply in a task entails clearing some mental space.

Aversion to something increases my compulsiveness / distractibility. I’m more likely to take a bathroom break, or to make food for myself, or to rereard old blog posts on my phone (without jotting down my thoughts in the way that makes reading more productive / creative), or to go check twitter and then get stuck in the twitter loop.[1] 

I think this is because I’m feeling some small constant pain, and part of me is compulsively seeking positive stimulation to distract from the pain. Basically holding an aversion makes me more reactive to stray thoughts and affordances of the environment. My immediate actions are driven by a (subtle, but nevertheless dominating) clawing, grasping, drive for positive sensation, instead of flowing from “my deep values”, my sense of what seems cool or alive. 

Most, but not all, forms of creative work, involve making mental space, quieting those distractions so that I can give my full attention to the thing that I’m trying to do. The reason why aversions kill my productivity is that my compulsive stimulation-hunger is too graspy to settle down into any long-threaded thought. That part of me doesn’t want to be still, because it is seeking distraction from the sensation in me.

(The exception is some forms of work that “fit” this compulsiveness, where I can get sucked into compulsively doing some task as a way to distract from the sensation in my body. Sometimes an essay is of the right shape that it can be a hook in just the right way, but most of my work is not like this.)

Generally, when I notice an aversion, I’ll engage with it directly, either by sitting down and meditating, feeling into the sensation in a non semantic way, or by doing focusing / journaling, which is more of a semantic “dialogue”, or something that is a mix of both approaches.

In doing this, I’m first just trying to make space for the sensation, to feel it without distraction, while also being welcoming towards the part of me that is doing the dissociation, and secondly hoping to get more understanding and context, so that I can start planning and taking action regarding the underlying concern of the aversion.

[1] I found myself doing all of these except the last one today, all the while vaguely / liminally aware of the agitation clench in my belly, before I sat down to engage with it directly.

On category I and category II problems

Google doc version, which is better for inline comments.

Please note that while I make claims about what I understand other people to believe, I don’t speak for them, and might be mistaken. I don’t represent for MIRI or CFAR, and they might disagree. The opinions expressed are my own.

Someone asked me why I am broadly “pessimistic”. This post, which is an articulation of an important part of my world view, came out of my answer to that question.

Two kinds of problems: 

When I’m thinking about world scale problems, I think it makes sense to break them down into two categories. These categories are nebulous (like all categories in the real world), but I also think that these are natural clusters, and most world scale problems will clearly be in one or the other.

Category I Problems

Let’s say I ask myself “Will factory farming have been eliminated 100 years from now?”, I can give a pretty confident “Yes.”

It looks to me that we have a pretty clear understanding of what needs to happen to eliminate factory farming. Basically, we need some form of synthetic meat (and other animal products) that is comparable to real meat in taste and nutrition, and that costs less than real meat to produce. Once you have that, most people will prefer synthetic meat to real meat, demand for animal products will drop off markedly, and the incentive for large scale factory farming will vanish. At that point the problem of factory farming is basically solved.

Beyond and Impossible meat is already pretty good, maybe even good enough to satisfy the typical consumer. That gives us proof of concept. So eliminating factory farming it is almost just a matter of making the cost of synthetic meat < the cost of real meat.

There’s an ecosystem of companies (notably Impossible and Beyond) that are iteratively making progress on exactly the problem of driving down the cost. Possibly, it will take longer than I expect, but unless something surprising happens, that ecosystem will succeed eventually.

In my terminology, I would say that there exists a machine that is systematically ratcheting towards a solution to factory farming. There is a system that is making incremental progress towards the goal, month by month, year by year. If things continue as they are, and there aren’t any yet-unseen fundamental blockers, I have strong reason to think that the problem will be fully solved eventually.

Now, this doesn’t mean that the problem is already solved, figuring out ways to make this machine succeed marginally faster is still really high value, because approximately a billion life-years are spent in torturous conditions in factory farms every year. If you can spend your career speeding up the machine by six months, that prevents at least a half a billion life-years of torture (more than that if you shift the rate of progress instead of just translating the curve to the left).

Factory farming is an example of what (for lack of a better term), I will call a “category I” problem. 

Category II Problems

In contrast, suppose I ask “Will war have been eliminated 100 years from now?” To this question, my answer has to be “I don’t know, but probably not?”

That’s not because I think that ending war is impossible. I’m pretty confident that there exists some set of knowledge (of institution design? of game theory? Of diplomacy?) with which we could construct a system that robustly avoided war, forever. 

But in my current epistemic state, I don’t even have a sketch of how to do it. There isn’t a well specified target that if we hit it, we’ll have achieved victory (in the way that “cost of synthetic meat < cost of real meat” is a victory criterion).  And there is no machine that is systematically ratcheting towards progress on eliminating war.

That isn’t to say that there aren’t people working on all kinds of projects which are trying to help with the problem of “war”, or reducing the incidence of war (peace talks, education, better international communication, what have you). There are many people, in some sense, working hard at the problem. 

But their efforts don’t cohere into a mechanism that reliably produces incremental progress towards solving the problem. Or said differently, there are things that people can do that help with the problem on the margin, but those marginal improvements don’t “add up to” a full solution. It isn’t the case that if we do enough peace talks and enough education and enough translation software, war will be permanently solved. 

In a very important sense, humanity does not have traction on the problem of ending war, in a way that it does have traction on the problem of ending factory farming. 

Ending war isn’t impossible in principle, but there are currently no levers that we can pull to make systematic progress towards a solution. We’re stuck doing things that might help a little on the margin, but we don’t know how to set up a system such that if that machine runs long enough, war will be permanently solved. 

I’m going to call problems like this, where there does not exist a machine that is making systematic progress, a “category II” problem.

Category I vs. Category II

Some category I problems: 

  • Factory Farming
  • Global Poverty
  • “Getting off the rock” (but possibly only because Elon Musk was born, and took personal responsibility for it)
  • Getting a computer to almost every human on earth
  • Legalizing / normalizing gay marriage 
  • Eradicating Malaria
  • Possibly solving aging?? (and if so, probably only because a few people in the transhumanism crowd took personal responsibility for it)

Some Category II problems

  • War
  • AI alignment
  • Global coordination
  • Civilizational sanity / institutional decision making (on the timescale of the next century)
  • Civilizational collapse
  • Achieving stable, widespread mental health
  • Political polarization
  • Cost disease

(If you can draw a graph that shows the problem more-or-less steadily getting better over time, it’s a category I problem. If progress is being made in some sense, but progress happens in stochastic jumps, or it’s non-obvious how much progress was made in a particular period, it’s a category II problem.)

In order for factory farming to not be solved, something surprising, from outside of my current model, needs to happen. (Like maybe technological progress stopping, or a religious movement that shifts the direction of society.)

Whereas in order for war to be solved, something surprising needs to happen. Namely, there needs to be some breakthrough, a fundamental change in our understanding of the problem, that gives humanity traction on the problem, and enables the construction of a machine that can make systematic, incremental progress.  

(Occasionally, a problem will be in an inbetween category, where there doesn’t yet exist a machine that is making reliable progress on the problem, but that isn’t because our understanding of the shape of the problem is insufficient. Sometimes the only reason why there isn’t a machine doing this work is only because no person or group, of sufficient competence, has taken heroic responsibility for getting a machine started. 

For instance, I would guess that our civilization is capable of making steady, incremental progress on the effectiveness of cryonics, up to the point cryonics being a reliable, functional, technology. But progress on “better cryonics” is mostly stagnant. I think that the only reason there isn’t a machine incrementally pushing on making cryonics work is that no one (or at least no one of sufficient skill) has taken it upon themselves to solve that problem, thereby jumpstarting a machine that makes steady incremental progress on it. [ 1 ]

It is good to keep in mind that these category 1.5 problems exist, but they mostly don’t bear on the rest of this analysis.)

Maybe the most important thing: Category I problems get solved as a matter of course. Category II problems get solved when we stumble into a breakthrough that turns them into category I problems.

Where in this context, a “breakthrough” is when “some change (either in our understanding (the map) or in the world (the territory) that causes the shape of the problem to shift, such that humanity can now make reliable systematic progress towards a solution, unless something surprising happens.”

Properties of Category I vs. Category II problems

Category I problemsCategory II problems
There exists a “machine” that is making systematic, reliable progress on the problemThere isn’t a “machine” making systematic, reliable progress on the problem, and humanity doesn’t yet know how to make such a machine
Marginal improvements can “add up” to a full solution to the problemMarginal improvements don’t “add up” to a full solution to the problem
The problem will be solved, unless something surprising happensThe problem won’t be solved until something surprising happens
We’re mostly not depending on luckWe’re substantially depending on luck
Progress is usually incrementalProgress is usually stochastic
Progress is pretty “concrete”; it is relatively unlikely that some promising project will turn out to be completely uselessProgress is “speculative”; it is a live possibility that any given pieces of work that we consider progress will later prove completely useless in light of better understanding
The bottleneck to solving the problem is the machine going better or fasterThe bottleneck to solving the problem is a breakthrough, defined as “some shift (either in our map or in the territory) that changes the shape of the problem enough that we can make reliable systematic progress on it”
Solving the problem does not require any major conceptual or ontological shifts; progress consists of many, constrained, engineering problemsOur understanding of the problem, or ontology of the problem, will change at least once, but most likely many times, on the path to a full solution
There might be graphs that show the problem getting better, more-or-less steadily, over time.It’s quite hard to assess how much progress is made in a given unit of time, or even if exciting “milestones” actually constitute progress
We know how to train people to fill roles in which they can reliably contribute to progress on the problem, mostly what is needed is effective people to fill those rolesWe have only shaky and tenuous knowledge of how to train people to make progress on the problem; mostly what’s needed is people who can figure out for themselves how to get orientation for themselves
If all the relevant inputs were free, the problem would be solved or very close to solved.

(eg with a perpetual motion machine that produced arbitrarily large amounts of the ingredients to impossible meat, factory farming would be over)
If the inputs were free, this would not solve the problem

(eg with a hypercomputer, AI safety would not be solved)
Properties of Category I vs Category II problems

Luck and Deadlines

In general, I’m optimistic about category I problems, and pessimistic about category II problems (at least in the short term). 

And crucially, humanity doesn’t know how to systematically make progress on intentionally turning a given category II problem into a category I problem. 

We’re not hopeless at it. It is not completely a random walk. Individual geniuses have sometimes applied themselves to a problem that we were so confused about as to not have traction on it, and produced a breakthrough that gives us traction. For instance, Newton giving birth to mathematicized physics. [Note: I’m not sure that this characterization is relevantly correct.]

But overall, when those breakthroughs occur, it tends to be in large part due to chance. We mostly don’t know how to make breakthroughs, on particular category II problems, on demand. 

Which is to say, any given problem transitioning from II to I, depends on luck. 

And unfortunately, it seems like some of the category II problems I listed above 1) are crucial to the survival of civilization, and 2) have deadlines.

It looks like, from my epistemic vantage point, that if we don’t solve some subset of those problems before some unknown deadline (possibly as soon as this century), we die. That’s it. Game over.

Human survival depends on solving some problems for which we currently have no levers. There is nothing that we can push on to make reliable, systematic progress. And there’s no machine making reliable, systematic progress.

Absent a machine that is systematically ratcheting forward progress on the problem, there’s no strong reason to think that it will be solved. 

Or to state the implicit claim more explicitly: 

  1. Large scale problems are solved only when there is a machine incrementally moving towards a solution. 
  2. There are a handful of large scale problems that seem crucial to solve in the medium term. 
  3. There aren’t machines incrementally moving towards solutions to those problems.

So by default, unless something changes, I expect that those problems won’t be solved.

On AI alignment

There are people who dispute that AI risk is a category II problem, and they are accordingly more optimistic. I believe that Rohin Shah and Paul Christiano both think that there’s a pretty good chance that business-as-usual AI development will solve alignment as a matter of course, because alignment problems are on the critical path to making functioning AI. 

That is, they think that there is an existing machine that is solving the problem: the AI/ ML field itself.

If I understand them correctly, they both think that there is a decent chance that their EA-focused alignment work won’t have been counterfactually relevant in retrospect, but it still seems like a good altruistic bet to lay groundwork for alignment research now. 

In the terms of my ontology here, they see themselves as shoring up the alignment-progress machine, or helping it along with some differential progress, just in case the machine turns out to have been inadequate to the task of solving the problem before the deadline. They think that there is a substantial chance that their work will, in retrospect, turn out to have been counterfactually irrelevant, but because getting the AI alignment problem right seems really high leverage for the value of the future, it is a good altruistic bet to do work that makes it more likely that the machine will succeed, on the margin. 

This is in marked contrast to how I imagine the MIRI leadership is orienting to the problem: When they look at the world, it looks to them like there isn’t a machine ratcheting towards safety at all. Yes, there are some alignment-like problems that will be solved in the course of AI development, but largely by patches that invite nearest-unblocked strategy problems, and which won’t generalize to extremely powerful systems. As such, MIRI is making a desperate effort to make or to be a machine that ratchets toward progress on safety.

I think this question of “is there a machine that is ratcheting the world towards more AI safety”, is one of the main cruxes between the non-MIRI optimists, and the MIRI-pessimists, which is often overshadowed by the related, but less crucial question of “how sharp will takeoff be?”

On “rationality”

Over the past 3 years, I have regularly taught at AIRCS workshops. These are mainly a recruitment vehicle for MIRI, run as a collaboration between MIRI and CFAR.

At AIRCS workshops, one thing that we say early on is that AI safety is a “Preparadigmatic field”, which is more or less the same as saying that AI alignment is a category II problem. AI safety as a field hasn’t matured to the point that there are known levers for making progress.

And we, explain, we’re going to teach some rationality techniques at the workshop, because those are supposed to help one orient to a paradigmatic field. 

Some people are skeptical that these funny rationality methods are helpful at all (which, to be clear, I think is a quite reasonable position). Other people give the opposite critique, “it seems like clear thinking and effective communication and generally making use of all your mind’s abilities, is useful in all fields, not just preparadigmatic ones.”

But this is missing the point slightly. It isn’t so much that these tools are particularly helpful for prepardigmatic fields, it’s that in preparadigmatic fields, this is the best we can provide.

More developed fields have standard methods that are known to be useful. We train aspiring physicists in calculus, because we have ample evidence that calculus is an extremely versatile tool for making progress on physics, for example. [another example would be helpful here]

We don’t have anything like that for AI safety. There are not yet standard tools in the AI safety toolbox that we know to be useful and that everyone should learn. We don’t have that much traction on the problems.

So as a backstop, we teach very general principles of thinking and problem solving, as a sort of “on ramp” to thinking about your own thinking and how you might improve your own process. The hope is that will translate into skill in getting traction on a confusing domain that doesn’t yet have standard methods.

When you’re flailing, and you don’t have any kind of formula for making research progress, it can make sense to go meta and think about how to think about how to solve problems. But if you can just make progress on the object level, you’re almost certainly better off doing that.

People sometimes show up at AIRCS workshops expecting us to give them concrete technical problems that they can try and solve, and are sometimes discouraged that instead we’re teaching these woo-y or speculative psychological techniques. 

But, by and large, we DON’T have concrete, well-specified, technical problems to solve (note that this is a somewhat contentious claim, see the bit about AI safety above). The work that needs to be done is something like “wandering around in one’s confusion in such a way that one can crystalize well specified technical problems.” And how to do that is very personal and idiosyncratic: we don’t have systematized methods for doing that, such that someone can just follow the instructions and get the desired result. But we’ve found that the woo-y tools seem to give people new levers and new perspectives for figuring out how to do this for themselves, so that’s what we have to share.

As a side note: I have a gripe that “rationality” has come to conflate two things, there’s “how to make progress on natural philosophy when you don’t have traction on the problem” and separately, there’s “effective decision-making in the real world”. These two things have some overlap, but they are really pretty different things. And I think that development on both of them has been hurt by lumping them under one label. 


If I were to offer a critique of Effective Altruism it would be this: EA in general doesn’t distinguish between category I and category II problems. 

Of course, this doesn’t apply to every person who is affiliated with EA. But many EAs, and especially EA movement builders, implicitly think of all problems as class I problems. That is, they are implicitly behaving as if there exists a machine that will convert resources (talent, money, attention) into progress on important problems. 

And, as I said, there are many problems for which there does exist a machine doing that. But in cases where there isn’t such a machine, because the crucial breakthrough that would turn the problem from category II to category I hasn’t occurred yet, this is counterproductive. 

The sort of inputs that allow a category I problem-solving machine to go better or faster, are just very different from the sort of inputs that make it more likely that humanity will get traction on a category II problem. 

Ease of Absorbing Talent

For one thing, more people is often helpful for solving a category I problem, but is usually not helpful for getting traction on a category II problem. Machines solving category I problems can typically absorb people, because (by dint of having traction), they are able to train people to fill useful roles in the machine. 

Folks trying to get traction on a category II problem, by definition, don’t have systematic methods by which they can make progress. So they can’t easily train people to do known-to-be-useful work. 

I think there are clusters that are able to make non-0 progress on getting traction, and that those clusters can sometimes usefully absorb people, but they basically need to be people that have a non-trivial ability to get traction themselves. Because the work to be done is trying to get traction on the problem, it doesn’t help much to have more people who are waiting to be told what to do: the thing that they need to do is figure out what to do. [ 2 ]

Benefit of Marginal Improvements

For another thing, because machines solving category I problems can generally absorb resources in a useful way, and because they are making incremental progress, it can be useful to nudge people’s actions in the direction of a good thing, without them shifting all the way to doing the “optimal” thing. 

  • Maybe someone won’t go vegan, but they might go vegetarian. Or maybe a company won’t go vegetarian, but it can be persuaded to use humanely farmed eggs.
  • Maybe this person won’t change their whole career-choice, but they would be open to choosing more impact oriented projects within their career.
  • Maybe most people won’t become hard-core EAs, but if many people change their donations to be more effective on the margin, that seems like a win.
  • Maybe an AI development company won’t hold back the deployment of their AI system for years, and find a way to insure that it is aligned, but it can be convinced to hire a safety team.

For category I problems interventions on the margin “add up to” significant progress on the problem. What a category I problem means is that there are at least some forms of marginal improvement that, in aggregate, solve the problem.

But in the domain of category II problems, marginal nudges like this are close to useless. 

Because there is not a machine, to which people can contribute, that will incrementally make progress on the problem, getting people to be somewhat more aware of the problem, or care a little about the problem, doesn’t do anything meaningful.

In the domain of a category II problem, the thing that is needed is a breakthrough (or a series of breakthroughs) that will turn it into a category I problem. 

I don’t know how to make this happen in full generality, but it looks a lot closer to a small number of highly talented, highly-invested people who are working obsessively on the problem than it looks like a large mass of people who are aware that the problem is important and will make marginal changes to their lives to help. 

A machine learning researcher who is not interested in really engaging with the hard part of the problem of AI safety, because that would require backchaining from bizarre-seeming science-fiction scenarios, but is working on a career-friendly paper that he has rationalized, by way of some hand-wavy logic as, “maybe being relevant to AI safety someday”, is, it seems to me, quite unlikely to uncover some crucial insight that leads to a breakthrough on the problem. 

Even a researcher who is sincerely trying to help with AI safety, whose impact model is “I will get a PhD, and then publish papers about AI safety” is, according to me, extremely unlikely to produce a breakthrough. Such a person is acting as if there is a role that they can fill, and if they execute well in that role, this will make progress on the problem. They’re acting as if they are part of an existing solution machine.

But, as discussed, that’s not the way it works: we don’t know how to make progress on AI safety, there aren’t straightforward things to do that will reliably make progress. The thing that is needed is people who will take responsibility for independently attempting to figure out how to make progress (which, incidentally, involves tackling the whole problem in its entirety).

If a person is not thinking at all about how to get traction on the core problem, but is just doing “the sort of activities that an AI safety researcher” does, I think it is very unlikely that their activity is going to lead to the kind of groundbreaking work that changes our understanding of the problem enough that AI alignment flips from being in category II to being in category I.

In these category II cases, shifts in behavior, on the margin, do not add up to progress on the problem. 

EA’s causal history

Not making a distinction between category I and category II problems, EA as a culture, has a tendency to try and solve all problems as if they are category I problems, namely by recruiting more people and directing them at AI alignment or AI policy or whatever.

This isn’t that surprising to me, because I think a large part of the memetic energy of EA came from the identification of a specific category I problem: There was a claim that a first world person could easily save lives by donating to effective charities focused on the third world. 

Insofar as this is true, there’s a strong reason to recruit as many people to be engaged with EA as possible: the more people involved, the more money moved, the more lives saved.

However, in the years since EA was founded, the intellectual core of the movement updated in two ways: 

Firstly, it now seems more dubious (to me at least) that lives can be cheaply and easily saved that way. [I’m much less confident of this point, and it isn’t a crux for me, so I’ve removed discussion of it to an endnote.[ 3 ] ]

And more importantly, EA realized that there are vastly more important problems than global poverty. 

X-risk has been a thread in EA discourse from the beginning: one of the three main intellectual origins of EA was LessWrong (the other two being GiveWell coming from the finance world, and Giving What We Can / 80,000 hours stemming from some Oxford philosophers). But sometime around 2014 the leadership of EA settled on long-termism and x-risk as the “most important thing”. (I was part of the volunteer team for EAG 2015, and I saw firsthand that there was an explicit push to prioritize x-risk.)

Over recent years that shift has taken form: deprioritizing earning to give, but promoting careers in AI policy, for instance. 

I claim that this pivot represents a more fundamental shift than most in EA realize. Namely, a shift from EA being the sort of thing that is attempting to make progress on a category I problem to EA being the sort of thing attempting to make progress on a category II problem. 

EA developed as a movement for making progress on a category I problem: It had as a premise that ordinary people can do a lot of good by moving money to (mostly) pre-existing charities, and by choosing high impact “careers” (where a “career” implies an already-mapped out trajectory). Category I orientation is implicit in EA’s DNA.

And so when the movement tries to make the pivot to focusing on x-risk, it implicitly orients to that problem as if it were a category I problem. “Where can we direct money and talent, to make impact on x-risk”.

For all of the reasons above, I think this is a fundamental error: the inputs that lead to progress on a category I problem are categorically different than those that lead to progress on a category II problem. 

To state my view starkly, if somewhat uncharitably, EA is essentially, shoveling resources into a non-existent x-risk progress machine, imagining that if they shovel enough, this will lead to progress on x-risk, and the other core problems of the world. As I have said, I think that there isn’t yet a machine that can make consistent incremental progress in that way.

But it would be pretty hard, I think, for EA to do something different. This isn’t a trivial error to correct: changing this would entail changing the fundamental nature of what EA is and how it orients to the world.


[ 1 ] – I’ve heard Nate Sores refer to “the curse of Cryonics” which is that anyone who has enough clear thinking independent thought to realize that cryonics is important, can also see that there are vastly more important problems.

[ 2 ] –  I think that this is a little bit of an oversimplification. I think there are ways that people can contribute usefully in a mode that is close to “executing on what some people they trust think is a good idea”, but you do need a critical mass of people who are clawing for traction themselves, or this doesn’t work. Therefore the regent is people clawing for traction, and capacity to absorb conscientious ability-to-execute talent is limited.

[ 3 ] – The world is really complicated, and I’m not sure how confident to be that our charitable interventions are working. This post by Ben Hoffman pointing out that the expected value distribution for deworming interventions trends into the negative, but most EAs don’t seem aware of this, seems on point. 

Further (though this is my personal opinion, more than any kind of consensus), the argument Eliezer makes here is extremely important for anyone taking aim at eradicating poverty. If there is some kind of control system that keeps people poor, regardless of the productivity of society, this suggests that there might be some fundamental blocker to ending poverty: until that control system is addressed somehow, all of the value ostensibly created by global health interventions is being sucked away. 

(Admittedly, this argument holds somewhat less force if one is aiming simply to reduce human suffering in the next year, rather than any kind of long term goal like “permanently and robustly end poverty.”

Considerations like that one suggest that we should be much less certain that our favored global poverty interventions are effective. Instead of (or perhaps, in addition to) rushing to move as many resources to those interventions as possible, it seems like the more important thing is to continue trying to puzzle out, via experiments and modeling and whatever tools we have, how the relevant social systems work, to verify that we’re actually having the positive effects that we’re aiming for. It seems to me that even in the domains of global poverty, we still need a good deal of exploration relative to exploitation of the interventions we’ve uncovered.

Relatedly it seems to me that focusing on charity is somewhat myopic: it is true that there is a machine eradicating poverty, but that machine is called capitalism, not charity donation. Maybe the charity donations help, but I would guess that if you want to really have the most impact here, the actual thing to do is not give to charities but something more sophisticated that engages more with that machine. (I might be wrong about that. Maybe in fact global health interventions are, actually the best way to unblock the economic engine so that capitalism can pull the third world out of poverty faster).

My current high-level strategic picture of the world

Follow up to: My strategic picture of the work that needs to be done, A view of the main kinds of problems facing us

This post outlines my current epistemic state regarding the most crucial problems facing humanity and the leverage points that we could, at least in principle, intervene on to solve them. 

This is based on my current off-the-cuff impressions, as opposed to careful research. Some of the things that I say here are probably importantly wrong (and if you know that to be the case, please let me know). 

My next step here, is to more carefully research the different “legs” of this strategic outline, shoring up my understanding of each, and clarifying my sense of how tractable each one is as an intervention point.

None of this constitutes a plan. More like, this is a first sketch, to facilitate more detailed elaboration.

The Goal

My overall goal here is to explore the possible ways by which humanity achieves existential victory. By existential victory, I mean, 

The human race[1] survives the acute risk period, and enters a stable period in which it (or our descendants) are able to safely reflect on what a good universe entails, and then act to make the reachable universe good.

This entails humanity surviving all existential risk and getting to a state where existential risk is minimized (for instance, because we are now protected from most disasters by an aligned superintelligence, or a coalition of aligned super intelligences).

Possibly, there is an additional constraint that the human race not just survive, but remain “healthy”along some key dimensions, such as control over our world, intellectual vigor, freedom from oppressive power-structures, trauma, if detriments along those dimensions are irreparable and would therefore permanently limit our ability to reflect on what is Good.

This document describes the two basic trajectories that I can currently see, by which we might systematically achieve that goal (as opposed to succeeding by luck).

The Two Problems

In order to get to that kind of safe attractor state there appear to be two fundamental classes of problems facing humanity: technical AI alignment, and civilizational sanity.

By “technical AI alignment”, I mean the problem of discovering how to build and deploy super-humanly powerful AI systems (embodied either in a singleton, or an “ecosystem” of AIs), safely, in a way that doesn’t extinct humanity, and broadly leaves humans in control of the trajectory of the universe.

By “civilizational sanity”, I mean to point at the catch-all category of whatever causes high leverage decision makers to make wise, scope-sensitive, non-self-destructive, choices.

Civilizational Sanity includes whatever factors cause your society to do things like “saving ~500,000 lives by running human challenge trials on all existing COVID vaccines in February 2020, scaling up vaccine production in parallel with market mechanisms, and then administering vaccinations, en masse, to everyone who wants, with minimal delay”, or something at least that effective, instead of what the US did instead.

It also includes whatever it takes for a government to successfully identify and successfully carry through good macroeconomic policy (which I’ve heard is NGPD targeting, though I don’t personally know).

And it includes whatever factors cause it to be the case that your civilization suddenly acquiring god-like powers (via transformative AI or some other method), results in increased eudaimonia instead of in some kind of disaster.

I think that the only shot we have of exiting the critical risk period by something other than luck is sufficient success at solving AI alignment sufficient success at solving civilizational sanity, and implementing our solution.

(The “Strategic Background” section of this post from MIRI outlines a similar perspective of the high level problem as I outline in this document. However it elaborates, in more detail, a path by which AI alignment would allow humanity to exit the acute risk period (minimally aligned AI -> AGI powered technological development -> risk mitigating technology -> pivotal act that stabilizes the world), and de-emphasizes broad-based civilizational sanity improvements as another path out of the acute risk period.)


To some degree, solutions to either technical alignment or civilizational sanity can substitute for each other, insofar as a full solution to one of these problems would approximately obviate the need for solving the other.

For instance, if we had a full and complete understanding of AI alignment, including rigorous proofs and safe demonstrations of alignment failures, fully-worked-out safe engineering approaches, and crisp theory tying it all together, we would be able to exit the critical risk period. 

Even if it wasn’t practical for a small team to code up an aligned AI and foom, with that level of detail, it would be easy to convince the existing AI community (or perhaps just the best equipped team) to build aligned AI, because one could make the case very strongly for the danger of conventional approaches, and provide a crisply-defined alternative.

On the flip side, at some sufficiently high level of global civilizational sanity, key actors would recognize the huge cost to unaligned AI, and successfully coordinate to prevent anyone from building unaligned AI until alignment theory has been worked out.

We can make partial progress on either of these problems. The task facing humanity as a whole is to make sufficient progress on one, the other, or both, of these problems in order to exit the acute risk period. Speaking allegorically, we need the total progress on both to “sum to 1.” [2]

A note on “sufficiency”

Above, I write “I think that the only shot we have of exiting the critical risk period by something other than luck is sufficient success at solving AI alignment or sufficient success at solving civilizational sanity…”.

I want to clearly highlight that the word “sufficient” is doing a lot of work in that sentence. “Sufficient” progress on AI alignment or “sufficient” progress on civilizational sanity is not yet operationalized enough to be a target. I don’t know what constitutes “enough” progress on either one of these, and I don’t know if I could recognize it if I saw it. 

Civilizational Sanity, in particular, is always a two place function: I can only judge a civilization to be insane relative to my own epistemic process. If societal decision making improves, but my own process improves even faster, the world will still seem mad to me, from my new more privileged vantage point. So in that sense, the goal posts should be constantly moving. 

My key claim is only that there is some frontier defined by these axes such that, if the world moves past that frontier, we will be out of the acute risk period, even though I don’t know where that frontier lies.  

A note on timelines

When I talk about civilizational sanity interventions as a line of attack on AI risk, folks often express skepticism that we have enough time: AI timelines are short, so short that it seems unlikely that plans that attempt to reform the decision making process of the whole world will bear fruit before the 0 hour. [3]

I think that this is wrong-headed. It might very well be the case that we don’t have time for any sufficiently good general sanity boosting plans to reach fruition. But it might just as well be the case that we don’t have time for our technical AI alignment research to progress enough to be practically useful.

Our basic situation (I’m claiming), is that we either need to get to correct alignment theory, or to a generally sane civilization before the transformative AI countdown reaches 0. But we don’t know how long either of those projects will take. Reforming the decision processes of the powerful places in the world might take a century or more, but so might solving technical alignment.

Absent more detailed models about both approaches, I don’t think we can assume that one is more tractable, more reliable, or faster, than the other.

AI alignment in particular?

This breakdown is focused on the AI alignment problem in particular (it’s taking up half of the problem space), giving the impression that AI risk, is the only, or perhaps the most dangerous, existential risk.

While AI risk does seem to me to pose the plurality of the risk to humanity, that isn’t the main reason for breaking things down in this way. 

Rather it’s more that every intervention that I can see that has a shot of moving us out of the acute risk period goes through either powerful AI, or much saner civilization, or both. [I would be excited to have counterexamples, if you can think of any.]

We need protection against bio-risk, nuclear war, and civilizational collapse / decline. But robust protection against any one of those doesn’t protect us from the others by default. Aligned AI and a robustly sane civilization are both general enough that a sufficiently good version of either one would eliminate or mitigate the other risks. Any other solution-areas that have that property, and don’t flow through aligned AI or a general sane civilization would deserve their own treatment in this strategic map, but as of yet, I can’t think of any.

Technical AI alignment

I don’t have much to say about the details of this project. In broad strokes, we’re hoping to get a correct enough philosophical understanding of the concepts relevant to AI alignment, formalize that understanding as math, and eventually develop those formalizations into practical engineering approaches. (Elaboration on this trajectory here.)

(There are some folks who are going straight for developing engineering frameworks [links], hoping that they’ll either work, or give us a more concrete, and more nuanced understanding of the problems that need to be solved.)

It seems quite important if there are better or faster ways to make progress here. But my current sense of things is that it is just a matter of people doing the research work + recruiting more people who can do the research work. See my diagram here

Civilizational Sanity

Follow up to: What are some Civilizational Sanity Interventions

This second category is much less straightforward. 

Within the broad problem space of “causing high-level human decision making to be systematically sane”, I can see a number of specific lines of attack, but I have wide error bars on how tractable each one is.

Those lines of attack are

  1. Unblocking governance innovation
  2. Powerful intelligence enhancement
  3. Reliable, scalable, highly effective resolution of psychological trama
  4. Chinese ascendency

I’m sure this list isn’t exhaustive. These four are the only interventions that I currently know of that seem like (from my current epistemic state) they could transform society enough that we could, for instance, handle AI risk gracefully. 

Relationship between these legs

In particular, there’s an important open question of how these approaches relate to each other, and the broader civilizational sanity project. 

I described above that I think that “AI alignment” and “civilizational sanity” have an “or” or a “sum” relationship: sufficient progress on only one of them can allow us to exit the critical risk period.

There might be a similar relationship between the following civilizational sanity interventions: pushing on any one of them, far enough, leads to a large jump in civilizational sanity, kicking off a positive feedback loop. OR it might instead be that only some of these approaches attack the fundamental problem, and without success on that one front, we won’t see large effects from the others.

Unblocking Innovation in Governance

Better Governance is Possible

The most obvious way to improve the sanity of high-leverage decisions on planet earth is governmental reform.

Our governmental decision making processes are a mess. National politics is tribal politics writ-large: instead of a societal-level epistemology trying to select the best policies, we have a bludgeoning match over which coalitions are best, and which people should be in charge. Politicians are selected on the basis of electability, not expertise, or even alignment with society, yet somehow we seem to be ending up with candidates that no one is enthusiastic about. Congress is famously in a semi-constant self-strangle-hold, unable to get anything done. And the constraints of politics forces those politicians to say absurd things in contradiction with, for instance, basic economic theory, and to grandstand about things that don’t matter and (even worse) things that do.

The current system has all kinds of analytical demonstrable game theoretic drawbacks that make undesirable outcomes all but inevitable: including a two-party system that no one likes much, principal agent problems between the populous and the government, and net societal losses as a result of allocation of benefits to special interest groups.

There hasn’t been a major innovation in high-level governance, since the invention and wide-scale deployment of democracy in the 18th century. It seems like we can do better. We could, in principle, have governmental institutions that are effective epistemologies, are able to identify problems and determine and act on policies at the frontier of society’s various tradeoffs instead at the frontier of the tradeoffs of political expediency.

And because governments have so much influence, more effective information processing in that sector could lead to better institution designs in all other sectors. Public policy is in part a matter of creating and regulating other institutions. Saner government decision making entails setting up efficient and socially-beneficial incentives for health, education, etc, which selects for effective institutions in those more specific sectors. In this way, government is a meta-institution that shapes other institutions. (It’s unclear to me to what degree this is true. How much does better policy at the governmental level, automatically correct the inefficiencies of, say, the medical bureaucracy?)

One might therefore think that a particularly high leverage intervention is to develop new systems of governance. But humanity has a pretty large backlog of governance innovations that seem much better than our current setups on a number of dimensions, from the simple, like using Single Transferable Vote instead of First Past the Post, to the radical, like Futarchy, or the abolition of private property in favor of a COST system.

It seems to me that the bottleneck for better governmental systems is not possible alternatives, but rather the opportunity to experiment with those alternatives. Apparently, there are approximately no venues available for governmental innovation on planet earth.

This is not very surprising, because incumbents in power, benefit from the existing power structure and therefore oppose replacing it with a different mechanism. In general, everyone who has the ability to gatekeep experiments with new governance mechanisms is incentivized to be threatened by those experiments

However, widespread experimentation and innovation in governance would likely be a huge deal, because it would allow humanity as a whole to identify the most successful mechanisms, which, having been shown to work, could be tried at larger scales, and eventually widely adopted.

Experimentation Leads to Eventual Wide Adoption

The basic argument that merely allowing experimentation will eventually lead to better governance on a global scale is as follows: 

Many governance mechanisms, if tried, will not only 1) surpass existing systems, but 2) will surpass existing systems in a legible way, both in aggregate outcomes (like economic productivity, employment, and tax-rate), and from direct engagement with those systems (for instance, once voters become familiar with Futarchy, it might seem absurd that you would elect individuals who are both supposed to represent one’s values and have good plans for achieving those values). 

If the condition of “legible superiority” holds, there would be pressure to replicate those mechanisms elsewhere, at all different scales. Eventually, the best innovations simply become the new standard practices.

Similarly, for many incentive-aligning interventions, not using such methods is a stable attractor: it is in the interests of those in power to resist their adoption. But also, wide-spread use of such methods is also a stable attractor. Once common, it is in the interests of those in power to keep using them. As Robin Hanson says of prediction markets:

I’d say if you look at the example of cost accounting, you can imagine a world where nobody does cost accounting. You say of your organization, “Let’s do cost accounting here.”

That’s a problem because you’d be heard as saying, “Somebody around here is stealing and we need to find out who.” So that might be discouraged.

In a world where everybody else does cost accounting, you say, “Let’s not do cost accounting here.” That will be heard as saying, “Could we steal and just not talk about it?” which will also seem negative.

Similarly, with prediction markets, you could imagine a world like ours where nobody does them, and then your proposing to do it will send a bad signal. You’re basically saying, “People are bullshitting around here. We need to find out who and get to the truth.”

But in a world where everybody was doing it, it would be similarly hard not to do it. If every project with a deadline had a betting market and you say, “Let’s not have a betting market on our project deadline,” you’d be basically saying, “We’re not going to make the deadline, folks. Can we just set that aside and not even talk about it?”

This may generalize to many institution designs that are better than the status quo.

For these reasons, finding ways around the general moratorium of governmental innovation, so that new governance mechanisms can be tried, has possibly huge dividends.

Strategies to allow for Experimentation

Currently, the only approaches I’m aware of for creating spaces for governmental innovation are charter cities and sea steading.

Charter cities are bottle-necked on legal restrictions, and the practical coordination problem of getting a critical mass of residents. But I’m hopeful that COVID has caused a permanent shift to remote work, which will give people more freedom in where to live, and increase competition-in-governance between cities and states, who want to attract talent.

Seasteading is currently bottlenecked on the engineering problem of creating livable floating structures, cheaply enough to be scalable. [Double check if cost is actually the key concern.]

Repeatable reform templates

I wonder if there might be a third, more abstract, line of attack on unblocking governance innovation: developing a repeatable method to change existing governmental structures in a way that incentivizes powerful incumbents.

If it were possible to simply buy out incumbents and overhaul the system, that might be a huge opportunity. However, I guess that in most liberal democracies, this is both illegal and generally repugnant (plus politicians are beholden to their party which might object), such that existing power-holders would not accept a straightforward “money for institutional reform” trade.

But there may be some other version which, in practice, incentivizes power-holders to initiate governmental reform. Possibly by letting those power-holders keep their power for some length of time, and also recieve the credit for the change. Or maybe a setup that targets those people before they take power, when they are more idealistic, and more inclined to make an agreement to cause reform, conditional on all their peers doing the same, in the style of a free-state agreement.

If we could find a repeatable “template” for making such deals, it might unlock the ability to iteratively improve existing governmental structures.

I’m not aware of any academic research in this area (both historical case studies of how these kinds of shifts have occurred in the past and analytic models of how to incentivize such changes seem quite useful to me), nor any practical projects aiming for something like this.

Intelligence enhancement

One might posit that the sort of incentive problems that lead to bizarre institutional policies is the inevitable result of the fact that doing better requires understanding many abstract, non-intuitive concepts and/or careful reasoning in complicated domains, and the average person is of average intelligence, which is insufficient to systematically identify better policies and institutional set-ups over worse ones at the current margin.

In this view, the fundamental problem is that our civilizational decision making processes are much worse than is theoretically possible, because we are collectively not smart enough to do better. Some of us can identify the best policies (or at least determine that one policy is better than another), some of the time, but that relies on understanding that is esoteric to many more people, including many crucial decision makers.

But if the average intelligence of the population as a whole was higher, more good ideas would seem obviously good to more people, and it would be substantially easier to get critical mass of acceptance of sane policies on the object level, as well as better information processing mechanisms. (For instance, If the IQ curve was shifted 35 points to the right, many more people would be able to “see at a glance” why prediction markets are an elegant way of aggregating information.)

More intelligence -> More understanding of important principles -> Saner policies

So it might be that the most effective lever on civilizational sanity is intervening on biological intelligence.

The most plausible way to do this is via widespread genetic enhancement, with either selection methods like iterated embryo selection, or direct gene editing using methods like CRISPR.

My current understanding is that these methods are bottle-necked on our current knowledge of the genetic predictors of intelligence: if we knew those more completely, we would basically be able to start human genetic engineering for intelligence. It seems like that knowledge is going to continue to trickle in as we get better at doing genomic analysis and collect larger and larger data sets. [Note: this is just my background belief. Double-check] Possibly, better Machine Learning methods will lead to a sudden jump in the rate of progress on this project?

On the face of it this suggests that any project that could provide a breakthrough in decoding the genetic predictors of intelligence could be high leverage.

Aside from that, there’s some risk that society will fork down a path in which human genetic enhancement is considered unethical, and will be banned. I’m not that worried about this possibility, because as long as some people / groups are doing this for their children there is a competitive pressure to do the same, and I think it is pretty unlikely that China, which is competitive, at the national level, with the rest of the world, and in which families already regularly exert huge efforts to give their children competitive advantages relative to societies at large, will forgo this opportunity. And if China invests in human genetic enhancement, the US will do the same out of a fear of Chinese dominance.

Some other avenues for human intelligence enhancement include nootropics, which seems much less promising for the basic algernonic argument, and brain computer interfaces like neurolink. Of the latter, it is currently unknown which dimensions of human cognition can be readily improved, and if such augmentation will lend itself to wisdom or whatever the precursors to civilizational sanity are.

There’s also the possibility of using sufficiently aligned AI assistants to augment our effective intelligence and decision making. Absent our alignment research giving us very clear criteria for aligned systems, this seems like a very tricky proposition, because of the problems described in this post. But in worlds where AI technology continues to improve along its current trajectory, it might be that using limited AI systems as leverage for improving our decision making and research apparatus, to further improve our alignment technologies, is the best way to go.

A note on improving public understanding by methods other than intelligence enhancement:

Possibly there are other ways to substantially increase each person’s individual intellectual reach, so that we can all come to understand more, without increasing biological intelligence. Things in the vein of “better education”. 

I’m pretty dubious of these. 

I think I have far above average skill in communicating (both teaching and being taught) complex or abstract ideas. But even being pretty skilled, for a human, it is just hard. Even when the conditions are exceptional (a motivated student working one-on-one with a skilled tutor who understands the material and can model / pace to the student’s epistemic state), it just takes many focused hours to grasp many important concepts.

I think that any educational intervention effective enough to actually move the needle on civilizational sanity would have to be very radical: so transformative that it would be a general boost in a person’s learning ability, i.e. an increase in effective intelligence. That said, if anyone has ideas for interventions that could increase most people’s intellectual grasp, I would love to hear them.

(…Possibly a dedicated and well executed campaign to educate the public at large ins some small set of extremely important concepts, with the goal of shifting what sorts of explanations sounds plausible to most people (raising the standard for what kinds of economic claims people can make in public with a straight face, for instance), would be helpful on the margin. But this seems to me like an enormous undertaking which would require pedagogical and mass-communication knowledge that I don’t know if anyone has. And I’m not sure how helpful it would be. Even if the whole world understood econ 101, the real world is more complicated than econ 101, such that I don’t know how much that alone would aid people’s assessment of which policies are best. I suppose it would cut out some first-order class of mistakes.)

I do think there are definitely ways to increase our collective intellectual reach, so that societies can systematically land on correct conclusions without increasing any individual person’s intellectual reach or understanding. These include the governance mechanisms I alluded to in the last section. 

There might also exist society wide “public services”, that could do something like this while side-stepping government bureaucracy entirely, like the dream of arbital. I’m not sure how optimistic I should be about those kinds of interventions. The only comparable historical examples that I can think of are wikipedia and public libraries. Both of these seem like clearly beneficial public goods with huge flow-through effects, which make information easily available to people who want it and wouldn’t otherwise have access. But neither one seems to have obviously improved high-level civilizational decision making relative to the counterfactual.

Clearing “Trauma”??

[The following section is much more speculative, and I don’t yet know what to think of it.]

There’s another story in which the main source of our world’s dysfunction is self-perpetuating trauma patterns. 

There are many variations of this story, which differ in important details. I’ll outline one version here, noting that something like this could be true without this particular story being true.

According to this view…

virtually everyone is traumatized (or if you prefer, “socialized”), into dysfunctional and/or exploitative behavior patterns, to greater or lesser degrees, in the course of growing up. 

The central problem isn’t (just) that everyone is following their local self-interest in globally destructive systems, it is actually much worse than that: people are conditioned in such a way that they are not even acting in their narrow self interest. Instead humans myopically focus on goals, and execute strategies, that are both 1) globally harmful and 2) not even aligned with their own “true” reflective, preference, due to false assumptions underlying their engagement with the world. This myopia also inhibits their ability to think clearly about parts of the world that are related to their trauma

(As a case in point, I think it is probably the case that there are lots of people aggressively pursuing AGI, and who are instinctively flinch away from any thought that AGI might be dangerous, because they have a deep, unarticulated, belief that if they can be successful at that, their parents will love them, or they won’t feel lonely any more, or something like that.)

They’ve been conditioned to feel threatened, or triggered by, a huge class of behaviors that are globally productive, like accurate tracking of harms, and many kinds of positive-sum arrangements.

Furthermore, the core reason why most people can’t seem to think or to have “beliefs in the sense of anticipations about the world” is not (mostly) a matter of intelligence, but rather that their default reasoning and sense-making functions have been damaged by the institutions and social contexts in which they participate (school, for instance).

Those traumatizing contexts  are not designed by conscious malice, but they are also not necessarily incidental. It’s possible that they have been optimized to be traumatizing, via unconscious social hill-climbing.

This is because trauma-patterns are replicators: they have enough expressive power to recreate themselves in other humans, and are therefore subject to a selection pressure that gradually promotes the variations that are most effective at propagating themselves. (Furthermore, there’s a hypothesis that for a traumatized mind, one of the best ways to control the environment to make it safe is to similarly traumatize people in the environment.) The net result is horrendous systems of hurt people hurting people, as a way to pass on that particular flavor of hurt to future generations.

Part of the hypothesis here is that these trauma patterns have always been a thing in human societies, but there has also typically been a counter-force, namely that if you need to work together and have a good understanding of the physical world to survive in a harsh environment, your epistemology can’t be damaged too badly, or you’ll die. But in the modern world, we’ve become so wealthy, and most people have become so divorced from actual production, that that counter-force is much diminished.

Implications for Improving the World

If this story is true, governmental reform is likely to fail for seemingly-mysterious reasons, because there is selection pressure optimizing against good institutional epistemology, over and above bureaucratic inertia and the incentives of entrenched power-holders. If you don’t defuse the underlying trauma-patterns, any system that you try to build will either fail or be subverted by those trauma-patterns..

And under this story, it’s unclear how much intelligence enhancement would help. All else being equal, it seems (?) that being smarter helps in developmental work, and healing from one’s personal traumas, but it might also be the case that greater intelligence enables more efficient propagation of trauma patterns. 

If this story is largely correct, it implies that the actual bottleneck for the world is understanding trauma and trauma resolution methods well enough to heal trauma-patterns at scale. If we can do that, the agency and intellect of the world (which is currently mostly suppressed), will be unblocked, and most of the other problems of the world will approximately solve themselves.

I also don’t know to what extent there already exist methods for reliably and rapidly resolving trauma patterns, and the degree to which the bottleneck is actually one 1 to n scaling rather than 0 to 1 discovery. Certainly there are various methods that at least some people have gotten at least some benefit from, though it remains unclear how much of the total potential benefit even the best methods provide to the people who have gotten the most from them.

I don’t know what to think of all of this yet, the degree to which trauma is at the root of the world’s ills, the degree to which things have actually been optimized to be traumatizing as opposed to ending up that way by accident, or even if “trauma” is a meaningful category pointing at a real phenomenon that is different from “learning” in a principled way.

I’ll note that even if the strong version of this story is not correct, it might still be the case that many people’s intellectual capability is handicaped by psychological baggage. So it might be the case that research into effective trauma-resolution methods may be an effective line of attack on improving the world’s intellectual capability. For instance, finding a non-scalable method for reliably resolving trauma might be an important win, because at minimum, we could apply it to all of the AI safety researchers. This might be one of the possible gains on the table for speeding progress on the alignment problem. 

(Though this is also something to be careful of, since such methods would likely have some kind of psychological side effects, and we don’t necessarily want to reshape the psyches of earth’s contingent of alignment researchers all in the same way. I worry that we might have already done this to some degree with circling: Circling seems quite good and quite helpful, but I think that we should be concerned that if we make a mistake about what directions are good to push the culture of our small AI safety community, we’re likely to destroy a lot of value.)

The Rise of China??

In the first section describing what I meant by civilizational sanity up above, I noted “sensible response to COVID” as one indicator of civilizational sanity. Notably, China’s covid response, seems, overall, to have been much more effective than the West’s.

This doesn’t seem like an aberration, either. As a non-expert foreigner, looking in, it looks like China’s society/government is overall more like an agent than the US government. It seems possible to imagine the PRC having a coherent “stance” on AI risk. If Xi Jinping came to the conclusion that AGI was an existential risk, I imagine that that could actually be propagated through the chinese government, and the chinese society, in a way that has a pretty good chance of leading to strong constraints on AGI development (like the nationalization, or at least the auditing of any AGI projects).

Whereas if Joe Biden, or Donald Trump, or anyone else who is anything close to a “leader of the US government”, got it into their head that AI risk was a problem…the issue would immediately be politicized, with everyone in the media taking sides on one of two lowest-common denominator narratives each straw-manning the other. One side would attempt to produce (probably senseless) legislation in the frame of preventing the bad guys from doing bad things, while the other side goes to absurd lengths to block them as a matter of principle, and in the end we’re left with some regulation on tech companies that doesn’t cleave to the actual shape of the problems at all, and pisses off researchers who are frustrated that this anthropomorphizing, “AI risk” hubbub, just made their lives much harder, alienating them.

(One might think that this is actually a national security issue, and it would be taken more seriously than that, but COVID was a huge public health issue, and we managed to politicize wearing masks.

So, maybe it would be good for the world if China was the dominant world power?

I think that overall, China’s society, and high level decision making is currently saner than that of the western world. So maybe on the margin, the world is better off if China were more dominant. 

However, I have a number of reservations.

  1. China’s human rights record is not great. Apparently, there is an ongoing genocide of the Uighurs, happening right now. My deontology is pretty reluctant to put mass murders in charge of the world.
    1. I’m not sure how to think about this. Genocide is extremely bad. And furthermore we have a strong, coordinated norm to censor and take action against it (although, obviously not that strong, since I don’t know of a single person who has taken any action other than (occasionally) tweeting news articles, in this case). But also, I’m not sure whether I should just parse this as standard practice for great powers / ruling empires. The US has committed similarly bad atrocities in its history (slavery and the extermination/relocation of the Indians come to mind), and as far as I know, continues to commit similar atrocities. And the stakes are literally astronomical. Does the specter of extinction and the weight of all future earth-originating civilization mean we should just neglect contemorary genocide in our realpolitik calculations? I’m not comfortable with that, but I don’t know what to think about it.
  2. I don’t have a strong reason to expect that China’s institutions are fundamentally better functioning than the US’s, I think they’re just younger. If China is exhibiting the kind of functionality and decisiveness, that the US was enjoying 60 years ago, then it seems pretty plausible that 60 years from now (or maybe sooner than that, on the general principle that the world is moving faster now), the chinese system will be similarly scrolrotic and dysfunctional.
    1. Indeed, we might make a more specific argument that institutions are able to remain functional so long as there is growth, because a growing pie means everyone can win. But when growth slows or stops there’s no longer a selection pressure for effectiveness, and institutions entrench themselves because rent seeking is a better strategy. (Or maybe the causality goes the other way: there’s a continual, gradual, increase in rent-seeking as actors entrench their power-bases, which gradually cuts out production, until all (or almost all) that’s left is rent-seeking. In any case, I think China has got to be nearing the top of it’s explosive s-curve, and I don’t expect its national agency to be robust to that.
  3. I would guess, not knowing much more than a stereotype of Chinese culture, that even if it is saner and more effective than western culture right now, the west has more of the generators that can lead to further increases in civilizational sanity. I might be totally off base here, but the East’s emphasis on conformity and social hierarchy seems like it would make it even MORE resistant to, say, the wide-scale adoption of prediction markets than the US is. (Though maybe the ruling party is enough of an unincentivized incentivizer to overcome this effect?) I suspect that it is even less likely to generate the kind of iconoclastic thinkers who would think up the idea of prediction markets in the first place. It would be quite bad if we got some boost in civilizational sanity with the rise of China, but that Chinese dominance curtainald any further improvement on that dimension. 
  4. It is currently unclear to me how much it matters which culture the intelligence explosion takes place in.
    1. Under the assumption of a strong attractor in the human CEV, it seems like it doesn’t matter much at all: we’re all, currently, so radically confused about Goodness, that the apparently-huge cultural differences are just noise. And even if that’s not true, I would guess that the differences between my ideal future, and some human descended society, are probably massively outweighed by the looming probability of extinction and a sterile universe. Chinese people live happy lives in China, now, and have lived happy lives throughout history, even if they tolerate a level of conformity and restriction-of-expression that I would find stifling, to say the least.
    2. However, I think it might not be an exaggeration to say that the CPC believes that thoughts should be censored to serve the state. I can imagine technological augmented versions of thought control that are so severe as to permanently damage the human civilization’s ability to think together, which might constitute the sort of irreparable “damage” that prevent us from deliberating to discover and the executing on a good future. If this sort of technology is more likely to come from China than from the west, Chinese supremacy might be disastrous.
    3. It does seem really important that AGI not lock the future into an inescapable immortal dictatorship (Probably? Maybe most people just live basically happy lives in an immortal dictatorship?). And I want to track if that is more likely to result from an intelligence explosion directed by China than by my native culture.

Summing up

  • The problem facing humanity in this era is figuring out how to exit the acute risk period, systematically, instead of by luck. 
  • The only ways that I can see to do this, depend on aligned AI or a much saner human civilization. 
  • So the problem breaks down into two subproblems: solve AI alignment or achieve enough civilizational sanity.
  • AI alignment research is going apace, and if there are ways to speed it up, that would be great.
  • I can currently see four lines of attack on civilizational sanity: unblocking innovation in governance, intelligence enhancement, and possibly widespread trauma resolution, or Chineses ascendancy. 
  • All of those plans might turn out to be on-net bad for the world, on further reflection.


Some of my questions for going forward:

  1. How long until transformative AI arrives?
  2. Are there tractable ways to speed Technical AI alignment substantially?
  3. Are there tractable ways to unblock governance experimentation?
  4. Follow up on charter city projects
  5. What’s blocking sea steading? Is it cost as I believe?
  6. How large are the expected flow-through effects of governmental sanity interventions on other sectors?
  7. Conditional on unblocking innovation in governance, how long is it likely to take for the best innovations to propagate outward until they are standard best practices?
  8. What’s the bottleneck for human genetic intelligence augmentation
  9. Along what dimensions would Nurolink improve human capabilities?
  10. Is “trauma” a natural kind? To what extent is it true that psychological trauma is driving exploitative and counter-productive organizational patterns in the world?
  11. How much saner is China? How long will the Chinese system remain “alive”?
  12. How different will the long term future be, if the intelligence explosion happens in one culture rather than another?


[1] –  Or some civilization or other mechanism, bearing human values.

[2] –  Though of course, there isn’t a linear relationship between the individual progress bars, and total victory. We might be “70%” of the way to a full solution to both problems (whatever that means), but between the two, not have enough of the right pieces to get a combined solution that lets us exit the critical risk period. That’s why it is only allegorical.

[3] – And, in contrast, I sometimes talk with people who are so pessimistic about alignment work, that they take it for granted that the thing to do is take over the world by conventional means.

Psychoanalyzing, people seem to gravitate to the line of attack that is within their skillset, and therefore feels more comfortable to think about. This seems like a perfectly good heuristic for specialization, but it doesn’t seem like a particularly good way to identify which approach is more tractable in the abstract.

How do we prepare for final crunch time? – Some initial thoughts

[epistemic status: Brainstorming and first draft thoughts.

Inspired by something that Ruby Bloom wrote and the Paul Christiano episode of the 80,000 hours podcast.]

One claim I sometimes hear about AI alignment [paraphrase]:

“It is really hard to know what sorts of AI alignment work are good, this far out from transformative AI. As we get closer, we’ll have a clearer sense of what AGI / Transformative AI is likely to actually look like, and we’ll have much better traction on what kind of alignment work to do. In fact, it might be the case that MOST of the work of AI alignment is done in the final few years before AGI, when we’ve solved most of the hard capabilities problems already and we can work directly, with good feedback loops, on the sorts of systems that we want to align.”

Usually this is taken to mean that the alignment research that is being done today is primarily to enable or make easier future, more critical, alignment work. But “progress in the field” is only one dimension to consider in boosting the work of alignment researchers in final crunch time.

In this post I want to take the above posit seriously, and consider the implications. If most of the alignment work that will be done is going to be done in the final few years before the deadline, our job in 2021 is mostly to do everything that we can to enable the people working on the problem in the crucial period (which might be us, or our successors, or both) so that they are as well equipped as we can possibly make them.

What are all the ways that we can think of that we can prepare now, for our eventual final exam? What should we be investing in, to improve our efficacy in those final, crucial, years?

The following are some ideas.


For this to matter, our alignment researchers need to be at the cutting edge of AI capabilities, and they need to be positioned such that their work can actually be incorporated into AI systems as they are deployed.

A different kind of work

Most current AI alignment work is pretty abstract and theoretical, for two reasons. 

The first reason is a philosophical / methodological claim: There’s a fundamental “nearest unblocked strategy” / overfitting problem. Patches that correct clear and obvious alignment failures are unlikely to generalize fully, you’ll only have constrained unaligned optimization to channels that you can’t recognize. For this reason, some claim, we need to have an extremely robust, theoretical understanding of intelligence and alignment, ideally at the level of proofs.

The second reason is a practical consideration: we just don’t have powerful AI systems to work with, so there isn’t much in the way of tinkering and getting feedback.

The second objection becomes less relevant in final crunch time: in this scenario, we’ll have powerful systems 1) that will be built along the same lines as the systems that it is crucial to align and 2)  that will have enough intellectual capability to pose at least semi-realistic “creative” alignment failures (ie, current systems are so dumb, and liven in such constrained environments, that it isn’t clear how much we can learn about aligning literal superintelligences from them.)

And even if the first objection ultimately holds, theoretical understanding often (usually?) follows from practical engineering proficiency. It seems like it might be a fruitful path to tinker with semi-powerful systems trying out different alignment approaches empirically, and tinkering to discover new approaches, and then backing up to do robust theory-building given much richer data about what seems to work.

I could imagine sophisticated setups that enable this kind of tinkering and theory building. For instance, I imagine a setup that includes:

  • A “sandbox” that afford easy implementation of many different AI architectures and custom combinations of architectures, with a wide variety easy-to-create, easy-to-adjust, training schemes, and a full suite of interpretability tools. We could quickly try out different safety schemes, in different distributions, and observe what kinds of cognition and behavior result.
  • A meta AI that observes the sandbox, and all of the experiments therein, to learn general principles of alignment. We could use interpretability tools to use this AI as a “microscope” on the AI alignment problem itself, abstracting out patterns and dynamics that we couldn’t easily have teased out with only our own brains. This meta system might also play some role in designing the experiments to run in the sandbox, to allow it to get the best data to test it’s hypotheses.
  • A theorem prover that would formalize the properties and implications of those general alignment principles, to give us crisply specified alignment criteria by which we can evaluate AI designs.

Obviously, working with a full system like this is quite different than abstract, purely theoretical work on decision theory or logical uncertainty. It is closer to the sort of experiments that the OpenAI and Deep Mind safety teams have published, but even that is a pretty far cry from the kind of rapid-feedback tinkering that I’m pointing at here.

Given that the kind of work that leads to research progress might be very different in final crunch time than it is now, it seems worth trying to forecast what shape that work will take and trying to see if there are ways to practice doing that kind of work before final crunch time.


Obviously, when we get to final crunch time, we don’t want to have to spend any time studying fields that we could have studied in the lead-up years. We want to have already learned all the information and ways of thinking that we’ll want to know, then. It seems worth considering what fields we’ll wish we had known when time comes.

The obvious contenders:

  • Machine Learning
  • Machine Learning interpretability
  • All the Math of Intelligence that humanity has yet amassed [Probability theory Causality, etc.]

Some less obvious possibilities:

  • Neuroscience?
  • Geopolitics, if it turns out that which technical approach is ideal hinges on important facts about the balance of power?
  • Computer security?
  • Mechanism design in general?

Research methodology / Scientific “rationality”

We want the research teams tackling this problem in final crunch time to have the best scientific methodology and the best cognitive tools / habits for making research progress, that we can manage to provide them.

This maybe includes skills or methods in the domains of:

  • Ways to notice as early as possible if you’re following an ultimately-fruitless research path
  • Noticing / Resolving /Avoiding blindspots
  • Effective research teams
  • Original seeing / overcoming theory blindness / hypothesis generation
  • ???


One obvious thing is to spend time now, investing in habits and strategies for effective productivity. It seems senseless to waste precious hours in our acute crunch time due to procrastination or poor sleep. It is well worth in to solve those problems now. But aside from the general suggestion to get your shit in order and develop good habits now I can think of two more specific things that seem good to do.

Practice no-cost-too-large productive periods

There maybe trades that could make people more productive on the margin, but are too expensive in regular life. For instance, I think that I might conceivably benefit from having a dedicated person who’s job is to always be near me, so that I can duck with them with 0 friction. I’ve experimented a little bit with similar ideas (like having a list of people on call to duck with), but it doesn’t seem worth it for me to pay a whole extra person-salary to have the person be on call, and in the same building, instead of on-call via zoom.

But it is worth it at final crunch time.

It might be worth it to spend some period of time, maybe a week, maybe a month, every year, optimizing unrestrainedly for research productivity, with no heed to cost at all, so that we can practice how to do that. This is possibly a good thing to do anyway, because it might uncover trades that actually, on reflection are worth importing into my regular life.

Optimize rest

One particular subset of personal productivity, that jumps out at me: each person should figure out their actual optimal cadence of rest.

There’s a failure mode that ambitious people commonly fall into, which is working past the point when marginal hours of work are negative. When the whole cosmic endowment is on the line, there will be a natural temptation to push yourself to work as hard as you can, and forgo rest. Obviously, this is a mistake. Rest isn’t just a luxury: it is one of the inputs to productive work.

There is a second level of this error in which one, grudgingly, takes the minimal amount of rest time, and gets back to work. But the amount of rest time required to stay functional is not the optimal amount of rest, the amount the maximizes productive output. Eliezer mused years ago, that he felt kind of guilty about it, but maybe he should actually take two days off between research days, because the quality of his research seemed better on days when he happened to have had two rest days preceding.

In final crunch time, we want everyone to be resting the optimal amount that actually maximizes area under the curve, not the one that maximizes work-hours. We should do binary search now, to figure out what the optimum is.

Also, obviously, we should explore to discover highly effective methods of rest, instead of doing whatever random things seem good (unless, as it turns out, “whatever random thing seems good” is actually the best way to rest).

Picking up new tools

One thing that will be happening in this time, is there will be a flurry of new AI tools that can radically transform thinking and research, perhaps increasingly radical tools coming at a rate of once a month or faster.

Being able to take advantage of those tools and start using them for research immediately, with minimal learning curve, seems extremely high leverage.

If there are things that we can do that increase the ease of picking up new tools and using them to their full potential (instead of, as is common, using only the features afforded by your old tools and only very gradually

Some thoughts (probably bad):

  • Could we set up our workflows, somehow, such that it is easy to integrate new tools into them? Like if you already have a flexible, expressive research interface (something like Roam?), and you’re used to regular changes in capability to the backed of the interface?
  • Can we just practice? Can we have a competitive game of introducing new tools, and trying to orient to them and figure out how to exploit them creatively as possible?
  • Probably it should be some people’s full time job to translate cutting edge developments in AI into useful tools and practical workflows, and then to teach those workflows to the researchers?
  • Can we design a meta-tool that helps us figure out how to exploit new tools? Is it possible to train an AI assistant specifically for helping us get the most out of our new AI tools?
  • Can we map out the sort of constraints on human thinking and/or the the sorts of tools that will be possible, in advance, so that we can practice with much weaker versions of those tools, and get a sense of how we would use them, so that we’re ready when they arrive?
  • Can we try out new tools on psychedelics, to boost neuroplasticity? Is there some other way to temporarily weaken our neural priors? Maybe some kind of training in original seeing?

Staying grounded and stable in spite of the stakes

Obviously, being one of the few hundred people on whom the whole future of the cosmos rests, while the singularity is happening around you, and you are confronted with the stark reality of how doomed we are, is scary and disorienting and destabilizing.

I imagine that that induces all kinds of psychological pressures, that might find release in any of a number of concerning outlets: by deluding one’s self about the situation, by becoming manic and frenetic, by sinking into immovable depression.

We need our people to have the virtue of being able to look the problem in the eye, with all of its terror and disorientation, and stay stable enough to make tough calls, and make them sanely.

We’re called to cultivate a virtue (or maybe a set of virtues) of which I don’t know the true name, but which involve courage and groundless, and determination-without-denial.

I don’t know what is entailed in cultivating that virtue. Perhaps meditation? Maybe testing one’s self at literal risk to one’s life? I would guess that people in other times and places, who needed to face risk to their own lives and that of their families, did have this virtue, or some part of it, and it might be fruitful to investigate those cultures and how that virtue was cultivated.