levels of interp probes: no causality attribution (i.e. integrated gradient): no interpretation methods of causal interventions activation patching / interchange interventions Record the activation, and swap the activations (can thus find the output) distributed alignment search Features are not axis aligned. Find equality task efficiently after (a rotation?) three worlds of casual interventions …as interp “can we find interpretable causal mechanisms?” That is, “searching for a rotation” and then run interchange interventions. proposa a model to align figure out if counter factual matches solve for alignment …as control ReFT: “can we optimize our intervention for any task?” That is, can intervention be a good way to derive control. ReFT applies only limited interventions to prompt tokens using the same notion of minor control …as steering AxBench tells us that most current steering objective untenable. However, we can steer better if by simply contrastive learning of both positive and negative cases. Via the notino of “negative steering”, we find that negative steering