Logit Probe — Jemoka Knowledge Base

Goals Motivation: it is very difficult to have an interpretable, causal trace of facts. Let’s fix that. Facts It is also further difficult to pull about what is a “fact” and what is a “syntactical relation”. For instance, the task of The Apple iPhone is made by American company <mask>. is different and arguably more of a syntactical relationship rather than factually eliciting prompt than The iPhone is made by American company <mask>. For our purposes, however, we obviate this problem by saying that both of these cases are a recall of the fact triplet <iPhone, made_by, Apple>. Even despite the syntactical relationship established by the first case, we define success as any intervention that edits this fact triplet without influencing other stuff of the form: The [company] [product] is made by [country] company [company]. The Probe Definition Maps Hidden mappings H^{(1)}, …, H^{N} Output projections W = W^{O}W^{I} Spaces embedding space U \subset \mathbb{R}^{\text{hidden}} vocab space V \subset \mathbb{R}^{|V|}, where |V| is vocab size LM: L = (W H^{(N)} \dots H^{(1)}): U \to V, such that L u \in V, for some word embedding u \in U. LM’s distribution: \sigma L, such that \sigma u \in \triangle_{|V|}. The Logit Lens The Logit Lens proposes that we can chop off some H and recover a distribution that’s similar to the true output distribution. Empirically, given large enough N, it is likely that:

\begin{equation} \arg\max_{j} \left(W H^{(N)} \dots H^{(1)}\right)_{j} = \arg\max_{j} \left(W H^{(N-1)} \dots H^{(1)}\right)_{j} = \arg\max_{j} \left(W H^{(N-2)} \dots H^{(1)}\right)_{j} \end{equation}

up to some finite depth before this effect breaks down. A Sketch Evidence suggests that storage of “factual” information is not typically axis-aligned in U. Meaning, it’s difficult to learn some binary mask m such that m \cdot u \in U which would then disrupt downstream knowledge production of a fact without knocking out other stuff. However, we know that due to the one-hot cross-entropy LM objective, “facts” (as defined above) is axis aligned to V. After all, a word v_{j} is represented by the j th standard basis (i.e. one-hot vector) in v.