ICLR 2025 Paper Notes

16 papers reviewed.

ICLR2025 Adaptive Computation

Talks ICLR2025 Snell: Optimality of Scaling LLM Test-Time Compute ICLR2025 Mathur: MIND Adaptive Thinking with Dynamic Computation ICLR2025 Yue: Inference Scaling for RAG

Full note

ICLR2025 Context and Retrieval

Talks ICLR2025 Wu: Retrieval Head Explains Long Context

Full note

ICLR2025 Friday Posters

ICLR2025 Morris: contextual document embeddings Take a bunch of sentence embeddings as input to produce a new sentence embedding that is now contextual
ICLR2025 Noukhovich: asynchronous reinforcement learning for language models Rollout and tune concurrently
ICLR2025 Yao: CR-CTC CONSISTENCY REGULATION CTC LOSS CAN BE MADE MORE ROBUST IF YOU REGULARIZE TO HAVE MINIMAL DIFFERENCE BETWEEN TWO AUGMENTED VIEWS OF THE SAME MEL SPECTRUM
ICLR2025 Sun: ReDeEP detecting hallucination using mechanistic int...

Full note

ICLR2025 HAIC

ICLR2025 Koyejo Proposal: Focus AI measurements on the validity of specific terms.
Five pillars of claim making:
content validity: does your evaluation cover all valuable cases? criterion validity: does your evaluation correlate with a known validated standard? construct validity: does your evaluation measure the intended construct? external validity: does your evaluation generalize across different environments or settings? consequential validity: does your evaluation consider the real world im...

Full note

ICLR2025 Jin: MOE++ zero computation experts

Motivation A fixed amount of experts is activated per task.
Key Insight MoE++ allows the amount of expert distribution to be adaptive.
Method Three key contributions:
zero-computation experts: discarding input E\left(x\right) = 0, copy input E\left(x\right) = x (“skip”), const E(x) = \alpha_{a} x +\alpha_{b} v_{\theta} (plus normallFFN experts) pathway-aware router (with additional loss augmentation where we learn a \tau_{\theta} to decide something else I missed zero-computation e...

Full note

ICLR2025 Kilani: MrT5 Tokenizer-Free

Motivation ByteT5 is very expensive (because you have to have a residual on every damn token)
MrT5 MrT5 uses a soft attention masking gate at pretraining time to delete unused tokens; at inference time we use a hard cut.
Cool: MrT5 learns language independent compression rate (different languages have different rates).

Full note

ICLR2025 Li: MoE is secretly an embedding

motivation Can we directly extract embeddings from MoE forwarding routing weights (i.e., compared to traditional residual stream information)?
Key Insight Using residual states vs. forwarding weights as semantic searc embeddings offer complementary strengths (i.e., when one method fails, the other one succeeds more)
Method Create an aggregate embedding:

\begin{equation} E_{j} = X_{j} + \alpha W_{j} \end{equation}

where W_{j} is the routing weight of the residual, and X_{j} is the residua...

Full note

ICLR2025 Mathur: MIND Adaptive Thinking with Dynamic Computation

Motivation Standard computation doesn’t adapt.
Fixed-Point Iteration for Adaptation method: CNN for every layer, perform fixed-point iteration until convergence to mask out (what exactly?) supervise also an “introspection model” to skip the entire fixed point loss: LM + supervision for the introspection model method: MIND-transformer for every layer, perform fixed-point iteration until attention activation convergence ditto introspection as above

Full note

ICLR2025 MoE

Talks ICLR2025 Li: MoE is secretly an embedding ICLR2025 Jin: MOE++ zero computation experts

Full note

ICLR2025 Neitemeier: Hierachical Autoregressive Transformers

“A Byte Level transformer, with some compression”
Key insight: use a [CLS] token in front of every word to train a small “tokenizer”, and then do a normal transformer on the [CLS] tokens, and then autoregressive decode out the single bytes.
Method Hierarchical Autoregressive Transformers We put a [cls] in front of every word. So the input looks like
[CLS] M y _ [CLS] n a m e _ [CLS] i s We then run a small encoder over each sequence. And then you take the encoded [CLS], a...

Full note

ICLR2025 Saturday Posters

ICLR2025 Cassidy: AssistanceZero Train reward predictor to also have rewards at test time MCTS Learn to match root node KL ICLR2025 Liu: synthesizing programmatic reinforcement learning policies with LLM guided search Hill climbing with partial mutations of generated programs of LLMs
ICLR2025 Weller: l PromptTrirver ??
ICLR2025 Yu: robust LLM safeguard via refusal feature adversarial training With mechanistic interpretability, we can find a sub space which is correlated with refusal, pull that u...

Full note

ICLR2025 Snell: Optimality of Scaling LLM Test-Time Compute

Compute-Optimal Scaling Compute-Optimal Scaling is the notion of selecting the optimal configuration (beam width, search budget, etc.) dynamically / for binned question.
Approaches to “Scaling Test-Time Compute” Three primary approaches:
best-of-n: roll out a bunch, reject Beam Search: check against intermediate lookahead search: MCTSish (do lookahead rollouts) Key insight On easy qusetion, beam search shows over-optimization and best of n is good on medium/hard questions, beam searc...

Full note

ICLR2025 Thursday Morning Posters

ICLR2025 Hu: belief state transformer Key insight: residual stream at the last token kept thought of as a belief state encoding future tokens, that is, uncertainty in the last residual directly correlate the diversity of output
Method: trainer transformer and trainer reverse transformer like what Robert wanted, then correlate
ICLR2025 Lingam: diversity of thoughts Key insight: Use iterative sampling to achieve higher diversity in self reflection, in order to get better outputs.
ICLR2025 Gu: data...

Full note

ICLR2025 Tokenizer-Free Approaches

Talks ICLR2025 Kilani: MrT5 Tokenizer-Free ICLR2025 Neitemeier: Hierachical Autoregressive Transformers Downsides of Subword Tokenization not learned end to end: vocab is fixed, can’t adapt to difficulty non-smoothness: similar inputs get mapped to very different token sequences [token][ization] typo: [token][zi][ation] <- suddenly bad despite small typo huge vocabs: yes non-adaptive compression ratio: you can’t decide how much to compress (affects FLOPs/document)

Full note

ICLR2025 Wu: Retrieval Head Explains Long Context

Motivation Previous works contain “heads” that perform some specific mechanism from context retrieval.
Retrieval Head Authors shows that Retrieval Heads exist in transformers: using Needle in a Haystack framework.
Key Insight There exists certain heads which performs retrieval, as measured by the retrieval score.
Methods Measuring Retrieval Behavior “retrieval score”: how often does a head engage in copy-paste behavior.
token inclusion: current generated token w is in...

Full note

ICLR2025 Yue: Inference Scaling for Long-Context RAG

“RAG performance can scale almost linearly w.r.t. log inference FLOPs”
Demonstration Based RAG (DRAG) Method Adding demonstrations as k in-context examples.
Prompt: documents, input query, final answer.
Parameters: number of documents, number of in context samples, number of iterations upper bound.
Iterative Demonstration Based RAG (IterDRAG) Method DRAG above, and then the model can generate a new sub-query. The model decides
Parameters: number of documents, number of in context sam...

Full note