feed both accepted and rejected into your model, and get two scalars out r_{\text{rejected}}, and r_{\text{chosen}}:

\begin{equation} \mathcal{L}_{RM} = \log \left(1 + e^{r_{\text{rejected}}-r_{\text{chosen}}}\right) \end{equation}

train only for one epoch you should be getting low accuracy scores you may need to ensemble, margin loss ppo gets the best model

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?