feed both accepted and rejected into your model, and get two scalars out r_{\text{rejected}}, and r_{\text{chosen}}:
\begin{equation} \mathcal{L}_{RM} = \log \left(1 + e^{r_{\text{rejected}}-r_{\text{chosen}}}\right) \end{equation}
train only for one epoch you should be getting low accuracy scores you may need to ensemble, margin loss ppo gets the best model