“What we have been building since ChatGPT at H4. No pretraining in any way Basic Three Steps Goal: “helpful, harmless, honest, and huggy” bots. Retraining step: large-scale next token prediction Incontext learning: few shot learning without updating parameters “Helpful” steps Taking supervised data to perform supervised fine tuning “Harmless” steps Training a classifier for result ranking RLHF Benchmarking Before we started to train, we have a problem. Most benchmarks are on generic reasoning, which evaluates 1), 2). Therefore, we need new metrics for steps 4) and 5). So: Evaluating instruction and “chatty-ness” Pairwise Elo Ratings leaderboard from 🤗 + AlpacaEval. Both use GPT4 as the automated evaluator + as well as humans. MTBench from LMSYS has a new benchmark for the same thing, but supports multi-turn evaluation. Three main effects observed: results improve slightly the longer the prompt GPT4 MTBench assigns worse scores on gpt4 like data adding more data into fine tuning had diminishing returns after thousands of samples TruthfulQA is the most differentiating benchmark; most others score about the same Evaluating Reward Model There are not any open source reward models. Nor is there anything on evaluating or dataset on red teaming. The only dataset out there is Anthropic’s red teaming data. https://huggingface.co/blog/red-teaming Wackiness GPT4 as an evaluator Why is everybody using GPT4 as a proxy for humans? GPT4 has a left positional bias (if you admonish GPT about this, it will prefer the second one instead :/), while humans provide pretty much uniform rating “Doping”: GPT4 prefers model trained on data that it itself generated GPT4 prefers a large variance in unique tokens GPT4 has bad correlation with humans with “low entropy” factual tasks: QA, Summarization, Code; it has better correlation with humans in brainstorming and creative generation arxiv:2306.05685 Supervised Fine Tuning Data “Self-Instruct” dataset, Wang et all 2022 => “Surge Instruct”, huggingface 2023 Instruction (what to do) Input (what to do it on) Output (what you are supposed to do) Goal: “helpful and chatty” Bootstrapping Data Generation 175 seed tasks: 1 instruction + 1 input/output pair Give it to a language model to generate more instructions Language mode Human-In-The-loop Data Generation Ultrachat, Ding et al 2023 human doing some research on the topic and create a prompt ask LLM to generate the output if result not good, rephrase the prompt repeat until good Roleplaying Data Generation Have two models role play to get and correct data. Huggingface Surge-Instruct Humans write everything from scratch. With a pretrained model, diminishing return is seen after a few thousand high quality examples. Task Distribution What should the topics be? Use InstructGPT as guidance: largestest section is a generation task (12%), OpenQA the second largest one (12.4%). HF replaced InstructGPT distribution’s “Other” section (3.5%) with code work. Length Distribution How long should the prompts be? Collected distributions, and Surge Instruct seems to be closest with InstructionGPT. Both Anthropic and InstructGPT used a US based task force, and so so did 🤗 us based taskforce roughly even gender slpit 19 to 62 years old primarily white technical degree to PhD Only used one turn. Multi-turn fine tuning wasn’t a thing a few mounths ago. Training starcoder, falcon, llama2 True fine tuning + PEFT (LoRA) The HF part of RLHF Scale agreed with Surge and H4 a lot more, but mostly no one agreed with anyone. Goal: “safe and factual” Task Distribution Distribution: a lot more about factual things (because we want to encourage factualness) so much more math and code than the general generation. Its also easier to score. Ranking Guidelines OpenAI have guidelines about how to rate Rate every turn from the diaglogue Smaller total length (<2048 tokens) Helpfulness OVER honesty – this is the opposite of OpenAI because the model wasn’t large enough to be very honest Two step selection: “which one is better” => “how much is better”