Example RLHF guidelines

Context

We’re creating a high-quality dataset to fine-tune our large language model. Specifically, we want to understand which type of responses people prefer to see from the model, and why. This preference data will be used to align the model’s outputs with the expectations of users.

Therefore, your contributions will be crucial to developing a model that’s calibrated to behave according to human expectations. This means providing helpful, accurate, and safe responses.

The project is expected to span over the next two months, with regular check-ins and feedback sessions to ensure high-quality data collection.

Workflow (including prompt creation)

If you want human annotators to create the prompt dataset used to generate the outputs which the annotators will then rank.

You’ll be given a set of high-level topics in your area of expertise. You will first write a prompt about this topic to send to the model. You’ll then get a set of responses answering your prompt, which you will rank based on specific criteria. Along with your ranking, you will also provide in-depth text feedback explaining your ranking.

Write a prompt and submit to the model.
1. The prompt should fall under the topic(s) you’ve been assigned, and clearly relate to the key concepts and issues within these topic(s).
2. The prompt should be written clearly and concisely, with correct spelling and grammar. The intended goal or purpose of the prompt should be clear.
3. Include any necessary context to ensure the model understands the scope and intent of the prompt.
4. The prompt should be written in an open-ended model to encourage detailed responses. Avoid yes/no questions or overly specific queries that limit depth of response.
Rank the outputs.
1. Rank the outputs based on helpfulness, accuracy, and harmlessness — in that order.
2. Helpfulness means that the output follows your instructions and helps solve your task.
3. Accuracy means that the output contains accurate information, doesn’t mislead the user, and includes all relevant information.
4. Harmlessness means that the output does not pose any sort of harm to people, whether physical, mental, or social.
Provide your reasoning for the ranking.
1. Explain why you ranked the outputs the way you did, based on the criteria of helpfulness, accuracy, and harmlessness.
2. Include an explanation of any tradeoffs you made while ranking. For example, if response A is more accurate but could be considered somewhat harmful, and response B is less accurate but also less harmful, then mention this tradeoff when ranking A > B.

Workflow (only preference ranking)

If you already have a set of prompts you want to use to generate the outputs which annotators will rank.