More researchers are recognizing the significance of instruction data during the Supervised Fine-Tuning (SFT) stage. In June, I wrote a blog about data generation, but I believe it was somewhat superficial and insufficient. Since then, many new methods have emerged. Therefore, I aim to cover more papers I've read to discuss instruction data generation and selection.
There are generally two ways to obtain instruction data: through human annotation or by using automatically generated data with LLMs. However, traditional annotation can be quite costly. As a result, many researchers are focusing on methods to automatically create high-quality synthetic instruction data.
In this blog, I will discuss the automated method for generating instruction data.
Most research on synthetic instruction data generation combines seed data with prompt engineering. Researchers prepare the seed data and use prompt techniques with LLMs to generate more instruction data.
According to my limited exploration, Wang et al. (2022) first presented the semi-automated process, named self-instruct, for instruction-tuning a pretrained LLM. The process is as follows:
Figure1. A high-level overview of SELF-INSTRUCT( Wang et al., 2022)
As shown, there are four steps in this process.
Creating task instructions. Self-instruct creates new instructions from a limited set of initial human-written instructions through a bootstrapping method. The authors compile a task pool containing 175 tasks, each with one instruction and one instance. At each step, they select 8 task instructions from this pool to use as in-context examples. To maintain diversity, 6 of the 8 instructions are from human-written tasks, while 2 are from model-generated tasks from previous steps. The prompt template as follows:
Figure2. Prompting template of instruct generation( Wang et al., 2022)
Identifying if the instruction is for a classification task. Prompting the LLM with a few-shot approach to identify if the generated instruction is a classification task. The authors include 12 classification instructions and 19 non-classification instructions from the seed task as examples in the prompt.
Generating instances using input-first or output-first methods. There are two ways to generate instances. The first is the Input-first Approach. Here, the authors ask the LLM to first come up with the input fields based on the instruction, and then produce the corresponding output. However, they find that this method can sometimes generate inputs biased toward one label, especially for classification tasks. Therefore, they choose the second approach for classification tasks, which is the Output-first Approach. In this method, they first generate the possible class labels and then condition the input generation on each class label. The authors apply the output-first approach for classification tasks and the input-first approach for non-classification tasks.
Filtering out low-quality data. Ensuring the quality of the dataset is crucial. The authors assess the similarity between newly generated instructions and those in the task pool using ROUGE-L as the metric. Additionally, they exclude instructions containing specific keywords that LLMs cannot handle.
The dataset the authors generate with self-instruct is named “Alpaca.” The following is an examination of the dataset:
Figure3. Dataset Overview( Wang et al., 2022)
As noted, most of the generated instructions are meaningful, although the instances may contain some noise. They still offer valuable guidance for training models to follow instructions.
Xu et al. (2023) introduced a new method called "Evol-Instruct" for generating complex instruction data, based on this "Alpaca" dataset. The pipeline of the evol-instruct is shown below: