1-Introduction

More researchers are recognizing the significance of instruction data during the Supervised Fine-Tuning (SFT) stage. In June, I wrote a blog about data generation, but I believe it was somewhat superficial and insufficient. Since then, many new methods have emerged. Therefore, I aim to cover more papers I've read to discuss instruction data generation and selection.

There are generally two ways to obtain instruction data: through human annotation or by using automatically generated data with LLMs. However, traditional annotation can be quite costly. As a result, many researchers are focusing on methods to automatically create high-quality synthetic instruction data.

In this blog, I will discuss the automated method for generating instruction data.

2-Seed Data and Prompt Engineering

2-1-Self-instruct

Most research on synthetic instruction data generation combines seed data with prompt engineering. Researchers prepare the seed data and use prompt techniques with LLMs to generate more instruction data.

According to my limited exploration, Wang et al. (2022) first presented the semi-automated process, named self-instruct, for instruction-tuning a pretrained LLM. The process is as follows:

Figure1. A high-level overview of SELF-INSTRUCT( Wang et al., 2022)

Figure1. A high-level overview of SELF-INSTRUCT( Wang et al., 2022)

As shown, there are four steps in this process.

The dataset the authors generate with self-instruct is named “Alpaca.” The following is an examination of the dataset:

Figure3. Dataset Overview( Wang et al., 2022)

Figure3. Dataset Overview( Wang et al., 2022)

As noted, most of the generated instructions are meaningful, although the instances may contain some noise. They still offer valuable guidance for training models to follow instructions.

2-2-Evol-Instruct

Xu et al. (2023) introduced a new method called "Evol-Instruct" for generating complex instruction data, based on this "Alpaca" dataset. The pipeline of the evol-instruct is shown below: