과제 2: RNN-based Language Model

전체 코드는 깃허브 참고 → 링크

Word-level language modeling with RNN

1) 데이터 클래스 준비

import os
import torch

class Dictionary(object):
    def __init__(self):
        self.token2id = {}
        self.id2token = []

    def add_word(self, word):
        if word not in self.token2id:
            self.id2token.append(word)
            self.token2id[word] = len(self.id2token) - 1
        return self.token2id[word]

    def __len__(self):
        return len(self.id2token)

class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf-8") as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf-8") as f:
            idss = []
            for line in f:
                words = line.split() + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.token2id[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

Dictionary 클래스: 데이터에 등장하는 토큰의 집합. 집합 내 토큰을 유일한 id에 매핑합니다.
Corpus 클래스: 모델의 학습, 테스트 과정에서 사용되는 입력을 준비합니다. 데이터를 불러오고 dictionary 를 생성합니다. 데이터를 토큰화하고 생성한 dictionary를 이용해 각 토큰을 id로 변환합니다.

제공된 wikitext-2 파일은 사전에 토큰화가 되어 공백으로 모든 토큰이 나누어져 있습니다. 따라서 단순히 공백 기준으로 토큰화를 진행하면 됩니다.

# corpus 확인
path = './wikitext'
corpus = Corpus(path)

print(corpus.train.size())
print(corpus.valid.size())
print(corpus.test.size())

Untitled

전처리를 완료한 데이터의 전체 개수 확인

2) 배치화(batchify)

<aside> 💡 BPTT 배치화 함수

한 줄로 길게 구성된 데이터를 받아 BPTT를 위해 배치화합니다.
batch_size * sequence_length의 배수에 맞지 않아 뒤에 남는 부분은 잘라버린 후 배수에 맞게 조절된 데이터로 BPTT 배치화를 진행합니다.

Arguments: data -- 학습 데이터가 담긴 텐서 dtype: torch.long shape: [data_lentgh] batch_size -- 배치 크기 sequence_length -- 한 샘플의 길이

Return: batches -- 배치화된 텐서 dtype: torch.long shape: [num_sample, batch_size, sequence_length]

</aside>

def bptt_batchify(data: torch.Tensor, batch_size: int, sequence_length: int):
    ### YOUR CODE HERE   
    ### 방법 1
    # 배치화하는 데이터 총 개수 = num_sample * batch_size * sequence_length
    # batch_size * sequence_length 배수에 맞게 자른 뒤 view로 shape 조절
    batches: torch.Tensor = None
    num_sample = len(data) // (batch_size * sequence_length)
    end_idx = num_sample * batch_size * sequence_length
    data = data[:end_idx]

    # batches = data.view(-1, batch_size, sequence_length) # 아래와 비슷한 거 같지만 이렇게하면 안됨
    batches = torch.transpose(data.view(batch_size, -1, sequence_length), 0, 1)       

    ### 정답 코드
    length = data.numel() // (batch_size * sequence_length) \\
                           * (batch_size * sequence_length)
    batches = data[:length].reshape(batch_size, -1, sequence_length).transpose(0, 1)

    return batches

Untitled

배치화 함수를 단계별로 그리면 위의 그림과 같다.

길이가 24인 1차원 형태의 데이터에 배치화(batch_size=4)를 수행한다. [, 24] → [4, 6]
배치화를 적용해도 하나의 샘플이 너무 길면 RNN이 역전파하는데 어려움이 있으므로 각 배치를 sequence_length로 나눈다. [4, 6] → [4, 3, 2]
RNN은 빨간색 화살표 방향으로 데이터를 입력해 학습하기 때문에 **[num_sample, batch_size, sequence_length]**로 reshape 한다. [4, 3, 2] → [3, 4, 2]