Joblib.Memory

Memory를 이용하면 어떤 함수의 return 되는 output을 지정된 디렉토리에 저장해둔다. 그리고 해당 함수를 다시 호출할 경우, 미리 계산해둔 output을 가져와서 로딩을 하게 된다.

그리고 해당 cache는 사용자가 임의로 삭제하지 않으면 계속 유지되며, 다른 프로세스에서도 접근 가능하다고 한다.(확인 필요) 즉 이는 함수의 연산 결과를 하드디스크에 캐싱해두는 방식으로 볼 수 있다.

사용법은 매우 간단하여 아래와 같다. 내가 정의한 함수를 cache()에 전달하여 다시 사용하는 것이다. 이 방식은 output이 pkl로 저장된다.

if cache:
    extract_feature = Memory("./cache", verbose=0).cache(extract_feature)

단 주의할점은 만약 [class method에 대해서 이를 적용하는 경우, 캐시 할 함수를 global 영역의 function에 대해 정의](https://joblib.readthedocs.io/en/latest/auto_examples/memory_basic_usage.html#:~:text=it is expensive.-,joblib.,function into a specific location.&text=At the first call%2C the,the results into the disk.)하고, 이를 class method에서 호출 하는 식으로 해야한다. 그 이유는 pickle이 class method에 대해서는 지원하지 않기 때문이다.

def read_wave_with_resampling(file_path, target_sample_rate):
    [wave_sequence, source_sample_rate] = soundfile.read(file_path)
    # print('source_sample_rate,target_sample_rate',source_sample_rate,target_sample_rate)
    return sample_wave_sequence(wave_sequence, source_sample_rate, target_sample_rate)

wave_data = self.memory.cache(read_wave_with_resampling)(wave_file_path, self.desired_sample_rate)

그러나 만약 numpy로 전체 데이터를 저장해둘 수 있는 상황이라면, numpy.memmap등을 이용해서 원하는 index의 데이터만 아주 빠르게 읽어올 수 있다.

그리고 사용이 모두 끝나면 memory.clear 함수로 같이 캐시데이터를 지울 수 있다.

memory.clear를 편하게 사용하는 방법으로는 python의 try/finally 구문을 사용하면 용이하다. 무조건 에러의 유무와 상관없이 finally에 해당하는 코드를 실행하기 때문에, 여기에서 자원을 release하면 편하다.

try:
    start_loop()
finally:
    memory.clear(warn=False)

https://joblib.readthedocs.io/en/latest/auto_examples/memory_basic_usage.html

https://stackoverflow.com/questions/39020217/how-to-use-joblib-memory-of-cache-the-output-of-a-member-function-of-a-python-cl

Joblib.memory 사용시 주의사항

Class method는 안되고, global function만 caching이 된다.
Input argument 중에 사용자가 정의한 class instance가 전달되면 pickling에러가 난다.
Input argument가 엄청나게 많고 복잡하면 속도가 매우 느려진다. 왜냐하면 input argument에 대해서 hashing을 하여 결과값을 저장하는데, 이 값이 너무 크고 복잡하면 hash가 가져야하는 key의 종류가 너무 많아지기 때문이다.

⇒ Joblib.memory를 많이 사용해보니, 여러모로 제약이 많다. 위의 세가지 제약사항 모두 치명적이라서 그냥 차라리 수동으로 데이터 전처리 결과를 dictionary로 만들어서 pickle로 저장해두는 게 나은 것 같다.