Tutorial: Moondream 2 Vision Model with LLaMA

Introduction

In this tutorial, we'll explore how to use the Moondream 2 vision model with llama-cpp-python to generate text descriptions from images. We'll cover the installation of required libraries, setting up the MoondreamVision class, and using it to process images and generate text.

Step 1: Install Required Libraries

To get started, we need to install the required libraries. Run the following command in your terminal or Python environment:

Update 2: DON’T FORGET accelerate!!!!! It takes the model from somewhere o the order of ~1m install ~45s inference to ~1.5m install, 3s inference on a Colab T4. Had under 1s inference one time when Ieft the llama-cpp-python ( The MoondreamModel class ) instance laying around and ran from that, so I suspect ~1.5s to be safe is the approximate runtime during sustained, efficient operation. No promises, and i’m sure there are myriads of ways to make that a lot faster, so don’t trust my clumsy hands too far.

This doesn’t sound like a big deal, but the exciting thing about this model is the ability to obtain verbose descriptions w/ details && in-depth visual question answering, all within 1s, all with Colab-ready technologies, all VERY quickly. Think of what you could do if any picture or screenshot was an instant repository of information for free, forever. And could run on your Android phone ( that one’s coming soon ;)

!pip install -U llama-cpp-python huggingface-hub accelerate

This will install the llama-cpp-python librar(ies), which are required for this tutorial.

Step 2: Import Libraries, Define class structure

Next, we'll import the required libraries and deine the MoondreamVision class:

import os
import requests
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MoondreamChatHandler

class MoondreamVision():
	def __init__(self, **kwargs):
		pass
	def __call__(self, **kwargs):
		pass

Step 3: Class Definition

Step 3.1: init ( Python class constructor )

So, several things happening here, nothing that should be too crazy.

We store our variables
We initialize our chat handler from it’s repo on the hub
We do the same for our language/vision model ( Yes, both. )

def __init__(self, path: str, text_model: str, mmproj: str):
    self.path = path
    self.text_model = text_model
    self.mmproj = mmproj
    self.chat_handler = MoondreamChatHandler.from_pretrained(
      repo_id=self.path,
      filename=self.mmproj,
    )
    # Initialize our LLM instance with out Chat Handler.
    self.llm = Llama.from_pretrained(
      repo_id=self.path,
      filename=self.text_model,
      chat_handler=self.chat_handler, # Trust me, you don't want to write one for a mere example.
      n_ctx=32786, # n_ctx should be maxxed out to accomodate the image embedding
    )

Step 3.2: call ( Python instance call override operator )

For this one, the code description was just the tutorial description verbatim; So it’s a pure code explanation! Wheee!!!