Private Vision AI: Run Reka Edge Entirely on Your Machine

Reka just released Reka Edge, a compact but powerful vision-language model that runs entirely on your own machine. No API keys, no cloud, no data leaving your computer. I work at Reka and putting together this tutorial was genuinely fun; I hope you enjoy running it as much as I did.

[Originally published at dev.to/reka]

In three steps, you'll go from zero to asking an AI what's in any image or video.

What You'll Need

  • A machine with enough RAM to run a 7B parameter model (~16 GB recommended)
  • Git
  • uv, a fast Python package manager. Install it with:
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    This works on macOS, Linux, and Windows (WSL). If you're on Windows without WSL, grab the Windows installer instead.

Step 1: Get the Model and Inference Code

Clone the Reka Edge repository from Hugging Face. This includes both the model weights and the inference code:

git clone https://huggingface.co/RekaAI/reka-edge-2603
cd reka-edge-2603

Step 2: Fetch the Large Files

Hugging Face stores large files (model weights and images) using Git LFS. After cloning, these files exist on disk but contain only small pointer files, not the actual content.

First, make sure Git LFS is installed. The command varies by platform:

# macOS
brew install git-lfs

# Linux / WSL (Ubuntu/Debian)
sudo apt install git-lfs

Then initialize it:

git lfs install

Then pull all large files, including model weights and media samples:

git lfs pull

Grab a coffee while it downloads, the model weights are several GB.


Step 3: Ask the Model About an Image or Video

To analyze an image, use the sample included in the media/ folder:

uv run example.py \
  --image ./media/hamburger.jpg \
  --prompt "What is in this image?"

Or pass a video with --video:

uv run example.py \
  --video ./media/many_penguins.mp4 \
  --prompt "What is in this?"

The model will load, process your input, and print a description, all locally, all private.

Try different prompts to unlock more:

  • "Describe this scene in detail."
  • "What text is visible in this image?"
  • "Is there anything unusual or unexpected here?"

What's Actually Happening? 

You don't need this to use the model, but if you're anything like me and can't help wondering what's going on under the hood, here's the magic behind example.py:

1. It picks the best hardware available. The script checks whether your machine has a GPU (CUDA for Nvidia, Metal for Apple Silicon) and uses it automatically. If neither is available, it falls back to the CPU. This affects speed, not quality.

if torch.cuda.is_available():
    device = torch.device("cuda")
elif mps_ok:
    device = torch.device("mps")
else:
    device = torch.device("cpu")

2. It loads the model into memory. The 7 billion parameter model is read from the folder you cloned. This is the "weights": billions of numbers that encode everything the model has learned. Loading takes ~30 seconds depending on your hardware.

processor = AutoProcessor.from_pretrained(args.model, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(args.model, ...).eval()

3. It packages your input into a structured message. Your image (or video) and your text prompt are wrapped together into a conversation-style format, the same way a chat message works, except one part is visual instead of text.

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": args.image},
        {"type": "text", "text": args.prompt},
    ],
}]

4. It converts everything into numbers. The processor translates your image into a grid of numerical patches and your prompt into tokens (small chunks of text, each mapped to a number). The model only understands numbers, so this step bridges the gap.

inputs = processor.apply_chat_template(
    messages, tokenize=True, return_tensors="pt", return_dict=True
)

5. The model generates a response, token by token. Starting from your input, the model predicts the most likely next word, then the next, up to 256 tokens. It stops when it hits a natural end-of-response marker.

output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

6. It converts the numbers back into text and prints it. The token IDs are decoded back into human-readable words and printed to your terminal. No internet involved at any point.

output_text = processor.tokenizer.decode(new_tokens, skip_special_tokens=True)
print(output_text)


Here the video

If you prefer watching and reading, here is the video version:

 

That's Pretty Cool, Right?

A single script. No API key. No cloud. You just ran a 7 billion parameter vision-language model entirely on your own machine, and it works whether you're on a Mac, Linux, or Windows with WSL, which is what I was using when I wrote this.

This works great as a one-off script: drop in a file, ask a question, get an answer. But what if you wanted to build something on top of it? A web app, a tool that watches a folder, or anything that needs to talk to the model repeatedly?

That's exactly what the next post is about. I'll show you how to wrap Edge as a local API, so instead of running a script, you have a service running on your machine that any app can plug into. Same model, same privacy, but now it's a proper building block.


~frank 

Reading Notes #689

Another week, another batch of interesting reads. This edition covers AI video experiments, extending coding agents with .NET skills, open source contributions, and a few podcast episodes worth adding to your queue.


AI

Programming

Open Source

Podcasts

Sharing my Reading Notes is a habit I started a long time ago, where I share a list of all the articles, blog posts, podcasts and books that catch my interest during the week.

If you have interesting content, share it!

~frank


Reading Notes #688

I'm always on the lookout for innovative ideas to streamline my development workflow. This week, I stumbled upon some fascinating reads that caught my eye, among them, an article about building an AI-powered pull request agent using GitHub Copilot SDK, and another demonstrating the secure use of OpenClaw in Docker sandboxes.


AI

Programming

~frank