How I went Local

 As of today, I moved some of my AI usage to my machine, yes that's locally. This wasn't an attempt at being on the bleeding edge of technology, I actually am notoriously a late adopter. But I hit an extreme pain point as of last week.

While at work, in my usual workflow, I looked at a feature, identified what it needed. Put it on Claude Code (Plan Mode) it did that, then I refined and told it to implement. Then I repeated that once more for another task. and I was at 80% usage, 2 prompts. Now, I won't start complaining about how my $20 should get me more, but fast forward this week Claude gave $20 in credit to me and other pro users, and also started allowing the use of their API's to 3rd-party uses. A great apology for the terrible week, but I was well underway.

I had heard in passing about Gemma 4, and after my frustration. I actually thought. Let's see how feasible it is. I have a RTX 4050, so I should be able to handle it, not comfortably, but handle it. 

 

The Setup

Now, as someone in web development, I love Docker, I hate having to manage multiple dependencies and changing things, so my docker is my set and forget. 
 

1. Create a docker file.

I created a docker file 

 services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped
    networks:
      - ai-network

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "4077:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=mykey
    volumes:
      - open-webui_data:/app/backend/data
    restart: unless-stopped
    depends_on:
      - ollama
    networks:
      - ai-network

networks:
  ai-network:

volumes:
  ollama_data:
  open-webui_data:

 

 

So we are using Ollama to run the model selections, loading, etc. Apparently it's the gold standard(Gemini). So we are using my graphics card for this and creaiting a network called. 
ai-network

exposing it to port 4077 which is mapped to the docker containers port 8080. and it doesn't start unless ollama is running. We also have the ports 11434 open this is for vscode, I'll get to that later. 

  1. Get the models

docker compose up -d # 3. Pull the 'Alfred' Model Pack inside the container docker exec -it ollama ollama pull gemma4:e4b docker exec -it ollama ollama pull qwen3.5:latest docker exec -it ollama ollama pull deepseek-r1:8b docker exec -it ollama ollama pull qwen2.5-coder:1.5b docker exec -it ollama ollama pull nomic-embed-text

I ran these commands to download the models I will be using, gemma makes sense becuase using quantiztion it can fit in my vram, then th eothers are for different uses, complex reasoning and qwen for coding.

This was pretty straightforward

To move away from generic "AI Assistant" responses, we created a permanent personality chip. This ensures the models remember they are Alfred, an executive assistant, and not just a Google or Alibaba model.

---
applyTo: "**/*"
---
# IDENTITY
Your name is Alfred. You are Boitu's sophisticated executive assistant.
- Never identify as a Google AI.
- No "yapping": provide code and logic instantly without filler.
- Explain technical concepts in great detail when asked.

 

and that's  what it looks like.

Comments

Peer Pressure (What other's liked reading)