Tech Log-Browser Image Agent project

Tech Log Entry — Browser Image Agent:

An AI-Powered Image Tagger and Saver Category:

AI / Computer Vision / Browser Automation / Local LLM / Python

Initial Goal

I wanted a tool to automatically save images from open browser tabs to a local folder, with descriptive, search-friendly filenames and tags generated by an AI vision model — replacing manual save-and-rename workflows.
All processing to be fully local: no cloud APIs, no subscriptions, no data leaving the machine.
The tool should use only free open-source software, run on my Windows 11 desktop, and leverage my RTX 3060 GPU for local AI inference.
Two operating modes: fully automatic (save all qualifying images without interruption) and human-in-the-loop review (confirm each image before saving).

Hardware Used

Desktop: Windows 11 Home, NVIDIA GeForce RTX 3060 (12 GB VRAM)
No additional hardware required — everything runs locally on the desktop

Software Stack

Python 3.12
Ollama 0.17.7 (local AI runtime)
Vision model: llama3.2-vision:11b (11B parameter multimodal model, ~7.9 GB)
Playwright 1.58 (browser automation / Chrome CDP connection)
Pillow (image processing and JPEG re-encoding)
piexif (EXIF metadata writing)
tkinter (GUI dialogs, stdlib)
pytest (testing framework)

LLM Used (for development)

Claude Sonnet 4.6 [Pro plan] (this chat session — requirements, architecture, all code, debugging)

Architecture Overview

The agent is set up as a set of independent modules wired together by the central "Main" orchestrator:

config.py — loads and validates config.toml
ollama_client.py — all Ollama HTTP API calls and prompt templates
browser.py — Playwright connector: tab enumeration, image extraction, screenshot fallback
file_writer.py — filename construction, collision resolution, EXIF metadata writing
session.py — run logging, tab outcome tracking, descriptor frequency map
agent/gui/ — four tkinter dialogs: launch, progress, HITL confirm, summary
main.py — orchestrator wiring all modules together

How It Works

User launches Chrome via setup_chrome.bat (opens Chrome with --remote-debugging-port=9222 --user-data-dir="C:\ChromeDebug")
User opens image tabs in that Chrome window, then runs python main.py
Agent verifies Ollama is reachable and the vision model is available
Launch dialog appears — user selects Automatic or Review mode and confirms output folder
Playwright attaches to Chrome via CDP and enumerates all open tabs across all windows
For each tab:
- Direct image URL tabs (.jpg, .png, etc.) qualify automatically
- Other tabs are DOM-scanned for a dominant <img> element ≥ 500px
- Image is downloaded directly from source URL (screenshot fallback if blocked)
- Vision model Call 1: Does this image contain a person? YES or NO
- Vision model Call 2: Describe the person with exactly 5 comma-separated descriptor terms
- Terms are parsed, sanitized, and joined into a filename (e.g. blonde_athletic_woman_standing_young.jpg)
In Automatic mode: image is saved immediately
In Review mode: confirmation dialog shows thumbnail + editable filename; user clicks Save, Skip, or Cancel
All saved images get fresh EXIF metadata: current date/time and the 5 terms written as Windows-visible tags
Summary dialog shows counts, duration, output folder link, and most-used descriptor terms

Development Process

Built iteratively in 8 steps, testing at every stage before proceeding:
1. config.py + tests
2. ollama_client.py + mocked tests + live Ollama tests
3. browser.py + mocked tests + live Chrome tests
4. file_writer.py + tests
5. session.py + tests
6. GUI modules (4 dialogs) + headless tests
7. main.py (orchestrator)
8. Live end-to-end runs with real image tabs
Final test count: 197 passing tests across all modules

Investigation and Bugs Fixed During Live Testing

Chrome debug port not opening — Chrome hands off new instances to existing processes on Windows; fixed by using --user-data-dir="C:\ChromeDebug" to force an independent instance
Trailing punctuation on model terms — model returned professional. with a period; fixed by stripping trailing non-word characters before validation
Multi-word phrases in model terms — model returned short hair and with hands in pockets; fixed by collapsing spaces within terms to hyphens (short-hair, with-hands-in-pockets)
First-image thumbnail not displaying — PhotoImage objects were being garbage collected when multiple tk.Tk() instances were created and destroyed; fixed by switching confirm dialogs to tk.Toplevel with a single persistent hidden tk.Tk root kept alive for the entire session
Original EXIF metadata preserved on download — one stock image had a 2018 creation date embedded; fixed by always re-encoding images with piexif, stripping old EXIF and writing fresh date/time and keyword tags

Final Results

9/9 images are processed correctly on final test run, 0 errors
Correct descriptive filenames are generated for all images
All saved files show today's date/time in Windows Explorer
Windows Explorer Tags column shows the 5 descriptor terms for each file
Both Automatic and Review modes are fully working

Daily Use

Open a command prompt, run ollama serve (leave open)
Open a second prompt, run ollama run llama3.2-vision:11b (warms model into VRAM, leave open)
Run setup_chrome.bat (opens debug Chrome session)
Open image tabs in that Chrome window
Run python main.py from the project folder

Watch Out For (Future)

config.toml paths must use forward slashes (/) or double backslashes (\\) — single backslashes cause a TOML parse error
Chrome must be launched via setup_chrome.bat (or the manual command with --user-data-dir), not as a normal Chrome window — otherwise port 9222 will not be open
The vision model must be warmed into VRAM before running (ollama run llama3.2-vision:11b) — otherwise /api/generate calls will time out
Model tends to tag walking people as "standing" — a known llama3.2-vision:11b pose-recognition tendency, not a bug
If a term contains spaces (e.g. short hair), the agent converts it to a hyphenated term (short-hair) automatically — this is expected behavior

Inquiries for TO-DO items:

Can we enable the model to recognize that there are multiple people in the image? For instance, if 4 people are together, the model can output "4-people" in the file name and tags. If they are all women, "4-women"; if they are all men, "4-men". Is this possible?
Is there any form of workaround in the model's tendency to view walkers as standers? If not, then we should remove that category from the file name and tags. So that we aren't getting false file name elements and/or tags, the script won't nudge the model to determine that.
What other descriptors would be useful? "Stand" vs "sit" [leaving aside "walking" for now]; "dark-skin" vs "light-skin"; "blonde", "brunette", "redhead" are hair colors, also "black-hair"; facial expressions like "smile" or "neutral-expression" or "frown"; also "glasses" and/or "hat" if the subject is wearing eyeglasses and/or a hat.
Continuing #3: Regarding the image being color or black-and-white. Can the model identify, put in file name, and tag "bw" for images which are black and white?
Is there a solution to the problem of needing to open a debug Chrome session via setup_chrome.bat? Is there a way this browser image agent can work with regular non-debug Chrome tabs and incognito tabs? If not, will another browser work correctly for this? Firefox, Safari, Edge, Brave, Tor, LibreWolf, Opera, or other?
A "memory" system, using an .md file; keeping a growing list of descriptor terms. Referring to it in the agent script.
Per #6: is there a way to generate new descriptor terms that the model sees as frequently showing up?
Security: Any method to "lock" the agent so that only I can call it, no matter who else is on my system? And that it can't be called by program or other agent? What options exist for this?

Search This Blog

Debug and Rebug: The Records of a Developing Developer

Tech Log-Browser Image Agent project

Comments

Post a Comment

Popular posts from this blog

Telling Rocks What To Think

Humanity as gaslighting victims of LLMs?