Tech Log-Browser Image Agent project


Tech Log Entry — Browser Image Agent: 

An AI-Powered Image Tagger and Saver Category:

AI / Computer Vision / Browser Automation / Local LLM / Python


Initial Goal

  • I wanted a tool to automatically save images from open browser tabs to a local folder, with descriptive, search-friendly filenames and tags generated by an AI vision model — replacing manual save-and-rename workflows.
  • All processing to be fully local: no cloud APIs, no subscriptions, no data leaving the machine.
  • The tool should use only free open-source software, run on my Windows 11 desktop, and leverage my RTX 3060 GPU for local AI inference.
  • Two operating modes: fully automatic (save all qualifying images without interruption) and human-in-the-loop review (confirm each image before saving).

Hardware Used

  • Desktop: Windows 11 Home, NVIDIA GeForce RTX 3060 (12 GB VRAM)
  • No additional hardware required — everything runs locally on the desktop

Software Stack

  • Python 3.12
  • Ollama 0.17.7 (local AI runtime)
  • Vision model: llama3.2-vision:11b (11B parameter multimodal model, ~7.9 GB)
  • Playwright 1.58 (browser automation / Chrome CDP connection)
  • Pillow (image processing and JPEG re-encoding)
  • piexif (EXIF metadata writing)
  • tkinter (GUI dialogs, stdlib)
  • pytest (testing framework)

LLM Used (for development)

  • Claude Sonnet 4.6 [Pro plan] (this chat session — requirements, architecture, all code, debugging)

Architecture Overview

The agent is set up as a set of independent modules wired together by the central "Main" orchestrator:

  • config.py — loads and validates config.toml
  • ollama_client.py — all Ollama HTTP API calls and prompt templates
  • browser.py — Playwright connector: tab enumeration, image extraction, screenshot fallback
  • file_writer.py — filename construction, collision resolution, EXIF metadata writing
  • session.py — run logging, tab outcome tracking, descriptor frequency map
  • agent/gui/ — four tkinter dialogs: launch, progress, HITL confirm, summary
  • main.py — orchestrator wiring all modules together

How It Works

  1. User launches Chrome via setup_chrome.bat (opens Chrome with --remote-debugging-port=9222 --user-data-dir="C:\ChromeDebug")
  2. User opens image tabs in that Chrome window, then runs python main.py
  3. Agent verifies Ollama is reachable and the vision model is available
  4. Launch dialog appears — user selects Automatic or Review mode and confirms output folder
  5. Playwright attaches to Chrome via CDP and enumerates all open tabs across all windows
  6. For each tab:
    • Direct image URL tabs (.jpg, .png, etc.) qualify automatically
    • Other tabs are DOM-scanned for a dominant <img> element ≥ 500px
    • Image is downloaded directly from source URL (screenshot fallback if blocked)
    • Vision model Call 1: Does this image contain a person? YES or NO
    • Vision model Call 2: Describe the person with exactly 5 comma-separated descriptor terms
    • Terms are parsed, sanitized, and joined into a filename (e.g. blonde_athletic_woman_standing_young.jpg)
  7. In Automatic mode: image is saved immediately
  8. In Review mode: confirmation dialog shows thumbnail + editable filename; user clicks Save, Skip, or Cancel
  9. All saved images get fresh EXIF metadata: current date/time and the 5 terms written as Windows-visible tags
  10. Summary dialog shows counts, duration, output folder link, and most-used descriptor terms

Development Process

  • Built iteratively in 8 steps, testing at every stage before proceeding:
    1. config.py + tests
    2. ollama_client.py + mocked tests + live Ollama tests
    3. browser.py + mocked tests + live Chrome tests
    4. file_writer.py + tests
    5. session.py + tests
    6. GUI modules (4 dialogs) + headless tests
    7. main.py (orchestrator)
    8. Live end-to-end runs with real image tabs
  • Final test count: 197 passing tests across all modules

Investigation and Bugs Fixed During Live Testing

  • Chrome debug port not opening — Chrome hands off new instances to existing processes on Windows; fixed by using --user-data-dir="C:\ChromeDebug" to force an independent instance
  • Trailing punctuation on model terms — model returned professional. with a period; fixed by stripping trailing non-word characters before validation
  • Multi-word phrases in model terms — model returned short hair and with hands in pockets; fixed by collapsing spaces within terms to hyphens (short-hair, with-hands-in-pockets)
  • First-image thumbnail not displayingPhotoImage objects were being garbage collected when multiple tk.Tk() instances were created and destroyed; fixed by switching confirm dialogs to tk.Toplevel with a single persistent hidden tk.Tk root kept alive for the entire session
  • Original EXIF metadata preserved on download — one stock image had a 2018 creation date embedded; fixed by always re-encoding images with piexif, stripping old EXIF and writing fresh date/time and keyword tags

Final Results

  • 9/9 images are processed correctly on final test run, 0 errors
  • Correct descriptive filenames are generated for all images
  • All saved files show today's date/time in Windows Explorer
  • Windows Explorer Tags column shows the 5 descriptor terms for each file
  • Both Automatic and Review modes are fully working

Daily Use

  • Open a command prompt, run ollama serve (leave open)
  • Open a second prompt, run ollama run llama3.2-vision:11b (warms model into VRAM, leave open)
  • Run setup_chrome.bat (opens debug Chrome session)
  • Open image tabs in that Chrome window
  • Run python main.py from the project folder

Watch Out For (Future)

  • config.toml paths must use forward slashes (/) or double backslashes (\\) — single backslashes cause a TOML parse error
  • Chrome must be launched via setup_chrome.bat (or the manual command with --user-data-dir), not as a normal Chrome window — otherwise port 9222 will not be open
  • The vision model must be warmed into VRAM before running (ollama run llama3.2-vision:11b) — otherwise /api/generate calls will time out
  • Model tends to tag walking people as "standing" — a known llama3.2-vision:11b pose-recognition tendency, not a bug
  • If a term contains spaces (e.g. short hair), the agent converts it to a hyphenated term (short-hair) automatically — this is expected behavior




Inquiries for TO-DO items:

  1. Can we enable the model to recognize that there are multiple people in the image? For instance, if 4 people are together, the model can output "4-people" in the file name and tags. If they are all women, "4-women"; if they are all men, "4-men". Is this possible?

  2. Is there any form of workaround in the model's tendency to view walkers as standers? If not, then we should remove that category from the file name and tags. So that we aren't getting false file name elements and/or tags, the script won't nudge the model to determine that.

  3. What other descriptors would be useful? "Stand" vs "sit" [leaving aside "walking" for now]; "dark-skin" vs "light-skin"; "blonde", "brunette", "redhead" are hair colors, also "black-hair"; facial expressions like "smile" or "neutral-expression" or "frown"; also "glasses" and/or "hat" if the subject is wearing eyeglasses and/or a hat.

  4. Continuing #3: Regarding the image being color or black-and-white. Can the model identify, put in file name, and tag "bw" for images which are black and white?

  5. Is there a solution to the problem of needing to open a debug Chrome session via setup_chrome.bat? Is there a way this browser image agent can work with regular non-debug Chrome tabs and incognito tabs? If not, will another browser work correctly for this? Firefox, Safari, Edge, Brave, Tor, LibreWolf, Opera, or other?


Comments

Popular posts from this blog

WWHD?

Telling Rocks What To Think