Tech Log-Browser Image Agent project
Tech Log Entry — Browser Image Agent:
An AI-Powered Image Tagger and Saver Category:
AI / Computer Vision / Browser Automation / Local LLM / Python
Initial Goal
- I wanted a tool to automatically save images from open browser tabs to a local folder, with descriptive, search-friendly filenames and tags generated by an AI vision model — replacing manual save-and-rename workflows.
- All processing to be fully local: no cloud APIs, no subscriptions, no data leaving the machine.
- The tool should use only free open-source software, run on my Windows 11 desktop, and leverage my RTX 3060 GPU for local AI inference.
- Two operating modes: fully automatic (save all qualifying images without interruption) and human-in-the-loop review (confirm each image before saving).
Hardware Used
- Desktop: Windows 11 Home, NVIDIA GeForce RTX 3060 (12 GB VRAM)
- No additional hardware required — everything runs locally on the desktop
Software Stack
- Python 3.12
- Ollama 0.17.7 (local AI runtime)
- Vision model:
llama3.2-vision:11b(11B parameter multimodal model, ~7.9 GB) - Playwright 1.58 (browser automation / Chrome CDP connection)
- Pillow (image processing and JPEG re-encoding)
- piexif (EXIF metadata writing)
- tkinter (GUI dialogs, stdlib)
- pytest (testing framework)
LLM Used (for development)
- Claude Sonnet 4.6 [Pro plan] (this chat session — requirements, architecture, all code, debugging)
Architecture Overview
The agent is set up as a set of independent modules wired together by the central "Main" orchestrator:
config.py— loads and validatesconfig.tomlollama_client.py— all Ollama HTTP API calls and prompt templatesbrowser.py— Playwright connector: tab enumeration, image extraction, screenshot fallbackfile_writer.py— filename construction, collision resolution, EXIF metadata writingsession.py— run logging, tab outcome tracking, descriptor frequency mapagent/gui/— four tkinter dialogs: launch, progress, HITL confirm, summarymain.py— orchestrator wiring all modules together
How It Works
- User launches Chrome via
setup_chrome.bat(opens Chrome with--remote-debugging-port=9222 --user-data-dir="C:\ChromeDebug") - User opens image tabs in that Chrome window, then runs
python main.py - Agent verifies Ollama is reachable and the vision model is available
- Launch dialog appears — user selects Automatic or Review mode and confirms output folder
- Playwright attaches to Chrome via CDP and enumerates all open tabs across all windows
- For each tab:
- Direct image URL tabs (
.jpg,.png, etc.) qualify automatically - Other tabs are DOM-scanned for a dominant
<img>element ≥ 500px - Image is downloaded directly from source URL (screenshot fallback if blocked)
- Vision model Call 1:
Does this image contain a person? YES or NO - Vision model Call 2:
Describe the person with exactly 5 comma-separated descriptor terms - Terms are parsed, sanitized, and joined into a filename (e.g.
blonde_athletic_woman_standing_young.jpg)
- Direct image URL tabs (
- In Automatic mode: image is saved immediately
- In Review mode: confirmation dialog shows thumbnail + editable filename; user clicks Save, Skip, or Cancel
- All saved images get fresh EXIF metadata: current date/time and the 5 terms written as Windows-visible tags
- Summary dialog shows counts, duration, output folder link, and most-used descriptor terms
Development Process
- Built iteratively in 8 steps, testing at every stage before proceeding:
config.py+ testsollama_client.py+ mocked tests + live Ollama testsbrowser.py+ mocked tests + live Chrome testsfile_writer.py+ testssession.py+ tests- GUI modules (4 dialogs) + headless tests
main.py(orchestrator)- Live end-to-end runs with real image tabs
- Final test count: 197 passing tests across all modules
Investigation and Bugs Fixed During Live Testing
- Chrome debug port not opening — Chrome hands off new instances to existing processes on Windows; fixed by using
--user-data-dir="C:\ChromeDebug"to force an independent instance - Trailing punctuation on model terms — model returned
professional.with a period; fixed by stripping trailing non-word characters before validation - Multi-word phrases in model terms — model returned
short hairandwith hands in pockets; fixed by collapsing spaces within terms to hyphens (short-hair,with-hands-in-pockets) - First-image thumbnail not displaying —
PhotoImageobjects were being garbage collected when multipletk.Tk()instances were created and destroyed; fixed by switching confirm dialogs totk.Toplevelwith a single persistent hiddentk.Tkroot kept alive for the entire session - Original EXIF metadata preserved on download — one stock image had a 2018 creation date embedded; fixed by always re-encoding images with
piexif, stripping old EXIF and writing fresh date/time and keyword tags
Final Results
- 9/9 images are processed correctly on final test run, 0 errors
- Correct descriptive filenames are generated for all images
- All saved files show today's date/time in Windows Explorer
- Windows Explorer Tags column shows the 5 descriptor terms for each file
- Both Automatic and Review modes are fully working
Daily Use
- Open a command prompt, run
ollama serve(leave open) - Open a second prompt, run
ollama run llama3.2-vision:11b(warms model into VRAM, leave open) - Run
setup_chrome.bat(opens debug Chrome session) - Open image tabs in that Chrome window
- Run
python main.pyfrom the project folder
Watch Out For (Future)
config.tomlpaths must use forward slashes (/) or double backslashes (\\) — single backslashes cause a TOML parse error- Chrome must be launched via
setup_chrome.bat(or the manual command with--user-data-dir), not as a normal Chrome window — otherwise port 9222 will not be open - The vision model must be warmed into VRAM before running (
ollama run llama3.2-vision:11b) — otherwise/api/generatecalls will time out - Model tends to tag walking people as "standing" — a known
llama3.2-vision:11bpose-recognition tendency, not a bug - If a term contains spaces (e.g.
short hair), the agent converts it to a hyphenated term (short-hair) automatically — this is expected behavior
Inquiries for TO-DO items:
Can we enable the model to recognize that there are multiple people in the image? For instance, if 4 people are together, the model can output "4-people" in the file name and tags. If they are all women, "4-women"; if they are all men, "4-men". Is this possible?
Is there any form of workaround in the model's tendency to view walkers as standers? If not, then we should remove that category from the file name and tags. So that we aren't getting false file name elements and/or tags, the script won't nudge the model to determine that.
What other descriptors would be useful? "Stand" vs "sit" [leaving aside "walking" for now]; "dark-skin" vs "light-skin"; "blonde", "brunette", "redhead" are hair colors, also "black-hair"; facial expressions like "smile" or "neutral-expression" or "frown"; also "glasses" and/or "hat" if the subject is wearing eyeglasses and/or a hat.
Continuing #3: Regarding the image being color or black-and-white. Can the model identify, put in file name, and tag "bw" for images which are black and white?
Is there a solution to the problem of needing to open a debug Chrome session via setup_chrome.bat? Is there a way this browser image agent can work with regular non-debug Chrome tabs and incognito tabs? If not, will another browser work correctly for this? Firefox, Safari, Edge, Brave, Tor, LibreWolf, Opera, or other?
Comments
Post a Comment