Master Plan v2: Local-first LLM Knowledge Base, Cloud Portfolio, & Study Log System
Master Plan v2: Local-First AI Knowledge Base, Cloud Portfolio, and Study Log System
Living markdown document for daily use, revision, and future planning
Introduction
This document is the current master plan for building a local-first, open-source, LLM-friendly knowledge base that will also become a cloud-hosted professional interface for recruiters, prospective employers, collaborators, and public readers.
This project is not only a software system. It is also part of a larger learning program. Its purpose is to help develop practical skill in:
-
Python
-
document processing
-
LLM workflows
-
retrieval and search
-
cloud/web deployment
-
software architecture
-
technical writing
-
reflective documentation
-
portfolio presentation
The project will serve several roles at once:
-
a personal knowledge base for study and project work
-
a local document-processing and retrieval system
-
a cloud-hosted portfolio and access interface
-
a structured public record of learning, progress, and technical growth
-
a platform that can later integrate multiple LLMs across different machines
This is a living document. It should be updated frequently as decisions are made, tools are tested, issues are discovered, and new requirements appear.
1. Project Purpose and Scope
1.1 Core Purpose
The core purpose of this project is to create a durable system for learning, building, documenting, and presenting technical work.
This includes:
-
converting documents into LLM-friendly forms
-
preserving them in structured formats
-
enabling later retrieval and search
-
using them in local and cloud-accessible systems
-
publishing selected outputs as part of a professional portfolio
This project exists in service of:
-
academic study
-
hands-on project learning
-
technical experimentation
-
professional development
-
future employability
1.2 Primary Outcomes
The system must support all of the following:
1.2.1 Local knowledge ingestion and processing
Documents should be parsed and stored locally first.
1.2.2 Structured archival
The knowledge base should preserve source documents, structured parsed outputs, chunked retrieval records, and metadata.
1.2.3 Local LLM use
The system should work with the user’s local machines and local LLMs.
1.2.4 Cloud-hosted interface
A cloud-hosted interface is required, not optional. It will be the outward-facing access point for:
-
recruiters
-
prospective employers
-
collaborators
-
public readers
-
the user from remote locations
This interface will include some combination of:
-
project pages
-
blog or mini-report entries
-
study logs
-
selected searchable content
-
summaries of technical projects
-
evidence of progress and skill development
1.2.5 Multi-machine interoperability
The system should be designed with practical interfacing between:
-
a low-end laptop with a smaller local LLM
-
a mid-range desktop with larger local LLM options
-
a future cloud-hosted service or model endpoint
1.2.6 Ongoing learning documentation
Every workday should produce some written record, even if brief. This is a formal requirement of the project.
Possible daily written outputs include:
-
progress log entry
-
decisions log entry
-
issue/resolution note
-
study reflection
-
mini-report
-
blog draft
-
weekly summary
1.3 Success Criteria
This project is successful if it becomes a working system where:
-
documents are ingested in an organized way
-
Markdown and JSON are produced reliably
-
chunks and metadata are searchable
-
local LLM workflows become easier and less confusing
-
the cloud site presents selected work clearly
-
daily or near-daily written reflection accumulates
-
the project demonstrates technical growth to outside readers
2. Guiding Principles
2.1 Local-first, cloud-accessible
Processing begins locally, but the system must be able to publish selected outputs to the cloud interface.
2.2 Open-source-first
Prefer tools that are open-source, scriptable, understandable, and replaceable.
2.3 One source of truth
Each raw document should exist once as a preserved original, with derived outputs generated from it.
2.4 Structure over flattening
Preserve headings, sections, tables, metadata, and document boundaries whenever possible.
2.5 Learning through doing
This is a study system as well as a product system. Decisions, experiments, mistakes, and revisions are part of the output.
2.6 Small repeatable steps
Prefer workflows that can be done incrementally and tested often.
2.7 Public professionalism
Selected outputs should eventually be suitable for public viewing by recruiters or future colleagues.
3. Chosen Technical Direction
3.1 Primary parser
Docling is the primary parser.
Why
-
open-source
-
local-first
-
supports AI-oriented document workflows
-
suitable for modern selectable-text PDFs
-
exports structured output
-
Python-friendly
3.2 Fallback parser for scanned PDFs
LlamaParse is the planned fallback for problematic scanned or image-based PDFs.
Use case
Only for documents that do not parse well enough locally.
Folder separation
Scanned/problematic files should be isolated in a separate subfolder from the beginning.
3.3 Canonical parsed outputs
The canonical parsed outputs are:
-
Markdown -
JSON
Markdown is for
-
human readability
-
LLM readability
-
website rendering
-
documentation
-
easier review
JSON is for
-
structure
-
metadata
-
automation
-
chunking
-
search systems
-
future integrations
3.4 Canonical workflow
The standard workflow is:
source file → parser → Markdown/JSON → semantic chunks with source metadata → vector DB or searchable archive → local/cloud interface
4. System Architecture
4.1 Architecture layers
The system should have five main layers.
Layer A: source preservation
Original documents remain untouched.
Layer B: parsing and normalization
Documents are converted into canonical Markdown and JSON.
Layer C: retrieval preparation
Chunks, metadata, embeddings, and indexable records are generated.
Layer D: local interfaces
Used for local search, testing, and local LLM workflows.
Layer E: cloud interface
Used for public presentation, recruiter-facing portfolio pages, selected search, and possibly authenticated personal access.
4.2 Machine roles
Laptop
Primary use:
-
light parsing
-
testing
-
small local LLM workflows
-
note-taking
-
writing
-
remote work
Desktop
Primary use:
-
heavier parsing
-
embedding generation
-
larger local LLM workflows
-
more demanding experimentation
Cloud
Primary use:
-
public portfolio interface
-
selected private access
-
publishing project summaries and logs
-
later potential hosted inference or retrieval features
4.3 Interoperability goal
The system should allow files and outputs to move cleanly between machines.
At minimum, this requires:
-
consistent folder naming
-
stable document IDs
-
clear manifests
-
portable file formats
-
machine-independent metadata
-
sync/export rules
5. Folder Structure
knowledge_base/
│
├── originals/
│ ├── professional/
│ ├── hobby/
│ ├── gaming/
│ ├── scanned/
│ └── misc/
│
├── parsed/
│ ├── md/
│ ├── json/
│ ├── tables/
│ └── images/
│
├── chunks/
│ ├── jsonl/
│ ├── debug/
│ └── exported/
│
├── indexes/
│ ├── keyword/
│ ├── vector/
│ └── hybrid/
│
├── manifests/
│ ├── documents.csv
│ ├── processing_log.csv
│ ├── chunk_manifest.csv
│ ├── daily_log.csv
│ └── publishing_manifest.csv
│
├── logs/
│ ├── daily/
│ ├── weekly/
│ ├── issues/
│ └── decisions/
│
├── scripts/
│ ├── ingest/
│ ├── parse/
│ ├── chunk/
│ ├── embed/
│ ├── search/
│ ├── publish/
│ └── utilities/
│
├── tests/
│ ├── smoke/
│ ├── integration/
│ └── fixtures/
│
├── site/
│ ├── content/
│ ├── public/
│ ├── private/
│ ├── templates/
│ └── exports/
│
├── config/
│ ├── settings.yaml
│ ├── categories.yaml
│ ├── models.yaml
│ └── secrets_template.env
│
└── docs/
├── master_plan.md
├── architecture.md
├── roadmap.md
├── decisions.md
├── issues_and_resolutions.md
└── study_notes.md
6. Required Data Standards
6.1 Document ID format
Each source document gets a stable ID.
Example:
DOC-000001
DOC-000002
DOC-000003
6.2 Chunk ID format
DOC-000001-CH-0001
DOC-000001-CH-0002
6.3 Daily log entry ID format
LOG-2026-03-09
LOG-2026-03-10
6.4 Required document metadata
Each document should track:
-
document_id -
title -
source_filename -
source_path -
category -
file_type -
visibility -
ingest_date -
last_processed_date -
parser_used -
parse_status -
language -
page_count -
has_selectable_text -
is_scanned -
contains_tables -
contains_images -
contains_ocr -
checksum_hash -
notes
6.5 Required chunk metadata
Each chunk should track:
-
chunk_id -
document_id -
document_title -
section_path -
page_start -
page_end -
chunk_order -
chunk_type -
token_estimate -
embedding_status -
visibility -
chunk_text
6.6 Required daily progress metadata
Each workday should track:
-
date -
time_spent -
project_area -
goal_for_day -
work_completed -
decision_made -
issue_found -
issue_resolved -
resources_used -
next_step -
public_summary_ready -
notes
7. File Templates
7.1 documents.csv
document_id,title,source_filename,source_path,category,file_type,visibility,ingest_date,last_processed_date,parser_used,parse_status,language,page_count,has_selectable_text,is_scanned,contains_tables,contains_images,contains_ocr,checksum_hash,notes
7.2 processing_log.csv
timestamp,document_id,script_name,action,status,message,output_md_path,output_json_path,duration_seconds
7.3 chunk_manifest.csv
chunk_id,document_id,document_title,section_path,page_start,page_end,chunk_order,chunk_type,token_estimate,embedding_status,visibility,chunk_path
7.4 daily_log.csv
date,time_spent_hours,project_area,goal_for_day,work_completed,decision_made,issue_found,issue_resolved,resources_used,next_step,public_summary_ready,notes
7.5 publishing_manifest.csv
content_id,source_type,source_ref,title,visibility,publish_status,target_location,last_reviewed_date,notes
7.6 Daily markdown progress note template
# Daily Progress Log - YYYY-MM-DD
## Goal
-
## Time Spent
-
## Work Completed
-
## Decisions Made
-
## Problems or Questions
-
## Resources Used
-
## What I Learned
-
## Next Step
-
## Public-Facing Summary Draft
-
7.7 Decision note template
# Decision Note - YYYY-MM-DD - [Short Title]
## Context
-
## Options Considered
-
## Decision
-
## Reason
-
## Tradeoffs
-
## Follow-up Needed
-
7.8 Issue/resolution template
# Issue Log - YYYY-MM-DD - [Short Title]
## Problem
-
## Symptoms
-
## Cause
-
## Fix Attempted
-
## Resolution
-
## Prevention / Lesson Learned
-
7.9 Weekly summary template
# Weekly Summary - Week of YYYY-MM-DD
## Main Accomplishments
-
## Documents Added
-
## Scripts Built or Updated
-
## Problems Resolved
-
## Study Progress
-
## Portfolio/Public Writing Progress
-
## Priorities for Next Week
-
8. Markdown and JSON Standards
8.1 Markdown rules
Markdown output should preserve:
-
a single H1 title where possible
-
heading hierarchy
-
lists
-
section boundaries
-
tables where feasible
-
captions if recoverable
-
page markers when useful for debugging or later citation
8.2 JSON rules
JSON should preserve structured document elements whenever possible.
Suggested shape:
{
"document_id": "DOC-000001",
"title": "Example Title",
"source_filename": "example.pdf",
"parser": "docling",
"processed_at": "2026-03-09",
"language": "en",
"elements": [
{
"type": "heading",
"level": 1,
"text": "Example Title",
"page": 1
},
{
"type": "paragraph",
"text": "Example paragraph text.",
"page": 1
}
]
}
8.3 Table handling
Tables should be preserved structurally when possible.
Preferred order:
-
CSV or JSON for actual structure
-
HTML if needed for rendering
-
Markdown table only when simple enough
9. Chunking and Retrieval Design
9.1 Chunking objective
The objective is to create chunks that remain meaningful when retrieved on their own.
9.2 Chunking strategy
Prefer chunking by:
-
heading
-
subsection
-
paragraph group
-
table boundary
-
image/caption boundary where relevant
Avoid purely blind fixed-length splitting as the main approach.
9.3 Target chunk size
Initial target:
-
about 250–700 tokens
-
overlap only where useful
-
preserve heading context in each chunk
9.4 Retrieval modes
The system should eventually support:
-
keyword search
-
semantic search
-
hybrid search
-
filtered search by category, visibility, date, or project area
10. Local and Cloud Interfaces
10.1 Local interface goals
Support:
-
script-driven processing
-
local search
-
local retrieval testing
-
local LLM prompt assembly
10.2 Cloud interface goals
Support:
-
portfolio presentation
-
project pages
-
public blog or mini-reports
-
selected study logs
-
selected searchable content
-
future authenticated sections
10.3 Public/private content classes
Each item should eventually be marked as one of:
-
public -
private -
restricted -
draft
10.4 Publishing workflow
A future publishing path should look like:
daily/weekly writing → review → mark visibility → add to publishing manifest → export to site content
11. Learning, Reflection, and Documentation Requirements
11.1 This is a learning system
This project is part of a larger study program that also includes CompTIA Network+ study.
The project should therefore produce evidence of:
-
learning goals
-
practice applied
-
experiments performed
-
tools evaluated
-
problems encountered
-
lessons learned
-
revisions made
11.2 Daily writing requirement
Each workday should produce written output.
Minimum acceptable output:
-
one short daily log entry
Preferred output:
-
a short daily log plus
-
one decision note, issue note, or public-facing summary draft
11.3 Public-facing writing requirement
At least some daily or weekly work should eventually be shaped into public portfolio material.
Typical forms:
-
mini-report
-
project note
-
architecture note
-
study reflection
-
“what I built / what I learned” post
Target size:
-
usually under one page of text
11.4 Calendar integration
Weekly goals may be copied into Google Calendar or other planning tools.
This plan should therefore remain clear, modular, and easy to excerpt.
12. Minimal Checklists
12.1 Project setup checklist
-
Create root folder structure
-
Create
docs/files -
Create manifest CSV templates
-
Create daily log template
-
Define document ID rules
-
Define visibility categories
-
Select initial sample documents
12.2 Parser setup checklist
-
Install Docling
-
Confirm environment works
-
Parse one test PDF
-
Save Markdown output
-
Save JSON output
-
Record result in
processing_log.csv -
Inspect output quality manually
12.3 Metadata checklist
-
Assign
document_id -
Fill category
-
Fill visibility
-
Record parser used
-
Record parse status
-
Store checksum/hash
-
Add notes if parsing is imperfect
12.4 Chunking checklist
-
Create chunk script
-
Chunk by sections where possible
-
Save chunk JSONL
-
Fill
chunk_manifest.csv -
Estimate token counts
-
Inspect chunk quality
12.5 Search checklist
-
Implement basic keyword search
-
Test by title and keyword
-
Show chunk text with metadata
-
Confirm source traceability
-
Log retrieval issues
12.6 Cloud publishing checklist
-
Choose hosting path
-
Set up domain or domain plan
-
Create homepage content
-
Create project page template
-
Create study log page template
-
Create publishing manifest
-
Publish first public page
12.7 Daily work checklist
-
Define goal for the day
-
Complete one visible task
-
Write daily progress note
-
Record one decision, issue, or lesson
-
Identify next step
13. Development Timeline
13.1 Time assumption
Original assumption: 1 hour per weekday.
Updated assumption: likely 10+ hours weekly, with this project and Network+ study taking most available time.
This timeline assumes concentrated weekday work, with optional weekend catch-up or writing.
13.2 Overall expected duration
Estimated initial completion of the first fully usable version:
8 to 10 weeks
This means:
-
a working local parsing pipeline
-
chunking and basic search
-
a first cloud portfolio interface
-
a repeatable documentation habit
-
a publishable first set of public-facing project materials
14. Daily Timeline from March 09, 2026
This timeline is designed as a practical path to the first usable complete version.
Week 1: Foundation and standards
Monday, March 09, 2026
-
Create project root folder structure
-
Create
docs/,logs/,manifests/,scripts/ -
Save this master plan into project docs
-
Create empty CSV templates
-
Write first daily log
Tuesday, March 10, 2026
-
Define document categories
-
Define visibility categories
-
Define document ID and chunk ID formats
-
Create
categories.yaml -
Write decision note on taxonomy choices
Wednesday, March 11, 2026
-
Select 5–10 representative sample documents
-
Separate modern PDFs from scanned/problematic PDFs
-
Record them in
documents.csv -
Note any uncertain cases
Thursday, March 12, 2026
-
Define metadata standards in writing
-
Create
architecture.md -
Create
roadmap.md -
Write daily log and one public-summary draft
Friday, March 13, 2026
-
Review structure created this week
-
Clean up filenames and folders
-
Update master plan with any clarifications
-
Write weekly summary
Week 2: Parser environment and first outputs
Monday, March 16, 2026
-
Install or confirm Docling environment
-
Create a simple parse test script
-
Test environment on one document
Tuesday, March 17, 2026
-
Parse first modern selectable-text PDF
-
Save Markdown and JSON outputs
-
Record processing result
Wednesday, March 18, 2026
-
Inspect first outputs manually
-
Note heading quality, table quality, missing sections
-
Write issue log if needed
Thursday, March 19, 2026
-
Parse second and third sample documents
-
Compare output quality across documents
-
Update notes on parser behavior
Friday, March 20, 2026
-
Refine parse script naming/output paths
-
Write summary of parser setup progress
-
Create first “what I built this week” draft
Week 3: Stable ingestion and manifests
Monday, March 23, 2026
-
Add automatic document ID assignment or manual procedure
-
Confirm
documents.csvworkflow
Tuesday, March 24, 2026
-
Add checksum/hash recording
-
Add parse status tracking
-
Test manifest updates
Wednesday, March 25, 2026
-
Process another small batch of documents
-
Verify all outputs land in correct folders
Thursday, March 26, 2026
-
Create or refine
processing_log.csvworkflow -
Make logging more consistent
Friday, March 27, 2026
-
Review all parsed files so far
-
Identify naming problems and fix standards
-
Write weekly summary
Week 4: Markdown/JSON quality and cleanup rules
Monday, March 30, 2026
-
Define acceptable Markdown output standards
-
Define acceptable JSON output standards
Tuesday, March 31, 2026
-
Identify common cleanup needs
-
Decide what should be automated vs manual
Wednesday, April 01, 2026
-
Create a cleanup utility or checklist
-
Test on one or two parsed files
Thursday, April 02, 2026
-
Document cleanup rules in
decisions.md -
Update architecture notes
Friday, April 03, 2026
-
Review whether current outputs are good enough to begin chunking
-
Write weekly summary
Week 5: Chunking prototype
Monday, April 06, 2026
-
Define chunk metadata fields
-
Create
chunk_manifest.csv
Tuesday, April 07, 2026
-
Write first chunking script prototype
-
Chunk one document by heading/section
Wednesday, April 08, 2026
-
Inspect chunk outputs manually
-
Check chunk size and coherence
Thursday, April 09, 2026
-
Refine chunking logic
-
Add section-path preservation
Friday, April 10, 2026
-
Process chunks for several test documents
-
Write weekly summary and note remaining issues
Week 6: Searchable archive prototype
Monday, April 13, 2026
-
Define search output format
-
Decide what a useful search result must display
Tuesday, April 14, 2026
-
Build simple keyword search over chunk outputs
Wednesday, April 15, 2026
-
Test search using realistic questions
-
Record retrieval quality problems
Thursday, April 16, 2026
-
Add filtering by category, title, or visibility
Friday, April 17, 2026
-
Summarize the first useful archive/search version
-
Write public-facing mini-report draft
Week 7: Embeddings and semantic retrieval
Monday, April 20, 2026
-
Select initial local embedding approach
-
Record decision note
Tuesday, April 21, 2026
-
Generate embeddings for a small chunk sample
Wednesday, April 22, 2026
-
Test first semantic retrieval examples
Thursday, April 23, 2026
-
Compare keyword and semantic retrieval
-
Note strengths and failures
Friday, April 24, 2026
-
Decide whether to proceed immediately with hybrid search or defer
-
Write weekly summary
Week 8: Local LLM workflow and prompt assembly
Monday, April 27, 2026
-
Define prompt assembly structure
-
Choose what metadata accompanies retrieved chunks
Tuesday, April 28, 2026
-
Test small local LLM workflow on laptop
Wednesday, April 29, 2026
-
Test larger local LLM workflow on desktop
Thursday, April 30, 2026
-
Compare laptop and desktop results
-
Note where smaller vs larger models are sufficient
Friday, May 01, 2026
-
Write a “local multi-machine workflow” note
-
Weekly summary
Week 9: Cloud interface planning and first public content
Monday, May 04, 2026
-
Define public site sections
-
Decide what content belongs on the site first
Tuesday, May 05, 2026
-
Create first project page draft
-
Create first study-log or blog page draft
Wednesday, May 06, 2026
-
Create
publishing_manifest.csv -
Mark first publishable items
Thursday, May 07, 2026
-
Choose site hosting direction and note cost assumptions
Friday, May 08, 2026
-
Review cloud interface requirements and update plan
-
Weekly summary
Week 10: First cloud-facing portfolio prototype
Monday, May 11, 2026
-
Set up site skeleton
-
Create homepage content draft
Tuesday, May 12, 2026
-
Add project page template
-
Add study-log page template
Wednesday, May 13, 2026
-
Publish first test content locally or to staging
Thursday, May 14, 2026
-
Review readability, clarity, and recruiter usefulness
Friday, May 15, 2026
-
Mark first usable version complete
-
Write milestone summary
-
Update master plan for Phase 2 expansion
15. “Completed” Definition for Version 1
Version 1 is complete when all of the following are true:
-
raw documents are stored in an organized structure
-
Docling parsing works for modern PDFs
-
Markdown and JSON outputs are generated consistently
-
chunk records exist with metadata
-
basic local search works
-
daily logging habit is active
-
at least a few public-facing writeups exist
-
the cloud interface has at least a first usable portfolio prototype
16. Phase 2 After Version 1
After the first complete version, next priorities may include:
-
better semantic retrieval
-
hosted private search
-
stronger auth
-
richer public project pages
-
cleaner sync between laptop, desktop, and cloud
-
improved logging and blog workflows
-
later cloud-hosted model access if affordable and useful
17. Weekly Operating Pattern
A good weekly pattern is:
Monday
Plan and define one clear goal.
Tuesday
Build or implement.
Wednesday
Test and inspect.
Thursday
Refine and document.
Friday
Summarize, publish notes, and plan next week.
Optional weekend time can be used for:
-
backlog cleanup
-
reading
-
extra writing
-
Network+ study
-
public post polishing
18. Risks and Mitigations
Risk: too many moving parts
Mitigation: keep the first version narrow and standards-driven.
Risk: public site work distracts from core pipeline
Mitigation: publish only selected outputs after local artifacts are stable.
Risk: inconsistent daily documentation
Mitigation: require a minimum daily log even on low-energy days.
Risk: multi-machine confusion
Mitigation: use stable IDs, manifests, and export rules.
Risk: expensive cloud growth
Mitigation: keep heavy processing local and publish selectively.
19. Immediate Next Steps
-
Save this plan into a Google Doc and local markdown file
-
Create the folder structure
-
Create the CSV templates
-
Create the daily markdown templates
-
Start the first daily log for March 09, 2026
-
Select the initial sample documents
-
Begin Week 1 Day 1 tasks
Conclusion
This revised plan defines a practical path for building an LLM-friendly knowledge base that is also a visible public-facing professional interface.
The core strategy remains:
-
local-first processing
-
open-source tools
-
Docling as the primary parser
-
Markdown + JSON as canonical outputs
-
semantic chunks with metadata
-
daily written reflection
-
gradual publication to a cloud-hosted portfolio
The major refinement in this version is that the cloud interface is now a required part of the system, not a possible later extra. This matters because the project is not only for private utility. It is also for demonstrating skill, discipline, and growth to outside viewers.
The project should therefore be treated as both:
-
a technical system
-
a documented learning journey
That combination is a strength. It means the work itself becomes part of the portfolio.
Revision Log
## Revision Log
### 2026-03-09
- Revised plan to Version 2.
- Made cloud-hosted interface a required component.
- Added multi-machine local/cloud architecture considerations.
- Added daily writing and reflection as formal project requirements.
- Added minimal checklists.
- Added file templates.
- Added day-by-day timeline from March 09, 2026 through first completion milestone.
### YYYY-MM-DD
- Updated:
- Reason:
- Next decision:
Comments
Post a Comment