Master Plan v2: Local-first LLM Knowledge Base, Cloud Portfolio, & Study Log System

 

Master Plan v2: Local-First AI Knowledge Base, Cloud Portfolio, and Study Log System

Living markdown document for daily use, revision, and future planning


Introduction

This document is the current master plan for building a local-first, open-source, LLM-friendly knowledge base that will also become a cloud-hosted professional interface for recruiters, prospective employers, collaborators, and public readers.

This project is not only a software system. It is also part of a larger learning program. Its purpose is to help develop practical skill in:

  • Python

  • document processing

  • LLM workflows

  • retrieval and search

  • cloud/web deployment

  • software architecture

  • technical writing

  • reflective documentation

  • portfolio presentation

The project will serve several roles at once:

  1. a personal knowledge base for study and project work

  2. a local document-processing and retrieval system

  3. a cloud-hosted portfolio and access interface

  4. a structured public record of learning, progress, and technical growth

  5. a platform that can later integrate multiple LLMs across different machines

This is a living document. It should be updated frequently as decisions are made, tools are tested, issues are discovered, and new requirements appear.


1. Project Purpose and Scope

1.1 Core Purpose

The core purpose of this project is to create a durable system for learning, building, documenting, and presenting technical work.

This includes:

  • converting documents into LLM-friendly forms

  • preserving them in structured formats

  • enabling later retrieval and search

  • using them in local and cloud-accessible systems

  • publishing selected outputs as part of a professional portfolio

This project exists in service of:

  • academic study

  • hands-on project learning

  • technical experimentation

  • professional development

  • future employability

1.2 Primary Outcomes

The system must support all of the following:

1.2.1 Local knowledge ingestion and processing

Documents should be parsed and stored locally first.

1.2.2 Structured archival

The knowledge base should preserve source documents, structured parsed outputs, chunked retrieval records, and metadata.

1.2.3 Local LLM use

The system should work with the user’s local machines and local LLMs.

1.2.4 Cloud-hosted interface

A cloud-hosted interface is required, not optional. It will be the outward-facing access point for:

  • recruiters

  • prospective employers

  • collaborators

  • public readers

  • the user from remote locations

This interface will include some combination of:

  • project pages

  • blog or mini-report entries

  • study logs

  • selected searchable content

  • summaries of technical projects

  • evidence of progress and skill development

1.2.5 Multi-machine interoperability

The system should be designed with practical interfacing between:

  • a low-end laptop with a smaller local LLM

  • a mid-range desktop with larger local LLM options

  • a future cloud-hosted service or model endpoint

1.2.6 Ongoing learning documentation

Every workday should produce some written record, even if brief. This is a formal requirement of the project.

Possible daily written outputs include:

  • progress log entry

  • decisions log entry

  • issue/resolution note

  • study reflection

  • mini-report

  • blog draft

  • weekly summary

1.3 Success Criteria

This project is successful if it becomes a working system where:

  • documents are ingested in an organized way

  • Markdown and JSON are produced reliably

  • chunks and metadata are searchable

  • local LLM workflows become easier and less confusing

  • the cloud site presents selected work clearly

  • daily or near-daily written reflection accumulates

  • the project demonstrates technical growth to outside readers


2. Guiding Principles

2.1 Local-first, cloud-accessible

Processing begins locally, but the system must be able to publish selected outputs to the cloud interface.

2.2 Open-source-first

Prefer tools that are open-source, scriptable, understandable, and replaceable.

2.3 One source of truth

Each raw document should exist once as a preserved original, with derived outputs generated from it.

2.4 Structure over flattening

Preserve headings, sections, tables, metadata, and document boundaries whenever possible.

2.5 Learning through doing

This is a study system as well as a product system. Decisions, experiments, mistakes, and revisions are part of the output.

2.6 Small repeatable steps

Prefer workflows that can be done incrementally and tested often.

2.7 Public professionalism

Selected outputs should eventually be suitable for public viewing by recruiters or future colleagues.


3. Chosen Technical Direction

3.1 Primary parser

Docling is the primary parser.

Why

  • open-source

  • local-first

  • supports AI-oriented document workflows

  • suitable for modern selectable-text PDFs

  • exports structured output

  • Python-friendly

3.2 Fallback parser for scanned PDFs

LlamaParse is the planned fallback for problematic scanned or image-based PDFs.

Use case

Only for documents that do not parse well enough locally.

Folder separation

Scanned/problematic files should be isolated in a separate subfolder from the beginning.

3.3 Canonical parsed outputs

The canonical parsed outputs are:

  • Markdown

  • JSON

Markdown is for

  • human readability

  • LLM readability

  • website rendering

  • documentation

  • easier review

JSON is for

  • structure

  • metadata

  • automation

  • chunking

  • search systems

  • future integrations

3.4 Canonical workflow

The standard workflow is:

source file → parser → Markdown/JSON → semantic chunks with source metadata → vector DB or searchable archive → local/cloud interface


4. System Architecture

4.1 Architecture layers

The system should have five main layers.

Layer A: source preservation

Original documents remain untouched.

Layer B: parsing and normalization

Documents are converted into canonical Markdown and JSON.

Layer C: retrieval preparation

Chunks, metadata, embeddings, and indexable records are generated.

Layer D: local interfaces

Used for local search, testing, and local LLM workflows.

Layer E: cloud interface

Used for public presentation, recruiter-facing portfolio pages, selected search, and possibly authenticated personal access.

4.2 Machine roles

Laptop

Primary use:

  • light parsing

  • testing

  • small local LLM workflows

  • note-taking

  • writing

  • remote work

Desktop

Primary use:

  • heavier parsing

  • embedding generation

  • larger local LLM workflows

  • more demanding experimentation

Cloud

Primary use:

  • public portfolio interface

  • selected private access

  • publishing project summaries and logs

  • later potential hosted inference or retrieval features

4.3 Interoperability goal

The system should allow files and outputs to move cleanly between machines.

At minimum, this requires:

  • consistent folder naming

  • stable document IDs

  • clear manifests

  • portable file formats

  • machine-independent metadata

  • sync/export rules


5. Folder Structure

knowledge_base/

├── originals/
│ ├── professional/
│ ├── hobby/
│ ├── gaming/
│ ├── scanned/
│ └── misc/

├── parsed/
│ ├── md/
│ ├── json/
│ ├── tables/
│ └── images/

├── chunks/
│ ├── jsonl/
│ ├── debug/
│ └── exported/

├── indexes/
│ ├── keyword/
│ ├── vector/
│ └── hybrid/

├── manifests/
│ ├── documents.csv
│ ├── processing_log.csv
│ ├── chunk_manifest.csv
│ ├── daily_log.csv
│ └── publishing_manifest.csv

├── logs/
│ ├── daily/
│ ├── weekly/
│ ├── issues/
│ └── decisions/

├── scripts/
│ ├── ingest/
│ ├── parse/
│ ├── chunk/
│ ├── embed/
│ ├── search/
│ ├── publish/
│ └── utilities/

├── tests/
│ ├── smoke/
│ ├── integration/
│ └── fixtures/

├── site/
│ ├── content/
│ ├── public/
│ ├── private/
│ ├── templates/
│ └── exports/

├── config/
│ ├── settings.yaml
│ ├── categories.yaml
│ ├── models.yaml
│ └── secrets_template.env

└── docs/
├── master_plan.md
├── architecture.md
├── roadmap.md
├── decisions.md
├── issues_and_resolutions.md
└── study_notes.md

6. Required Data Standards

6.1 Document ID format

Each source document gets a stable ID.

Example:

DOC-000001
DOC-000002
DOC-000003

6.2 Chunk ID format

DOC-000001-CH-0001
DOC-000001-CH-0002

6.3 Daily log entry ID format

LOG-2026-03-09
LOG-2026-03-10

6.4 Required document metadata

Each document should track:

  • document_id

  • title

  • source_filename

  • source_path

  • category

  • file_type

  • visibility

  • ingest_date

  • last_processed_date

  • parser_used

  • parse_status

  • language

  • page_count

  • has_selectable_text

  • is_scanned

  • contains_tables

  • contains_images

  • contains_ocr

  • checksum_hash

  • notes

6.5 Required chunk metadata

Each chunk should track:

  • chunk_id

  • document_id

  • document_title

  • section_path

  • page_start

  • page_end

  • chunk_order

  • chunk_type

  • token_estimate

  • embedding_status

  • visibility

  • chunk_text

6.6 Required daily progress metadata

Each workday should track:

  • date

  • time_spent

  • project_area

  • goal_for_day

  • work_completed

  • decision_made

  • issue_found

  • issue_resolved

  • resources_used

  • next_step

  • public_summary_ready

  • notes


7. File Templates

7.1 documents.csv

document_id,title,source_filename,source_path,category,file_type,visibility,ingest_date,last_processed_date,parser_used,parse_status,language,page_count,has_selectable_text,is_scanned,contains_tables,contains_images,contains_ocr,checksum_hash,notes

7.2 processing_log.csv

timestamp,document_id,script_name,action,status,message,output_md_path,output_json_path,duration_seconds

7.3 chunk_manifest.csv

chunk_id,document_id,document_title,section_path,page_start,page_end,chunk_order,chunk_type,token_estimate,embedding_status,visibility,chunk_path

7.4 daily_log.csv

date,time_spent_hours,project_area,goal_for_day,work_completed,decision_made,issue_found,issue_resolved,resources_used,next_step,public_summary_ready,notes

7.5 publishing_manifest.csv

content_id,source_type,source_ref,title,visibility,publish_status,target_location,last_reviewed_date,notes

7.6 Daily markdown progress note template

# Daily Progress Log - YYYY-MM-DD

## Goal
-

## Time Spent
-

## Work Completed
-

## Decisions Made
-

## Problems or Questions
-

## Resources Used
-

## What I Learned
-

## Next Step
-

## Public-Facing Summary Draft
-

7.7 Decision note template

# Decision Note - YYYY-MM-DD - [Short Title]

## Context
-

## Options Considered
-

## Decision
-

## Reason
-

## Tradeoffs
-

## Follow-up Needed
-

7.8 Issue/resolution template

# Issue Log - YYYY-MM-DD - [Short Title]

## Problem
-

## Symptoms
-

## Cause
-

## Fix Attempted
-

## Resolution
-

## Prevention / Lesson Learned
-

7.9 Weekly summary template

# Weekly Summary - Week of YYYY-MM-DD

## Main Accomplishments
-

## Documents Added
-

## Scripts Built or Updated
-

## Problems Resolved
-

## Study Progress
-

## Portfolio/Public Writing Progress
-

## Priorities for Next Week
-

8. Markdown and JSON Standards

8.1 Markdown rules

Markdown output should preserve:

  • a single H1 title where possible

  • heading hierarchy

  • lists

  • section boundaries

  • tables where feasible

  • captions if recoverable

  • page markers when useful for debugging or later citation

8.2 JSON rules

JSON should preserve structured document elements whenever possible.

Suggested shape:

{
"document_id": "DOC-000001",
"title": "Example Title",
"source_filename": "example.pdf",
"parser": "docling",
"processed_at": "2026-03-09",
"language": "en",
"elements": [
{
"type": "heading",
"level": 1,
"text": "Example Title",
"page": 1
},
{
"type": "paragraph",
"text": "Example paragraph text.",
"page": 1
}
]
}

8.3 Table handling

Tables should be preserved structurally when possible.

Preferred order:

  1. CSV or JSON for actual structure

  2. HTML if needed for rendering

  3. Markdown table only when simple enough


9. Chunking and Retrieval Design

9.1 Chunking objective

The objective is to create chunks that remain meaningful when retrieved on their own.

9.2 Chunking strategy

Prefer chunking by:

  • heading

  • subsection

  • paragraph group

  • table boundary

  • image/caption boundary where relevant

Avoid purely blind fixed-length splitting as the main approach.

9.3 Target chunk size

Initial target:

  • about 250–700 tokens

  • overlap only where useful

  • preserve heading context in each chunk

9.4 Retrieval modes

The system should eventually support:

  • keyword search

  • semantic search

  • hybrid search

  • filtered search by category, visibility, date, or project area


10. Local and Cloud Interfaces

10.1 Local interface goals

Support:

  • script-driven processing

  • local search

  • local retrieval testing

  • local LLM prompt assembly

10.2 Cloud interface goals

Support:

  • portfolio presentation

  • project pages

  • public blog or mini-reports

  • selected study logs

  • selected searchable content

  • future authenticated sections

10.3 Public/private content classes

Each item should eventually be marked as one of:

  • public

  • private

  • restricted

  • draft

10.4 Publishing workflow

A future publishing path should look like:

daily/weekly writing → review → mark visibility → add to publishing manifest → export to site content


11. Learning, Reflection, and Documentation Requirements

11.1 This is a learning system

This project is part of a larger study program that also includes CompTIA Network+ study.

The project should therefore produce evidence of:

  • learning goals

  • practice applied

  • experiments performed

  • tools evaluated

  • problems encountered

  • lessons learned

  • revisions made

11.2 Daily writing requirement

Each workday should produce written output.

Minimum acceptable output:

  • one short daily log entry

Preferred output:

  • a short daily log plus

  • one decision note, issue note, or public-facing summary draft

11.3 Public-facing writing requirement

At least some daily or weekly work should eventually be shaped into public portfolio material.

Typical forms:

  • mini-report

  • project note

  • architecture note

  • study reflection

  • “what I built / what I learned” post

Target size:

  • usually under one page of text

11.4 Calendar integration

Weekly goals may be copied into Google Calendar or other planning tools.
This plan should therefore remain clear, modular, and easy to excerpt.


12. Minimal Checklists

12.1 Project setup checklist

  • Create root folder structure

  • Create docs/ files

  • Create manifest CSV templates

  • Create daily log template

  • Define document ID rules

  • Define visibility categories

  • Select initial sample documents

12.2 Parser setup checklist

  • Install Docling

  • Confirm environment works

  • Parse one test PDF

  • Save Markdown output

  • Save JSON output

  • Record result in processing_log.csv

  • Inspect output quality manually

12.3 Metadata checklist

  • Assign document_id

  • Fill category

  • Fill visibility

  • Record parser used

  • Record parse status

  • Store checksum/hash

  • Add notes if parsing is imperfect

12.4 Chunking checklist

  • Create chunk script

  • Chunk by sections where possible

  • Save chunk JSONL

  • Fill chunk_manifest.csv

  • Estimate token counts

  • Inspect chunk quality

12.5 Search checklist

  • Implement basic keyword search

  • Test by title and keyword

  • Show chunk text with metadata

  • Confirm source traceability

  • Log retrieval issues

12.6 Cloud publishing checklist

  • Choose hosting path

  • Set up domain or domain plan

  • Create homepage content

  • Create project page template

  • Create study log page template

  • Create publishing manifest

  • Publish first public page

12.7 Daily work checklist

  • Define goal for the day

  • Complete one visible task

  • Write daily progress note

  • Record one decision, issue, or lesson

  • Identify next step


13. Development Timeline

13.1 Time assumption

Original assumption: 1 hour per weekday.

Updated assumption: likely 10+ hours weekly, with this project and Network+ study taking most available time.

This timeline assumes concentrated weekday work, with optional weekend catch-up or writing.

13.2 Overall expected duration

Estimated initial completion of the first fully usable version:

8 to 10 weeks

This means:

  • a working local parsing pipeline

  • chunking and basic search

  • a first cloud portfolio interface

  • a repeatable documentation habit

  • a publishable first set of public-facing project materials


14. Daily Timeline from March 09, 2026

This timeline is designed as a practical path to the first usable complete version.

Week 1: Foundation and standards

Monday, March 09, 2026

  • Create project root folder structure

  • Create docs/, logs/, manifests/, scripts/

  • Save this master plan into project docs

  • Create empty CSV templates

  • Write first daily log

Tuesday, March 10, 2026

  • Define document categories

  • Define visibility categories

  • Define document ID and chunk ID formats

  • Create categories.yaml

  • Write decision note on taxonomy choices

Wednesday, March 11, 2026

  • Select 5–10 representative sample documents

  • Separate modern PDFs from scanned/problematic PDFs

  • Record them in documents.csv

  • Note any uncertain cases

Thursday, March 12, 2026

  • Define metadata standards in writing

  • Create architecture.md

  • Create roadmap.md

  • Write daily log and one public-summary draft

Friday, March 13, 2026

  • Review structure created this week

  • Clean up filenames and folders

  • Update master plan with any clarifications

  • Write weekly summary


Week 2: Parser environment and first outputs

Monday, March 16, 2026

  • Install or confirm Docling environment

  • Create a simple parse test script

  • Test environment on one document

Tuesday, March 17, 2026

  • Parse first modern selectable-text PDF

  • Save Markdown and JSON outputs

  • Record processing result

Wednesday, March 18, 2026

  • Inspect first outputs manually

  • Note heading quality, table quality, missing sections

  • Write issue log if needed

Thursday, March 19, 2026

  • Parse second and third sample documents

  • Compare output quality across documents

  • Update notes on parser behavior

Friday, March 20, 2026

  • Refine parse script naming/output paths

  • Write summary of parser setup progress

  • Create first “what I built this week” draft


Week 3: Stable ingestion and manifests

Monday, March 23, 2026

  • Add automatic document ID assignment or manual procedure

  • Confirm documents.csv workflow

Tuesday, March 24, 2026

  • Add checksum/hash recording

  • Add parse status tracking

  • Test manifest updates

Wednesday, March 25, 2026

  • Process another small batch of documents

  • Verify all outputs land in correct folders

Thursday, March 26, 2026

  • Create or refine processing_log.csv workflow

  • Make logging more consistent

Friday, March 27, 2026

  • Review all parsed files so far

  • Identify naming problems and fix standards

  • Write weekly summary


Week 4: Markdown/JSON quality and cleanup rules

Monday, March 30, 2026

  • Define acceptable Markdown output standards

  • Define acceptable JSON output standards

Tuesday, March 31, 2026

  • Identify common cleanup needs

  • Decide what should be automated vs manual

Wednesday, April 01, 2026

  • Create a cleanup utility or checklist

  • Test on one or two parsed files

Thursday, April 02, 2026

  • Document cleanup rules in decisions.md

  • Update architecture notes

Friday, April 03, 2026

  • Review whether current outputs are good enough to begin chunking

  • Write weekly summary


Week 5: Chunking prototype

Monday, April 06, 2026

  • Define chunk metadata fields

  • Create chunk_manifest.csv

Tuesday, April 07, 2026

  • Write first chunking script prototype

  • Chunk one document by heading/section

Wednesday, April 08, 2026

  • Inspect chunk outputs manually

  • Check chunk size and coherence

Thursday, April 09, 2026

  • Refine chunking logic

  • Add section-path preservation

Friday, April 10, 2026

  • Process chunks for several test documents

  • Write weekly summary and note remaining issues


Week 6: Searchable archive prototype

Monday, April 13, 2026

  • Define search output format

  • Decide what a useful search result must display

Tuesday, April 14, 2026

  • Build simple keyword search over chunk outputs

Wednesday, April 15, 2026

  • Test search using realistic questions

  • Record retrieval quality problems

Thursday, April 16, 2026

  • Add filtering by category, title, or visibility

Friday, April 17, 2026

  • Summarize the first useful archive/search version

  • Write public-facing mini-report draft


Week 7: Embeddings and semantic retrieval

Monday, April 20, 2026

  • Select initial local embedding approach

  • Record decision note

Tuesday, April 21, 2026

  • Generate embeddings for a small chunk sample

Wednesday, April 22, 2026

  • Test first semantic retrieval examples

Thursday, April 23, 2026

  • Compare keyword and semantic retrieval

  • Note strengths and failures

Friday, April 24, 2026

  • Decide whether to proceed immediately with hybrid search or defer

  • Write weekly summary


Week 8: Local LLM workflow and prompt assembly

Monday, April 27, 2026

  • Define prompt assembly structure

  • Choose what metadata accompanies retrieved chunks

Tuesday, April 28, 2026

  • Test small local LLM workflow on laptop

Wednesday, April 29, 2026

  • Test larger local LLM workflow on desktop

Thursday, April 30, 2026

  • Compare laptop and desktop results

  • Note where smaller vs larger models are sufficient

Friday, May 01, 2026

  • Write a “local multi-machine workflow” note

  • Weekly summary


Week 9: Cloud interface planning and first public content

Monday, May 04, 2026

  • Define public site sections

  • Decide what content belongs on the site first

Tuesday, May 05, 2026

  • Create first project page draft

  • Create first study-log or blog page draft

Wednesday, May 06, 2026

  • Create publishing_manifest.csv

  • Mark first publishable items

Thursday, May 07, 2026

  • Choose site hosting direction and note cost assumptions

Friday, May 08, 2026

  • Review cloud interface requirements and update plan

  • Weekly summary


Week 10: First cloud-facing portfolio prototype

Monday, May 11, 2026

  • Set up site skeleton

  • Create homepage content draft

Tuesday, May 12, 2026

  • Add project page template

  • Add study-log page template

Wednesday, May 13, 2026

  • Publish first test content locally or to staging

Thursday, May 14, 2026

  • Review readability, clarity, and recruiter usefulness

Friday, May 15, 2026

  • Mark first usable version complete

  • Write milestone summary

  • Update master plan for Phase 2 expansion


15. “Completed” Definition for Version 1

Version 1 is complete when all of the following are true:

  • raw documents are stored in an organized structure

  • Docling parsing works for modern PDFs

  • Markdown and JSON outputs are generated consistently

  • chunk records exist with metadata

  • basic local search works

  • daily logging habit is active

  • at least a few public-facing writeups exist

  • the cloud interface has at least a first usable portfolio prototype


16. Phase 2 After Version 1

After the first complete version, next priorities may include:

  • better semantic retrieval

  • hosted private search

  • stronger auth

  • richer public project pages

  • cleaner sync between laptop, desktop, and cloud

  • improved logging and blog workflows

  • later cloud-hosted model access if affordable and useful


17. Weekly Operating Pattern

A good weekly pattern is:

Monday

Plan and define one clear goal.

Tuesday

Build or implement.

Wednesday

Test and inspect.

Thursday

Refine and document.

Friday

Summarize, publish notes, and plan next week.

Optional weekend time can be used for:

  • backlog cleanup

  • reading

  • extra writing

  • Network+ study

  • public post polishing


18. Risks and Mitigations

Risk: too many moving parts

Mitigation: keep the first version narrow and standards-driven.

Risk: public site work distracts from core pipeline

Mitigation: publish only selected outputs after local artifacts are stable.

Risk: inconsistent daily documentation

Mitigation: require a minimum daily log even on low-energy days.

Risk: multi-machine confusion

Mitigation: use stable IDs, manifests, and export rules.

Risk: expensive cloud growth

Mitigation: keep heavy processing local and publish selectively.


19. Immediate Next Steps

  • Save this plan into a Google Doc and local markdown file

  • Create the folder structure

  • Create the CSV templates

  • Create the daily markdown templates

  • Start the first daily log for March 09, 2026

  • Select the initial sample documents

  • Begin Week 1 Day 1 tasks


Conclusion

This revised plan defines a practical path for building an LLM-friendly knowledge base that is also a visible public-facing professional interface.

The core strategy remains:

  • local-first processing

  • open-source tools

  • Docling as the primary parser

  • Markdown + JSON as canonical outputs

  • semantic chunks with metadata

  • daily written reflection

  • gradual publication to a cloud-hosted portfolio

The major refinement in this version is that the cloud interface is now a required part of the system, not a possible later extra. This matters because the project is not only for private utility. It is also for demonstrating skill, discipline, and growth to outside viewers.

The project should therefore be treated as both:

  1. a technical system

  2. a documented learning journey

That combination is a strength. It means the work itself becomes part of the portfolio.


Revision Log

## Revision Log

### 2026-03-09
- Revised plan to Version 2.
- Made cloud-hosted interface a required component.
- Added multi-machine local/cloud architecture considerations.
- Added daily writing and reflection as formal project requirements.
- Added minimal checklists.
- Added file templates.
- Added day-by-day timeline from March 09, 2026 through first completion milestone.

### YYYY-MM-DD
- Updated:
- Reason:
- Next decision:

Comments

Popular posts from this blog

WWHD?