Master Plan v2: Local-First AI Knowledge Base, Cloud Portfolio, and Study Log System

Living markdown document for daily use, revision, and future planning

Introduction

This document is the current master plan for building a local-first, open-source, LLM-friendly knowledge base that will also become a cloud-hosted professional interface for recruiters, prospective employers, collaborators, and public readers.

This project is not only a software system. It is also part of a larger learning program. Its purpose is to help develop practical skill in:

Python
document processing
LLM workflows
retrieval and search
cloud/web deployment
software architecture
technical writing
reflective documentation
portfolio presentation

The project will serve several roles at once:

a personal knowledge base for study and project work
a local document-processing and retrieval system
a cloud-hosted portfolio and access interface
a structured public record of learning, progress, and technical growth
a platform that can later integrate multiple LLMs across different machines

This is a living document. It should be updated frequently as decisions are made, tools are tested, issues are discovered, and new requirements appear.

1. Project Purpose and Scope

1.1 Core Purpose

The core purpose of this project is to create a durable system for learning, building, documenting, and presenting technical work.

This includes:

converting documents into LLM-friendly forms
preserving them in structured formats
enabling later retrieval and search
using them in local and cloud-accessible systems
publishing selected outputs as part of a professional portfolio

This project exists in service of:

academic study
hands-on project learning
technical experimentation
professional development
future employability

1.2 Primary Outcomes

The system must support all of the following:

1.2.1 Local knowledge ingestion and processing

Documents should be parsed and stored locally first.

1.2.2 Structured archival

The knowledge base should preserve source documents, structured parsed outputs, chunked retrieval records, and metadata.

1.2.3 Local LLM use

The system should work with the user’s local machines and local LLMs.

1.2.4 Cloud-hosted interface

A cloud-hosted interface is required, not optional. It will be the outward-facing access point for:

recruiters
prospective employers
collaborators
public readers
the user from remote locations

This interface will include some combination of:

project pages
blog or mini-report entries
study logs
selected searchable content
summaries of technical projects
evidence of progress and skill development

1.2.5 Multi-machine interoperability

The system should be designed with practical interfacing between:

a low-end laptop with a smaller local LLM
a mid-range desktop with larger local LLM options
a future cloud-hosted service or model endpoint

1.2.6 Ongoing learning documentation

Every workday should produce some written record, even if brief. This is a formal requirement of the project.

Possible daily written outputs include:

progress log entry
decisions log entry
issue/resolution note
study reflection
mini-report
blog draft
weekly summary

1.3 Success Criteria

This project is successful if it becomes a working system where:

documents are ingested in an organized way
Markdown and JSON are produced reliably
chunks and metadata are searchable
local LLM workflows become easier and less confusing
the cloud site presents selected work clearly
daily or near-daily written reflection accumulates
the project demonstrates technical growth to outside readers

2. Guiding Principles

2.1 Local-first, cloud-accessible

Processing begins locally, but the system must be able to publish selected outputs to the cloud interface.

2.2 Open-source-first

Prefer tools that are open-source, scriptable, understandable, and replaceable.

2.3 One source of truth

Each raw document should exist once as a preserved original, with derived outputs generated from it.

2.4 Structure over flattening

Preserve headings, sections, tables, metadata, and document boundaries whenever possible.

2.5 Learning through doing

This is a study system as well as a product system. Decisions, experiments, mistakes, and revisions are part of the output.

2.6 Small repeatable steps

Prefer workflows that can be done incrementally and tested often.

2.7 Public professionalism

Selected outputs should eventually be suitable for public viewing by recruiters or future colleagues.

3. Chosen Technical Direction

3.1 Primary parser

Docling is the primary parser.

Why

open-source
local-first
supports AI-oriented document workflows
suitable for modern selectable-text PDFs
exports structured output
Python-friendly

3.2 Fallback parser for scanned PDFs

LlamaParse is the planned fallback for problematic scanned or image-based PDFs.

Use case

Only for documents that do not parse well enough locally.

Folder separation

Scanned/problematic files should be isolated in a separate subfolder from the beginning.

3.3 Canonical parsed outputs

The canonical parsed outputs are:

Markdown
JSON

Markdown is for

human readability
LLM readability
website rendering
documentation
easier review

JSON is for

structure
metadata
automation
chunking
search systems
future integrations

3.4 Canonical workflow

The standard workflow is:

source file → parser → Markdown/JSON → semantic chunks with source metadata → vector DB or searchable archive → local/cloud interface

4. System Architecture

4.1 Architecture layers

The system should have five main layers.

Layer A: source preservation

Original documents remain untouched.

Layer B: parsing and normalization

Documents are converted into canonical Markdown and JSON.

Layer C: retrieval preparation

Chunks, metadata, embeddings, and indexable records are generated.

Layer D: local interfaces

Used for local search, testing, and local LLM workflows.

Layer E: cloud interface

Used for public presentation, recruiter-facing portfolio pages, selected search, and possibly authenticated personal access.

4.2 Machine roles

Laptop

Primary use:

light parsing
testing
small local LLM workflows
note-taking
writing
remote work

Desktop

Primary use:

heavier parsing
embedding generation
larger local LLM workflows
more demanding experimentation

Cloud

Primary use:

public portfolio interface
selected private access
publishing project summaries and logs
later potential hosted inference or retrieval features

4.3 Interoperability goal

The system should allow files and outputs to move cleanly between machines.

At minimum, this requires:

consistent folder naming
stable document IDs
clear manifests
portable file formats
machine-independent metadata
sync/export rules

5. Folder Structure


knowledge_base/
│
├── originals/
│   ├── professional/
│   ├── hobby/
│   ├── gaming/
│   ├── scanned/
│   └── misc/
│
├── parsed/
│   ├── md/
│   ├── json/
│   ├── tables/
│   └── images/
│
├── chunks/
│   ├── jsonl/
│   ├── debug/
│   └── exported/
│
├── indexes/
│   ├── keyword/
│   ├── vector/
│   └── hybrid/
│
├── manifests/
│   ├── documents.csv
│   ├── processing_log.csv
│   ├── chunk_manifest.csv
│   ├── daily_log.csv
│   └── publishing_manifest.csv
│
├── logs/
│   ├── daily/
│   ├── weekly/
│   ├── issues/
│   └── decisions/
│
├── scripts/
│   ├── ingest/
│   ├── parse/
│   ├── chunk/
│   ├── embed/
│   ├── search/
│   ├── publish/
│   └── utilities/
│
├── tests/
│   ├── smoke/
│   ├── integration/
│   └── fixtures/
│
├── site/
│   ├── content/
│   ├── public/
│   ├── private/
│   ├── templates/
│   └── exports/
│
├── config/
│   ├── settings.yaml
│   ├── categories.yaml
│   ├── models.yaml
│   └── secrets_template.env
│
└── docs/
    ├── master_plan.md
    ├── architecture.md
    ├── roadmap.md
    ├── decisions.md
    ├── issues_and_resolutions.md
    └── study_notes.md

6. Required Data Standards

6.1 Document ID format

Each source document gets a stable ID.

Example:


DOC-000001
DOC-000002
DOC-000003

6.2 Chunk ID format


DOC-000001-CH-0001
DOC-000001-CH-0002

6.3 Daily log entry ID format


LOG-2026-03-09
LOG-2026-03-10

6.4 Required document metadata

Each document should track:

document_id
title
source_filename
source_path
category
file_type
visibility
ingest_date
last_processed_date
parser_used
parse_status
language
page_count
has_selectable_text
is_scanned
contains_tables
contains_images
contains_ocr
checksum_hash
notes

6.5 Required chunk metadata

Each chunk should track:

chunk_id
document_id
document_title
section_path
page_start
page_end
chunk_order
chunk_type
token_estimate
embedding_status
visibility
chunk_text

6.6 Required daily progress metadata

Each workday should track:

date
time_spent
project_area
goal_for_day
work_completed
decision_made
issue_found
issue_resolved
resources_used
next_step
public_summary_ready
notes

7. File Templates

7.1 `documents.csv`


document_id,title,source_filename,source_path,category,file_type,visibility,ingest_date,last_processed_date,parser_used,parse_status,language,page_count,has_selectable_text,is_scanned,contains_tables,contains_images,contains_ocr,checksum_hash,notes

7.2 `processing_log.csv`


timestamp,document_id,script_name,action,status,message,output_md_path,output_json_path,duration_seconds

7.3 `chunk_manifest.csv`


chunk_id,document_id,document_title,section_path,page_start,page_end,chunk_order,chunk_type,token_estimate,embedding_status,visibility,chunk_path

7.4 `daily_log.csv`


date,time_spent_hours,project_area,goal_for_day,work_completed,decision_made,issue_found,issue_resolved,resources_used,next_step,public_summary_ready,notes

7.5 `publishing_manifest.csv`


content_id,source_type,source_ref,title,visibility,publish_status,target_location,last_reviewed_date,notes

7.6 Daily markdown progress note template


# Daily Progress Log - YYYY-MM-DD

## Goal
-

## Time Spent
-

## Work Completed
-

## Decisions Made
-

## Problems or Questions
-

## Resources Used
-

## What I Learned
-

## Next Step
-

## Public-Facing Summary Draft
-

7.7 Decision note template


# Decision Note - YYYY-MM-DD - [Short Title]

## Context
-

## Options Considered
-

## Decision
-

## Reason
-

## Tradeoffs
-

## Follow-up Needed
-

7.8 Issue/resolution template


# Issue Log - YYYY-MM-DD - [Short Title]

## Problem
-

## Symptoms
-

## Cause
-

## Fix Attempted
-

## Resolution
-

## Prevention / Lesson Learned
-

7.9 Weekly summary template


# Weekly Summary - Week of YYYY-MM-DD

## Main Accomplishments
-

## Documents Added
-

## Scripts Built or Updated
-

## Problems Resolved
-

## Study Progress
-

## Portfolio/Public Writing Progress
-

## Priorities for Next Week
-

8. Markdown and JSON Standards

8.1 Markdown rules

Markdown output should preserve:

a single H1 title where possible
heading hierarchy
lists
section boundaries
tables where feasible
captions if recoverable
page markers when useful for debugging or later citation

8.2 JSON rules

JSON should preserve structured document elements whenever possible.

Suggested shape:


{
  "document_id": "DOC-000001",
  "title": "Example Title",
  "source_filename": "example.pdf",
  "parser": "docling",
  "processed_at": "2026-03-09",
  "language": "en",
  "elements": [
    {
      "type": "heading",
      "level": 1,
      "text": "Example Title",
      "page": 1
    },
    {
      "type": "paragraph",
      "text": "Example paragraph text.",
      "page": 1
    }
  ]
}

8.3 Table handling

Tables should be preserved structurally when possible.

Preferred order:

CSV or JSON for actual structure
HTML if needed for rendering
Markdown table only when simple enough

9. Chunking and Retrieval Design

9.1 Chunking objective

The objective is to create chunks that remain meaningful when retrieved on their own.

9.2 Chunking strategy

Prefer chunking by:

heading
subsection
paragraph group
table boundary
image/caption boundary where relevant

Avoid purely blind fixed-length splitting as the main approach.

9.3 Target chunk size

Initial target:

about 250–700 tokens
overlap only where useful
preserve heading context in each chunk

9.4 Retrieval modes

The system should eventually support:

keyword search
semantic search
hybrid search
filtered search by category, visibility, date, or project area

10. Local and Cloud Interfaces

10.1 Local interface goals

Support:

script-driven processing
local search
local retrieval testing
local LLM prompt assembly

10.2 Cloud interface goals

Support:

portfolio presentation
project pages
public blog or mini-reports
selected study logs
selected searchable content
future authenticated sections

10.3 Public/private content classes

Each item should eventually be marked as one of:

public
private
restricted
draft

10.4 Publishing workflow

A future publishing path should look like:

daily/weekly writing → review → mark visibility → add to publishing manifest → export to site content

11. Learning, Reflection, and Documentation Requirements

11.1 This is a learning system

This project is part of a larger study program that also includes CompTIA Network+ study.

The project should therefore produce evidence of:

learning goals
practice applied
experiments performed
tools evaluated
problems encountered
lessons learned
revisions made

11.2 Daily writing requirement

Each workday should produce written output.

Minimum acceptable output:

one short daily log entry

Preferred output:

a short daily log plus
one decision note, issue note, or public-facing summary draft

11.3 Public-facing writing requirement

At least some daily or weekly work should eventually be shaped into public portfolio material.

Usual forms:

mini-report
project note
architecture note
study reflection
“what I built / what I learned” post

Target size:

usually under one page of text

11.4 Calendar integration

Weekly goals may be copied into Google Calendar or other planning tools.
This plan should therefore remain clear, modular, and easy to excerpt.

12. Minimal Checklists

12.1 Project setup checklist

Create root folder structure
Create docs/ files
Create manifest CSV templates
Create daily log template
Define document ID rules
Define visibility categories
Select initial sample documents

12.2 Parser setup checklist

12.3 Metadata checklist

12.4 Chunking checklist

12.5 Search checklist

Implement basic keyword search
Test by title and keyword
Show chunk text with metadata
Confirm source traceability
Log retrieval issues

12.6 Cloud publishing checklist

12.7 Daily work checklist

13. Development Timeline

13.1 Time assumption

Original assumption: 1 hour per weekday.

Updated assumption: likely 10+ hours weekly, with this project and Network+ study taking most available time.

This timeline assumes concentrated weekday work, with optional weekend catch-up or writing.

13.2 Overall expected duration

Estimated initial completion of the first fully usable version:

8 to 10 weeks

This means:

a working local parsing pipeline
chunking and basic search
a first cloud portfolio interface
a repeatable documentation habit
a publishable first set of public-facing project materials

14. Daily Timeline from March 09, 2026

This timeline is designed as a practical path to the first usable complete version.

Week 1: Foundation and standards

Monday, March 09, 2026

Create project root folder structure
Create docs/, logs/, manifests/, scripts/
Save this master plan into project docs
Create empty CSV templates
Write first daily log

Tuesday, March 10, 2026

Define document categories
Define visibility categories
Define document ID and chunk ID formats
Create categories.yaml
Write decision note on taxonomy choices

Wednesday, March 11, 2026

Select 5–10 representative sample documents
Separate modern PDFs from scanned/problematic PDFs
Record them in documents.csv
Note any uncertain cases

Thursday, March 12, 2026

Define metadata standards in writing
Create architecture.md
Create roadmap.md
Write daily log and one public-summary draft

Friday, March 13, 2026

Review structure created this week
Clean up filenames and folders
Update master plan with any clarifications
Write weekly summary

Week 2: Parser environment and first outputs

Monday, March 16, 2026

Install or confirm Docling environment
Create a simple parse test script
Test environment on one document

Tuesday, March 17, 2026

Parse first modern selectable-text PDF
Save Markdown and JSON outputs
Record processing result

Wednesday, March 18, 2026

Inspect first outputs manually
Note heading quality, table quality, missing sections
Write issue log if needed

Thursday, March 19, 2026

Parse second and third sample documents
Compare output quality across documents
Update notes on parser behavior

Friday, March 20, 2026

Refine parse script naming/output paths
Write summary of parser setup progress
Create first “what I built this week” draft

Week 3: Stable ingestion and manifests

Monday, March 23, 2026

Add automatic document ID assignment or manual procedure
Confirm documents.csv workflow

Tuesday, March 24, 2026

Add checksum/hash recording
Add parse status tracking
Test manifest updates

Wednesday, March 25, 2026

Process another small batch of documents
Verify all outputs land in correct folders

Thursday, March 26, 2026

Create or refine processing_log.csv workflow
Make logging more consistent

Friday, March 27, 2026

Review all parsed files so far
Identify naming problems and fix standards
Write weekly summary

Week 4: Markdown/JSON quality and cleanup rules

Monday, March 30, 2026

Define acceptable Markdown output standards
Define acceptable JSON output standards

Tuesday, March 31, 2026

Identify common cleanup needs
Decide what should be automated vs manual

Wednesday, April 01, 2026

Create a cleanup utility or checklist
Test on one or two parsed files

Thursday, April 02, 2026

Document cleanup rules in decisions.md
Update architecture notes

Friday, April 03, 2026

Review whether current outputs are good enough to begin chunking
Write weekly summary

Week 5: Chunking prototype

Monday, April 06, 2026

Define chunk metadata fields
Create chunk_manifest.csv

Tuesday, April 07, 2026

Write first chunking script prototype
Chunk one document by heading/section

Wednesday, April 08, 2026

Inspect chunk outputs manually
Check chunk size and coherence

Thursday, April 09, 2026

Refine chunking logic
Add section-path preservation

Friday, April 10, 2026

Process chunks for several test documents
Write weekly summary and note remaining issues

Week 6: Searchable archive prototype

Monday, April 13, 2026

Define search output format
Decide what a useful search result must display

Tuesday, April 14, 2026

Build simple keyword search over chunk outputs

Wednesday, April 15, 2026

Test search using realistic questions
Record retrieval quality problems

Thursday, April 16, 2026

Add filtering by category, title, or visibility

Friday, April 17, 2026

Summarize the first useful archive/search version
Write public-facing mini-report draft

Week 7: Embeddings and semantic retrieval

Monday, April 20, 2026

Select initial local embedding approach
Record decision note

Tuesday, April 21, 2026

Generate embeddings for a small chunk sample

Wednesday, April 22, 2026

Test first semantic retrieval examples

Thursday, April 23, 2026

Compare keyword and semantic retrieval
Note strengths and failures

Friday, April 24, 2026

Decide whether to proceed immediately with hybrid search or defer
Write weekly summary

Week 8: Local LLM workflow and prompt assembly

Monday, April 27, 2026

Define prompt assembly structure
Choose what metadata accompanies retrieved chunks

Tuesday, April 28, 2026

Test small local LLM workflow on laptop

Wednesday, April 29, 2026

Test larger local LLM workflow on desktop

Thursday, April 30, 2026

Compare laptop and desktop results
Note where smaller vs larger models are sufficient

Friday, May 01, 2026

Write a “local multi-machine workflow” note
Weekly summary

Week 9: Cloud interface planning and first public content

Monday, May 04, 2026

Define public site sections
Decide what content belongs on the site first

Tuesday, May 05, 2026

Create first project page draft
Create first study-log or blog page draft

Wednesday, May 06, 2026

Create publishing_manifest.csv
Mark first publishable items

Thursday, May 07, 2026

Choose site hosting direction and note cost assumptions

Friday, May 08, 2026

Review cloud interface requirements and update plan
Weekly summary

Week 10: First cloud-facing portfolio prototype

Monday, May 11, 2026

Set up site skeleton
Create homepage content draft

Tuesday, May 12, 2026

Add project page template
Add study-log page template

Wednesday, May 13, 2026

Publish first test content locally or to staging

Thursday, May 14, 2026

Review readability, clarity, and recruiter usefulness

Friday, May 15, 2026

Mark first usable version complete
Write milestone summary
Update master plan for Phase 2 expansion

15. “Completed” Definition for Version 1

Version 1 is complete when all of the following are true:

raw documents are stored in an organized structure
Docling parsing works for modern PDFs
Markdown and JSON outputs are generated consistently
chunk records exist with metadata
basic local search works
daily logging habit is active
at least a few public-facing writeups exist
the cloud interface has at least a first usable portfolio prototype

16. Phase 2 After Version 1

After the first complete version, next priorities may include:

better semantic retrieval
hosted private search
stronger auth
richer public project pages
cleaner sync between laptop, desktop, and cloud
improved logging and blog workflows
later cloud-hosted model access if affordable and useful

17. Weekly Operating Pattern

A good weekly pattern is:

Monday

Plan and define one clear goal.

Tuesday

Build or implement.

Wednesday

Test and inspect.

Thursday

Refine and document.

Friday

Summarize, publish notes, and plan next week.

Optional weekend time can be used for:

backlog cleanup
reading
extra writing
Network+ study
public post polishing

18. Risks and Mitigations

Risk: too many moving parts

Mitigation: keep the first version narrow and standards-driven.

Risk: public site work distracts from core pipeline

Mitigation: publish only selected outputs after local artifacts are stable.

Risk: inconsistent daily documentation

Mitigation: require a minimum daily log even on low-energy days.

Risk: multi-machine confusion

Mitigation: use stable IDs, manifests, and export rules.

Risk: expensive cloud growth

Mitigation: keep heavy processing local and publish selectively.

19. Immediate Next Steps

Save this plan into a Google Doc and local markdown file
Create the folder structure
Create the CSV templates
Create the daily markdown templates
Start the first daily log for March 09, 2026
Select the initial sample documents
Begin Week 1 Day 1 tasks

Conclusion

This revised plan defines a practical path for building an LLM-friendly knowledge base that is also a visible public-facing professional interface.

The core strategy remains:

local-first processing
open-source tools
Docling as the primary parser
Markdown + JSON as canonical outputs
semantic chunks with metadata
daily written reflection
gradual publication to a cloud-hosted portfolio

The major refinement in this version is that the cloud interface is now a required part of the system, not a possible later extra. This matters because the project is not only for private utility. It is also for demonstrating skill, discipline, and growth to outside viewers.

The project should therefore be treated as both:

a technical system
a documented learning journey

That combination is a strength. It means the work itself becomes part of the portfolio.

Revision Log


## Revision Log

### 2026-03-09
- Revised plan to Version 2.
- Made cloud-hosted interface a required component.
- Added multi-machine local/cloud architecture considerations.
- Added daily writing and reflection as formal project requirements.
- Added minimal checklists.
- Added file templates.
- Added day-by-day timeline from March 09, 2026 through first completion milestone.

### YYYY-MM-DD
- Updated:
- Reason:
- Next decision:

Master Plan v2: Local-first LLM Knowledge Base, Cloud Portfolio, & Study Log System

Master Plan v2: Local-First AI Knowledge Base, Cloud Portfolio, and Study Log System

Introduction

1. Project Purpose and Scope

1.1 Core Purpose

1.2 Primary Outcomes

1.2.1 Local knowledge ingestion and processing

1.2.2 Structured archival

1.2.3 Local LLM use

1.2.4 Cloud-hosted interface

1.2.5 Multi-machine interoperability

1.2.6 Ongoing learning documentation

1.3 Success Criteria

2. Guiding Principles

2.1 Local-first, cloud-accessible

2.2 Open-source-first

2.3 One source of truth

2.4 Structure over flattening

2.5 Learning through doing

2.6 Small repeatable steps

2.7 Public professionalism

3. Chosen Technical Direction

3.1 Primary parser

Why

3.2 Fallback parser for scanned PDFs

Use case

Folder separation

3.3 Canonical parsed outputs

Markdown is for

JSON is for

3.4 Canonical workflow

4. System Architecture

4.1 Architecture layers

Layer A: source preservation

Layer B: parsing and normalization

Layer C: retrieval preparation

Layer D: local interfaces

Layer E: cloud interface

4.2 Machine roles

Laptop

Desktop

Cloud

4.3 Interoperability goal

5. Folder Structure

6. Required Data Standards

6.1 Document ID format

6.2 Chunk ID format

6.3 Daily log entry ID format

6.4 Required document metadata

6.5 Required chunk metadata

6.6 Required daily progress metadata

7. File Templates

7.1 documents.csv

7.2 processing_log.csv

7.3 chunk_manifest.csv

7.4 daily_log.csv

7.5 publishing_manifest.csv

7.6 Daily markdown progress note template

7.7 Decision note template

7.8 Issue/resolution template

7.9 Weekly summary template

8. Markdown and JSON Standards

8.1 Markdown rules

8.2 JSON rules

8.3 Table handling

9. Chunking and Retrieval Design

9.1 Chunking objective

9.2 Chunking strategy

9.3 Target chunk size

9.4 Retrieval modes

10. Local and Cloud Interfaces

10.1 Local interface goals

10.2 Cloud interface goals

10.3 Public/private content classes

10.4 Publishing workflow

11. Learning, Reflection, and Documentation Requirements

11.1 This is a learning system

11.2 Daily writing requirement

11.3 Public-facing writing requirement

11.4 Calendar integration

7.1 `documents.csv`

7.2 `processing_log.csv`

7.3 `chunk_manifest.csv`

7.4 `daily_log.csv`

7.5 `publishing_manifest.csv`