graph TB
subgraph "Tracked Projects (e.g., mosquito_model_data_prep, weather-data-collector)"
TP1[Project A Job Script]
TP2[Project B Job Script]
TP3[Project C Job Script]
end
subgraph "Monitor Repository (mosquito-alert-model-monitor)"
subgraph "Scripts"
UJS[update_job_status.sh]
LGS[locked_git_sync.sh]
SDS[slurm_dashboard_sync.sh]
end
subgraph "Data Storage"
SF[data/status/*.json]
HF[data/history/*.json]
LF[data/details/*.log]
end
subgraph "Dashboard"
IDX[index.qmd]
HIST[history.qmd]
JOBS[jobs.qmd]
ALERTS[alerts.qmd]
end
end
subgraph "GitHub"
GR[Git Repository]
GA[GitHub Actions]
GP[GitHub Pages]
end
subgraph "SLURM Cluster"
SC[Cron Jobs]
end
%% Data flow from tracked projects
TP1 -->|calls| UJS
TP2 -->|calls| UJS
TP3 -->|calls| UJS
%% Internal monitor flows
UJS -->|writes immediately| SF
UJS -->|queues for sync| LGS
LGS -->|git add/commit/push| GR
%% Cron sync flow
SC -->|every 15-30 min| SDS
SDS -->|uses| LGS
%% GitHub deployment
GR -->|triggers| GA
GA -->|renders| IDX
GA -->|renders| HIST
GA -->|renders| JOBS
GA -->|renders| ALERTS
GA -->|deploys| GP
%% Dashboard reads data
IDX -->|reads| SF
HIST -->|reads| HF
JOBS -->|reads| SF
ALERTS -->|reads| SF
%% Styling
classDef tracked fill:#e1f5fe
classDef monitor fill:#f3e5f5
classDef github fill:#fff3e0
classDef cluster fill:#e8f5e8
class TP1,TP2,TP3 tracked
class UJS,LGS,SDS,SF,HF,LF,IDX,HIST,JOBS,ALERTS monitor
class GR,GA,GP github
class SC cluster
System Architecture and Data Flow
This document provides a comprehensive overview of how the Mosquito Alert Model Monitor integrates with tracked projects and manages data flow.
🏗️ Architecture Overview
🔄 Detailed Data Flow
1. Job Status Updates (Real-time)
sequenceDiagram
participant PA as ProjectA
participant PB as ProjectB
participant UJS as update_job_status
participant SF as StatusFiles
participant LGS as locked_git_sync
participant GIT as GitRepository
Note over PA,PB: Multiple projects can run simultaneously
PA->>+UJS: Call with job status
UJS->>SF: Write A.json immediately
UJS->>+LGS: Request git sync A
Note over LGS: Acquire git lock
PB->>+UJS: Call with job status
UJS->>SF: Write B.json immediately
UJS->>LGS: Request git sync B WAITS for lock
LGS->>GIT: git add commit push A changes
LGS-->>-UJS: Sync complete A
UJS-->>-PA: Status update complete
Note over LGS: Release git lock acquire for B
UJS->>+LGS: Now process B sync
LGS->>GIT: git add commit push A and B changes
LGS-->>-UJS: Sync complete B
UJS-->>-PB: Status update complete
2. Periodic Dashboard Sync (Cron)
sequenceDiagram
participant CRON as SLURMCron
participant SDS as dashboard_sync
participant LGS as locked_git_sync
participant GIT as GitRepository
participant GA as GitHubActions
participant GP as GitHubPages
CRON->>+SDS: Every 15-30 minutes
SDS->>+LGS: Request bulk sync
Note over LGS: Acquire git lock may wait
LGS->>GIT: git add data all accumulated changes
LGS->>GIT: git commit Dashboard sync
LGS->>GIT: git push origin main
LGS-->>-SDS: Sync complete
SDS-->>-CRON: Cron job complete
GIT->>+GA: Trigger on push to main
GA->>GA: quarto render build dashboard
GA->>GP: Deploy to GitHub Pages
GA-->>-GIT: Deployment complete
🔐 Lock Mechanism Details
File Locking Strategy
graph LR
subgraph NoLock ["No Lock Required"]
A1[ProjectA writes A.json]
B1[ProjectB writes B.json]
C1[ProjectC writes C.json]
end
subgraph GitLock ["Git Lock Required"]
GL[git_sync_lock]
GO[Git Operations]
end
A1 -.->|immediate| A1
B1 -.->|immediate| B1
C1 -.->|immediate| C1
A1 -->|request sync| GL
B1 -->|request sync| GL
C1 -->|request sync| GL
GL -->|serialize| GO
classDef nolock fill:#e8f5e8
classDef lock fill:#ffebee
class A1,B1,C1 nolock
class GL,GO lock
Lock File Structure
mosquito-alert-model-monitor/
├── .git_sync_lock/ # Lock directory (created atomically)
│ ├── pid # Process ID holding lock
│ ├── timestamp # When lock was acquired
│ ├── operation # Description of operation
│ └── host # Hostname of locking process
📊 Component Responsibilities
Tracked Projects
- ✅ Call monitoring via
update_job_status.sh - ✅ Continue execution regardless of monitoring success/failure
- ✅ Provide status updates (running, completed, failed)
- ✅ Include progress info (duration, percentage complete)
update_job_status.sh
- ✅ Write status files immediately (no waiting)
- ✅ Queue git operations via
locked_git_sync.sh - ✅ Never fail (always exit 0)
- ✅ Handle missing directories gracefully
locked_git_sync.sh
- ✅ Serialize git operations using file locks
- ✅ Handle lock timeouts (max 30 seconds)
- ✅ Clean up stale locks (> 5 minutes old)
- ✅ Retry failed pushes with rebase
- ✅ Preserve data locally if push fails
slurm_dashboard_sync.sh
- ✅ Periodic bulk sync (every 15-30 minutes)
- ✅ Use same locking mechanism as individual updates
- ✅ Minimal resource usage (512MB RAM, 1 CPU, 5 min max)
- ✅ Handle conflicts gracefully
Dashboard Pages
- ✅ Read status files for current state
- ✅ Read history files for trends
- ✅ Auto-refresh via GitHub Actions
- ✅ Handle missing data gracefully
🚨 Race Condition Prevention
Problem Scenarios
- Multiple projects updating simultaneously
- Cron sync during project updates
- Git conflicts from concurrent pushes
- Stale locks from crashed processes
Solutions Implemented
- Immediate file writes - No waiting for git operations
- Git-only locking - Files can be written concurrently
- Lock timeouts - Prevent deadlocks
- Stale lock cleanup - Handle crashed processes
- Graceful failures - Preserve data even if sync fails
📈 Performance Characteristics
File Operations
- Status file writes: < 1 second (no lock contention)
- Multiple projects: Can write simultaneously
- File size: ~1-2KB per status file
Git Operations
- Lock acquisition: Usually instant, max 30 seconds
- Commit/push time: 5-30 seconds depending on network
- Conflict resolution: Automatic via rebase
Resource Usage
- Individual updates: Minimal overhead
- Cron sync: 512MB RAM, 1 CPU core, ~1-2 minutes
- Dashboard build: Handled by GitHub Actions (free)
🔧 Configuration Points
Lock Timeouts
LOCK_TIMEOUT=30 # Max wait for git lock (seconds)
LOCK_CHECK_INTERVAL=1 # Check frequency (seconds)
STALE_LOCK_AGE=300 # Clean locks older than 5 minutesCron Frequency
*/15 * * * * # Every 15 minutes (active development)
*/30 * * * * # Every 30 minutes (production)Resource Limits
#SBATCH --mem=512M # Memory limit
#SBATCH --cpus-per-task=1 # CPU limit
#SBATCH --time=00:05:00 # Time limit🔍 Monitoring and Debugging
Lock Status Check
# Check if git lock is active
ls -la /path/to/monitor/.git_sync_lock/
# View lock details
cat /path/to/monitor/.git_sync_lock/pid
cat /path/to/monitor/.git_sync_lock/operationGit Sync Logs
# Recent git operations
git log --oneline -10
# Check for conflicts
git statusStatus File Verification
# Check latest status files
ls -lt data/status/*.json | head -5
# Validate JSON format
jq . data/status/project_name.jsonThis architecture ensures that tracked projects can always update their status immediately while git operations are properly serialized to prevent conflicts and data loss.