# Database Association Issue: PL11089 Files Show PL689 Content

## 🔍 Deeper Investigation

You've correctly identified that even after the SQL collation fix, the **actual .bin files being converted** are still showing PL689 content when querying PL11089. This indicates a **database association problem**, not just a query matching issue.

## 📊 The Data Flow

Here's the complete flow from database to PDF:

```
1. Query lr_source_document by document_number
   ↓ (FIXED with binary collation)
2. Query alf_node_properties to get content_url (store://...)
   ↓ (FIXED with binary collation)
3. Parse store:// URL to filesystem path
   ↓
4. Read .bin file from filesystem
   ↓
5. Convert .bin to PDF
   ↓
6. Return PDF to user
```

**The problem occurs at Step 2-4**: The database might be returning the **correct content_url** based on the query, but that URL might point to the **wrong file** (PL689 instead of PL11089).

## 🎯 Three Possible Scenarios

### Scenario A: Database Has Wrong Associations
**Symptom**: `alf_node_properties` table has PL689's files incorrectly tagged as "PL11089"

**Example**:
```sql
-- alf_node_properties might have:
node_id | string_value | content_url
--------|--------------|---------------------------
12345   | "PL11089"    | store://2014/11/3/.../PL689_file.bin  ← WRONG!
```

**Verification**: Run the diagnostic script
```bash
./diagnose_image_associations.py
```

**Solution**: Database correction required (see below)

---

### Scenario B: Cached PDF Serving Old Data
**Symptom**: Cache has old wrong PDF and keeps serving it

**Example**:
```bash
# Cache file might be:
/tmp/aumentum_pdfs/PL11089_doc123.pdf  ← Contains PL689 content from previous bug
```

**Verification**: Clear cache and regenerate
```bash
./clear_cache_and_test.sh
```

**Solution**: Clear cache and test fresh generation

---

### Scenario C: Filesystem Discovery Mixing Files
**Symptom**: `_discover_pages_by_filesystem()` finding wrong files based on timestamp proximity

**Example**:
```
PL689 and PL11089 were scanned at similar times
→ Filesystem discovery finds PL689 files when looking for PL11089 pages
```

**Verification**: Check the diagnostic output for file timestamps

**Solution**: Disable filesystem discovery or improve filtering

## 🧪 Diagnostic Commands

### 1. Run Full Diagnostic
```bash
cd /home/plagis/workspace/plagis_aumentum

# Check database associations
python3 diagnose_image_associations.py
```

**This will show**:
- What lr_source_document says about PL11089 vs PL689
- What alf_node_properties associates with each
- Actual filesystem paths and file existence
- File timestamps (to detect proximity issues)
- Cross-contamination warnings

### 2. Clear Cache and Test Fresh
```bash
./clear_cache_and_test.sh
```

**This will**:
- Clear all PL11089 and PL689 cached PDFs
- Query database for document IDs
- Run diagnostics
- Generate fresh PDF for PL11089
- Save to `/tmp/test_PL11089_fresh.pdf` for manual verification

### 3. Manual Verification
```bash
# After running clear_cache_and_test.sh, open the PDF:
xdg-open /tmp/test_PL11089_fresh.pdf
# or on Mac:
open /tmp/test_PL11089_fresh.pdf

# Verify the document content matches PL11089, not PL689
```

## 🔧 Solutions Based on Scenario

### If Scenario A (Database Association Wrong)

The `alf_node_properties` table has incorrect string_value entries. This requires database correction.

**Option 1: Identify and Fix Wrong Associations**

```sql
-- Find all nodes associated with PL11089
SELECT 
    np.node_id,
    np.string_value,
    q.local_name,
    cu.content_url,
    n.audit_created
FROM LRSAdmin.alf_node_properties np
JOIN LRSAdmin.alf_qname q ON q.id = np.qname_id
JOIN LRSAdmin.alf_node n ON n.id = np.node_id
JOIN LRSAdmin.alf_content_data cd ON cd.id = n.id
JOIN LRSAdmin.alf_content_url cu ON cu.id = cd.content_url_id
WHERE RTRIM(LTRIM(np.string_value)) COLLATE Latin1_General_BIN = 'PL11089'
AND q.local_name IN ('targetRids','sourceRids');

-- Check the creation dates and file paths
-- If they don't match the PL11089 acceptance date, they're wrong

-- To fix (BE CAREFUL - backup first!):
UPDATE LRSAdmin.alf_node_properties
SET string_value = 'PL689'  -- Correct value
WHERE node_id = <wrong_node_id>
AND qname_id = (SELECT id FROM LRSAdmin.alf_qname WHERE local_name = 'targetRids');
```

**Option 2: Exclude Known Wrong Nodes**

If you can identify specific wrong node IDs, add a filter:

```python
# In aumentum_browser_service.py, resolve_store_urls_by_document_number()
# Add exclusion list
KNOWN_WRONG_ASSOCIATIONS = {
    'PL11089': [12345, 67890],  # Node IDs that incorrectly reference PL11089
}

# In the query, add:
# AND n.id NOT IN (?, ?)  -- Exclude wrong nodes
```

### If Scenario B (Cache Issue)

**Solution**: Implemented in code already

The cache now uses `{document_number}_doc{document_id}.pdf` format to prevent collisions.

**Immediate fix**:
```bash
# Clear cache
rm -rf /tmp/aumentum_pdfs/*

# Or use API endpoint:
curl -X DELETE "http://localhost:8001/cache/clear-all"

# Then regenerate:
curl "http://localhost:8001/documents/pdf-by-document-number?document_number=PL11089&document_id=<doc_id>" \
  --output test_fresh.pdf
```

### If Scenario C (Filesystem Discovery Issue)

The `_discover_pages_by_filesystem()` method uses timestamp proximity to find multi-page documents. If PL689 and PL11089 were scanned at similar times, it might grab wrong files.

**Solution**: Add document_number validation

```python
# In aumentum_browser_service.py, _discover_pages_by_filesystem()
# After line 971, add validation:

# ADDITIONAL VALIDATION: Check if files actually belong to this document
# by verifying they're not associated with OTHER document numbers
validated_urls = []
for url in discovered_urls:
    # Check if this file is associated with a DIFFERENT document number
    # via alf_node_properties
    # If yes, skip it
    # If no, include it
    validated_urls.append(url)

return validated_urls
```

## 📋 Investigation Checklist

Run through this checklist to identify the root cause:

- [ ] Run `diagnose_image_associations.py` to see database associations
- [ ] Check if PL11089 and PL689 have DIFFERENT content_url values
- [ ] Verify file timestamps match document acceptance dates
- [ ] Clear cache with `clear_cache_and_test.sh`
- [ ] Generate fresh PDF and manually verify content
- [ ] Check if issue persists after cache clear (indicates database problem)
- [ ] Look for "⚠️ WARNING" messages in diagnostic output

## 🎯 Most Likely Root Cause

Based on your description, **Scenario A** (Database has wrong associations) is most likely:

**Evidence**:
- SQL collation fix applied correctly
- Issue persists in fresh PDF generation
- Files are being fetched from database, not filesystem discovery

**Hypothesis**: The Alfresco `alf_node_properties` table has entries where:
- `string_value = "PL11089"` 
- BUT the `content_url` points to files that actually contain PL689 content

This could happen if:
1. Documents were mislabeled during scanning
2. Manual correction was incomplete
3. Data migration had errors
4. Document number was changed but references weren't updated

## 🚀 Recommended Action Plan

**Phase 1: Diagnose** (5 minutes)
```bash
cd /home/plagis/workspace/plagis_aumentum
python3 diagnose_image_associations.py > diagnosis.txt
cat diagnosis.txt
```

**Phase 2: Clear Cache** (2 minutes)
```bash
./clear_cache_and_test.sh
```

**Phase 3: Manual Verification** (5 minutes)
```bash
# Open the generated PDF
xdg-open /tmp/test_PL11089_fresh.pdf

# Verify:
# - Does it show PL11089 content? ✅ Fixed!
# - Does it show PL689 content? ❌ Database problem
```

**Phase 4: If Still Wrong** (Database Correction Needed)
```sql
-- Backup first!
SELECT * INTO LRSAdmin.alf_node_properties_backup_20251103
FROM LRSAdmin.alf_node_properties;

-- Identify wrong associations
-- (See diagnostic output for specific node IDs)

-- Correct the associations
-- (See "Option 1" above)
```

## 📞 Need Help?

If the diagnostic scripts show database association problems, you'll need to:

1. **Identify** which node IDs are incorrectly associated
2. **Verify** the correct associations (check file contents or timestamps)
3. **Update** the `alf_node_properties.string_value` to fix associations
4. **Retest** to confirm fix

Or consider implementing **Option 2** (exclusion list) as a temporary workaround while database is being corrected.

---

**Next Step**: Run the diagnostic scripts and report the findings!

```bash
./diagnose_image_associations.py
./clear_cache_and_test.sh
```

