# Fix for "Random Images" Bug

## 🐛 The Bug

Our code was returning wrong images for documents (e.g., PL11089 showing PL689 content).

## 🔍 Root Cause - NOW UNDERSTOOD!

### What We Thought

```
Directory structure: YYYY/MM/DD/HOUR/MINUTE/
"All files in same directory = uploaded at same time = same document"
```

### What It Actually Is

```
Directory structure: YYYY/MM/DD/NODE_ID/BATCH_ID/
"Files in same directory = processed by same scanner node in same batch"
```

**Multiple different documents can share the same NODE_ID/BATCH_ID directory!**

---

## 💥 The Problem Explained

### Example: Directory `2025/11/4/10/1/`

This directory contains files from Node 10, Batch 1, which processed:

```
/contentstore/2025/11/4/10/1/
    ├── a1b2c3...bin  ← PL21825 Type 127, Page 1
    ├── d4e5f6...bin  ← PL21825 Type 127, Page 2
    ├── g7h8i9...bin  ← PL20886, Page 1
    ├── j0k1l2...bin  ← PL20886, Page 2
    ├── m3n4o5...bin  ← PL20886, Page 3
    └── p6q7r8...bin  ← Another document...

Total: 39 .bin files from MULTIPLE documents
```

### Our Buggy Code

```python
# ❌ WRONG APPROACH
def get_document_files(document_number):
    # 1. Get ONE content_url from database
    url = db.query("SELECT content_url WHERE doc = ?", document_number)[0]
    # Result: store://2025/11/4/10/1/a1b2c3...bin
    
    # 2. Extract directory
    directory = "2025/11/4/10/1/"
    
    # 3. Get ALL files in directory (MISTAKE!)
    all_files = os.listdir(f"/contentstore/{directory}")
    # Returns: 39 files including PL20886, PL21900, etc.!
    
    # 4. Use timestamp proximity to select "related" files
    files = find_closest_timestamps(all_files, reference_file, page_count=2)
    # Might select: [a1b2c3.bin, g7h8i9.bin]
    #                └─ PL21825    └─ PL20886! WRONG DOCUMENT!
    
    return files
```

**Result:** We return a mix of files from DIFFERENT documents!

---

## ✅ The Fix

### Correct Approach

```python
# ✅ CORRECT APPROACH
def get_document_files(document_number):
    # 1. Get ALL content_urls for this document from database
    urls = db.query("""
        SELECT cu.content_url
        FROM alf_node_properties np
        JOIN alf_node n ON n.id = np.node_id
        JOIN alf_content_data cd ON cd.id = n.id
        JOIN alf_content_url cu ON cu.id = cd.content_url_id
        WHERE np.string_value = ?
        AND cu.content_url IS NOT NULL
    """, document_number)
    
    # Results:
    # [
    #   "store://2025/11/4/10/1/a1b2c3...bin",
    #   "store://2025/11/4/10/1/d4e5f6...bin"
    # ]
    
    # 2. Convert each URL to filesystem path
    files = [parse_store_url(url) for url in urls]
    # Results:
    # [
    #   "/contentstore/2025/11/4/10/1/a1b2c3...bin",
    #   "/contentstore/2025/11/4/10/1/d4e5f6...bin"
    # ]
    
    # 3. Return ONLY these specific files
    return files
    
    # ✅ ONLY returns files for PL21825!
    # ✅ No contamination from other documents!
```

### Key Difference

| Buggy Code | Fixed Code |
|------------|------------|
| Gets ONE URL, extracts directory | Gets ALL URLs for document |
| Lists ALL files in directory | Uses exact files from URLs |
| Uses timestamp proximity | Uses database UUIDs |
| Returns mix of documents | Returns only correct document |

---

## 🔧 Implementation in `aumentum_browser_service.py`

### Update `resolve_store_urls_by_document_number()`

```python
def resolve_store_urls_by_document_number(self, document_number: str) -> List[Dict]:
    """
    Get ALL content URLs for a document number.
    
    CRITICAL: Returns ALL URLs from database, not filesystem discovery!
    """
    conn = self._get_db_connection()
    cursor = conn.cursor()
    
    # Get ALL document IDs for this number
    cursor.execute("""
        SELECT id AS document_id, document_type, page_count
        FROM LRSAdmin.lr_source_document
        WHERE RTRIM(LTRIM(document_number)) = ?
    """, (document_number.strip(),))
    
    docs = cursor.fetchall()
    
    results = []
    for doc_id, doc_type, page_count in docs:
        # Get ALL content URLs for this document ID
        cursor.execute("""
            SELECT DISTINCT
                n.id AS node_id,
                n.uuid,
                cu.content_url,
                cu.content_size
            FROM LRSAdmin.alf_node_properties np
            JOIN LRSAdmin.alf_qname q ON q.id = np.qname_id
            JOIN LRSAdmin.alf_node n ON n.id = np.node_id AND n.node_deleted = 0
            LEFT JOIN LRSAdmin.alf_content_data cd ON cd.id = n.id
            LEFT JOIN LRSAdmin.alf_content_url cu ON cu.id = cd.content_url_id
            WHERE RTRIM(LTRIM(np.string_value)) = ?
            AND q.local_name IN ('targetRids', 'sourceRids')
            AND cu.content_url IS NOT NULL
            ORDER BY cu.content_url
        """, (document_number.strip(),))
        
        images = [dict(zip([col[0] for col in cursor.description], row)) 
                  for row in cursor.fetchall()]
        
        results.append({
            "document_id": doc_id,
            "document_type": doc_type,
            "page_count": page_count,
            "images": images,  # ← ALL URLs from database!
        })
    
    cursor.close()
    conn.close()
    return results
```

### Remove Filesystem Discovery (if database has all URLs)

```python
# ❌ DELETE THIS (causes random image bug):
def _discover_pages_by_filesystem(self, reference_url, expected_page_count):
    # This was listing ALL files in node/batch directory
    # including files from OTHER documents!
    pass

# ✅ ONLY use this if database truly has incomplete data:
def _discover_pages_by_filesystem_SAFE(self, document_number, expected_page_count):
    # Only use if database has 0 URLs but files exist
    # AND only as last resort with clear warning
    pass
```

---

## 🧪 Testing the Fix

### Test Case 1: PL11089

```python
# Before fix:
result = service.resolve_store_urls_by_document_number("PL11089")
# Returns: 46 pages (wrong - includes PL689 content!)

# After fix:
result = service.resolve_store_urls_by_document_number("PL11089")
# Returns: Only PL11089's actual pages from database
```

### Test Case 2: PL21825

```python
# Once database linking completes:
result = service.resolve_store_urls_by_document_number("PL21825")

# Should return 3 groups:
# Group 1 (Type 103): 50 URLs from Node 9, Batch 15
# Group 2 (Type 127): 2 URLs from Node 10, Batch 1
# Group 3 (Type 126): 2 URLs from Node 10, Batch 4

# Total: 54 URLs, ALL correct, no contamination!
```

---

## 📋 Summary

### Why We Had Random Images

1. **Misunderstood directory structure**
   - Thought: HOUR/MINUTE (time-based)
   - Actually: NODE_ID/BATCH_ID (scanner-based)

2. **Wrong assumption**
   - Thought: "Same directory = same document"
   - Actually: "Same directory = same node/batch = MULTIPLE documents"

3. **Filesystem discovery contamination**
   - Listed ALL files in directory
   - Included files from other documents
   - Picked "closest" files by timestamp (wrong!)

### The Fix

1. **Always query database for ALL content URLs**
2. **Use exact UUIDs from database**
3. **Never list directory contents**
4. **Trust the database, not the filesystem**

### Result

- ✅ No more random images
- ✅ Correct files for each document
- ✅ Proper handling of multi-type documents
- ✅ No cross-contamination between documents

---

## 🎯 Key Takeaway

**The directory structure is for LOAD DISTRIBUTION, not DOCUMENT ORGANIZATION!**

```
Same NODE_ID/BATCH_ID directory can contain:
    ├── Document A files
    ├── Document B files
    ├── Document C files
    └── Document D files

To get Document A files: Query database for Document A's UUIDs
To get Document B files: Query database for Document B's UUIDs

NEVER list directory and assume all files are related!
```

**Always use the database as the source of truth!**

