Skip to content

Entropy Analysis RAM Usage Optimization

Overview

The current analyze entropy implementation can consume excessive RAM when processing large workloads (1GB+ datasets). This document analyzes the RAM bottlenecks and provides actionable recommendations for optimization without compromising functionality.

Current Architecture Analysis

RAM Usage Bottlenecks Identified

1. Complete Result Set Accumulation (Primary Issue)

Location: cli/commands/analyze.py:751-810, 821-846

The current flow accumulates ALL results in memory before displaying:

# In _analyze_files_with_single_progress() and _analyze_files_without_progress()
all_results: dict[str, dict[str, float | list[EntropyRegion] | str | None]] = {}

# For each file processed:
all_results[str(file_path)] = {
    "overall_entropy": overall_entropy,
    "regions": regions,  # <-- This can be huge for large files
    "mime_type": mime_type,
    "detected_language": detected_language,
}

RAM Impact: For a 1GB file with 64-byte analysis blocks and 16-byte step size, this creates ~64M entropy regions, with each EntropyRegion object consuming ~200+ bytes (including Python object overhead and 32-byte data samples).

Calculation: - Regions per file: (file_size - analysis_block_size) / step_size + 1 - For 1GB file: (1,073,741,824 - 64) / 16 + 1 ≈ 67,108,736 regions - Memory per region: ~200 bytes (object + data sample) - Total RAM for regions: ~12.5 GB per 1GB file

2. Data Sample Storage in Regions (Secondary Issue)

Location: analyze/services/entropy/analyzer.py:481

Each EntropyRegion stores a 32-byte data sample:

region = EntropyRegion(
    # ... other fields ...
    data_sample=block_data[:32],  # 32 bytes per region
)

RAM Impact: For 67M regions × 32 bytes = ~2GB just for data samples.

3. Multiple Result Processing Passes (Tertiary Issue)

Location: cli/commands/analyze.py:972-1033, 1036-1104

Results are processed multiple times for different outputs: - Once for table display preparation - Once for Excel export (if enabled)
- Once for JSON export (if enabled)

Optimization Recommendations

Priority 1: Streaming Results Architecture

1.1 Implement Streaming Results Processing

File: cli/commands/analyze.py

Replace the current accumulation pattern with streaming:

# Instead of accumulating all_results, process immediately
def _analyze_files_streaming(
    files: list[Path],
    analyzer: EntropyAnalyzer,
    rich_output: RichOutputService,
    # ... other params
    summary_collector: ResultsSummaryCollector,  # New component
) -> None:
    """Stream analysis results without holding all in memory."""

    for file_path in files:
        # Analyze single file
        overall_entropy, regions = analyzer.analyze_file_entropy(...)

        # Process results immediately
        file_result = FileAnalysisResult(
            file_path=file_path,
            overall_entropy=overall_entropy,
            # Process regions immediately rather than storing
        )

        # Update running summary statistics
        summary_collector.update_with_file_result(file_result)

        # Stream to output immediately (optional)
        if streaming_output_enabled:
            rich_output.stream_file_result(file_result)

1.2 Create Results Summary Collector

New File: analyze/services/summary_collector.py

class ResultsSummaryCollector:
    """Collects running statistics without storing full results."""

    def __init__(self):
        self.total_files = 0
        self.total_risky_files = 0
        self.risk_level_counts = defaultdict(int)
        # Other running statistics

    def update_with_file_result(self, result: FileAnalysisResult):
        """Update running statistics from single file result."""
        # Extract and accumulate key metrics
        # Classify risk level and increment counters
        # No need to store the full result

Priority 2: Selective Region Storage

2.1 Filter Regions by Risk Level

File: analyze/services/entropy/analyzer.py

Only store regions above a certain risk threshold:

def _process_chunk_sliding_windows(
    # ... existing params
    risk_threshold: EntropyLevel = EntropyLevel.MEDIUM_HIGH,
) -> tuple[list[EntropyRegion], int, bytes]:
    """Process sliding windows, keeping only risky regions."""

    regions: list[EntropyRegion] = []

    # ... existing processing logic ...

    # Only store region if it meets risk threshold
    if entropy_level.value >= risk_threshold.value:
        region = EntropyRegion(
            offset=block_offset,
            size=len(block_data),
            confidence=self._calculate_confidence(block_entropy, params.file_type),
            entropy=block_entropy,
            level=entropy_level,
            data_sample=block_data[:32] if include_samples else b"",
        )
        regions.append(region)

    # Always update region count for accurate statistics
    region_count += 1

RAM Reduction: Instead of 67M regions, store only high-risk regions (estimated 5-10% = ~3-7M regions).

2.2 Optional Data Sample Reduction

Add configuration option to exclude data samples:

# In configuration
class AnalysisConfiguration:
    include_data_samples: bool = False  # Default off for large datasets
    max_regions_per_file: int = 10000   # Cap regions per file

Priority 3: Incremental Output Generation

3.1 Progressive Table Display

File: core/services/rich_output/service.py

class RichOutputService:
    def stream_analysis_progress(
        self,
        current_file: str,
        file_summary: FileSummary,
        running_totals: RunningTotals,
    ):
        """Update display progressively instead of final table."""
        # Update progress display
        # Show running statistics
        # Highlight concerning files immediately

3.2 Buffered Excel Export

File: core/services/excel_export/service.py

class ExcelExportService:
    def create_streaming_export(self, output_path: Path) -> ExcelStreamWriter:
        """Create Excel writer that streams data row-by-row."""
        return ExcelStreamWriter(output_path)

class ExcelStreamWriter:
    def write_file_result(self, result: FileAnalysisResult):
        """Write single file result immediately to Excel."""
        # Append to Excel without accumulating in memory

Priority 4: Memory-Conscious Configuration Options

4.1 Large Dataset Mode

Add a configuration preset for large datasets:

# In analyze command
@click.option(
    "--large-dataset-mode",
    is_flag=True,
    help="Optimize for large datasets (>100MB total): reduce memory usage, filter low-risk regions",
)

This mode would automatically: - Set include_data_samples=False - Set risk_threshold=MEDIUM_HIGH - Set max_regions_per_file=10000 - Enable streaming output

4.2 Adaptive Block Size

Implement adaptive analysis block size based on file size:

def calculate_optimal_block_size(file_size: int, available_memory: int) -> int:
    """Calculate block size to stay within memory limits."""
    if file_size > 100 * 1024 * 1024:  # >100MB
        return 1024  # Larger blocks, fewer regions
    return 64  # Default for smaller files

Implementation Plan

Phase 1: Critical RAM Reduction (Immediate)

  1. Implement streaming results processing
  2. Add risk-level filtering for regions
  3. Create --large-dataset-mode flag

Expected RAM Reduction: 80-90% for large files

Phase 2: Enhanced Streaming (Short-term)

  1. Implement results summary collector
  2. Add progressive display updates
  3. Add buffered Excel export

Expected Benefits: Consistent RAM usage regardless of dataset size

Phase 3: Advanced Optimization (Medium-term)

  1. Implement adaptive block sizing
  2. Add memory usage monitoring
  3. Add configuration presets for different use cases

Backward Compatibility

All optimizations maintain backward compatibility: - Default behavior unchanged for small datasets - New flags are opt-in - All existing output formats supported - API compatibility maintained

Testing Strategy

  1. Memory Profiling: Add tests with memory_profiler for various dataset sizes
  2. Large Dataset Testing: Test with 1GB+, 5GB+, 10GB+ datasets
  3. Output Validation: Ensure streaming results match current full results
  4. Performance Benchmarks: Measure processing time vs memory usage tradeoffs

Monitoring and Metrics

Add runtime memory monitoring:

def monitor_memory_usage():
    """Log memory usage at key processing milestones."""
    import psutil
    process = psutil.Process()
    memory_mb = process.memory_info().rss / 1024 / 1024
    logger.info(f"Memory usage: {memory_mb:.1f} MB")

Configuration Examples

For large workloads:

analysis:
  large_dataset_mode: true
  risk_threshold: "medium_high"
  include_data_samples: false
  max_regions_per_file: 10000
  adaptive_block_size: true

For comprehensive analysis (current behavior):

analysis:
  large_dataset_mode: false
  risk_threshold: "low"  
  include_data_samples: true
  max_regions_per_file: null  # unlimited
  adaptive_block_size: false

Conclusion

The primary RAM bottleneck is storing complete result sets in memory. By implementing streaming results processing with selective region storage, RAM usage can be reduced by 80-90% while maintaining full functionality. The recommendations provide a clear path to handle multi-gigabyte workloads efficiently.