Entropy Analysis RAM Usage Optimization¶
Overview¶
The current analyze entropy implementation can consume excessive RAM when processing large workloads (1GB+ datasets). This document analyzes the RAM bottlenecks and provides actionable recommendations for optimization without compromising functionality.
Current Architecture Analysis¶
RAM Usage Bottlenecks Identified¶
1. Complete Result Set Accumulation (Primary Issue)¶
Location: cli/commands/analyze.py:751-810, 821-846
The current flow accumulates ALL results in memory before displaying:
# In _analyze_files_with_single_progress() and _analyze_files_without_progress()
all_results: dict[str, dict[str, float | list[EntropyRegion] | str | None]] = {}
# For each file processed:
all_results[str(file_path)] = {
"overall_entropy": overall_entropy,
"regions": regions, # <-- This can be huge for large files
"mime_type": mime_type,
"detected_language": detected_language,
}
RAM Impact: For a 1GB file with 64-byte analysis blocks and 16-byte step size, this creates ~64M entropy regions, with each EntropyRegion object consuming ~200+ bytes (including Python object overhead and 32-byte data samples).
Calculation:
- Regions per file: (file_size - analysis_block_size) / step_size + 1
- For 1GB file: (1,073,741,824 - 64) / 16 + 1 ≈ 67,108,736 regions
- Memory per region: ~200 bytes (object + data sample)
- Total RAM for regions: ~12.5 GB per 1GB file
2. Data Sample Storage in Regions (Secondary Issue)¶
Location: analyze/services/entropy/analyzer.py:481
Each EntropyRegion stores a 32-byte data sample:
RAM Impact: For 67M regions × 32 bytes = ~2GB just for data samples.
3. Multiple Result Processing Passes (Tertiary Issue)¶
Location: cli/commands/analyze.py:972-1033, 1036-1104
Results are processed multiple times for different outputs:
- Once for table display preparation
- Once for Excel export (if enabled)
- Once for JSON export (if enabled)
Optimization Recommendations¶
Priority 1: Streaming Results Architecture¶
1.1 Implement Streaming Results Processing¶
File: cli/commands/analyze.py
Replace the current accumulation pattern with streaming:
# Instead of accumulating all_results, process immediately
def _analyze_files_streaming(
files: list[Path],
analyzer: EntropyAnalyzer,
rich_output: RichOutputService,
# ... other params
summary_collector: ResultsSummaryCollector, # New component
) -> None:
"""Stream analysis results without holding all in memory."""
for file_path in files:
# Analyze single file
overall_entropy, regions = analyzer.analyze_file_entropy(...)
# Process results immediately
file_result = FileAnalysisResult(
file_path=file_path,
overall_entropy=overall_entropy,
# Process regions immediately rather than storing
)
# Update running summary statistics
summary_collector.update_with_file_result(file_result)
# Stream to output immediately (optional)
if streaming_output_enabled:
rich_output.stream_file_result(file_result)
1.2 Create Results Summary Collector¶
New File: analyze/services/summary_collector.py
class ResultsSummaryCollector:
"""Collects running statistics without storing full results."""
def __init__(self):
self.total_files = 0
self.total_risky_files = 0
self.risk_level_counts = defaultdict(int)
# Other running statistics
def update_with_file_result(self, result: FileAnalysisResult):
"""Update running statistics from single file result."""
# Extract and accumulate key metrics
# Classify risk level and increment counters
# No need to store the full result
Priority 2: Selective Region Storage¶
2.1 Filter Regions by Risk Level¶
File: analyze/services/entropy/analyzer.py
Only store regions above a certain risk threshold:
def _process_chunk_sliding_windows(
# ... existing params
risk_threshold: EntropyLevel = EntropyLevel.MEDIUM_HIGH,
) -> tuple[list[EntropyRegion], int, bytes]:
"""Process sliding windows, keeping only risky regions."""
regions: list[EntropyRegion] = []
# ... existing processing logic ...
# Only store region if it meets risk threshold
if entropy_level.value >= risk_threshold.value:
region = EntropyRegion(
offset=block_offset,
size=len(block_data),
confidence=self._calculate_confidence(block_entropy, params.file_type),
entropy=block_entropy,
level=entropy_level,
data_sample=block_data[:32] if include_samples else b"",
)
regions.append(region)
# Always update region count for accurate statistics
region_count += 1
RAM Reduction: Instead of 67M regions, store only high-risk regions (estimated 5-10% = ~3-7M regions).
2.2 Optional Data Sample Reduction¶
Add configuration option to exclude data samples:
# In configuration
class AnalysisConfiguration:
include_data_samples: bool = False # Default off for large datasets
max_regions_per_file: int = 10000 # Cap regions per file
Priority 3: Incremental Output Generation¶
3.1 Progressive Table Display¶
File: core/services/rich_output/service.py
class RichOutputService:
def stream_analysis_progress(
self,
current_file: str,
file_summary: FileSummary,
running_totals: RunningTotals,
):
"""Update display progressively instead of final table."""
# Update progress display
# Show running statistics
# Highlight concerning files immediately
3.2 Buffered Excel Export¶
File: core/services/excel_export/service.py
class ExcelExportService:
def create_streaming_export(self, output_path: Path) -> ExcelStreamWriter:
"""Create Excel writer that streams data row-by-row."""
return ExcelStreamWriter(output_path)
class ExcelStreamWriter:
def write_file_result(self, result: FileAnalysisResult):
"""Write single file result immediately to Excel."""
# Append to Excel without accumulating in memory
Priority 4: Memory-Conscious Configuration Options¶
4.1 Large Dataset Mode¶
Add a configuration preset for large datasets:
# In analyze command
@click.option(
"--large-dataset-mode",
is_flag=True,
help="Optimize for large datasets (>100MB total): reduce memory usage, filter low-risk regions",
)
This mode would automatically:
- Set include_data_samples=False
- Set risk_threshold=MEDIUM_HIGH
- Set max_regions_per_file=10000
- Enable streaming output
4.2 Adaptive Block Size¶
Implement adaptive analysis block size based on file size:
def calculate_optimal_block_size(file_size: int, available_memory: int) -> int:
"""Calculate block size to stay within memory limits."""
if file_size > 100 * 1024 * 1024: # >100MB
return 1024 # Larger blocks, fewer regions
return 64 # Default for smaller files
Implementation Plan¶
Phase 1: Critical RAM Reduction (Immediate)¶
- Implement streaming results processing
- Add risk-level filtering for regions
- Create
--large-dataset-modeflag
Expected RAM Reduction: 80-90% for large files
Phase 2: Enhanced Streaming (Short-term)¶
- Implement results summary collector
- Add progressive display updates
- Add buffered Excel export
Expected Benefits: Consistent RAM usage regardless of dataset size
Phase 3: Advanced Optimization (Medium-term)¶
- Implement adaptive block sizing
- Add memory usage monitoring
- Add configuration presets for different use cases
Backward Compatibility¶
All optimizations maintain backward compatibility: - Default behavior unchanged for small datasets - New flags are opt-in - All existing output formats supported - API compatibility maintained
Testing Strategy¶
- Memory Profiling: Add tests with
memory_profilerfor various dataset sizes - Large Dataset Testing: Test with 1GB+, 5GB+, 10GB+ datasets
- Output Validation: Ensure streaming results match current full results
- Performance Benchmarks: Measure processing time vs memory usage tradeoffs
Monitoring and Metrics¶
Add runtime memory monitoring:
def monitor_memory_usage():
"""Log memory usage at key processing milestones."""
import psutil
process = psutil.Process()
memory_mb = process.memory_info().rss / 1024 / 1024
logger.info(f"Memory usage: {memory_mb:.1f} MB")
Configuration Examples¶
For large workloads:
analysis:
large_dataset_mode: true
risk_threshold: "medium_high"
include_data_samples: false
max_regions_per_file: 10000
adaptive_block_size: true
For comprehensive analysis (current behavior):
analysis:
large_dataset_mode: false
risk_threshold: "low"
include_data_samples: true
max_regions_per_file: null # unlimited
adaptive_block_size: false
Conclusion¶
The primary RAM bottleneck is storing complete result sets in memory. By implementing streaming results processing with selective region storage, RAM usage can be reduced by 80-90% while maintaining full functionality. The recommendations provide a clear path to handle multi-gigabyte workloads efficiently.