Entropy Analysis - Streaming Excel Implementation¶

Overview¶

The entropy analysis tool has been refactored to use a streaming Excel export approach that eliminates memory bottlenecks when analyzing large files. The new implementation processes and writes results incrementally, maintaining constant memory usage regardless of file size.

Key Changes¶

Memory-Efficient Architecture¶

Generator-Based Analysis: The analyze_file_generator() method yields results as they're computed, never accumulating all regions in memory.
Selective Region Processing: Only regions meeting the risk threshold are created as objects and written to Excel. Lower-risk regions are counted but not stored.
Direct Excel Streaming: Results are written to Excel immediately using XlsxWriter's constant_memory mode, which streams data to disk without buffering.

Usage¶

Basic Command¶

ssf_tools analyze entropy <target>

Options¶

Option	Default	Description
`--risk-threshold`	`medium_high`	Minimum risk level for regions to include (`very_low`, `low`, `medium`, `medium_high`, `high`, `critical`)
`--file-block-size`	65536	File I/O block size in bytes
`--analysis-block-size`	64	Analysis block size in bytes
`--step-size`	16	Step size for sliding window
`--force-file-type`	auto	Override file type detection
`--no-recurse`	false	Disable recursive directory analysis
`--include-samples`	false	Include data samples in region analysis

Examples¶

# Analyze single file with default settings
ssf_tools analyze entropy sample.bin

# Analyze with higher threshold (fewer results, less memory)
ssf_tools analyze entropy large_file.bin --risk-threshold high

# Analyze directory non-recursively
ssf_tools analyze entropy data/ --no-recurse

# Custom analysis parameters
ssf_tools analyze entropy file.exe --analysis-block-size 128 --step-size 32

Excel Output Format¶

The tool generates an Excel file with two worksheets:

1. File Summary¶

Contains one row per analyzed file with: - File path and name - File size (MB) - Overall entropy score - Total regions analyzed - High-risk regions found - Risk level (LOW/MEDIUM/HIGH) - Processing time - MIME type - Detected language

2. High Risk Regions¶

Contains detailed information for regions meeting the risk threshold: - File path and name - Offset in file - Region size - Entropy value - Risk level - Confidence score - Data sample (optional, hex format)

Memory Usage¶

Before (Accumulation Approach)¶

1GB file: ~12.5 GB RAM usage
100MB file: ~1.3 GB RAM usage
Memory usage scales linearly with file size

After (Streaming Approach)¶

Any file size: ~10-20 MB RAM usage
Memory usage remains constant regardless of file size

How It Works¶

File Processing: Files are read in chunks (default 64KB)
Sliding Window Analysis: Each chunk is analyzed with a sliding window
Selective Creation: Only high-risk regions create EntropyRegion objects
Immediate Output: Results are written to Excel as soon as they're computed
Garbage Collection: Objects become eligible for GC immediately after writing

Performance Considerations¶

Risk Threshold Selection¶

The --risk-threshold parameter significantly impacts both memory usage and output size:

Threshold	Typical Regions	Use Case
`very_low`	80-90% of all	Forensic analysis (may hit Excel limits)
`low`	50-70% of all	Detailed analysis
`medium`	20-40% of all	Standard analysis
`medium_high`	5-15% of all	Recommended default
`high`	1-5% of all	Quick scan for critical issues
`critical`	<1% of all	Emergency response

Excel Limitations¶

Excel has a hard limit of 1,048,576 rows per worksheet. The tool will: 1. Warn when approaching the limit (900,000 rows) 2. Stop writing regions when the limit is reached 3. Add a note to the summary if the limit was hit

For very large datasets with low thresholds, consider: - Using a higher --risk-threshold - Analyzing files separately - Breaking large files into chunks

Implementation Details¶

Key Components¶

analyze_file_generator() (analyzer.py)
Yields results as they're computed
Only creates objects for high-risk regions
Maintains running statistics
StreamingExcelExporter (excel_streamer.py)
Uses XlsxWriter with constant_memory=True
Writes rows immediately without buffering
Manages worksheet formatting and structure
Refactored CLI (analyze.py)
Simplified to only support Excel export
Progress reporting per file
Memory-conscious file discovery

Configuration¶

The tool respects existing configuration files but focuses on streaming:

# config/analysis.yaml
analysis:
  file_chunk_size: 65536      # I/O chunk size
  analysis_block_size: 64     # Analysis window size
  step_size: 16               # Sliding window step

Migration Notes¶

Removed Features¶

JSON output format (incompatible with streaming)
Table display output (requires full dataset in memory)
In-memory result accumulation

Backward Compatibility¶

Configuration files remain compatible
Command-line interface simplified but familiar
File discovery and validation unchanged

Troubleshooting¶

Excel File Won't Open¶

Ensure the analysis completed successfully
Check disk space (Excel files can be large)
Try opening with Excel 2016 or newer

Analysis Seems Slow¶

Larger --step-size processes fewer regions
Higher --risk-threshold reduces output overhead
Consider analyzing files in parallel

Memory Still High¶

Verify you're using the latest version
Check for other memory-consuming processes
Ensure temp directory has sufficient space

Future Enhancements¶

Potential improvements for consideration: 1. Parallel file processing for multi-core systems 2. Resume capability for interrupted analyses 3. Alternative output formats (Parquet, SQLite) 4. Real-time analysis dashboard 5. Cloud storage integration

Summary¶

The streaming Excel implementation provides: - Constant memory usage regardless of file size - Immediate output visibility during processing - Selective processing to reduce unnecessary computation - Excel format familiar to security analysts - Configurable thresholds for different use cases

This approach solves the memory bottleneck while maintaining all analytical capabilities and improving user experience through progressive output generation.