Entropy Analysis - Streaming Excel Implementation¶
Overview¶
The entropy analysis tool has been refactored to use a streaming Excel export approach that eliminates memory bottlenecks when analyzing large files. The new implementation processes and writes results incrementally, maintaining constant memory usage regardless of file size.
Key Changes¶
Memory-Efficient Architecture¶
-
Generator-Based Analysis: The
analyze_file_generator()method yields results as they're computed, never accumulating all regions in memory. -
Selective Region Processing: Only regions meeting the risk threshold are created as objects and written to Excel. Lower-risk regions are counted but not stored.
-
Direct Excel Streaming: Results are written to Excel immediately using XlsxWriter's
constant_memorymode, which streams data to disk without buffering.
Usage¶
Basic Command¶
Options¶
| Option | Default | Description |
|---|---|---|
--risk-threshold |
medium_high |
Minimum risk level for regions to include (very_low, low, medium, medium_high, high, critical) |
--file-block-size |
65536 | File I/O block size in bytes |
--analysis-block-size |
64 | Analysis block size in bytes |
--step-size |
16 | Step size for sliding window |
--force-file-type |
auto | Override file type detection |
--no-recurse |
false | Disable recursive directory analysis |
--include-samples |
false | Include data samples in region analysis |
Examples¶
# Analyze single file with default settings
ssf_tools analyze entropy sample.bin
# Analyze with higher threshold (fewer results, less memory)
ssf_tools analyze entropy large_file.bin --risk-threshold high
# Analyze directory non-recursively
ssf_tools analyze entropy data/ --no-recurse
# Custom analysis parameters
ssf_tools analyze entropy file.exe --analysis-block-size 128 --step-size 32
Excel Output Format¶
The tool generates an Excel file with two worksheets:
1. File Summary¶
Contains one row per analyzed file with: - File path and name - File size (MB) - Overall entropy score - Total regions analyzed - High-risk regions found - Risk level (LOW/MEDIUM/HIGH) - Processing time - MIME type - Detected language
2. High Risk Regions¶
Contains detailed information for regions meeting the risk threshold: - File path and name - Offset in file - Region size - Entropy value - Risk level - Confidence score - Data sample (optional, hex format)
Memory Usage¶
Before (Accumulation Approach)¶
- 1GB file: ~12.5 GB RAM usage
- 100MB file: ~1.3 GB RAM usage
- Memory usage scales linearly with file size
After (Streaming Approach)¶
- Any file size: ~10-20 MB RAM usage
- Memory usage remains constant regardless of file size
How It Works¶
- File Processing: Files are read in chunks (default 64KB)
- Sliding Window Analysis: Each chunk is analyzed with a sliding window
- Selective Creation: Only high-risk regions create EntropyRegion objects
- Immediate Output: Results are written to Excel as soon as they're computed
- Garbage Collection: Objects become eligible for GC immediately after writing
Performance Considerations¶
Risk Threshold Selection¶
The --risk-threshold parameter significantly impacts both memory usage and output size:
| Threshold | Typical Regions | Use Case |
|---|---|---|
very_low |
80-90% of all | Forensic analysis (may hit Excel limits) |
low |
50-70% of all | Detailed analysis |
medium |
20-40% of all | Standard analysis |
medium_high |
5-15% of all | Recommended default |
high |
1-5% of all | Quick scan for critical issues |
critical |
<1% of all | Emergency response |
Excel Limitations¶
Excel has a hard limit of 1,048,576 rows per worksheet. The tool will: 1. Warn when approaching the limit (900,000 rows) 2. Stop writing regions when the limit is reached 3. Add a note to the summary if the limit was hit
For very large datasets with low thresholds, consider:
- Using a higher --risk-threshold
- Analyzing files separately
- Breaking large files into chunks
Implementation Details¶
Key Components¶
analyze_file_generator()(analyzer.py)- Yields results as they're computed
- Only creates objects for high-risk regions
-
Maintains running statistics
-
StreamingExcelExporter(excel_streamer.py) - Uses XlsxWriter with
constant_memory=True - Writes rows immediately without buffering
-
Manages worksheet formatting and structure
-
Refactored CLI (analyze.py)
- Simplified to only support Excel export
- Progress reporting per file
- Memory-conscious file discovery
Configuration¶
The tool respects existing configuration files but focuses on streaming:
# config/analysis.yaml
analysis:
file_chunk_size: 65536 # I/O chunk size
analysis_block_size: 64 # Analysis window size
step_size: 16 # Sliding window step
Migration Notes¶
Removed Features¶
- JSON output format (incompatible with streaming)
- Table display output (requires full dataset in memory)
- In-memory result accumulation
Backward Compatibility¶
- Configuration files remain compatible
- Command-line interface simplified but familiar
- File discovery and validation unchanged
Troubleshooting¶
Excel File Won't Open¶
- Ensure the analysis completed successfully
- Check disk space (Excel files can be large)
- Try opening with Excel 2016 or newer
Analysis Seems Slow¶
- Larger
--step-sizeprocesses fewer regions - Higher
--risk-thresholdreduces output overhead - Consider analyzing files in parallel
Memory Still High¶
- Verify you're using the latest version
- Check for other memory-consuming processes
- Ensure temp directory has sufficient space
Future Enhancements¶
Potential improvements for consideration: 1. Parallel file processing for multi-core systems 2. Resume capability for interrupted analyses 3. Alternative output formats (Parquet, SQLite) 4. Real-time analysis dashboard 5. Cloud storage integration
Summary¶
The streaming Excel implementation provides: - Constant memory usage regardless of file size - Immediate output visibility during processing - Selective processing to reduce unnecessary computation - Excel format familiar to security analysts - Configurable thresholds for different use cases
This approach solves the memory bottleneck while maintaining all analytical capabilities and improving user experience through progressive output generation.