Skip to content

Entropy Analysis - Streaming Excel Implementation

Overview

The entropy analysis tool has been refactored to use a streaming Excel export approach that eliminates memory bottlenecks when analyzing large files. The new implementation processes and writes results incrementally, maintaining constant memory usage regardless of file size.

Key Changes

Memory-Efficient Architecture

  1. Generator-Based Analysis: The analyze_file_generator() method yields results as they're computed, never accumulating all regions in memory.

  2. Selective Region Processing: Only regions meeting the risk threshold are created as objects and written to Excel. Lower-risk regions are counted but not stored.

  3. Direct Excel Streaming: Results are written to Excel immediately using XlsxWriter's constant_memory mode, which streams data to disk without buffering.

Usage

Basic Command

ssf_tools analyze entropy <target>

Options

Option Default Description
--risk-threshold medium_high Minimum risk level for regions to include (very_low, low, medium, medium_high, high, critical)
--file-block-size 65536 File I/O block size in bytes
--analysis-block-size 64 Analysis block size in bytes
--step-size 16 Step size for sliding window
--force-file-type auto Override file type detection
--no-recurse false Disable recursive directory analysis
--include-samples false Include data samples in region analysis

Examples

# Analyze single file with default settings
ssf_tools analyze entropy sample.bin

# Analyze with higher threshold (fewer results, less memory)
ssf_tools analyze entropy large_file.bin --risk-threshold high

# Analyze directory non-recursively
ssf_tools analyze entropy data/ --no-recurse

# Custom analysis parameters
ssf_tools analyze entropy file.exe --analysis-block-size 128 --step-size 32

Excel Output Format

The tool generates an Excel file with two worksheets:

1. File Summary

Contains one row per analyzed file with: - File path and name - File size (MB) - Overall entropy score - Total regions analyzed - High-risk regions found - Risk level (LOW/MEDIUM/HIGH) - Processing time - MIME type - Detected language

2. High Risk Regions

Contains detailed information for regions meeting the risk threshold: - File path and name - Offset in file - Region size - Entropy value - Risk level - Confidence score - Data sample (optional, hex format)

Memory Usage

Before (Accumulation Approach)

  • 1GB file: ~12.5 GB RAM usage
  • 100MB file: ~1.3 GB RAM usage
  • Memory usage scales linearly with file size

After (Streaming Approach)

  • Any file size: ~10-20 MB RAM usage
  • Memory usage remains constant regardless of file size

How It Works

  1. File Processing: Files are read in chunks (default 64KB)
  2. Sliding Window Analysis: Each chunk is analyzed with a sliding window
  3. Selective Creation: Only high-risk regions create EntropyRegion objects
  4. Immediate Output: Results are written to Excel as soon as they're computed
  5. Garbage Collection: Objects become eligible for GC immediately after writing

Performance Considerations

Risk Threshold Selection

The --risk-threshold parameter significantly impacts both memory usage and output size:

Threshold Typical Regions Use Case
very_low 80-90% of all Forensic analysis (may hit Excel limits)
low 50-70% of all Detailed analysis
medium 20-40% of all Standard analysis
medium_high 5-15% of all Recommended default
high 1-5% of all Quick scan for critical issues
critical <1% of all Emergency response

Excel Limitations

Excel has a hard limit of 1,048,576 rows per worksheet. The tool will: 1. Warn when approaching the limit (900,000 rows) 2. Stop writing regions when the limit is reached 3. Add a note to the summary if the limit was hit

For very large datasets with low thresholds, consider: - Using a higher --risk-threshold - Analyzing files separately - Breaking large files into chunks

Implementation Details

Key Components

  1. analyze_file_generator() (analyzer.py)
  2. Yields results as they're computed
  3. Only creates objects for high-risk regions
  4. Maintains running statistics

  5. StreamingExcelExporter (excel_streamer.py)

  6. Uses XlsxWriter with constant_memory=True
  7. Writes rows immediately without buffering
  8. Manages worksheet formatting and structure

  9. Refactored CLI (analyze.py)

  10. Simplified to only support Excel export
  11. Progress reporting per file
  12. Memory-conscious file discovery

Configuration

The tool respects existing configuration files but focuses on streaming:

# config/analysis.yaml
analysis:
  file_chunk_size: 65536      # I/O chunk size
  analysis_block_size: 64     # Analysis window size
  step_size: 16               # Sliding window step

Migration Notes

Removed Features

  • JSON output format (incompatible with streaming)
  • Table display output (requires full dataset in memory)
  • In-memory result accumulation

Backward Compatibility

  • Configuration files remain compatible
  • Command-line interface simplified but familiar
  • File discovery and validation unchanged

Troubleshooting

Excel File Won't Open

  • Ensure the analysis completed successfully
  • Check disk space (Excel files can be large)
  • Try opening with Excel 2016 or newer

Analysis Seems Slow

  • Larger --step-size processes fewer regions
  • Higher --risk-threshold reduces output overhead
  • Consider analyzing files in parallel

Memory Still High

  • Verify you're using the latest version
  • Check for other memory-consuming processes
  • Ensure temp directory has sufficient space

Future Enhancements

Potential improvements for consideration: 1. Parallel file processing for multi-core systems 2. Resume capability for interrupted analyses 3. Alternative output formats (Parquet, SQLite) 4. Real-time analysis dashboard 5. Cloud storage integration

Summary

The streaming Excel implementation provides: - Constant memory usage regardless of file size - Immediate output visibility during processing - Selective processing to reduce unnecessary computation - Excel format familiar to security analysts - Configurable thresholds for different use cases

This approach solves the memory bottleneck while maintaining all analytical capabilities and improving user experience through progressive output generation.