Skip to content

Streaming Excel Export Architecture

Overview

The refactored StreamingExcelExporter uses dependency injection to leverage existing Excel export services while adding streaming capabilities for handling massive files with minimal memory usage.

Key Components

1. StreamingExcelExporter (analyze/services/excel_streamer.py)

  • Uses dependency injection via @inject decorator
  • Leverages existing services:
  • SheetNameSanitizerProtocol - for Excel-safe sheet names
  • TitleFormatterProtocol - for formatting titles
  • WorkbookEngineProtocol - for creating Excel writers
  • RichOutputProtocol - for console output
  • Implements context manager for proper resource cleanup
  • Buffers writes for efficiency while maintaining low memory usage

2. StreamingWorkbookEngine

  • Custom engine that specifically uses xlsxwriter with constant_memory mode
  • Provides streaming capabilities not available in standard openpyxl engine
  • Memory usage: O(1) regardless of data size

3. Generator-based Analyzer (analyze/services/entropy/analyzer_streaming.py)

  • analyze_file_generator() method yields results as computed
  • Filters regions at creation time - never creates objects for low-risk regions
  • Memory usage: O(1) for any file size

Integration Points

Dependency Injection Container

# In containers/application.py or containers/core.py
streaming_excel_exporter = providers.Factory(
    StreamingExcelExporter,
    sheet_name_sanitizer=excel_export.sheet_name_sanitizer,
    title_formatter=excel_export.title_formatter,
    workbook_engine=streaming_workbook_engine,
    rich_output=rich_output,
)

CLI Command Integration

@inject
def entropy(
    excel_streaming: bool,
    risk_threshold: str,
    # ... injected services
):
    if excel_streaming:
        with StreamingExcelExporter(output_path, risk_level) as exporter:
            for file in files:
                exporter.process_file_streaming(file, analyzer)

Memory Usage Comparison

Component Current Implementation Streaming Implementation
Region Storage All regions in memory Only high-risk regions yielded
Excel Writing Build DataFrame first Direct cell writes
File Processing Complete file in memory Chunked streaming
1GB File RAM Usage ~12.5 GB ~10 MB

Benefits of Refactored Architecture

  1. Dependency Injection: Reuses existing, tested services
  2. Separation of Concerns: Streaming logic separate from Excel formatting
  3. Testability: Each component can be mocked/tested independently
  4. Extensibility: Easy to add new output formats or filters
  5. Configuration: Risk thresholds and options configurable via DI

Configuration Options

# In configuration files
analysis:
  streaming:
    enabled: true
    risk_threshold: "medium_high"
    exclude_data_samples: true
    buffer_size: 1000  # Rows to buffer before writing

excel:
  engine: "xlsxwriter"  # Required for streaming
  constant_memory: true
  max_rows_warning: 900000

Usage Examples

Basic Streaming

ssf_tools analyze entropy /large/dataset --excel-streaming

With Risk Filtering

ssf_tools analyze entropy /large/dataset \
  --excel-streaming \
  --risk-threshold=high

Large Dataset Warning

# Automatically warns if estimated regions > 500k with low threshold
ssf_tools analyze entropy /massive/file \
  --excel-streaming \
  --risk-threshold=low
# Warning: Low risk threshold with 6,250,000 estimated regions...

Implementation Checklist

  • [x] Create StreamingExcelExporter with dependency injection
  • [x] Implement StreamingWorkbookEngine for xlsxwriter support
  • [x] Add analyze_file_generator() method to analyzer
  • [x] Create container configuration for DI
  • [x] Document CLI integration approach
  • [ ] Add to main analyze.py command
  • [ ] Update container wiring in application.py
  • [ ] Add unit tests for streaming components
  • [ ] Add integration tests with large files
  • [ ] Update user documentation

Testing Strategy

  1. Unit Tests
  2. Mock all injected services
  3. Test buffer flushing logic
  4. Test Excel row limit handling

  5. Integration Tests

  6. Test with files of various sizes
  7. Verify memory usage stays constant
  8. Validate Excel output format

  9. Performance Tests

  10. Benchmark vs current implementation
  11. Measure memory usage over time
  12. Test with 1GB+, 5GB+, 10GB+ files

Migration Path

  1. Phase 1: Add streaming as opt-in flag (--excel-streaming)
  2. Phase 2: Make streaming default for files > 100MB
  3. Phase 3: Deprecate non-streaming Excel export
  4. Phase 4: Remove legacy code

Backward Compatibility

  • Non-streaming mode remains available via standard --excel-export flag
  • Output format identical between streaming and non-streaming
  • Configuration files backward compatible
  • No breaking changes to existing APIs