Large File Streaming
Memory-efficient processing of large Excel files using the streaming API with Consumer callbacks
The Spreadsheet Module provides true streaming capabilities for processing large Excel files without loading the entire workbook into memory. This is essential when working with spreadsheets containing thousands of rows that would otherwise cause memory issues.
Memory Efficient: The streaming API processes files row-by-row, keeping only ~100 rows in memory at any given time. This allows you to process spreadsheets with millions of rows on standard hardware.
🎯 Why Use Streaming?
Traditional Loading vs. Streaming
Traditional Approach (Loads Entire File):
// ⚠️ Loads ALL rows into memory at once
sheet = Spreadsheet( "huge-file.xlsx" );
data = sheet.toArray();  // Could consume gigabytes of RAM
for ( row in data ) {
    // Process each row
}Streaming Approach (Memory Efficient):
// ✅ Processes rows one at a time
Spreadsheet().process( "huge-file.xlsx", ( row ) => {
    // Process row immediately, then discard from memory
    println( row );
});When to Use Streaming
Use streaming for:
- Large files with thousands of rows (> 10,000 rows) 
- Import operations where you process data sequentially 
- Memory-constrained environments where RAM is limited 
- Data transformation that doesn't require random access to all rows 
- Export to databases or other formats where you process row-by-row 
Use traditional loading for:
- Small files (< 10,000 rows) 
- Random access where you need to jump between rows 
- Modification operations where you need to write back to the file 
- Complex operations requiring multiple passes over the data 
🔧 How It Works
Technology Stack
The streaming implementation uses different approaches based on file format:
.xlsx
excel-streaming-reader
~100 rows
True streaming with Apache POI wrapper
.xls
Apache POI (HSSF)
Full file
Legacy format requires full load
Streaming Configuration
The streaming reader is configured with optimal defaults:
// Internal configuration (you don't need to set this)
StreamingReader.builder()
    .rowCacheSize( 100 )      // Keep 100 rows in memory
    .bufferSize( 4096 )       // 4KB read buffer
    .open( inputStream );📚 API Methods
The streaming API provides three process() methods that accept Consumer callbacks:
Method 1: Process File by Path
Process all rows from the first sheet in a file:
Spreadsheet().process( filePath, consumer );Parameters:
- filePath(String) - Absolute or relative path to Excel file
- consumer(Consumer) - Callback function receiving each row as an Array
Method 2: Process Specific Sheet
Process rows from a named sheet:
Spreadsheet().process( filePath, sheetName, consumer );Parameters:
- filePath(String) - Absolute or relative path to Excel file
- sheetName(String) - Name of the sheet to process
- consumer(Consumer) - Callback function receiving each row as an Array
Method 3: Process Loaded Workbook
Process an already-loaded workbook (still memory-efficient for active sheet):
sheet = Spreadsheet( "file.xlsx" );
sheet.process( consumer );Parameters:
- consumer(Consumer) - Callback function receiving each row as an Array
Method 3 Note: This method processes the currently active sheet from a workbook already loaded into memory. For truly large files, prefer Methods 1 or 2 which stream directly from disk.
💡 Examples
Basic Row Processing
Process each row and print its contents:
Spreadsheet().process( "sales-data.xlsx", ( row ) => {
    println( "Row: #row.toString()#" );
});Data Import to Database
Stream large file directly into a database:
Spreadsheet().process( "customers.xlsx", ( row ) => {
    // Skip header row
    if ( row[1] == "Name" ) return;
    // Insert into database
    queryExecute(
        "INSERT INTO customers (name, email, phone) VALUES (?, ?, ?)",
        [ row[1], row[2], row[3] ]
    );
});Processing Specific Sheet
Stream data from a named sheet:
Spreadsheet().process(
    "multi-sheet-report.xlsx",
    "Sales Data",
    ( row ) => {
        // Process only "Sales Data" sheet
        processOrder( row );
    }
);Type-Safe Data Processing
Handle different cell value types:
Spreadsheet().process( "data.xlsx", ( row ) => {
    // Access cells by index (1-based)
    productName = row[1];        // String
    quantity = row[2];           // Number (always Double)
    price = row[3];              // Number (always Double)
    inStock = row[4];            // Boolean
    lastUpdated = row[5];        // LocalDateTime
    // Calculate total
    total = quantity * price;
    println( "#productName#: #total#" );
});Filtering During Streaming
Filter rows during processing to reduce memory:
results = [];
Spreadsheet().process( "orders.xlsx", ( row ) => {
    // Only keep high-value orders
    if ( row[3] > 10000 ) {
        results.append({
            orderId: row[1],
            customer: row[2],
            amount: row[3]
        });
    }
});
println( "Found #results.len()# high-value orders" );Streaming with Row Counter
Track progress while processing:
rowCount = 0;
Spreadsheet().process( "large-file.xlsx", ( row ) => {
    rowCount++;
    // Log progress every 1000 rows
    if ( rowCount % 1000 == 0 ) {
        println( "Processed #rowCount# rows..." );
    }
    processRow( row );
});
println( "Total rows processed: #rowCount#" );Error Handling in Streaming
Gracefully handle errors during streaming:
errorCount = 0;
try {
    Spreadsheet().process( "data.xlsx", ( row ) => {
        try {
            // Attempt to process row
            processBusinessLogic( row );
        } catch ( any e ) {
            // Log error but continue processing
            errorCount++;
            writeLog( "Error processing row: #e.message#" );
        }
    });
} catch ( any e ) {
    writeLog( "Fatal error reading file: #e.message#" );
}
println( "Processing complete. Errors: #errorCount#" );Chaining with Fluent API
Combine streaming with other fluent operations:
// Create summary spreadsheet from streamed data
summary = [];
Spreadsheet()
    .process( "source-data.xlsx", ( row ) => {
        // Aggregate data during streaming
        summary.append( summarizeRow( row ) );
    });
// Create new spreadsheet from aggregated data
Spreadsheet()
    .addRow( [ "Summary", "Count", "Total" ] )
    .formatRow( 1, { bold: true, fgcolor: "blue" } )
    .addRows( summary )
    .autoSizeColumns()
    .save( "summary-report.xlsx" );🎭 Cell Value Types
The streaming API returns strongly-typed values based on Excel cell types:
String
String
"Hello World"
Numeric
Double
123.45 (always Double, even for integers)
Boolean
Boolean
true or false
Date/Time
LocalDateTime
2025-01-15T10:30:00
Formula
Evaluated Result
Returns the cached formula result value
Blank
"" (empty string)
""
Error
String
Error code like "#DIV/0!"
Formula Handling
Formulas are automatically evaluated using cached results:
// Excel cell contains formula: =A1*B1
Spreadsheet().process( "formulas.xlsx", ( row ) => {
    result = row[3];  // Returns calculated Double value, not formula string
});⚡ Performance Considerations
Memory Usage
10 MB / 50K rows
~150 MB RAM
~5 MB RAM
50 MB / 250K rows
~750 MB RAM
~5 MB RAM
200 MB / 1M rows
~3 GB RAM
~5 MB RAM
Processing Speed
- Streaming is slightly slower than traditional loading due to on-demand parsing 
- Tradeoff: Slightly longer processing time for dramatically reduced memory usage 
- Best Practice: Use streaming for files > 10,000 rows where memory is a concern 
Optimization Tips
DO:
- ✅ Process data directly in the Consumer callback 
- ✅ Write to database or file immediately 
- ✅ Aggregate/summarize data during streaming 
- ✅ Use filtering to reduce memory accumulation 
DON'T:
- ❌ Store all rows in an array during streaming (defeats the purpose) 
- ❌ Make blocking I/O calls in the callback (slows processing) 
- ❌ Modify the source file during streaming 
- ❌ Use streaming for small files (< 10K rows) 
🔄 Streaming vs. Traditional Loading
Comparison Table
Memory Usage
Constant (~5 MB)
Scales with file size
Random Access
❌ Sequential only
✅ Full random access
Modification
❌ Read-only
✅ Read and write
XLSX Support
✅ True streaming
✅ Full support
XLS Support
⚠️ Full load required
✅ Full support
Best For
Large imports
Small files, editing
Decision Matrix
Use Streaming when:
- File size > 10,000 rows 
- Sequential processing is sufficient 
- Memory is limited 
- Read-only operations 
- Format is - .xlsx
Use Traditional Loading when:
- File size < 10,000 rows 
- Need to modify cells 
- Random access to rows required 
- Multiple passes over data needed 
- Format is - .xls(no streaming available)
🔐 Best Practices
1. Always Use XLSX for Large Files
// ✅ Good: XLSX supports streaming
Spreadsheet().process( "large-data.xlsx", processor );
// ❌ Bad: XLS requires full load
Spreadsheet().process( "large-data.xls", processor );  // Loads entire file2. Process Data Immediately
// ✅ Good: Process and discard
Spreadsheet().process( "data.xlsx", ( row ) => {
    saveToDatabase( row );  // Immediate processing
});
// ❌ Bad: Defeats streaming purpose
allRows = [];
Spreadsheet().process( "data.xlsx", ( row ) => {
    allRows.append( row );  // Accumulates in memory!
});3. Handle Errors Gracefully
// ✅ Good: Error handling per row
Spreadsheet().process( "data.xlsx", ( row ) => {
    try {
        processRow( row );
    } catch ( any e ) {
        logError( e );  // Continue processing
    }
});4. Use Filtering Early
// ✅ Good: Filter during streaming
Spreadsheet().process( "data.xlsx", ( row ) => {
    if ( row[1] != "Invalid" ) {  // Filter early
        processValid( row );
    }
});5. Monitor Progress for Large Files
// ✅ Good: Progress tracking
counter = 0;
Spreadsheet().process( "huge-file.xlsx", ( row ) => {
    if ( ++counter % 10000 == 0 ) {
        println( "Processed #counter# rows" );
    }
});🐛 Troubleshooting
Issue: OutOfMemoryError
Cause: Accumulating rows in memory during streaming
Solution: Process rows immediately without storing
// ❌ Wrong
results = [];
Spreadsheet().process( "data.xlsx", ( row ) => results.append(row) );
// ✅ Correct
Spreadsheet().process( "data.xlsx", ( row ) => saveToDatabase(row) );Issue: Slow Processing
Cause: Blocking I/O or expensive operations in callback
Solution: Optimize callback logic
// ❌ Slow: Multiple database calls per row
Spreadsheet().process( "data.xlsx", ( row ) => {
    queryExecute( "INSERT..." );  // Slow per-row insert
});
// ✅ Fast: Batch inserts
batch = [];
Spreadsheet().process( "data.xlsx", ( row ) => {
    batch.append( row );
    if ( batch.len() >= 1000 ) {
        bulkInsert( batch );
        batch = [];
    }
});
bulkInsert( batch );  // Insert remainingIssue: XLS Files Not Streaming
Cause: Legacy .xls format doesn't support streaming
Solution: Convert to .xlsx or accept full loading
// Convert XLS to XLSX first
Spreadsheet( "old-file.xls" ).save( "new-file.xlsx" );
// Now stream the XLSX version
Spreadsheet().process( "new-file.xlsx", processor );📖 Related Documentation
- Fluent API Reference - Complete API documentation 
- Data Export - Exporting data to various formats 
- User Guide - General usage patterns 
- Examples - More code examples 
🎓 Summary
The streaming API provides:
✅ Memory-efficient processing of large files ✅ Simple Consumer callback pattern ✅ True streaming for .xlsx files via excel-streaming-reader ✅ Type-safe cell value handling ✅ Production-ready for files with millions of rows
Remember: Use streaming for large files, traditional loading for small files or when you need to modify data.
Last updated
Was this helpful?
