Document Upload & Processing
Detailed guide on uploading and processing documents in BillionLens.
Supported File Formats
BillionLens supports a wide range of document formats:
Documents
| Format | Extensions | Notes |
|---|---|---|
.pdf | Native text extraction + OCR for scanned documents | |
| Word | .docx, .doc | Microsoft Word documents |
| Excel | .xlsx, .xls | Spreadsheets |
| PowerPoint | .pptx, .ppt | Presentations |
| Text | .txt, .csv, .rtf | Plain text files |
Images (with OCR)
| Format | Extensions | Notes |
|---|---|---|
| Images | .png, .jpg, .jpeg | Text extracted via advanced OCR |
| TIFF | .tiff, .tif | Common for scanned documents |
| Other | .gif, .bmp, .webp | Also supported |
Email
| Format | Extensions | Notes |
|---|---|---|
| Individual Emails | .eml, .msg | Single email messages |
| Email Archives | .pst, .mbox | Automatically extracted to individual emails |
Uploading Documents
Folder Upload
BillionLens preserves your folder structure when uploading:
- Click Select Folder to Upload
- Choose a folder from your computer
- All files and subfolders are uploaded
- Folder structure is preserved in the file browser
Organize your production by custodian or document type before uploading:
Production/
├── John_Smith/
│ ├── Emails/
│ └── Documents/
├── ABC_Corp/
│ ├── Contracts/
│ └── Financials/
Upload Limits
- Maximum file size: 100MB per file
- No limit on number of files
- Supported formats only - unsupported files are flagged
Duplicate Detection
BillionLens automatically detects duplicate files using SHA-256 hash:
- Identical files are not uploaded twice
- Saves storage space
- Maintains single source of truth
Processing Documents
What Happens During Processing
When you click Process Uploaded Files:
- Upload to Cloud - Files are stored in encrypted cloud storage
- Text Extraction - Text is extracted from each document
- Vector Indexing - Documents are chunked and indexed for semantic search
- Metadata Extraction - Email headers, dates, and other metadata are parsed
Text Extraction Methods
PDF Documents:
- Native text PDFs: Direct text extraction
- Scanned PDFs: Automatic OCR with intelligent layout detection
- Mixed PDFs: Combines both methods
Images:
- All images processed with advanced neural OCR
- Handwriting recognition supported
- Multi-column layouts handled
Email Messages:
- Body text extracted (HTML and plain text)
- Metadata parsed: From, To, CC, BCC, Subject, Date
- Attachments processed separately
Email Archives (PST/MBOX):
- Automatically extracted to individual EML files
- Each email becomes a separate indexed document
- Preserves folder structure from email client
Processing Time
Processing time depends on document count and complexity:
| Document Count | Approximate Time |
|---|---|
| 10 documents | ~30 seconds |
| 100 documents | ~5 minutes |
| 500 documents | ~20 minutes |
| 1,000 documents | ~45 minutes |
Processing runs in the background. You can navigate away and return to check progress.
Progress Tracking
During processing, you'll see:
- Progress bar showing percentage complete
- Current file being processed
- Statistics updating in real-time
File Metadata
Automatic Metadata Extraction
For each file, BillionLens tracks:
| Field | Description |
|---|---|
| File Name | Original filename |
| File Type | Document format |
| File Size | Size in bytes |
| Upload Date | When file was uploaded |
| Indexed Date | When file was processed |
Email Metadata
For email files (.eml, .msg), additional metadata is extracted:
| Field | Description |
|---|---|
| From | Sender email address |
| To | Recipient email addresses |
| CC | Carbon copy recipients |
| Subject | Email subject line |
| Date Sent | When email was sent |
Production Numbers
BillionLens can recognize production/Bates numbers in filenames:
Supported formats:
PROD001234.pdf→ Production Number: PROD001234ABC-0001.pdf→ Production Number: ABC-0001DEF_00123.pdf→ Production Number: DEF_00123
Handling Failed Files
Common Failure Reasons
- Password protected - PDFs with passwords cannot be processed
- Corrupted files - Damaged files that cannot be read
- Unsupported encoding - Text files with unusual character encoding
- Empty files - Files with no extractable content
Retrying Failed Files
- Check the Failed count in file statistics
- Click Browse Files to see which files failed
- Fix the source file if possible (remove password, re-export)
- Re-upload the fixed file
- Run Process Uploaded Files again
Best Practices
Before Uploading
- Organize folders by custodian or document type
- Remove duplicates at the source if possible
- Include production numbers in filenames
- Remove password protection from PDFs
Naming Conventions
Recommended:
PROD001234_Contract_Agreement.pdfJSmith_Email_2024-08-15.emlABC-0001_Financial_Statement.xlsx
Avoid:
- Spaces in production numbers
- Very long filenames
- Special characters:
< > : " / \ | ? *
Large Productions
For productions over 1,000 documents:
- Upload in batches of 500-1,000 files
- Process each batch before uploading more
- Monitor for failed files
Next: Learn about AI Chat & Research.