Document Upload & Processing

Detailed guide on uploading and processing documents in BillionLens.

Supported File Formats

BillionLens supports a wide range of document formats:

Documents

Format	Extensions	Notes
PDF	`.pdf`	Native text extraction + OCR for scanned documents
Word	`.docx`, `.doc`	Microsoft Word documents
Excel	`.xlsx`, `.xls`	Spreadsheets
PowerPoint	`.pptx`, `.ppt`	Presentations
Text	`.txt`, `.csv`, `.rtf`	Plain text files

Images (with OCR)

Format	Extensions	Notes
Images	`.png`, `.jpg`, `.jpeg`	Text extracted via advanced OCR
TIFF	`.tiff`, `.tif`	Common for scanned documents
Other	`.gif`, `.bmp`, `.webp`	Also supported

Email

Format	Extensions	Notes
Individual Emails	`.eml`, `.msg`	Single email messages
Email Archives	`.pst`, `.mbox`	Automatically extracted to individual emails

Uploading Documents

Folder Upload

BillionLens preserves your folder structure when uploading:

Click Select Folder to Upload
Choose a folder from your computer
All files and subfolders are uploaded
Folder structure is preserved in the file browser

Folder Organization

Organize your production by custodian or document type before uploading:

Production/
├── John_Smith/
│   ├── Emails/
│   └── Documents/
├── ABC_Corp/
│   ├── Contracts/
│   └── Financials/

Upload Limits

Maximum file size: 100MB per file
No limit on number of files
Supported formats only - unsupported files are flagged

Duplicate Detection

BillionLens automatically detects duplicate files using SHA-256 hash:

Identical files are not uploaded twice
Saves storage space
Maintains single source of truth

Processing Documents

What Happens During Processing

When you click Process Uploaded Files:

Upload to Cloud - Files are stored in encrypted cloud storage
Text Extraction - Text is extracted from each document
Vector Indexing - Documents are chunked and indexed for semantic search
Metadata Extraction - Email headers, dates, and other metadata are parsed

Text Extraction Methods

PDF Documents:

Native text PDFs: Direct text extraction
Scanned PDFs: Automatic OCR with intelligent layout detection
Mixed PDFs: Combines both methods

Images:

All images processed with advanced neural OCR
Handwriting recognition supported
Multi-column layouts handled

Email Messages:

Body text extracted (HTML and plain text)
Metadata parsed: From, To, CC, BCC, Subject, Date
Attachments processed separately

Email Archives (PST/MBOX):

Automatically extracted to individual EML files
Each email becomes a separate indexed document
Preserves folder structure from email client

Processing Time

Processing time depends on document count and complexity:

Document Count	Approximate Time
10 documents	~30 seconds
100 documents	~5 minutes
500 documents	~20 minutes
1,000 documents	~45 minutes

Background Processing

Processing runs in the background. You can navigate away and return to check progress.

Progress Tracking

During processing, you'll see:

Progress bar showing percentage complete
Current file being processed
Statistics updating in real-time

File Metadata

Automatic Metadata Extraction

For each file, BillionLens tracks:

Field	Description
File Name	Original filename
File Type	Document format
File Size	Size in bytes
Upload Date	When file was uploaded
Indexed Date	When file was processed

Email Metadata

For email files (.eml, .msg), additional metadata is extracted:

Field	Description
From	Sender email address
To	Recipient email addresses
CC	Carbon copy recipients
Subject	Email subject line
Date Sent	When email was sent

Production Numbers

BillionLens can recognize production/Bates numbers in filenames:

Supported formats:

PROD001234.pdf → Production Number: PROD001234
ABC-0001.pdf → Production Number: ABC-0001
DEF_00123.pdf → Production Number: DEF_00123

Handling Failed Files

Common Failure Reasons

Password protected - PDFs with passwords cannot be processed
Corrupted files - Damaged files that cannot be read
Unsupported encoding - Text files with unusual character encoding
Empty files - Files with no extractable content

Retrying Failed Files

Check the Failed count in file statistics
Click Browse Files to see which files failed
Fix the source file if possible (remove password, re-export)
Re-upload the fixed file
Run Process Uploaded Files again

Best Practices

Before Uploading

Organize folders by custodian or document type
Remove duplicates at the source if possible
Include production numbers in filenames
Remove password protection from PDFs

Naming Conventions

Recommended:

PROD001234_Contract_Agreement.pdf
JSmith_Email_2024-08-15.eml
ABC-0001_Financial_Statement.xlsx

Avoid:

Spaces in production numbers
Very long filenames
Special characters: < > : " / \ | ? *

Large Productions

For productions over 1,000 documents:

Upload in batches of 500-1,000 files
Process each batch before uploading more
Monitor for failed files

Next: Learn about AI Chat & Research.

Supported File Formats​

Documents​

Images (with OCR)​

Email​

Uploading Documents​

Folder Upload​

Upload Limits​

Duplicate Detection​

Processing Documents​

What Happens During Processing​

Text Extraction Methods​

Processing Time​

Progress Tracking​

File Metadata​

Automatic Metadata Extraction​

Email Metadata​

Production Numbers​

Handling Failed Files​

Common Failure Reasons​

Retrying Failed Files​

Best Practices​

Before Uploading​

Naming Conventions​

Large Productions​