Skip to main content

Document Upload & Processing

Detailed guide on uploading and processing documents in BillionLens.

Supported File Formats

BillionLens supports a wide range of document formats:

Documents

FormatExtensionsNotes
PDF.pdfNative text extraction + OCR for scanned documents
Word.docx, .docMicrosoft Word documents
Excel.xlsx, .xlsSpreadsheets
PowerPoint.pptx, .pptPresentations
Text.txt, .csv, .rtfPlain text files

Images (with OCR)

FormatExtensionsNotes
Images.png, .jpg, .jpegText extracted via advanced OCR
TIFF.tiff, .tifCommon for scanned documents
Other.gif, .bmp, .webpAlso supported

Email

FormatExtensionsNotes
Individual Emails.eml, .msgSingle email messages
Email Archives.pst, .mboxAutomatically extracted to individual emails

Uploading Documents

Folder Upload

BillionLens preserves your folder structure when uploading:

  1. Click Select Folder to Upload
  2. Choose a folder from your computer
  3. All files and subfolders are uploaded
  4. Folder structure is preserved in the file browser
Folder Organization

Organize your production by custodian or document type before uploading:

Production/
├── John_Smith/
│ ├── Emails/
│ └── Documents/
├── ABC_Corp/
│ ├── Contracts/
│ └── Financials/

Upload Limits

  • Maximum file size: 100MB per file
  • No limit on number of files
  • Supported formats only - unsupported files are flagged

Duplicate Detection

BillionLens automatically detects duplicate files using SHA-256 hash:

  • Identical files are not uploaded twice
  • Saves storage space
  • Maintains single source of truth

Processing Documents

What Happens During Processing

When you click Process Uploaded Files:

  1. Upload to Cloud - Files are stored in encrypted cloud storage
  2. Text Extraction - Text is extracted from each document
  3. Vector Indexing - Documents are chunked and indexed for semantic search
  4. Metadata Extraction - Email headers, dates, and other metadata are parsed

Text Extraction Methods

PDF Documents:

  • Native text PDFs: Direct text extraction
  • Scanned PDFs: Automatic OCR with intelligent layout detection
  • Mixed PDFs: Combines both methods

Images:

  • All images processed with advanced neural OCR
  • Handwriting recognition supported
  • Multi-column layouts handled

Email Messages:

  • Body text extracted (HTML and plain text)
  • Metadata parsed: From, To, CC, BCC, Subject, Date
  • Attachments processed separately

Email Archives (PST/MBOX):

  • Automatically extracted to individual EML files
  • Each email becomes a separate indexed document
  • Preserves folder structure from email client

Processing Time

Processing time depends on document count and complexity:

Document CountApproximate Time
10 documents~30 seconds
100 documents~5 minutes
500 documents~20 minutes
1,000 documents~45 minutes
Background Processing

Processing runs in the background. You can navigate away and return to check progress.

Progress Tracking

During processing, you'll see:

  • Progress bar showing percentage complete
  • Current file being processed
  • Statistics updating in real-time

File Metadata

Automatic Metadata Extraction

For each file, BillionLens tracks:

FieldDescription
File NameOriginal filename
File TypeDocument format
File SizeSize in bytes
Upload DateWhen file was uploaded
Indexed DateWhen file was processed

Email Metadata

For email files (.eml, .msg), additional metadata is extracted:

FieldDescription
FromSender email address
ToRecipient email addresses
CCCarbon copy recipients
SubjectEmail subject line
Date SentWhen email was sent

Production Numbers

BillionLens can recognize production/Bates numbers in filenames:

Supported formats:

  • PROD001234.pdf → Production Number: PROD001234
  • ABC-0001.pdf → Production Number: ABC-0001
  • DEF_00123.pdf → Production Number: DEF_00123

Handling Failed Files

Common Failure Reasons

  • Password protected - PDFs with passwords cannot be processed
  • Corrupted files - Damaged files that cannot be read
  • Unsupported encoding - Text files with unusual character encoding
  • Empty files - Files with no extractable content

Retrying Failed Files

  1. Check the Failed count in file statistics
  2. Click Browse Files to see which files failed
  3. Fix the source file if possible (remove password, re-export)
  4. Re-upload the fixed file
  5. Run Process Uploaded Files again

Best Practices

Before Uploading

  1. Organize folders by custodian or document type
  2. Remove duplicates at the source if possible
  3. Include production numbers in filenames
  4. Remove password protection from PDFs

Naming Conventions

Recommended:

  • PROD001234_Contract_Agreement.pdf
  • JSmith_Email_2024-08-15.eml
  • ABC-0001_Financial_Statement.xlsx

Avoid:

  • Spaces in production numbers
  • Very long filenames
  • Special characters: < > : " / \ | ? *

Large Productions

For productions over 1,000 documents:

  • Upload in batches of 500-1,000 files
  • Process each batch before uploading more
  • Monitor for failed files

Next: Learn about AI Chat & Research.