How AI Agents Handle Unstructured Data from Customer Attachments

Handling unstructured data coming from your customers needs special treatment.

Even in the AI era.

The “impossible before, hard now” task of extracting unstructured data from customer attachments — usually PDFs — requires a few layers of processing to get right in production.

1. Is the Document Even Relevant?

Before anything else, an agent needs to check whether the incoming file makes sense in context.

A customer asked about their bank statement can send you a scanned picture of a dog. Intentionally or not.

An AI agent should be ready for this. Ideally it detects an irrelevant document early — before any expensive vision model tokens are burned on content that has nothing to do with the task.

This first gate is cheap to implement and saves a lot downstream.

2. Document Noise Filtering

Real-world documents are overloaded with content that is useless from a use case perspective:

GDPR disclaimers on agreements
Caution clauses and legal boilerplate
Repeated headers and footers

The challenge is that the agent must filter out this noise without accidentally cutting content that only looks like noise but is actually relevant.

This requires more care than it seems. A naive filter will occasionally remove something important.

3. Proper Structure Understanding

Extracting data from a PDF that was created by exporting a Word document is straightforward. The underlying text is there, parseable, clean.

Things get harder when the PDF is a scanned paper document — which is basically just an image. There is no text layer to read. The agent needs to understand the visual structure of the page: tables, columns, handwritten annotations, stamps.

This is where structure understanding becomes a core capability, not an afterthought.

4. Smooth Data Extraction

Whether you are working with a proper text-based PDF or a scanned image, the extraction step needs the right tool for each case.

Simple PDFs can be parsed with lightweight PDF reading libraries. But smooth extraction from image-based documents works best with LLMs that have strong vision capabilities — models like Gemini or GLM-4V handle this well.

Picking the right extraction method per document type matters a lot for accuracy and cost.

5. Specialization Beats Generalization

With LLMs it is technically possible to build a general-purpose agent that reads and understands all possible document types. That works well for a proof of concept.

Production systems with real customers are all about edge cases.

The best approach is to narrow down what your AI agent needs to handle as much as possible. The more specialized it becomes, the better it covers your customers’ actual use cases — and the more reliably it handles the weird inputs that will inevitably show up.

An expert agent beats a generalist agent in production, every time.

Kamil Kwapisz