Building a Simple RAG + LLM Workflow for Procurement Documents

Recently, I worked on a small proof-of-concept project to explore how Retrieval-Augmented Generation (RAG) can help users search and reuse information from procurement and sustainment documents.

The goal was not to generate final procurement decisions automatically. Instead, the idea was to help users quickly find historical examples, related answers, and useful context from previous documents.

Project Goal

The main objective was:

Search a collection of procurement documents
Extract meaningful text
Chunk the content into smaller searchable pieces
Prepare the data for use with an LLM or Copilot-style assistant
Return relevant historical examples based on user questions

The dataset included:

Word documents
PDF files
Excel sheets
Mixed formatting and writing styles

Step 1: Extracting Text

The first step was converting the source files into structured text.

I used Python libraries such as:

python-docx
PyMuPDF (fitz)
pandas

Each document was converted into JSON format.

Example structure:

{
  "document": "SBCA_Report.docx",
  "section": "Operational Considerations",
  "text": "The sustainment strategy should support..."
}

This made the documents easier to process and search later.

Step 2: Chunking the Documents

Large documents are difficult for LLMs to process directly.

To solve this, the text was divided into smaller chunks.

Each chunk contained:

Document name
Section information
Chunk text
Metadata

Example:

{
  "chunk_id": 15,
  "document": "SBCA_Report.docx",
  "text": "Lifecycle sustainment costs were evaluated..."
}

This is an important RAG step because retrieval works better with smaller focused chunks.

Step 3: Preparing for Embeddings

After chunking, the data was prepared for embeddings.

The idea was:

Convert chunks into vector embeddings
Store embeddings in a searchable index
Compare user questions against embeddings
Return the most relevant chunks to the LLM

This allows the model to answer questions using real internal documents instead of general internet knowledge.

Step 4: Testing with Copilot / LLM

The processed files were tested using:

Copilot-style prompts
SharePoint-hosted files
Manual prompt testing

The questions were open-ended and often varied in wording.

Example questions:

What sustainment risks were identified?
What flexibility requirements existed in previous projects?
What lessons learned were mentioned?

The system returned related historical examples from the source documents.

Challenges

Some practical challenges included:

Different document formats
Inconsistent section structures
OCR quality issues
Large paragraphs
Duplicate information
Mixed writing styles

Another challenge was balancing preprocessing effort versus project complexity.

For smaller datasets, over-engineering can waste time.

Lessons Learned

A few key lessons from this work:

Simple preprocessing helps a lot
Clean chunking improves retrieval quality
Metadata becomes very important later
Procurement documents often contain nuanced language
Retrieval quality matters more than complex prompting

Future Improvements

Possible future improvements include:

Better semantic search
Automatic metadata extraction
Vector database integration
Hybrid keyword + vector search
Better chunk ranking
Citation and source highlighting

Final Thoughts

This project was a good hands-on exercise for understanding practical RAG workflows.

The most interesting part was seeing how historical procurement knowledge could become searchable and reusable using relatively simple tools and preprocessing steps.

Even with a small dataset, the workflow demonstrated how AI assistants can support research and decision-support tasks without replacing human expertise.