Legal-FAQ-Assistant

10/2025 — 11/2025
+20% domain accuracyPrivacy compliant

The Problem

Legal Q&A systems lacked domain-specific accuracy and data privacy compliance for sensitive legal queries.

The Solution

Fine-tuned Mistral 7B using QLoRA, achieving 20% increase in domain-specific accuracy. Architected a secure pipeline with AES-256 encryption and containerized deployment via Kubernetes. Optimized model inference for accurate legal responses.

Impact

+20% domain accuracy with production-ready legal Q&A and privacy compliance.

Architecture

A Mistral 7B base model was fine-tuned using QLoRA with PEFT adapters on a curated legal Q&A dataset, reducing GPU memory requirements by 60% vs full fine-tuning. The trained model is served behind a FastAPI inference endpoint containerized with Docker and orchestrated via Kubernetes for horizontal scaling. All query inputs and outputs pass through an AES-256 encryption layer before storage to meet data privacy requirements.

Key Challenges

  • QLoRA fine-tuning required careful rank and alpha hyperparameter tuning — too low and the model didn't adapt to legal vocabulary, too high and it overfit to training examples. Ran systematic ablations across r=8, r=16, r=32 before settling on r=16 with alpha=32 for the best accuracy/generalization tradeoff.
  • Legal queries often contain jurisdiction-specific terms that the base Mistral model had never seen. Solved this by augmenting the training dataset with synthetic Q&A pairs generated from legal glossaries, which improved out-of-distribution accuracy significantly.
  • Kubernetes pod scaling created cold-start latency spikes when new inference pods loaded the model weights. Fixed by implementing model weight caching at the node level and using readiness probes to ensure pods only receive traffic after the model is fully loaded.

Key Learnings

Fine-tuning a 7B model with QLoRA is accessible but unforgiving — the dataset quality matters far more than training duration. Half the accuracy gains came from cleaning and deduplicating the training data, not from model architecture choices. I also learned that encryption at the application layer adds measurable latency that needs to be budgeted into SLA targets from day one.

Technologies

Mistral 7BQLoRAPEFTKubernetesAES-256Python