Skip to content

Dataset Automation Pipeline

Category: AI Infrastructure · AutomationStack: Python · PyQt · PostgreSQL · Enterprise NAS · PyTorch · Audio Processing


Background

Before this system existed, the team's workflow for preparing training data looked like this:

  • Audio files scattered across individual laptops and USB drives
  • No access control — anyone could modify or delete anything
  • Every dataset preparation required manually hunting files, converting formats, and cutting audio by hand
  • Non-technical staff couldn't participate at all, leaving engineers to do everything

This made every training cycle slow and error-prone, directly bottlenecking the AI model's iteration speed. I designed and built this dataset automation pipeline from scratch to solve all of it.


Architecture

Storage Layer: Enterprise NAS + PostgreSQL

Enterprise NAS stores the actual audio files:

  • Centralized storage — no more scattered files
  • Fine-grained user permission control
  • Stable LAN access speed

PostgreSQL manages Metadata:

  • Each audio file's ID, path, duration, label, creator, and timestamp
  • Complex queries supported (filter by label, date range, etc.)
  • Maps to physical files on the NAS

Interface Layer: PyQt GUI

To make the system usable by non-technical staff, I built a desktop GUI with PyQt:

  • On login, the system automatically queries PostgreSQL and fetches the corresponding files from NAS
  • Users select the target label, quantity, and output format — no knowledge of the underlying storage structure required
  • All operations are logged in detail for traceability and debugging

Processing Layer: Fully Automated Batch Pipeline

This is the system's core value. When a user submits a request, the backend launches an end-to-end automated pipeline:

Step 1 — Audio Splitting Long audio files are split into training-sized segments based on silence detection or fixed duration.

Step 2 — Alignment & Validation Each split segment is validated against model input specs (sample rate, channels, bit depth). Non-conforming files are automatically converted. Error checks are placed at every node — failed files are flagged, never silently skipped.

Step 3 — Data Augmentation To improve dataset diversity, the system includes built-in audio augmentation:

  • Mixing: Blending voice with background audio to simulate real environments
  • Synthesis: Random variation of pitch, speed, and volume

Step 4 — Automatic Metadata Generation After processing, the system auto-generates the Metadata file in the exact format the model expects — no manual formatting required.


Results

  • Non-technical staff can operate independently — engineers no longer act as intermediaries
  • Dataset preparation time dramatically reduced, accelerating the AI model iteration cycle
  • Consistent data quality: automated validation eliminates format errors from manual handling
  • Centralized management resolved the original storage chaos and security concerns

Takeaway

Good tool design isn't just about making engineers' lives easier — it's about making the entire team capable of doing the right thing safely. Adding a GUI, error checks, and automated validation looks like "extra work" up front. But it turns unpredictable manual steps into a reliable automated process. The time saved over the long run far exceeds the cost to build it.