Skip to content

AI Training Data Pipeline & Automated Web Scraping

Category: AI · Data Engineering · iOSStack: Python · Selenium · BeautifulSoup · RegEx · PyTorch Lite · iOS · Objective-C


Background

The company's core competitive advantage is its proprietary AI model, and model quality depends directly on the volume and cleanliness of training data. My role covered both ends of the data lifecycle: raw data acquisition and final model deployment on iOS.

This project spans the full pipeline from original data collection to live model inference on a mobile device.


Data Collection: Compliant Scraping Strategy

Compliance comes first. My approach:

  1. Always check robots.txt and respect the allowed scope
  2. Use official APIs first — more stable, more compliant, less fragile than scraping
  3. Scrape only when no API exists

For sites without APIs, I analyze the rendering approach:

  • Static pagesrequests directly fetches HTML — fast and lightweight
  • JavaScript-rendered (SPA)Selenium simulates real browser behavior to handle dynamic content and user interactions
  • Anti-bot mechanisms → diagnose the type (User-Agent detection, behavioral analysis, rate limiting) and adjust the strategy accordingly

Parsing and Cleaning

Once raw HTML is retrieved:

BeautifulSoup for structural parsing Understand the DOM structure, locate target nodes, extract with BS4 selectors.

RegEx for precision extraction This is the key to clean data. For example, when scraping audio file links, I use an exact pattern to filter only .wav files:

python
import re
wav_links = re.findall(r'https?://[^\s"\']+\.wav', html_content)

This guarantees only the correct format enters the training pipeline — no noise, no mixed file types.


Model Deployment: On-device iOS Inference

After training, I was responsible for deploying the model to run directly on iOS devices.

Format Calibration PyTorch models are sensitive to input format. Before deployment, I strictly calibrated audio sample rate, channel count, and bit depth to match exactly what the training preprocessing used. Any deviation causes inference errors.

Objective-C Low-Level Integration iOS's low-level APIs are Objective-C. Bridging PyTorch Mobile's C++ API to the iOS native layer requires Objective-C:

objc
// Load the model
NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"model" ofType:@"ptl"];
auto module = torch::jit::load(modelPath.UTF8String);

// Run inference
auto output = module.forward({inputTensor});

This requires understanding both the PyTorch C++ API and iOS native development — two domains that rarely overlap.


Results

  • Established a repeatable data collection workflow supporting multiple source types
  • High data purity: consistent audio format entering the training pipeline, minimal post-processing required
  • Model runs stably on iOS devices, validating the full path from data engineering to edge deployment

Takeaway

Data quality sets the ceiling for model performance. The most sophisticated architecture produces mediocre results if training data is noisy or inconsistent. And deploying a model all the way to a mobile device requires crossing multiple technical layers at once — it's not a single skill, it's a chain.