Machine Learning Platform Documentation¶

Version: 1.0.0
Date Created: January 02, 2026

Table of Contents¶

Purpose of the Platform
1.1 Overview
1.2 Why the Platform Exists
1.3 What the Platform Manages
1.4 Design Approach
1.5 Intended Usage
High-Level Architecture and Core Components
2.1 Architectural Overview
2.2 Core Components
2.2.1 API and Service Layer
2.2.2 Data Management Layer
2.2.3 Template and Schema Management
2.2.4 Preprocessing and Feature Engineering
2.2.5 Training and Execution Engine
2.2.6 Prediction and Inference Engine
2.2.7 Distributed and Asynchronous Processing
2.3 Cross-Cutting Concerns
Data Ingestion and Dataset Lifecycle
3.1 Overview
3.2 Dataset Creation
3.3 File-Level Handling for CSV Datasets
3.4 Schema Validation
3.5 Dataset Usability and State Tracking
3.6 Dataset Statistics Generation
3.7 Incremental Awareness and Reprocessing
3.8 Dataset Retrieval and Visualization Support
3.9 Role of Datasets in Downstream Workflows
Template and Schema Design
4.1 Role of Templates in the Platform
4.2 Template Structure
4.3 Data Type Governance
4.4 Template Validation Rules
4.5 Template Updates and Immutability Constraints
4.6 Training-Specific Templates
4.7 Templates as a Contract Across the Lifecycle
4.8 Operational Benefits of the Template Model
Preprocessing and Feature Engineering
5.1 Purpose of the Preprocessing Layer
5.2 Preprocess Sessions as Versioned Artifacts
5.3 Data Access and Execution Strategy
5.4 Column Categorization Using Templates
5.5 Numerical Feature Processing
5.6 Categorical Feature Encoding
5.7 Datetime Feature Handling
5.8 Missing Value Handling
5.9 Train–Test Split and Data Persistence
5.10 State Management and Reusability
5.11 Incremental and Tune-Aware Preprocessing
5.12 Error Handling and Observability
5.13 Why Preprocessing Is Explicit in This Platform
Model Training and Execution
6.1 How Training Fits Into the Platform
6.2 Problem Type as the Foundation
6.3 Mapping Dataset Columns to Meaning
6.4 Algorithm Resolution and Constraints
6.5 Creating a Training Session
6.6 Preprocessing and Execution Strategy
6.7 Hyperparameter Tuning and Incremental Learning
Inference and Prediction Workflow
7.1 Overview
7.2 Preconditions for Prediction
7.3 Single-Record (Form-Based) Prediction
7.4 Algorithm-Specific Inference Handling
7.5 Batch Prediction Workflow
7.6 Forecasting Inference
7.7 Prediction Persistence and Auditability
7.8 Error Handling and Stability
7.9 Design Rationale
Batch Prediction and Ray Execution
8.1 Purpose and Design Intent
8.2 Entry Conditions for Batch Prediction
8.3 Dataset Validation and Preparation
8.4 Feature Scaling and State Application
8.5 Ray Job Submission Model
8.6 Execution Inside Ray
8.7 Completion, Status Updates, and Results
Feature Importance and SHAP Analysis
9.1 When Feature Importance Is Available
9.2 Non-SHAP Feature Importance
9.3 SHAP-Based Explainability
9.4 Incremental Training and SHAP
Forecasting and Time-Series Inference
10.1 Supported Forecasting Models
10.2 Frequency and Horizon Validation
10.3 Handling of Unique Identifiers
10.4 Exogenous Variable Processing
10.5 Forecast Execution and Output
10.6 Error Handling and Safety Guards
Error Handling, Validation, and Guardrails
11.1 Design Philosophy
11.2 Validation at Data Ingestion
11.3 Guardrails in Training Configuration
11.4 Preprocessing and Feature-Level Validation
11.5 Ray Job Execution and Failure Handling
11.6 Inference-Time Validation
11.7 Batch Prediction Safeguards
11.8 Status Propagation and User Visibility
11.9 Summary

1. Purpose of the Platform¶

1.1 Overview¶

This platform exists to provide a single, consistent way to build and operate machine learning models across the organization.

Instead of individual teams handling data preparation, model training, and prediction in their own scripts or services, the platform centralizes these workflows behind well-defined APIs and controlled execution paths. This allows teams to focus on model logic and use-case design, while the platform takes responsibility for data validation, preprocessing consistency, execution, and tracking.

From a technical perspective, the system acts as an orchestration layer over data, preprocessing pipelines, model training jobs, and inference workloads. From a user’s perspective, it offers a structured, repeatable process for moving from raw datasets to production-ready predictions.

1.2 Why the Platform Exists¶

As machine learning usage scales, a few recurring problems tend to emerge: inconsistent preprocessing between training and inference, difficulty reproducing past results, long-running jobs blocking application services, and limited visibility into what data or configuration produced a given model.

This platform is designed to address those problems directly.

All training and prediction operations run through a controlled lifecycle. Datasets are explicitly marked for training or prediction use. Feature mappings, target selection, and preprocessing steps are recorded at training time and reused verbatim during inference. Model artifacts and metrics are stored alongside the metadata needed to understand how they were produced.

The result is a system where models can be trained, evaluated, reused, and updated without relying on undocumented assumptions or external scripts.

1.3 What the Platform Manages¶

The platform manages the full machine learning workflow end-to-end.

It starts with data definition. Before any model can be trained, the structure of the data is defined using templates. These templates describe column names, data types, and default values, and they serve as the contract between datasets, preprocessing logic, and models. This approach ensures that training and prediction data always conform to an expected schema.

Once data is defined and validated, the platform manages model training. Training requests capture not only the algorithm and problem type, but also how each column is interpreted (for example, which column is the target, which represents time, or which acts as a unique identifier). The platform then coordinates preprocessing, model execution, metric collection, and artifact storage.

Finally, the platform manages prediction and forecasting. Predictions are always executed using the preprocessing state generated during training, ensuring consistency between training and inference. Both real-time (form-based) and batch prediction workflows are supported, as well as time-series forecasting with optional exogenous variables.

1.4 Design Approach¶

The platform is intentionally conservative in how it handles machine learning workflows.

Validation is enforced early and often. Invalid configurations are rejected before any compute-intensive work begins. This includes checks on dataset usage, problem type compatibility, feature mappings, and algorithm constraints. While this may feel restrictive, it significantly reduces runtime failures and ambiguous behavior.

Long-running operations such as training, batch prediction, and SHAP computation are executed asynchronously. The API layer remains responsive, while execution is handled by distributed workers. Results are reported back through well-defined callbacks and persisted in the database.

Throughout the system, emphasis is placed on traceability. Every dataset, training run, and prediction can be traced back to the inputs and configuration that produced it. This is essential for debugging, auditing, and long-term model maintenance.

1.5 Intended Usage¶

The platform is intended for teams that need to train and operate machine learning models in a controlled, repeatable manner.

It supports iterative model development, where models are retrained or incrementally updated as new data becomes available. It also supports large-scale batch inference workflows and forecasting use cases that require careful handling of time and frequency.

By enforcing structure and consistency, the platform allows machine learning workflows to scale without becoming brittle or opaque.

2. High-Level Architecture and Core Components¶

Overall Platform Architecture

Figure 1: High-level system architecture of the ML platform

2.1 Architectural Overview¶

The ML platform is built as a service-oriented system where each stage of the machine learning lifecycle is handled by a clearly defined component. Rather than embedding training or inference logic directly into request-handling code paths, the platform separates orchestration from execution.

At a high level, the API layer acts as the control plane. It validates requests, records intent and metadata in the database, and triggers background execution. Actual compute-heavy tasks such as preprocessing, training, statistics generation, batch prediction, and SHAP analysis are executed asynchronously, primarily using distributed processing where required.

This separation allows the platform to remain responsive under load while still supporting large datasets, complex models, and long-running jobs.

2.2 Core Components¶

2.2.1 API and Service Layer¶

The service layer is responsible for enforcing business rules and coordinating workflows. Each major capability—templates, datasets, training, prediction, statistics, and feature importance—is implemented as a dedicated service with a clearly scoped responsibility.

This layer performs validation such as:

Whether a dataset is suitable for training or prediction
Whether a model configuration is compatible with the chosen problem type
Whether required attributes (target, date, unique ID) are correctly defined
Whether requested operations are allowed given the current state of a resource

The service layer does not perform heavy computation. Instead, it prepares execution context and delegates work to preprocessing pipelines or distributed jobs.

2.2.2 Data Management Layer¶

The platform supports two primary data sources: uploaded CSV datasets and data lake–backed tables. Regardless of the source, all data flows through a unified abstraction so that downstream components do not need to handle source-specific logic.

For CSV-based datasets, files are validated, versioned, and tracked at the file level. The system records which files were included in training or statistics generation to ensure reproducibility.

For data lake–based datasets, schema and metadata are retrieved dynamically from the underlying storage engine. This allows large datasets to be used without copying data into the application layer while still enforcing schema consistency.

2.2.3 Template and Schema Management¶

Templates define the structure of data used across the platform. They specify column names, data types, and default values, and they act as the authoritative schema for datasets.

During training, templates are extended with semantic meaning. Columns are explicitly marked as targets, dates, unique identifiers, or standard features. These mappings are stored with the training record and reused during inference and forecasting.

This approach ensures that preprocessing and feature handling remain consistent across runs, even as datasets evolve.

2.2.4 Preprocessing and Feature Engineering¶

Preprocessing is treated as a first-class, persistent step rather than a transient transformation.

When a model is trained, the preprocessing pipeline generates a state artifact that captures all transformations applied to the data. This includes scaling parameters for numerical features, encoding rules for categorical features, and any derived time-based features for forecasting or anomaly detection.

This preprocessing state is stored and later reloaded during inference and batch prediction. By design, the platform never re-fits preprocessing logic during prediction, eliminating training–inference skew.

2.2.5 Training and Execution Engine¶

Model training is executed asynchronously. Once a training request is validated and recorded, the platform submits a background job that performs preprocessing, model fitting, evaluation, and artifact persistence.

Training jobs produce:

Model artifacts (binary or directory-based, depending on algorithm)
Evaluation metrics such as accuracy or regression scores
Optional summaries and metadata required for downstream use

The system supports both standard training and incremental learning, with explicit constraints on which algorithms and problem types can be updated incrementally.

2.2.6 Prediction and Inference Engine¶

Inference is split into multiple execution paths depending on use case.

Form-based prediction supports low-latency, single-record inference using the trained model and preprocessing state. Batch prediction supports large datasets and is executed asynchronously, writing results either to files or back to the data lake.

Forecasting follows a specialized path that accounts for time frequency, horizon conversion, and optional exogenous inputs. The platform enforces strict validation around time continuity and frequency alignment to prevent incorrect forecasts.

2.2.7 Distributed and Asynchronous Processing¶

Long-running and resource-intensive tasks are executed outside the request lifecycle. This includes:

Dataset statistics generation
Model training and retraining
Batch prediction
SHAP-based feature importance computation

Each job is tracked using a unique identifier, and its status is persisted so that the platform can report progress, completion, or failure back to the user.

This design ensures that the platform can scale horizontally without blocking API requests or introducing timeouts.

2.3 Cross-Cutting Concerns¶

Across all components, the platform emphasizes a few core principles.

Consistency is enforced by reusing stored preprocessing state and schema mappings. Traceability is achieved by recording configuration, inputs, and outputs for every operation. Failure handling is explicit, with partial failures surfaced clearly rather than silently ignored.

Together, these principles allow the platform to support complex machine learning workflows while remaining predictable and maintainable.

3. Data Ingestion and Dataset Lifecycle¶

Template and Dataset Validation

Figure 2: Dataset ingestion, template binding, and validation flow

3.1 Overview¶

Data is the foundation of the platform, and a significant portion of the system is designed around ensuring that datasets are ingested, validated, and managed in a controlled and reproducible manner. The platform treats datasets as long-lived assets rather than temporary inputs, allowing them to be reused across multiple trainings and prediction workflows.

Each dataset moves through a clearly defined lifecycle, from creation to validation, statistics generation, and eventual usage in training or prediction. At every stage, metadata is persisted so that downstream operations can rely on consistent assumptions about schema, quality, and provenance.

3.2 Dataset Creation¶

Datasets can be created from two supported sources:

Uploaded CSV files
Data lake–backed tables

Regardless of the source, dataset creation begins with associating the dataset to a template. The template acts as the contract that defines expected columns, data types, and default values. This association ensures that schema validation remains consistent across ingestion, preprocessing, and training.

When a dataset is created, the platform records:

Dataset name and usage type (training or prediction)
Associated template
Data source (CSV upload or data lake)
Physical storage location or table reference
Tenant and ownership metadata

At this point, the dataset is registered but not yet considered ready for downstream use.

3.3 File-Level Handling for CSV Datasets¶

For CSV-based datasets, ingestion operates at the file level rather than treating the dataset as a single monolithic entity. Each uploaded file is tracked independently, allowing the platform to support incremental training and partial reprocessing.

During upload:

Files are stored in a structured directory layout
Metadata such as record count, file size, and null presence is captured
Each file is marked as unvalidated until schema and data checks complete

If null values are detected, the platform may generate alternate file paths that explicitly separate clean and null-containing data. This distinction is later used during preprocessing and statistics computation.

3.4 Schema Validation¶

Once data is ingested, the platform validates it against the associated template.

This validation includes:

Ensuring all required columns are present
Verifying that column data types match the template definition
Applying default values where appropriate
Detecting unsupported or unexpected columns

Validation is strict by design. Datasets that fail schema validation are marked as invalid and cannot be used for training or prediction until corrected. This prevents downstream failures during preprocessing or model execution.

For data lake–backed datasets, schema validation is performed by inspecting the table metadata directly, ensuring that the template remains aligned with the source table.

3.5 Dataset Usability and State Tracking¶

Each dataset maintains an internal usability state. This state reflects whether the dataset is suitable for training, prediction, or neither.

Key usability checks include:

Whether the dataset has passed schema validation
Whether required files are present and accessible
Whether the dataset usage matches the requested operation (for example, preventing training on prediction-only datasets)

By enforcing these checks early, the platform avoids ambiguous behavior later in the workflow.

3.6 Dataset Statistics Generation¶

After validation, datasets can undergo statistics generation. This step provides both visibility into the data and a foundation for informed feature selection.

Statistics are computed at the column level and include:

Null counts and percentages
Unique value counts
Basic numeric summaries such as minimum, maximum, and mean
Temporal distributions for datetime columns
Boolean distributions where applicable

For smaller datasets, statistics are computed synchronously using an in-memory execution path. For larger datasets or data lake sources, statistics generation is delegated to a background job to ensure scalability.

The platform records statistics status per column, allowing partial success and transparent failure reporting when issues arise.

3.7 Incremental Awareness and Reprocessing¶

The platform is designed to recognize when datasets change. For CSV datasets, newly uploaded files are tracked independently and marked as not yet included in statistics or training.

This enables:

Selective recomputation of statistics
Incremental model updates without retraining on previously processed data
Clear traceability of which data contributed to a given model version

Files are explicitly marked once they have been included in statistics or training runs, ensuring that repeated operations remain deterministic.

3.8 Dataset Retrieval and Visualization Support¶

Datasets expose structured views that allow users to:

Inspect schema and column metadata
Review validation status
View computed statistics and column summaries
Generate simple visualizations such as distributions and trends

These capabilities are not just for exploratory analysis. They also serve as safeguards, helping users confirm that data is suitable before committing to training or prediction workflows.

3.9 Role of Datasets in Downstream Workflows¶

Once validated and analyzed, datasets become reusable inputs across the platform. The same dataset can support multiple training runs with different configurations, or act as a prediction input for batch inference.

Crucially, the platform never mutates datasets during training or prediction. All transformations occur in derived artifacts such as preprocessing states or result outputs. This guarantees that datasets remain stable reference points across the entire machine learning lifecycle.

4. Template and Schema Design¶

4.1 Role of Templates in the Platform¶

Templates sit at the core of how the platform understands data. Rather than inferring structure dynamically at every stage, the system relies on templates as explicit, versioned definitions of schema and intent. A template describes what a dataset is expected to look like, how each column should be interpreted, and how that column may be used later in training or prediction.

This approach serves two purposes. First, it enforces consistency across ingestion, preprocessing, training, and inference. Second, it provides a stable contract that allows different parts of the system to operate independently without revalidating assumptions at every step.

In practice, templates are not just schemas; they are the bridge between raw data and machine learning semantics.

4.2 Template Structure¶

A template is composed of a template name and a collection of column definitions. Each column definition includes:

Column name
Data type
Default value
Tenant ownership and metadata

At creation time, templates are validated to ensure they meet minimum structural requirements. A template must contain at least two columns, column names must be unique, and each column must declare a valid data type supported by the platform.

The platform normalizes data types internally. For example, aliases such as int, integer, or float8 are resolved to canonical data types. This normalization ensures that schema behavior remains consistent even if users use different naming conventions.

4.3 Data Type Governance¶

Data types are centrally managed rather than being treated as free-form strings. Each column’s data type is resolved against a controlled set of supported types, such as integer, float, string, boolean, and datetime.

This design has two important implications:

Invalid or unsupported data types are rejected early, during template creation.
Downstream components, such as preprocessing and model training, can rely on a predictable set of data types without defensive checks.

Default values are also validated. Empty defaults are not permitted, as they introduce ambiguity during preprocessing and inference.

4.4 Template Validation Rules¶

Several safeguards are enforced during template creation and update:

Template names must be unique within a tenant and meet minimum length requirements.
Duplicate column names are not allowed.
Column definitions must include a valid data type and default value.
Templates associated with existing datasets cannot be modified.

These constraints are intentional. Templates are designed to be stable references. Once data has been ingested using a template, changing that template would invalidate assumptions already embedded in stored data and preprocessing states.

4.5 Template Updates and Immutability Constraints¶

The platform allows templates to be updated only when they are not yet associated with any dataset. This prevents accidental schema drift that could break existing pipelines.

When updates are allowed, the system performs a controlled reconciliation:

Columns removed from the updated schema are deleted.
New columns are added.
Existing columns are updated only if their data type or default value has changed.

Each change is explicitly applied at the column level, ensuring that schema evolution is transparent and traceable.

4.6 Training-Specific Templates¶

In addition to user-defined templates, the platform also generates training-specific templates internally. These templates extend the base schema with additional semantic flags that are critical for model execution.

Examples include:

Identifying the target column
Marking date or timestamp columns
Flagging unique identifiers
Distinguishing feature columns from labels

These training templates are not exposed for general editing. They are derived artifacts, created to capture the exact schema used during a specific training run. This ensures that every trained model is tied to a precise, reproducible schema definition.

4.7 Templates as a Contract Across the Lifecycle¶

Once defined, templates influence nearly every downstream operation:

Dataset validation checks incoming data against the template.
Statistics generation categorizes columns based on template data types.
Preprocessing pipelines decide scaling, encoding, and feature generation using template metadata.
Training logic determines target variables and special columns based on template flags.
Inference validates incoming prediction inputs against the same schema expectations.

Because of this, templates effectively act as the single source of truth for how data should be interpreted across the platform.

4.8 Operational Benefits of the Template Model¶

By centralizing schema definition and validation, the platform avoids a class of issues common in ad-hoc machine learning systems, such as silent schema mismatches, inconsistent preprocessing, or training–inference skew. Templates make the system more predictable, auditable, and maintainable. They also allow teams to reason about data and models at a higher level, without needing to inspect raw files or code paths for every operation.

5. Preprocessing and Feature Engineering¶

Preprocessing

Figure 3: Dataset Preprocessing and feature engineering workflow

5.1 Purpose of the Preprocessing Layer¶

Preprocessing is where raw, validated data is transformed into a form that machine learning models can reliably consume. In this platform, preprocessing is treated as a first-class, auditable step rather than a transient operation embedded inside training code.

Every preprocessing run produces explicit artifacts—transformed datasets, state files, and metadata—that are persisted and later reused during inference, batch prediction, and incremental training. This guarantees that the exact same transformations applied during training are reapplied during prediction, eliminating training–inference skew.

5.2 Preprocess Sessions as Versioned Artifacts¶

Each training request creates a dedicated preprocessing session. This session acts as a container for all preprocessing-related outputs, including:

Train and test datasets
Transformation state files
Model output paths
References to the original dataset and template

These sessions are stored in the database and linked to a specific training run. As a result, preprocessing is not an implicit step; it is a versioned, traceable operation that can be inspected independently of model training.

5.3 Data Access and Execution Strategy¶

The platform supports both CSV-based datasets and data lake–backed datasets. Preprocessing adapts its execution strategy based on dataset size and source:

Small CSV datasets are processed locally for faster turnaround.
Large datasets or data lake sources trigger distributed execution using Ray.

This decision is made automatically based on dataset characteristics, allowing users to work with both small experiments and large-scale production datasets without changing workflows.

5.4 Column Categorization Using Templates¶

Before any transformation is applied, columns are categorized using the template definition. Each column is assigned to one of the following groups:

Numerical columns
Categorical columns
Datetime columns
Boolean columns

This categorisation is derived directly from the template schema and ensures that transformations are deterministic and consistent across runs.

5.5 Numerical Feature Processing¶

Numerical columns are standardised using Vaex’s StandardScaler. This scales features to zero mean and unit variance, which is especially important for models that are sensitive to feature magnitude, such as linear models and gradient-based algorithms.

The fitted scaler parameters are stored as part of the preprocessing state. These parameters are later reloaded during inference and batch prediction to ensure identical scaling behavior.

5.6 Categorical Feature Encoding¶

Categorical columns are processed using label encoding. Each unique category value is mapped to a numeric representation. This encoding is performed deterministically and persisted in the preprocessing state.

For datasets sourced from the data lake, label mappings may be retrieved directly from metadata tables rather than being recomputed. This allows consistent category handling even when predictions are performed on datasets different from the original training data.

Boolean columns are normalized to string representations where required to avoid ambiguity during encoding.

5.7 Datetime Feature Handling¶

Datetime columns play a special role, particularly for forecasting and unsupervised use cases. During preprocessing:

Datetime columns are validated and parsed into canonical formats.
For time-series models, datetime columns may be expanded into derived features such as day, week, or month components.
For unsupervised anomaly detection, datetime columns are used to infer temporal frequency and generate time-based features when necessary.

Datetime handling is tightly coupled with model type and problem definition, ensuring that time semantics are preserved correctly.

5.8 Missing Value Handling¶

Missing values are handled using predefined imputation strategies:

Numerical columns are imputed using the mean.
Categorical columns are imputed using the most frequent value.

These strategies are applied consistently and recorded in the preprocessing configuration. Any deviation, such as unresolved missing values after preprocessing, results in an explicit error rather than silent correction.

5.9 Train–Test Split and Data Persistence¶

During preprocessing, datasets are split into training and testing subsets. These subsets are written to disk as versioned CSV files and linked to the preprocessing session.

The explicit persistence of train and test data serves multiple purposes:

Enables reproducibility of training results
Supports later inspection and debugging
Allows downstream components, such as feature importance or SHAP analysis, to reuse the same data

5.10 State Management and Reusability¶

A key output of preprocessing is the state file, which captures all transformations applied to the data. This includes scalers, encoders, and any derived feature logic.

At inference time, this state file is loaded and applied to incoming data before prediction. This design ensures that feature transformations are never reimplemented or approximated during prediction—they are replayed exactly as they were during training.

5.11 Incremental and Tune-Aware Preprocessing¶

The preprocessing pipeline is aware of both incremental training and hyperparameter tuning modes:

Incremental training reuses existing preprocessing state while incorporating only new data files.
Hyperparameter tuning uses a minimal preprocessing flow to avoid redundant transformations and ensure compatibility with tuning frameworks.

These modes allow the platform to scale from experimentation to continuous learning without duplicating logic or compromising correctness.

5.12 Error Handling and Observability¶

Preprocessing failures are treated as first-order events. Errors such as schema mismatches, missing columns, invalid data types, or unresolved null values are surfaced immediately with clear diagnostics.

Each preprocessing run logs detailed metadata, making it possible to trace exactly how input data was transformed and why a particular run failed or succeeded.

5.13 Why Preprocessing Is Explicit in This Platform¶

By externalizing preprocessing from model code and treating it as a persistent, inspectable step, the platform avoids common issues such as:

Inconsistent transformations between training and inference
Silent data drift caused by schema changes
Difficulty reproducing historical model behavior

Preprocessing, in this system, is not just preparation—it is part of the model’s identity.

6. Model Training and Execution¶

Model Training Pipeline

Figure 4: Model training and execution pipeline

6.1 How Training Fits Into the Platform¶

Model training in this platform is designed as a managed workflow, not a one-off operation that simply outputs a model file. Every training run is treated as a durable entity with a lifecycle, metadata, and traceability across the rest of the system.

When a user starts training, the platform records more than just the algorithm choice. It captures the dataset used, how columns were interpreted, what preprocessing logic was applied, and which constraints governed the run. This information is persisted so that predictions, retraining, feature importance, and audits all operate from the same shared context.

At a high level, a training run always produces:

A preprocessing state that can be reused at inference time
A trained model artifact (or set of artifacts)
A training record containing metrics, configuration, and execution status

This separation between intent, execution, and artifacts is what allows the platform to scale across datasets, problem types, and execution backends.

6.2 Problem Type as the Foundation¶

The selected problem type determines how almost every part of training behaves. Rather than branching logic deep inside model code, the platform uses the problem type as an upfront contract.

Each problem type enforces a different set of expectations:

Regression and Classification

Require a single target column
Enforce strict data type compatibility for the target
Support hyperparameter tuning
Allow incremental learning (classification only)

Forecasting

Requires a timestamp column and a unique identifier
Optionally supports exogenous features
Uses frequency-aware preprocessing and inference
Disallows SHAP-based explanations

Unsupervised

Does not require a target
Requires a timestamp column for temporal context
Enforces SHAP usage for explainability

These validations happen before execution begins. If a configuration violates any of these rules, the training request is rejected early, avoiding partial execution or ambiguous results.

6.3 Mapping Dataset Columns to Meaning¶

Training does not operate on raw columns alone. Every column in the dataset must be explicitly interpreted before it can participate in the model.

During training setup, users map dataset columns to semantic roles such as:

Target
Date or timestamp
Unique identifier
Domain-specific attributes
Additional numerical features

This mapping is validated against both the dataset schema and the selected problem type. For example, a column marked as a timestamp must be of a datetime type, and a unique identifier cannot be a floating-point value.

Once validated, the column mapping becomes part of the training definition. It is stored alongside the model and reused later during:

Inference (to reconstruct feature order)
Batch prediction (to validate schemas)
Feature importance generation
Incremental retraining

This approach avoids schema drift and ensures that predictions are always evaluated in the same feature space that the model was trained on.

6.4 Algorithm Resolution and Constraints¶

Algorithms are resolved dynamically rather than being hardcoded by the user. The platform determines the appropriate algorithm based on:

The selected problem type
User configuration flags (tuning, incremental learning)
Dataset source (CSV or data lake)

Not every algorithm is compatible with every configuration. The platform explicitly enforces these boundaries, for example:

Hyperparameter tuning is only enabled for regression and classification
Incremental learning is limited to classification
Forecasting models cannot use SHAP
Certain algorithms require additional metadata such as frequency or unique identifiers

By enforcing these constraints upfront, the system ensures that training failures are configuration-related rather than runtime surprises.

6.5 Creating a Training Session¶

Once all validations pass, the platform creates a training session. At this stage, no computation has started yet, but the intent to train is formally recorded.

The training session stores:

Predictor name and domain
Dataset and tenant information
Problem type and resolved algorithm
Column mappings and execution flags
Initial status (In Progress)

A dedicated directory structure is also created to hold:

Preprocessing outputs
Train and test splits
Model artifacts
SHAP outputs (if applicable)

This separation allows the platform to track training independently of execution. Even if execution fails later, the training record remains available for inspection and debugging.

6.6 Preprocessing and Execution Strategy¶

Preprocessing is handled using Vaex to efficiently support large datasets without loading everything into memory.

The preprocessing pipeline follows a consistent approach:

Numerical columns are standardized using Vaex’s StandardScaler
Categorical columns are encoded in a deterministic manner to ensure consistency between training and inference
Datetime columns are handled explicitly for forecasting and unsupervised use cases

The fitted preprocessing state is serialized and stored as a reusable artifact. During inference or batch prediction, this state is loaded instead of being recomputed, guaranteeing identical transformations.

Execution strategy depends on dataset size and configuration:

Small datasets may be processed synchronously
Large datasets, tuning-enabled runs, and data lake sources are executed as Ray jobs

This allows the platform to scale without changing the user-facing workflow.

6.7 Hyperparameter Tuning and Incremental Learning¶

When hyperparameter tuning is enabled, training follows a specialized execution path. Instead of retraining preprocessing for every trial, the platform reuses the preprocessed feature space and varies only model-level parameters.

This significantly reduces execution overhead while still allowing meaningful exploration of the parameter space.

Incremental learning follows a different pattern. Rather than retraining from scratch, the system:

Identifies new files that were not included in previous training
Applies the existing preprocessing state
Updates the model using only new data
Preserves historical performance metadata

Incremental learning is intentionally restricted to scenarios where model behavior is stable and well-understood, which is why it is limited to classification use cases.

7. Inference and Prediction Workflow¶

Inference and Prediction Flow

Figure 5: Real-time and batch inference workflow

7.1 Overview¶

Inference in this platform is not a simple “load model and predict” operation. It is a controlled workflow that guarantees predictions are produced using the exact same assumptions, transformations, and schema that were applied during training.

Every prediction—whether it comes from a form input, a batch dataset, or a forecasting request—is tightly bound to:

A completed training run
Its associated preprocessing state
The column mappings defined at training time

This design ensures that predictions remain reliable even as datasets evolve, models are retrained, or infrastructure changes underneath.

7.2 Preconditions for Prediction¶

Before any inference logic is executed, the platform performs a strict set of validations. These checks exist to prevent silent failures or misleading outputs.

A prediction request is allowed to proceed only if:

The referenced training run exists and has Completed successfully
The preprocessing session linked to the training run exists
A valid model artifact path is available
The dataset context (CSV or data lake) is compatible with the trained model

If any of these conditions fail, the request is rejected immediately with a descriptive error. This avoids scenarios where partially trained models or outdated preprocessing states are accidentally used.

7.3 Single-Record (Form-Based) Prediction¶

Form-based prediction is designed for real-time or interactive use cases, where users provide feature values directly.

When a request arrives, the platform performs the following sequence:

First, the input features are validated against the training schema. Any missing required features or unexpected fields are rejected early. If the target column was part of training, it is injected with a placeholder value to maintain schema consistency.

Next, the preprocessing state saved during training is loaded. This is a critical step. Instead of re-running preprocessing logic, the platform applies the serialized Vaex state to the incoming data. This guarantees:

Identical scaling for numerical columns
Consistent encoding for categorical columns
Stable feature ordering

For tuned models, additional cleanup is applied. Datetime columns are dropped if they were not part of the trained feature space, and boolean values are normalized to match the format expected by the model.

Once transformed, the data is passed to the appropriate inference backend based on the algorithm and problem type.

7.4 Algorithm-Specific Inference Handling¶

Inference logic is intentionally specialized per algorithm rather than generic.

For XGBoost, the platform loads the Booster directly and performs prediction using a DMatrix.
For classification, the raw output is converted into a boolean prediction with a confidence score derived from the probability.

For scikit-learn based models (such as SGD classifiers or regression models), the model is loaded using joblib. If probability outputs are available, confidence is calculated from predict_proba. Otherwise, only the raw prediction is returned.

For unsupervised models like Isolation Forest, inference returns semantic labels such as Normal or Anomaly instead of numeric outputs. Confidence is intentionally minimal here, reflecting the probabilistic nature of anomaly detection.

Across all algorithms, the platform ensures that:

No NaN values enter the model
Feature order matches the trained schema exactly
Errors are surfaced clearly rather than masked

7.5 Batch Prediction Workflow¶

Batch prediction is designed for large datasets and asynchronous execution. It reuses the same inference logic as form-based prediction but applies it at scale.

The workflow begins by validating the prediction dataset:

Only datasets marked for prediction usage are allowed
Exactly one file must be present for CSV-based prediction
The dataset schema must match the training schema exactly

Extra columns are dropped, and missing columns cause the request to fail. This strict enforcement prevents schema drift from producing incorrect results.

The dataset is then processed in chunks. Each batch is converted to a Vaex DataFrame, transformed using the saved preprocessing state, and validated for null values. Any remaining nulls are handled explicitly to avoid runtime failures.

Rather than running predictions inline, the platform submits a Ray batch prediction job. This allows:

Horizontal scaling
Controlled resource usage
Fault isolation from the API layer

The API immediately returns a job identifier, and prediction results are stored asynchronously.

7.6 Forecasting Inference¶

Forecasting inference differs significantly from standard prediction and is treated as a dedicated workflow.

Forecasting requires:

A timestamp column
A unique identifier
A trained model frequency

Before forecasting begins, the platform validates that the requested forecast frequency is not finer than the frequency used during training. This prevents logically invalid forecasts.

For AutoARIMA, the platform:

Loads train and test splits
Reconstructs the full historical time series
Optionally validates and aligns exogenous datasets
Generates confidence intervals
Computes a confidence score based on interval width

For LightGBM forecasting, the platform generates future timestamps, creates time-based features, and optionally merges exogenous data before prediction.

Forecast outputs are returned as structured time series data, including:

Forecasted value
Timestamp
Confidence (where applicable)

7.7 Prediction Persistence and Auditability¶

Every prediction—form, batch, or forecast—is recorded in the system.

Stored prediction metadata includes:

Training ID and model name
Input features or dataset reference
Prediction output
Execution status
Confidence or accuracy (when applicable)

For batch predictions, results are stored either as:

Parquet files (CSV source)
Iceberg tables (data lake source)

This allows predictions to be reviewed, audited, and reused without rerunning inference.

7.8 Error Handling and Stability¶

Inference errors are handled conservatively. The platform avoids partial results and always fails loudly when assumptions are violated.

Common failure cases include:

Schema mismatches
Missing preprocessing state
Invalid frequency requests
Null values after transformation

Rather than returning ambiguous outputs, the system provides explicit error messages that help users correct inputs or configurations.

7.9 Design Rationale¶

The inference layer is intentionally strict. While this adds upfront validation overhead, it ensures that predictions remain:

Reproducible
Explainable
Consistent across environments

By tightly coupling inference to training artifacts and preprocessing state, the platform avoids the most common causes of production ML failures.

8. Batch Prediction and Ray Execution¶

Batch Prediction with Ray

Figure 6: Distributed batch prediction execution using Ray

8.1 Purpose and Design Intent¶

Batch prediction exists to support scenarios where inference must be applied to large datasets rather than individual records. Typical use cases include scoring an entire customer base, generating risk scores in bulk, or producing predictions that are later consumed by downstream systems.

From a design perspective, batch prediction is treated as a background, distributed workload, not an extension of synchronous API inference. This separation is intentional. It prevents long-running jobs from blocking API traffic and allows predictions to scale independently of user-facing services.

To achieve this, the platform relies on Ray as the execution engine for batch inference.

8.2 Entry Conditions for Batch Prediction¶

A batch prediction request is accepted only when a strict set of conditions is met.

The model referenced by the request must:

Exist in the system
Have completed training successfully
Have a valid preprocessing state and model artifact

The dataset selected for batch prediction must:

Be explicitly marked for prediction usage
Contain exactly one file in the case of CSV uploads
Match the training schema exactly (excluding the target column)

These checks are not optional. They exist to ensure that batch inference behaves deterministically and does not silently adapt to schema differences.

8.3 Dataset Validation and Preparation¶

Before any Ray job is submitted, the dataset is validated in the API layer.

For CSV-based datasets:

The file is opened using Vaex to avoid loading it fully into memory
Column names are compared against the expected schema derived from training
Missing columns cause the request to fail immediately
Extra columns are dropped explicitly and logged
Null values are checked column by column and rejected if present

For data lake–based datasets:

Column validation is performed against the Iceberg table schema
No local file materialization is performed at this stage
The system ensures that the table structure aligns with the trained feature set

This validation phase is critical. Once a Ray job is submitted, failures become harder to diagnose. Catching schema issues early reduces wasted compute and unclear failure states.

8.4 Feature Scaling and State Application¶

A defining characteristic of batch prediction in this platform is that preprocessing is never recomputed.

Instead, the preprocessing state saved during training is reused exactly as-is.

For CSV datasets:

Data is read in batches using Arrow and Vaex
Each batch is converted to a pandas DataFrame only after preprocessing
The saved Vaex state file is loaded and applied
The target column is injected temporarily if required and then removed
Final scaled features are validated for NaN values

For data lake datasets:

Scaling may be applied inside the Ray job depending on configuration
In some cases, preprocessing is bypassed if scaling was handled upstream

This approach guarantees that batch predictions are mathematically consistent with training, even months after the model was created.

8.5 Ray Job Submission Model¶

Once validation and preparation are complete, the system constructs a Ray job configuration.
This configuration includes:

Model path and algorithm type
Prediction problem type
Batch size
Dataset identifiers
Preprocessing state path (if applicable)
Data source type (CSV or data lake)
Iceberg metadata (for data lake predictions)

The job is submitted to Ray using the job submission client, and execution begins asynchronously.

At this point, the API does not wait for completion. Instead, it:

Stores a prediction record with status set to Pending
Persists the Ray job ID
Returns control to the caller immediately

This design ensures that batch prediction remains resilient even for very large datasets.

8.6 Execution Inside Ray¶

Within Ray, prediction execution is parallelized across batches.

Each worker:

Loads the trained model artifact
Applies preprocessing to its assigned batch
Performs inference using the correct algorithm-specific logic
Writes prediction outputs incrementally

For CSV-based datasets, results are typically written to Parquet files.
For data lake–based datasets, predictions are written back as Iceberg tables.

This separation allows batch prediction results to be consumed independently of the API and reused by analytics or downstream pipelines.

8.7 Completion, Status Updates, and Results¶

Once the Ray job completes, the platform updates the prediction record with:

Final status (Completed or Failed)
Output location (file path or Iceberg table name)

Users can retrieve batch prediction results through dedicated endpoints. Depending on the data source, results may be:

Previewed directly
Queried from the data lake
Downloaded as files

The system also supports polling for prediction status.

9. Feature Importance and SHAP Analysis¶

SHAP Analysis Workflow

Figure 7: SHAP-based model explainability workflow

Understanding why a model produces a particular output is a core requirement for trust, debugging, and regulatory review. This platform supports feature importance generation through both native model mechanisms and SHAP-based explainability, depending on the algorithm and configuration used during training.

Feature importance is always tied to a completed training run. It is generated against the exact model artifact, preprocessing logic, and feature schema that were used during training, ensuring consistency between training behavior and explainability results.

9.1 When Feature Importance Is Available¶

Feature importance can only be requested once a training job has successfully completed. The system enforces this constraint to avoid partial or misleading explanations. The availability also depends on the selected algorithm:

Tree-based models such as XGBoost and LightGBM support direct importance extraction.
Linear models such as SGD Classifier expose coefficients that are converted into relative importance.
Time-series models such as AutoARIMA support feature importance only when exogenous variables are used.
Unsupervised models expose limited interpretability and are handled separately.

The platform validates these conditions before proceeding, and users receive explicit errors if explainability is not supported for the selected configuration.

9.2 Non-SHAP Feature Importance¶

When SHAP is not enabled, feature importance is computed directly from the trained model:

For tree-based models, gain-based importance is extracted and normalized to percentages.
For linear models, absolute coefficient magnitudes are used and normalized.
For forecasting models, coefficients associated with exogenous variables are aggregated across series.

The resulting importance values are converted into human-readable impact levels such as High Impact, Medium Impact, or Low Impact. This mapping is intentionally coarse-grained to help users reason about influence without over-interpreting small numeric differences.

Once generated, feature importance is stored with the training record and reused unless the model is retrained or updated incrementally.

9.3 SHAP-Based Explainability¶

When SHAP is enabled during training, feature importance generation follows a different execution path. Instead of computing importance inline, the system submits a dedicated Ray job to compute SHAP values asynchronously.

This design choice is intentional:

SHAP computation is resource-intensive.
It can scale poorly with large datasets.
It benefits from parallel execution and isolation from the main API workload.

The SHAP job loads:

The trained model artifact
The preprocessing state
The training or reference dataset
The mapped feature schema

Once computation completes, SHAP values are persisted as a structured JSON artifact and linked back to the training record. The training status is updated to reflect SHAP completion, failure, or pending state.

If SHAP values already exist and the model has not changed, the system avoids recomputation and serves the stored results directly.

9.4 Incremental Training and SHAP¶

For incremental training scenarios, feature importance must remain temporally consistent. The platform tracks when SHAP values were generated and compares that timestamp with the last model update. If the model has changed since SHAP computation, explainability is recalculated automatically to prevent stale interpretations.

This ensures that feature importance always reflects the current state of the model, even as new data is introduced incrementally.

10. Forecasting and Time-Series Inference¶

Forecast Prediction Workflow

Figure 8: Forecasting and time-series inference pipeline

The platform supports time-series forecasting as a first-class capability, designed for both single-series and multi-series use cases. Forecasting workflows are distinct from regression and classification, as they require strict handling of temporal structure, frequency alignment, and optional exogenous inputs.

10.1 Supported Forecasting Models¶

Forecasting is currently implemented using:

AutoARIMA for statistical time-series modeling
LightGBM for machine-learning-based forecasting with engineered time features

Each model follows a different execution strategy but is unified under a consistent API and validation layer.

10.2 Frequency and Horizon Validation¶

Before any forecast is generated, the system validates that the requested forecast frequency and horizon are compatible with the frequency used during training.

The platform explicitly prevents:

Requesting forecasts at a finer granularity than the trained frequency
Generating horizons that exceed reasonable bounds for the trained model

All frequency conversions are normalized internally to minute-level representations, allowing consistent comparison across inputs such as hour, day, week, or compound intervals.

10.3 Handling of Unique Identifiers¶

For multi-series forecasting, each time series is identified by a unique identifier. The forecasting request must specify the target unique ID, and the platform ensures that:

The identifier exists in the training data
The identifier is consistently typed and sanitized
Forecast output is scoped strictly to the requested series

This prevents cross-series contamination and ensures deterministic forecasts.

10.4 Exogenous Variable Processing¶

When exogenous variables are used, the platform performs strict validation before forecasting:

All required non-target features must be present
No missing values are allowed
Dates must be continuous and aligned with the model frequency
Exogenous data must start immediately after the last observed training timestamp

For LightGBM-based forecasting, time-based features (such as day, week, or month indicators) are generated automatically and merged with exogenous inputs. For AutoARIMA, exogenous variables are passed directly into the statistical forecasting process.

If any of these conditions fail, the forecast request is rejected with a clear validation error.

10.5 Forecast Execution and Output¶

Forecast execution is performed synchronously for single requests and produces:

A sequence of future timestamps
Corresponding forecasted values
Optional confidence scores (where supported)

For AutoARIMA, confidence intervals are computed and converted into confidence scores based on interval width. For LightGBM, predictions are deterministic unless additional uncertainty estimation is enabled externally.

The final forecast output is formatted consistently, regardless of model type, allowing downstream consumers to treat forecasting results uniformly.

10.6 Error Handling and Safety Guards¶

The forecasting pipeline includes multiple safeguards to prevent invalid or misleading outputs:

Large horizon requests are capped
Non-continuous date ranges are rejected
Missing or duplicated timestamps trigger errors
Invalid frequency mappings are blocked early

These controls are designed to fail fast and transparently, reducing ambiguity for users and preventing silent data issues.

11. Error Handling, Validation, and Guardrails¶

Reliability in an ML platform is not achieved by model performance alone. It is enforced through a layered system of validation, defensive checks, and controlled failure modes that operate consistently across data ingestion, training, inference, and background execution. This platform is designed to fail early, fail clearly, and fail safely—ensuring that invalid states do not propagate downstream.

11.1 Design Philosophy¶

The platform follows three core principles when handling errors and validation:

Reject invalid inputs as early as possible
User-facing operations such as dataset uploads, training configuration, and prediction requests are validated before any expensive processing begins.

Preserve system consistency under failure
Partial state updates are avoided. Database updates, file writes, and Ray job submissions are coordinated so that failures do not leave orphaned or misleading records.

Expose actionable errors, not internal stack traces
Errors returned to users describe what failed and why, without leaking implementation details or internal paths.

This approach ensures that errors are understandable to users while remaining diagnosable by platform engineers.

11.2 Validation at Data Ingestion¶

Validation begins at dataset ingestion and continues throughout the lifecycle of the data.

When CSV-based datasets are uploaded, each file is validated against the associated template schema. This validation checks:

Column presence and ordering
Data type compatibility
Null constraints
File-level structural integrity (empty files, non-CSV inputs)

Files that fail validation are not discarded silently. Instead:

An error report is generated per file
The file is stored in a dedicated error location
The dataset is marked as invalid or partially usable, depending on the failure type

For datalake-backed datasets, schema validation is performed by comparing the Iceberg table schema against the expected template-derived structure. Preview data is sanitized to ensure that it can be safely rendered in downstream views.

At no point is invalid data allowed to enter the training or inference pipelines.

11.3 Guardrails in Training Configuration¶

Before training begins, the platform validates the training configuration against both the dataset and the selected problem type.

This includes enforcing constraints such as:

Only one target column for supervised learning
Mandatory date and unique ID fields for forecasting
Compatible data types for targets and identifiers
Restrictions on incremental learning and hyperparameter tuning based on problem type

These checks are intentionally strict. A training job that violates configuration rules is rejected immediately, rather than allowing ambiguous behavior during preprocessing or model fitting.

Once training is accepted, the configuration is frozen and persisted. Subsequent operations—such as inference, feature importance generation, or incremental updates—are always evaluated against this recorded configuration.

11.4 Preprocessing and Feature-Level Validation¶

Preprocessing is one of the most failure-prone stages in any ML system, particularly when dealing with heterogeneous data sources.

To mitigate this, the platform applies multiple safeguards:

Feature transformations are deterministic and state-driven
Numerical scaling and categorical encoding are persisted as reusable state artifacts
Unexpected nulls, NaNs, or infinite values are detected before model execution
Date and time features are validated for continuity and frequency alignment

If preprocessing fails, the training job is marked as failed, associated file logs are updated accordingly, and no model artifact is registered. This prevents partially trained or inconsistent models from being used downstream.

11.5 Ray Job Execution and Failure Handling¶

Distributed execution via Ray introduces additional failure modes, including worker crashes, resource exhaustion, and job preemption.

The platform treats Ray jobs as asynchronous but accountable operations. Every Ray job is associated with:

A persistent job identifier
A corresponding database record
A well-defined lifecycle state

If a Ray job fails:

The training or prediction status is updated to Failed
Associated file logs are marked as unprocessed
Downstream operations (such as inference or feature importance) are blocked
The failure reason is recorded for audit and debugging

Importantly, the platform does not rely on background polling to detect failures. Instead, job state is reconciled during user-triggered refreshes and result callbacks, reducing operational overhead while maintaining correctness.

11.6 Inference-Time Validation¶

Inference endpoints apply a final layer of validation before executing predictions.

For real-time inference, the platform verifies:

Model existence and accessibility
Compatibility between input features and trained schema
Availability of preprocessing state
Absence of invalid values after transformation

For forecasting and batch inference, additional checks ensure:

Frequency compatibility
Sufficient historical and exogenous data
Schema alignment between training and prediction datasets

If validation fails at this stage, inference is aborted with a clear error message. No partial predictions are returned, and no prediction records are persisted.

11.7 Batch Prediction Safeguards¶

Batch prediction introduces risks related to scale, schema drift, and partial output generation.

To manage this, the platform enforces:

Strict schema matching before Ray job submission
Null and missing value checks at the batch level
Atomic result generation—either the batch completes successfully or it is marked as failed

Intermediate artifacts (such as scaled feature files) are written to controlled directories and cleaned up implicitly through lifecycle management. Prediction results are only exposed once the entire batch has completed successfully.

11.8 Status Propagation and User Visibility¶

Across all workflows, status propagation is explicit and consistent.

Each major entity—datasets, trainings, preprocess sessions, predictions—maintains a clearly defined status field. Transitions between statuses are controlled and validated, preventing invalid combinations such as “Completed” without metrics or “Pending” without an active job.

Users are never required to infer system state from logs or side effects. The platform surfaces status directly, allowing consumers to reason about progress, failure, and completion deterministically.

11.9 Summary¶

Error handling and validation in this platform are not treated as secondary concerns. They are embedded into every workflow as first-class design elements.

By enforcing strict validation, isolating failures, and maintaining consistent state transitions, the platform ensures that:

Invalid data does not silently corrupt models
Failed jobs do not leave residual or misleading artifacts
Users receive clear, actionable feedback at every stage

This foundation allows the platform to scale in complexity—across data sources, algorithms, and execution environments—without compromising reliability or trust.