Kubeflow Training Operator

Overview

Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others.

Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using Training Operator Python SDK.

Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes.

For a complete reference of the custom resource definitions, please refer to the API Definition.
For details of all-in-one operator design, please refer to the All-in-one Kubeflow Training Operator
For details on its observability, please refer to the monitoring design doc.

Prerequisites

Version >= 1.25 of Kubernetes cluster and kubectl

Installation

Master Branch

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

Stable Release

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

TensorFlow Release Only

For users who prefer to use original TensorFlow controllers, please checkout v1.2-branch, patches for bug fixes will still be accepted to this branch.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0"

Python SDK for Kubeflow Training Operator

Training Operator provides Python SDK for the custom resources. To learn more about available SDK APIs check the TrainingClient.

Use pip install command to install the latest release of the SDK:

pip install kubeflow-training

Training Operator controller and Python SDK have the same release versions.

Quickstart

Please refer to the getting started guide to quickly create your first Training Operator Job using Python SDK.

If you want to work directly with Kubernetes Custom Resources provided by Training Operator, follow the PyTorchJob MNIST guide.

API Documentation

Please refer to following API Documentation:

Kubeflow.org v1 API Documentation

Community

The following links provide information about getting involved in the community:

Attend the AutoML and Training Working Group community meeting.
Join our Slack channel.
Check out who is using the Training Operator.

This is a part of Kubeflow, so please see readme in kubeflow/kubeflow to get in touch with the community.

Contributing

Please refer to the DEVELOPMENT

Change Log

Please refer to CHANGELOG

Version Matrix

The following table lists the most recent few versions of the operator.

Operator Version	API Version	Kubernetes Version
`v1.0.x`	`v1`	1.16+
`v1.1.x`	`v1`	1.16+
`v1.2.x`	`v1`	1.16+
`v1.3.x`	`v1`	1.18+
`v1.4.x`	`v1`	1.23+
`v1.5.x`	`v1`	1.23+
`v1.6.x`	`v1`	1.23+
`v1.7.x`	`v1`	1.25+
`latest` (master HEAD)	`v1`	1.25+

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

PyTorch Operator: list of contributors and maintainers.
MPI Operator: list of contributors and maintainers.
XGBoost Operator: list of contributors and maintainers.
MXNet Operator: list of contributors and maintainers.
Common library: list of contributors and maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 999 Commits
.github		.github
build/images		build/images
cmd/training-operator.v1		cmd/training-operator.v1
docs		docs
examples		examples
hack		hack
manifests		manifests
pkg		pkg
scripts		scripts
sdk/python		sdk/python
test_job		test_job
third_party/library		third_party/library
third_party_licenses		third_party_licenses
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum
prow_config.yaml		prow_config.yaml
vendor.go		vendor.go

License

kubeflow/training-operator

Folders and files

Latest commit

History

Repository files navigation

Kubeflow Training Operator

Overview

Prerequisites

Installation

Master Branch

Stable Release

TensorFlow Release Only

Python SDK for Kubeflow Training Operator

Quickstart

API Documentation

Community

Contributing

Change Log

Version Matrix

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages