Contents
引言:IaC驱动的MLOps基石
随着AI模型复杂度的提升,对基础设施的弹性和可复制性要求也越来越高。Kubeflow作为Kubernetes上领先的MLOps平台,提供了端到端的机器学习生命周期管理能力。然而,手动部署Kubeflow及其底层依赖(如Kubernetes集群、存储、网络)耗时且容易出错。本文将指导您如何利用基础设施即代码(IaC)工具Terraform,自动化地在AWS EKS(弹性Kubernetes服务)上部署一个生产级的Kubeflow环境。
我们将重点分为两大阶段:
1. 使用Terraform配置AWS EKS集群。
2. 使用Terraform集成Kubernetes Provider和null_resource来部署Kubeflow核心组件。
前提条件
您需要安装以下工具并配置好AWS凭证:
* Terraform (v1.0+)
* AWS CLI
* kubectl
* kustomize (用于处理Kubeflow的manifests)
阶段一:Terraform部署EKS集群
为了简化EKS的部署,我们通常推荐使用官方或社区维护的EKS模块。下面的示例展示了如何定义EKS集群及其所需的IAM角色和VPC。
创建一个名为 main.tf 的文件:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42 # 1. 配置AWS Provider
provider "aws" {
region = "us-west-2"
}
# 2. 部署EKS集群 (使用模块简化配置)
# 注意: 实际生产中您需要替换为具体的VPC和子网配置
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "18.0.0"
cluster_name = "kubeflow-mlops-cluster"
cluster_version = "1.23"
vpc_id = "vpc-xxxxxxxxxxxx"
subnet_ids = ["subnet-xxxxxxx1", "subnet-xxxxxxx2"]
eks_managed_node_groups = {
general = {
min_size = 2
max_size = 5
desired_size = 3
instance_types = ["m5.large"]
# 确保工作节点拥有足够的权限,例如S3访问
iam_role_additional_policies = [
"arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess",
]
}
}
}
# 3. 配置Kubernetes Provider,用于后续Kubeflow部署
provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", module.eks.cluster_id]
}
}
运行Terraform初始化和应用:
1
2 terraform init
terraform apply --auto-approve
验证EKS连接
EKS集群创建完成后,确保您的本地kubectl配置正确指向新集群:
1
2 aws eks update-kubeconfig --name kubeflow-mlops-cluster --region us-west-2
kubectl get nodes
阶段二:Terraform部署Kubeflow
Kubeflow的部署通常依赖Kustomize来管理复杂的Manifests。由于Terraform的kubernetes Provider在处理大量跨资源的依赖时效率不高,我们采用最佳实践:使用Terraform的null_resource和local-exec provisioner来执行kustomize build | kubectl apply命令。
首先,您需要下载Kubeflow的Manifests文件。以Kubeflow v1.6为例:
1
2
3
4
5
6 mkdir kubeflow-manifests && cd kubeflow-manifests
KUBEFLOW_VERSION="v1.6.1"
wget https://github.com/kubeflow/manifests/archive/refs/tags/${KUBEFLOW_VERSION}.tar.gz
tar -xzf ${KUBEFLOW_VERSION}.tar.gz
mv manifests-${KUBEFLOW_VERSION} manifests
cd ..
接下来,将以下配置添加到您的 main.tf 文件中:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42 # 环境变量,确保kubectl能找到正确的集群配置
resource "local_file" "kubeconfig_output" {
content = module.eks.kubeconfig
filename = "./kubeconfig-kf.yaml"
}
# 使用null_resource执行Kustomize和kubectl命令
resource "null_resource" "kubeflow_deployment" {
# 依赖于EKS集群和kubeconfig文件生成
depends_on = [module.eks, local_file.kubeconfig_output]
provisioner "local-exec" {
command = <<EOT
# 导出KUBECONFIG指向新集群
export KUBECONFIG=./kubeconfig-kf.yaml
echo "Starting Kubeflow base deployment..."
# 1. 部署Istio(Kubeflow网关依赖)
kubectl apply -k ./kubeflow-manifests/manifests/common/istio-1-16/istio-crds/
kubectl apply -k ./kubeflow-manifests/manifests/common/istio-1-16/istio-install/
# 2. 部署Cert-Manager
kubectl apply -k ./kubeflow-manifests/manifests/common/cert-manager/cert-manager/
kubectl apply -k ./kubeflow-manifests/manifests/common/cert-manager/kubeflow-issuer/
# 3. 部署核心组件和命名空间
kubectl apply -k ./kubeflow-manifests/manifests/common/kubeflow-namespace/
kubectl apply -k ./kubeflow-manifests/manifests/common/kubeflow-roles/
kubectl apply -k ./kubeflow-manifests/manifests/common/istio-1-16/cluster-local-gateway/
# 4. 部署ML组件(Pipelines, Notebooks, Training Operator等)
kubectl apply -k ./kubeflow-manifests/manifests/apps/pipeline/installs/kubernetes/env/platform-agnostic
kubectl apply -k ./kubeflow-manifests/manifests/apps/notebooks/setup/
echo "Kubeflow deployment triggered. Wait for initialization."
EOT
# 运行目录
working_dir = path.root
}
}
部署和验证
再次运行Terraform应用,部署Kubeflow组件:
1 terraform apply --auto-approve
部署完成后,验证Kubeflow命名空间下的Pod状态:
1 kubectl -n kubeflow get pods
如果所有Pod都处于 Running 状态,恭喜您,您的基于Terraform的云原生AI平台已经搭建完成。下一步就是配置域名和Ingress,通过Istio Gateway访问Kubeflow Dashboard。
汤不热吧