generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 557
Open
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.
Description
What happened:
During a RayJob with autoscaling enabled running time, if kill a worker pod, a new pod will be started but stuck in scheduling gate status. The new pod does not change to running state.
What you expected to happen:
After the new pod starts, Kueue should ungate it and switch it to running state.
How to reproduce it (as minimally and precisely as possible):
- Start a RayJob with autoscaling
- Trigger autoscaling (e.g. scaling-up)
- Delete a worker pod using kubectl delete
- Watch the new worker pod created with scheduling gated
- Keep watching the new worker pod, will see it stuck in scheduling gated, not change to running
Anything else we need to know?:
RayService seems have similar issue
Environment:
- Kubernetes version (use
kubectl version): any - Kueue version (use
git describe --tags --dirty --always): 0.15, probably also in 0.16 and latest - Cloud provider or hardware configuration: any
- OS (e.g:
cat /etc/os-release): any - Kernel (e.g.
uname -a): any - Install tools: KubeRay
- Others:
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.