Deleted/restarted pod for autoscaling RayJob stuck in scheduling gated status

**What happened**:

During a RayJob with [autoscaling enabled](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html) running time, if kill a worker pod, a new pod will be started but stuck in scheduling gate status. The new pod does not change to running state.

**What you expected to happen**:

After the new pod starts, Kueue should ungate it and switch it to running state.

**How to reproduce it (as minimally and precisely as possible)**:

1. Start a RayJob with autoscaling
2. Trigger autoscaling (e.g. scaling-up)
3. Delete a worker pod using kubectl delete
4. Watch the new worker pod created with scheduling gated
5. Keep watching the new worker pod, will see it stuck in scheduling gated, not change to running

**Anything else we need to know?**:

RayService seems have similar issue

**Environment**:
- Kubernetes version (use `kubectl version`): any
- Kueue version (use `git describe --tags --dirty --always`): 0.15, probably also in 0.16 and latest
- Cloud provider or hardware configuration: any
- OS (e.g: `cat /etc/os-release`): any
- Kernel (e.g. `uname -a`): any
- Install tools: KubeRay
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleted/restarted pod for autoscaling RayJob stuck in scheduling gated status #9986

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deleted/restarted pod for autoscaling RayJob stuck in scheduling gated status #9986

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions