Skip to content

Resource Request: Accelerator quotas for AI Conformance CI jobs #9145

@janetkuo

Description

@janetkuo

AI Conformance is currently designing the automated testing framework for Kubernetes AI Conformance to move away from our current process of manual self-attestation (see our design doc).

As part of this effort, we plan to set up CI jobs (presubmits and periodics) that require access to specialized hardware. Specifically, we need trivial quotas for accelerators (e.g., nvidia-tesla-t4 or L4) to verify basic functionality (e.g., secure accelerator access) without incurring high costs.

I found existing GPU jobs in GCP and AWS. Should our new AI conformance jobs share the existing approach/resources used by those jobs, or provision new ones?

Cost estimate for accelerators:

Variable Example Value (T4 GPU)
Cost per Hour ~$0.35 (on-demand) or ~$0.11 (preemptible/spot) / instance
Job Frequency 1 periodic + ~0.1 presubmit / day
Duration per Job ~30 min
Monthly Estimate ($0.11/hr * 2 instances * 1.1 jobs/day * 0.5 hours) * 30 days = ~$3.63 / month (note: this assumes using spot instance; duration per job might increase as we add more AI conformance tests)

@ritazh @mfahlandt @terrytangyuan @dims @BenTheElder @ameukam

Metadata

Metadata

Assignees

No one assigned

    Labels

    sig/k8s-infraCategorizes an issue or PR as relevant to SIG K8s Infra.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions