Frequently Asked Questions

Why my job has an unexpected retry number?

Generally speaking, there are 3 types of error in OpenPAI: transient error, permanent error, and unknown error. In jobs, transient errors will be always retried, and permanent errors will never be retried. If an unknown error happens, PAI will retry the job according to the retry policy of the job, which is set by the user.

If you don't set any retry policy but find the job has an unexpected retry number, it can be caused by some transient error, e.g. memory issues, disk pressure, or power failure of the node. Another kind of transient error is preemption. Jobs with higher priority can preempt jobs with lower priority. In OpenPAI's job protocol, you can find a field named jobPriorityClass. It defines the priority of a job.