I've been running into this issue where every once in a while Knative will become unable to create new Deployments, and will spontaneously recover within a few hours and create it. Until then, the following errors keep playing out within the serving components. What it feels like to me is the requests to kubernetes service are timing out, but I cannot tell why.
Expected Behavior
On making updates to a service, expecting deployment of new revision to work.
Actual Behavior
Occasionally, while making valid changes ex: changing the value of an annotation Knative will become unable to deploy a new revision, getting stuck in the state of constantly trying to reconcile it for hours before spontaneously recovering.
$ kn revision list -A
NAMESPACE NAME SERVICE TRAFFIC TAGS GENERATION AGE CONDITIONS READY REASON
knative service-00033 service 33 <invalid> 0 OK / 3 Unknown Deploying
knative service-00032 service 100% primary 32 <invalid> 4 OK / 4 True
In the controller logs I see the following context deadline exceeded error while trying to post to the Kubernetes service IP:
{
"insertId": "plhs429mzmf9nh5f",
"jsonPayload": {
"logger": "controller.event-broadcaster",
"caller": "record/event.go:285",
"knative.dev/pod": "controller-8c6b99cb7-7zg6n",
"commit": "484e848",
"message": "Event(v1.ObjectReference{Kind:\"Revision\", Namespace:\"knative\", Name:\"service-00033\", UID:\"8a09a3ff-655e-4e5f-b8d4-1a4886ab0678\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"1844291799\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' failed to create deployment \"service-api-00033-deployment\": Post \"https://10.123.20.1:443/apis/apps/v1/namespaces/knative/deployments\": context deadline exceeded",
"timestamp": "2023-06-30T09:57:08.7332053Z"
}
and right before it the following in Webhook logs:
{
"insertId": "k078pd2dmx16qrr7",
"jsonPayload": {
"knative.dev/pod": "webhook-d44b476b8-89gbx",
"message": "Failed the resource specific validation",
"knative.dev/operation": "UPDATE",
"logger": "webhook",
"knative.dev/name": "service",
"knative.dev/subresource": "",
"knative.dev/namespace": "knative",
"knative.dev/kind": "serving.knative.dev/v1, Kind=Service",
"knative.dev/resource": "serving.knative.dev/v1, Resource=services",
"commit": "484e848",
"knative.dev/userinfo": "system:serviceaccount:service:default",
"timestamp": "2023-06-30T09:56:38.327880939Z",
"caller": "validation/validation_admit.go:183",
"stacktrace": "knative.dev/pkg/webhook/resourcesemantics/validation.validate\n\tknative.dev/[email protected]/webhook/resourcesemantics/validation/validation_admit.go:183\nknative.dev/pkg/webhook/resourcesemantics/validation.(*reconciler).Admit\n\tknative.dev/[email protected]/webhook/resourcesemantics/validation/validation_admit.go:79\nknative.dev/pkg/webhook.admissionHandler.func1\n\tknative.dev/[email protected]/webhook/admission.go:123\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2487\nknative.dev/pkg/webhook.(*Webhook).ServeHTTP\n\tknative.dev/[email protected]/webhook/webhook.go:263\nknative.dev/pkg/network/handlers.(*Drainer).ServeHTTP\n\tknative.dev/[email protected]/network/handlers/drain.go:113\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"
}
At a complete loss here at this point.
Steps to Reproduce the Problem
Unknown
I haven't looked at your service yaml, but I have a hypothesis that this might be related to slow tag to digest resolution. Your can try the following:
Monitor latency for registry operations, particularly
GEToperations.Use image digests when referencing images. These look like
@sha256:...rather than:latest, and ensure that the image does not change after deployment.Disable tag to digest resolution. Note that this can lead to unpredictable behavior if a referenced tag is moved. Some instances may pick up the new image, while other instances may use an earlier image.
If this is tag to digest resolution and you're using public Dockerhub images, adding pull credentials to the service account that's running the Knative Service might give you higher rate limits.