kured | Kubernetes Reboot Daemon

by weaveworks Go Version: 1.9.2 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | kured Summary

kured is a Go library typically used in Internet of Things (IoT), Raspberry Pi applications. kured has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.

Support

Quality

Security

License

Reuse

Support

kured has a medium active ecosystem.

It has 1415 star(s) with 167 fork(s). There are 68 watchers for this library.

It had no major release in the last 12 months.

There are 27 open issues and 167 have been closed. On average issues are closed in 65 days. There are 5 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of kured is 1.9.2

Quality

kured has 0 bugs and 0 code smells.

Security

kured has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

kured code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

kured is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

kured releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

It has 1557 lines of code, 66 functions and 13 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of kured

Get all kandi verified functions for this library.

kured Key Features

No Key Features are available at this moment for kured.

kured Examples and Code Snippets

No Code Snippets are available at this moment for kured.

Community Discussions

Trending Discussions on kured

Aks Error Failed to drain the node, aborting scale down

QUESTION

Aks Error Failed to drain the node, aborting scale down

Asked 2021-Sep-30 at 13:46

I am using Kured to perform safe reboots of our nodes to upgrade the OS and kernel versions. In my understanding, it works by cordoning and draining the node, and the pods are scheduled on a new node with the older version. After the reboot, the nodes are uncordoned and back to the ready state and the temporary worker nodes get deleted.

It was perfectly fine until yesterday when one of the nodes failed to upgrade to the latest kernel version. It was on 5.4.0-1058-azure last week after a successful upgrade and it should be on 5.4.0-1059-azure yesterday after the latest patch, but it is using the old version 5.4.0-1047-azure (which I think is the version of the temporary node that got created).

Upon checking the log analytics on azure, it says that it failed to scale down.

Reason: ScaleDownFailed

Message: failed to drain the node, aborting ScaleDown

Error message

Any idea on why this is happening?

...

ANSWER

Answered 2021-Sep-29 at 22:20

Firstly, there is a little misunderstanding of the OS and Kernel patching process.

In my understanding, it works by cordoning and draining the node, and the pods are scheduled on a new node with the older version.

The new node that is/are added should come with the latest node image version with latest security patches (which usually does not fall back to an older kernel version) available for the node pool. You can check out the AKS node image releases here. Reference

However, it is not necessary that the pod(s) evicted by the drain operation from the node that is being rebooted at any point during the process has to land on the surge node. Evicted pod(S) might very well be scheduled on an existing node should the node fit the bill for scheduling these pods.

For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on. The scheduler reaches this placement decision taking into account the scheduling principles described here.

The documentation, at the time of writing, might be a little misleading on this.

About the error:

Reason: ScaleDownFailed
Message: failed to drain the node, aborting ScaleDown

This might happen due to a number of reasons. Common ones might be:

The scheduler could not find a suitable node to place evicted pods and the node pool could not scale up due to insufficient compute quota available. [Reference]
The scheduler could not find a suitable node to place evicted pods and the cluster could not scale up due to insufficient IP addresses in the node pool's subnet. [Reference]
PodDisruptionBudgets (PDBs) did not allow for at least 1 pod replica to be moved at a time causing the drain/evict operation to fail. [Reference]

In general,

The Eviction API can respond in one of three ways:

If the eviction is granted, then the Pod is deleted as if you sent a DELETE request to the Pod's URL and received back 200 OK.
If the current state of affairs wouldn't allow an eviction by the rules set forth in the budget, you get back 429 Too Many Requests. This is typically used for generic rate limiting of any requests, but here we mean that this request isn't allowed right now but it may be allowed later.
If there is some kind of misconfiguration; for example multiple PodDisruptionBudgets that refer the same Pod, you get a 500 Internal Server Error response.

For a given eviction request, there are two cases:

There is no budget that matches this pod. In this case, the server always returns 200 OK.
There is at least one budget. In this case, any of the three above responses may apply.

Stuck evictions
In some cases, an application may reach a broken state, one where unless you intervene the eviction API will never return anything other than 429 or 500.

For example: this can happen if ReplicaSet is creating Pods for your application but the replacement Pods do not become Ready. You can also see similar symptoms if the last Pod evicted has a very long termination grace period.

How to investigate further?

On the Azure Portal navigate to your AKS cluster
Go to Resource Health on the left hand menu as shown below and click on Diagnose and solve problems
You should see something like the following
If you click on each of the options, you should see a number of checks loading. You can set the time frame of impact on the top right hand corner of the screen as shown below (Please press the Enter key after you have set the correct timeframe). You can click on the More Info link on the right hand side of each entry for detailed information and recommended action.

How to mitigate the issue?

Once you have identified the issue and followed the recommendations to fix the same, please perform an az aks upgrade on the AKS cluster to the same Kubernetes version it is currently running. This should initiate a reconcile operation wherever required under the hood.

Source https://stackoverflow.com/questions/69382905

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install kured

To obtain a default installation without Prometheus alerting interlock or Slack notifications:. If you want to customise the installation, download the manifest and edit it in accordance with the following section before application.