kured | Kubernetes Reboot Daemon

 by   weaveworks Go Version: 1.9.2 License: Apache-2.0

kandi X-RAY | kured Summary

kandi X-RAY | kured Summary

kured is a Go library typically used in Internet of Things (IoT), Raspberry Pi applications. kured has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              kured has a medium active ecosystem.
              It has 1415 star(s) with 167 fork(s). There are 68 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 27 open issues and 167 have been closed. On average issues are closed in 65 days. There are 5 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of kured is 1.9.2

            kandi-Quality Quality

              kured has 0 bugs and 0 code smells.

            kandi-Security Security

              kured has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              kured code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              kured is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              kured releases are available to install and integrate.
              Installation instructions, examples and code snippets are available.
              It has 1557 lines of code, 66 functions and 13 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of kured
            Get all kandi verified functions for this library.

            kured Key Features

            No Key Features are available at this moment for kured.

            kured Examples and Code Snippets

            No Code Snippets are available at this moment for kured.

            Community Discussions

            QUESTION

            Aks Error Failed to drain the node, aborting scale down
            Asked 2021-Sep-30 at 13:46

            I am using Kured to perform safe reboots of our nodes to upgrade the OS and kernel versions. In my understanding, it works by cordoning and draining the node, and the pods are scheduled on a new node with the older version. After the reboot, the nodes are uncordoned and back to the ready state and the temporary worker nodes get deleted.

            It was perfectly fine until yesterday when one of the nodes failed to upgrade to the latest kernel version. It was on 5.4.0-1058-azure last week after a successful upgrade and it should be on 5.4.0-1059-azure yesterday after the latest patch, but it is using the old version 5.4.0-1047-azure (which I think is the version of the temporary node that got created).

            Upon checking the log analytics on azure, it says that it failed to scale down.

            Reason: ScaleDownFailed

            Message: failed to drain the node, aborting ScaleDown

            Error message

            Any idea on why this is happening?

            ...

            ANSWER

            Answered 2021-Sep-29 at 22:20

            Firstly, there is a little misunderstanding of the OS and Kernel patching process.

            In my understanding, it works by cordoning and draining the node, and the pods are scheduled on a new node with the older version.

            The new node that is/are added should come with the latest node image version with latest security patches (which usually does not fall back to an older kernel version) available for the node pool. You can check out the AKS node image releases here. Reference

            However, it is not necessary that the pod(s) evicted by the drain operation from the node that is being rebooted at any point during the process has to land on the surge node. Evicted pod(S) might very well be scheduled on an existing node should the node fit the bill for scheduling these pods.

            For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on. The scheduler reaches this placement decision taking into account the scheduling principles described here.

            The documentation, at the time of writing, might be a little misleading on this.

            About the error:

            Reason: ScaleDownFailed
            Message: failed to drain the node, aborting ScaleDown

            This might happen due to a number of reasons. Common ones might be:

            • The scheduler could not find a suitable node to place evicted pods and the node pool could not scale up due to insufficient compute quota available. [Reference]

            • The scheduler could not find a suitable node to place evicted pods and the cluster could not scale up due to insufficient IP addresses in the node pool's subnet. [Reference]

            • PodDisruptionBudgets (PDBs) did not allow for at least 1 pod replica to be moved at a time causing the drain/evict operation to fail. [Reference]

            In general,

            The Eviction API can respond in one of three ways:

            • If the eviction is granted, then the Pod is deleted as if you sent a DELETE request to the Pod's URL and received back 200 OK.
            • If the current state of affairs wouldn't allow an eviction by the rules set forth in the budget, you get back 429 Too Many Requests. This is typically used for generic rate limiting of any requests, but here we mean that this request isn't allowed right now but it may be allowed later.
            • If there is some kind of misconfiguration; for example multiple PodDisruptionBudgets that refer the same Pod, you get a 500 Internal Server Error response.

            For a given eviction request, there are two cases:

            • There is no budget that matches this pod. In this case, the server always returns 200 OK.
            • There is at least one budget. In this case, any of the three above responses may apply.

            Stuck evictions
            In some cases, an application may reach a broken state, one where unless you intervene the eviction API will never return anything other than 429 or 500.

            For example: this can happen if ReplicaSet is creating Pods for your application but the replacement Pods do not become Ready. You can also see similar symptoms if the last Pod evicted has a very long termination grace period.

            How to investigate further?

            1. On the Azure Portal navigate to your AKS cluster

            2. Go to Resource Health on the left hand menu as shown below and click on Diagnose and solve problems

            3. You should see something like the following

            4. If you click on each of the options, you should see a number of checks loading. You can set the time frame of impact on the top right hand corner of the screen as shown below (Please press the Enter key after you have set the correct timeframe). You can click on the More Info link on the right hand side of each entry for detailed information and recommended action.

            How to mitigate the issue?

            Once you have identified the issue and followed the recommendations to fix the same, please perform an az aks upgrade on the AKS cluster to the same Kubernetes version it is currently running. This should initiate a reconcile operation wherever required under the hood.

            Source https://stackoverflow.com/questions/69382905

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install kured

            To obtain a default installation without Prometheus alerting interlock or Slack notifications:. If you want to customise the installation, download the manifest and edit it in accordance with the following section before application.

            Support

            If you have any questions about, feedback for or problems with kured:. We follow the CNCF Code of Conduct.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/weaveworks/kured.git

          • CLI

            gh repo clone weaveworks/kured

          • sshUrl

            git@github.com:weaveworks/kured.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link