Excelero delivers big data AI storage solutions for business & enterprise, big data storage solutions, and enterprise data storage solutions. Applications are for major web scale companies for data analytics, machine learning applications in media and entertainment and HPC environments. Skip to main content

Improving kernel uptime via hot upgrade of modules (part II)

By June 4, 2019No Comments

In the previous post, I discussed the benefits of hot upgrades and elaborated on two methods to hot upgrades. In this article I will discuss how we apply all this wisdom to our own software and conclude with some of the key challenges of hot-upgrading.

Hot Upgrading the NVMesh client module

Excelero’s NVMesh components include a few kernel modules (read our white paper to learn more about the NVMesh architecture). For this post, I will focus on the client module.

Our client module is in charge of distributed block device algorithms. It presents a few disks under /dev/ which are being used by databases, user space apps and file systems. Naturally the entire module cannot be removed, but most of it can.

By keeping only the basic functionality in a separate tiny module, the entire algorithmic code (with networking and infrastructure) can be removed (via rmmod). Later it reincarnates in the updated version. This basic functionality (open, close, io elevators, etc) is implemented in a separate module named Atom (due to its indivisible nature). The Atom module however, supports 2 additional methods: being abandoned by the old client code and being adopted back by the newer version. Every line of code which resides in the client module is hot upgradeable in a straightforward fashion, but the code of Atom is much more difficult to upgrade.

kernel uptime via hot upgrade of modulesHot-upgrade challenges

So what are the challenges in Hot Upgrading kernel modules?

As a developer, I’d say that the trickiest part is doing a correct system architecture. Naturally, for the greater good, we all keep encapsulating pieces of code (classes, design patterns, source files). Encapsulation has huge benefits but it is not always possible: in inheritance, for example, the encapsulation is usually unidirectional. Parent class is not aware of the child implementation but child is typically aware of the parent. A great example of encapsulation is that typically constructor of a class and a to_string() function are kept together. A class may know how to construct itself and print its status. But when it is not aware of how to build itself (virtual constructor) then usually it does not know how to print nor describe itself.

In our case, the main consideration is making the Atom module as small as possible. Because any dependency it has will make the upgrade more difficult in the far future. So we stripped the Atom module to the bare minimum. As a result, system architecture becomes tricky. For example. we need logs to debug the module but logging framework itself can be hot upgraded so Atom cannot rely on this framework. And what about memory allocations, optimized algorithms? When removing client module, the assembly instructions of those frameworks disappear and Atom cannot rely on anything. So we either start breaking encapsulation or replicating huge chunks of code. Both solutions are not ideal.

As a conclusion, I think that when implementing a “hot upgrade” functionality, there is no 100% clean solution, no correct design pattern to choose. It is more of a tradeoff between various considerations. And in that sense kernel modules are probably not special in any way. Developing Hot upgrade for any piece of software, really sucks, but oh boy how the customers appreciate it.

If you liked this post, I recommend a good read about probably the most wide spread operating system (Minix). It includes advanced strategies for hot upgrading the micro-kernel without stopping the file system. If you are not into reading, watch this remarkable video about Minix (Hot upgrades are discussed during timestamps 43:30-53:00). Cheers!

Yaniv Romem

Author Yaniv Romem

More posts by Yaniv Romem