JoelFernandes.org

Hello! I’m Joel and this my personal website built with Jekyll! I currently work at Nvidia. My interests are scheduler, RCU, tracing, synchronization, memory models and other kernel internals. I also love contributing to the upstream Linux kernel and other open source projects. In prior jobs, I worked at Google, Amazon and TI.

Connect with me on Twitter, and LinkedIn. Or, drop me an email at: joel at joelfernandes dot org

Look for my name in the kernel git log to find my upstream kernel patches. Check out my resume for full details of my work experience. I also actively present at conferences, see a list of my past talks, presentations and publications.

Full list of all blog posts on this site:

25 Jun 2023 SVM and vectors for the curious

10 Jun 2023 SELinux Debugging on ChromeOS

28 Apr 2023 Understanding Hazard Pointers

25 Apr 2023 PowerPC stack guard false positives in Linux kernel

24 Feb 2023 Getting YouCompleteMe working for kernel development

29 Jan 2023 Figuring out herd7 memory models

13 Nov 2022 On workings of hrtimer's slack time functionality

25 Oct 2020 C++ rvalue references

06 Mar 2020 SRCU state double scan

18 Oct 2019 Modeling (lack of) store ordering using PlusCal - and a wishlist

02 Sep 2019 Making sense of scheduler deadlocks in RCU

22 Dec 2018 Dumping User and Kernel stacks on Kernel events

15 Jun 2018 RCU and dynticks-idle mode

10 Jun 2018 Single-stepping the kernel's C code

10 May 2018 RCU-preempt: What happens on a context switch

11 Feb 2018 USDT for reliable Userspace event tracing

08 Jan 2018 BPFd- Running BCC tools remotely across systems

01 Jan 2017 ARMv8: flamegraph and NMI support

19 Jun 2016 Ftrace events mechanism

20 Mar 2016 TIF_NEED_RESCHED: why is it needed

25 Dec 2015 Tying 2 voltage sources/signals together

04 Jun 2014 MicroSD card remote switch

07 May 2014 Linux Spinlock Internals

24 Apr 2014 Studying cache-line sharing effects on SMP systems

23 Apr 2014 Design of fork followed by exec in Linux

Below are some notes I wrote while studying hrtimer slack behavior (range timers), which was added to reduce wakeups and save power, in the commit below. The idea is that:

Normal hrtimers will have both a soft and hard expiry which are equal to each other.
But hrtimers with timer slack will have a soft expiry and a hard expiry which is the soft expiry + delta.

The slack/delay effect is achieved by splitting the execution of the timer function, and the programming of the next timer event into 2 separate steps. That is, we execute the timer function as soon as we notice that its soft expiry has passed (hrtimer_run_queues()). However, for programming the next timer interrupt, we only look at the hard expiry (hrtimer_update_next_event() -> __hrtimer_get_next_event() -> __hrtimer_next_event_base()->hrtimer_get_expires()). As a result, the only way a slack-based timer will execute before its slack time elapses, is, if another timer without any slack time gets queued such that it hard-expires before the slack time of the slack-based timer passes.

The commit containing the original code added for range timers is:

commit 654c8e0b1c623b156c5b92f28d914ab38c9c2c90
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Mon Sep 1 15:47:08 2008 -0700

    hrtimer: turn hrtimers into range timers
   
    this patch turns hrtimers into range timers;
    they have 2 expire points
    1) the soft expire point
    2) the hard expire point
   
    the kernel will do it's regular best effort attempt to get the timer run at
the hard expire point. However, if some other time fires after the soft expire
point, the kernel now has the freedom to fire this timer at this point, and
thus grouping the events and preventing a power-expensive wakeup in the future.

The original code seems a bit buggy. I got a bit confused about how/where we handle the case in hrtimer_interrupt() where other normal timers that expire before the slack time elapses, have their next timer interrupt programmed correctly such that the interrupt goes off before the slack time passes.

To see the issue, consider the case where we have 2 timers queued:

The first one soft expires at t = 10, and say it has a slack of 50, so it hard expires at t = 60.
The second one is a normal timer, so the soft/hard expiry of it is both at t = 30.

Now say, an hrtimer interrupt happens at t=5 courtesy of an unrelated expiring timer. In the below code, we notice that the next expiring timer is (the one with slack one), which has not soft-expired yet. So we have no reason to run it. However, we reprogram the next timer interrupt to be t=60 which is its hard expiry time (this is stored in expires_next to use as the value to program the next timer interrupt with). Now we have a big problem, because the timer expiring at t=30 will not run in time and run much later.

As shown below, the loop in hrtimer_interrupt() goes through all the active timers in the timerqueue, _softexpires is made to be the real expiry, and the old _expires now becomes _softexpires + slack.

       while((node = timerqueue_getnext(&base->active))) {
              struct hrtimer *timer;

              timer = container_of(node, struct hrtimer, node);

              /*
               * The immediate goal for using the softexpires is
               * minimizing wakeups, not running timers at the
               * earliest interrupt after their soft expiration.
               * This allows us to avoid using a Priority Search
               * Tree, which can answer a stabbing querry for
               * overlapping intervals and instead use the simple
               * BST we already have.
               * We don't add extra wakeups by delaying timers that
               * are right-of a not yet expired timer, because that
               * timer will have to trigger a wakeup anyway.
               */

              if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) {
                      ktime_t expires;

                      expires = ktime_sub(hrtimer_get_expires(timer),
                                          base->offset);
                      if (expires.tv64 < expires_next.tv64)
                              expires_next = expires;
                      break;
              }

              __run_hrtimer(timer, &basenow);
      }

However, this seems to be an old kernel issue, as, in upstream v6.0, I believe the next hrtimer interrupt will be programmed correctly because __hrtimer_next_event_base() calls hrtimer_get_expires() which correctly use the “hard expiry” times to do the programming.