JoelFernandes.org

Age is of no importance unless you're a cheese -- Billie Burke

Hello! I’m Joel and this my personal website built with Jekyll! I currently work at Google. My interests are scheduler, RCU, tracing, synchronization, memory models and other kernel internals. I also love contributing to the upstream Linux kernel and other open source projects.

Connect with me on Twitter, and LinkedIn. Or, drop me an email at: joel at joelfernandes dot org

Look for my name in the kernel git log to find my upstream kernel patches. Check out my resume for full details of my work experience. I also actively present at conferences, see a list of my past talks, presentations and publications.

Full list of all blog posts on this site:
  • 25 Jun 2023   SVM and vectors for the curious
  • 10 Jun 2023   SELinux Debugging on ChromeOS
  • 28 Apr 2023   Understanding Hazard Pointers
  • 25 Apr 2023   PowerPC stack guard false positives in Linux kernel
  • 24 Feb 2023   Getting YouCompleteMe working for kernel development
  • 29 Jan 2023   Figuring out herd7 memory models
  • 13 Nov 2022   On workings of hrtimer's slack time functionality
  • 25 Oct 2020   C++ rvalue references
  • 06 Mar 2020   SRCU state double scan
  • 18 Oct 2019   Modeling (lack of) store ordering using PlusCal - and a wishlist
  • 02 Sep 2019   Making sense of scheduler deadlocks in RCU
  • 23 Dec 2018   Dumping User and Kernel stacks on Kernel events
  • 15 Jun 2018   RCU and dynticks-idle mode
  • 10 Jun 2018   Single-stepping the kernel's C code
  • 10 May 2018   RCU-preempt: What happens on a context switch
  • 11 Feb 2018   USDT for reliable Userspace event tracing
  • 08 Jan 2018   BPFd- Running BCC tools remotely across systems
  • 01 Jan 2017   ARMv8: flamegraph and NMI support
  • 19 Jun 2016   Ftrace events mechanism
  • 20 Mar 2016   TIF_NEED_RESCHED: why is it needed
  • 25 Dec 2015   Tying 2 voltage sources/signals together
  • 04 Jun 2014   MicroSD card remote switch
  • 08 May 2014   Linux Spinlock Internals
  • 25 Apr 2014   Studying cache-line sharing effects on SMP systems
  • 23 Apr 2014   Design of fork followed by exec in Linux
  • Most Recept Post:

    Making sense of scheduler deadlocks in RCU

    | Comments

    Note: At the time of this writing, it is kernel v5.3 release. RCU moves fast and can change in the future, so some details in this article may be obsolete.

    The RCU subsystem and the task scheduler are inter-dependent. They both depend on each other to function correctly. The scheduler has many data structures that are protected by RCU. And, RCU may need to wake up threads to perform things like completing grace periods and callback execution. One such case where RCU does a wake up and enters the scheduler is rcu_read_unlock_special().

    Recently Paul McKenney consolidated RCU flavors. What does this mean?

    Consider the following code executing in CPU 0:

    preempt_disable();
    rcu_read_lock();
    rcu_read_unlock();
    preempt_enable();
    

    And, consider the following code executing in CPU 1:

    a = 1;
    synchronize_rcu();  // Assume synchronize_rcu
                        // executes after CPU0's rcu_read_lock
    b = 2;
    

    CPU 0’s execution path shows 2 flavors of RCU readers, one nested into another. The preempt_{disable,enable} pair is an RCU-sched flavor RCU reader section, while the rcu_read_{lock,unlock} pair is an RCU-preempt flavor RCU reader section.

    In older kernels (before v4.20), CPU 1’s synchronize_rcu() could return after CPU 0’s rcu_read_unlock() but before CPU 0’s preempt_enable(). This is because synchronize_rcu() only needs to wait for the “RCU-preempt” flavor of the RCU grace period to end.

    In newer kernels (v4.20 and above), the RCU-preempt and RCU-sched flavors have been consolidated. This means CPU 1’s synchronize_rcu() is guaranteed to wait for both of CPU 1’s rcu_read_unlock() and preempt_enable() to complete.

    Now, lets get a bit more detailed. That rcu_read_unlock() most likely does very little. However, there are cases where it needs to do more, by calling rcu_read_unlock_special(). One such case is if the reader section was preempted. A few more cases are:

    • The RCU reader is blocking an expedited grace period, so it needed to report a quiescent state quickly.
    • The RCU reader is blocking a grace period for too long (~100 jiffies on my system, that’s the default but can be set with rcutree.jiffies_till_sched_qs parameter).

    In all these cases, the rcu_read_unlock() needs to do more work. However, care must be taken when calling rcu_read_unlock() from the scheduler, that’s why this article on scheduler deadlocks.

    One of the reasons rcu_read_unlock_special() needs to call into the scheduler is priority de-boosting: A task getting preempted in the middle of an RCU read-side critical section results in blocking the completion of the critical section and hence could prevent current and future grace periods from ending. So the priority of the RCU reader may need to be boosted so that it gets enough CPU time to make progress, and have the grace period end soon. But it also needs to be de-boosted after the reader section completes. This de-boosting happens by calling of the rcu_read_unlock_special() function in the outer most rcu_read_unlock().

    What could go wrong with the scheduler using RCU? Let us see this in action. Consider the following piece of code executed in the scheduler:

      reader()
    	{
    		rcu_read_lock();
    		do_something();     // Preemption happened
                    /* Preempted task got boosted */
    		task_rq_lock();     // Disables interrupts
                    rcu_read_unlock();  // Need to de-boost
    		task_rq_unlock();   // Re-enables interrupts
    	}
    

    Assume that the rcu_read_unlock() needs to de-boost the task’s priority. This may cause it to enter the scheduler and cause a deadlock due to recursive locking of RQ/PI locks.

    Because of these kind of issues, there has traditionally been a rule that RCU usage in the scheduler must follow:

    “Thou shall not hold RQ/PI locks across an rcu_read_unlock() if thou not
    holding it or disabling IRQ across both both the rcu_read_lock() +
    rcu_read_unlock().”
    

    More on this rule can be read here as well.

    Obviously, acquiring RQ/PI locks across the whole rcu_read_lock() and rcu_read_unlock() pair would resolve the above situation. Since preemption and interrupts are disabled across the whole rcu_read_lock() and rcu_read_unlock() pair; there is no question of task preemption.

    Anyway, the point is rcu_read_unlock() needs to be careful about scheduler wake-ups; either by avoiding calls to rcu_read_unlock_special() altogether (as is the case if interrupts are disabled across the entire RCU reader), or by detecting situations where a wake up is unsafe. Peter Ziljstra says there’s no way to know when the scheduler uses RCU, so “generic” detection of the unsafe condition is a bit tricky.

    Now with RCU consolidation, the above situation actually improves. Even if the scheduler RQ/PI locks are not held across the whole read-side critical sectoin, but just across that of the rcu_read_unlock(), then that itself may be enough to prevent a scheduler deadlock. The reasoning is: during the rcu_read_unlock(), we cannot yet report a QS until the RQ/PI lock is itself released since the act of holding the lock itself means preemption is disabled and that would cause a QS deferral. As a result, the act of priority de-boosting would also be deferred and prevent a possible scheduler deadlock.

    However, RCU consolidation introduces even newer scenarios where the rcu_read_unlock() has to enter the scheduler, if the “scheduler rules” above is not honored, as explained below:

    Consider the previous code example. Now also assume that the RCU reader is blocking an expedited RCU grace period. That is just a fancy term for a grace period that needs to end fast. These grace periods have to complete much more quickly than normal grace period. An expedited grace period causes currently running RCU reader sections to receive IPIs that set a hint. Setting of this hint results in the outermost rcu_read_unlock() calling rcu_read_unlock_special(), which otherwise would not occur. When rcu_read_unlock_special() gets called in this scenario, it tries to get more aggressive once it notices that the reader has blocked an expedited RCU grace period. In particular, it notices that preemption is disabled and so the grace period cannot end due to RCU consolidation. Out of desperation, it raises a softirq (raise_softirq()) in the hope that the next time the softirq runs, the grace period could be ended quickly before the scheduler tick occurs. But that can cause a scheduler deadlock by way of entry into the scheduler due to a ksoftirqd-wakeup.

    The cure for this problem is the same, holding the RQ/PI locks across the entire reader section results in no question of a scheduler related deadlock due to recursively acquiring of these locks; because there would be no question of expedited-grace-period IPIs, hence no question of setting of any hints, and hence no question of calling rcu_read_unlock_special() from scheduler code. For a twist of the IPI problem, see special note.

    However, the RCU consolidation throws yet another curve ball. Paul McKenney explained on LKML that there is yet another situation now due to RCU consolidation that can cause scheduler deadlocks.

    Consider the following code, where previous_reader() and current_reader() execute in quick succession in the context of the same task:

           previous_reader()
    	{
    		rcu_read_lock();
    		do_something();      // Preemption or IPI happened
    		local_irq_disable(); // Cannot be the scheduler
    		do_something_else();
    		rcu_read_unlock();  // As IRQs are off, defer QS report
                                        //but set deferred_qs bit in 
                                        //rcu_read_unlock_special
    		do_some_other_thing();
    		local_irq_enable();
    	}
    
            // QS from previous_reader() is still deferred.
    	current_reader() 
    	{
    		local_irq_disable();  // Might be the scheduler.
    		do_whatever();
    		rcu_read_lock();
    		do_whatever_else();
    		rcu_read_unlock();    // Must still defer reporting QS
    		do_whatever_comes_to_mind();
    		local_irq_enable();
    	}
    

    Here previous_reader() had a preemption; even though the current_reader() did not - but the current_reader() still needs to call rcu_read_unlock_special() from the scheduler! This situation would not happen in the pre-consolidated-RCU world because previous_reader()’s rcu_read_unlock() would have taken care of it.

    As you can see, just following the scheduler rule of disabling interrupts across the entire reader section does not help. To detect the above scenario; a new bitfield deferred_qs has been added to the task_struct::rcu_read_unlock_special union. Now what happens is, at rcu_read_unlock()-time, the previous reader() sets this bit, and the current_reader() checks this bit. If set, the call to raise_softirq() is avoided thus eliminating the possibility of a scheduler deadlock.

    Hopefully no other scheduler deadlock issue is lurking!

    Coming back to the scheduler rule, I have been running overnight rcutorture tests to detect if this rule is ever violated. Here is the test patch checking for the unsafe condition. So far I have not seen this condition occur which is a good sign.

    I may need to check with Paul McKenney about whether proposing this checking for mainline is worth it. Thankfully, LPC 2019 is right around the corner! ;-) ——–

    Special Note

    [1] The expedited IPI interrupting an RCU reader has a variation. For an example see below where the IPI was not received, but we still have a problem because the ->need_qs bit in the rcu_read_unlock_special union got set even though the expedited grace period started after IRQs were disabled. The start of the expedited grace period would set the rnp->expmask bit for the CPU. In the unlock path, because the ->need_qs bit is set, it will call rcu_read_unlock_special() and risk a deadlock by way of a ksoftirqd wakeup because exp in that function is true.

    CPU 0                         CPU 1
    preempt_disable();
    rcu_read_lock();
    
    // do something real long
    
    // Scheduler-tick sets
    // ->need_qs as reader is
    // held for too long.
    
    local_irq_disable();
                                  // Expedited GP started
    // Exp IPI not received
    // because IRQs are off.
    
    local_irq_enable();
    
    // Here rcu_read_unlock will
    // still call ..._special()
    // as ->need_qs got set.
    rcu_read_unlock();
    
    preempt_enable();
    

    The fix for this issue is the same as described earlier, disabling interrupts across both rcu_read_lock() and rcu_read_unlock() in the scheduler path.