<h4>Description</h4>
<p>nrfmin radio driver can become stuck in a state where it can no longer receive packets over the radio.</p>
<p>This issue triggers sporadically and can happen within seconds or hours of operation. Increased number of nodes concurrently trying to send seems to increase the chance of it.</p>
<p>My original research showed when driver is in this state, there is a packet received, but it appears to be never passed on and cleared, so it prevents the driver from switching periphery into receive mode. I.e. second condition fails:</p>
<p><a href="https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L160">https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L160</a></p>
<p>For more info please see attempted fixes at the end.</p>
<h4>Steps to reproduce the issue</h4>
<ol>
<li>Use nrfmin, 6LoWPAN, and GNRC in a program that periodically (broad|multi)casts.</li>
<li>Leave two or more nodes on.</li>
</ol>
<h4>Expected results</h4>
<p>Packets should be received until conditions become unsatisfactory for chip operation or radio connection.</p>
<h4>Actual results</h4>
<p>Packets can be sent over the radio, but receiving does not work.</p>
<h4>Versions</h4>
<p>(a few week old version of RIOT; will be filled in later)</p>
<p>My particular NRF51 chip <strong>is not affected by</strong> PAN 20 (which means <code>STATE</code> register can be used in my case)</p>
<h4>Attempted fixes</h4>
<p>I attempted to fix this by making certain code run in the interrupt handler (assuming that's the most correct place in effort to avoid blowing something else's stack):</p>
<div class="highlight highlight-source-c"><pre><span class="pl-k">void</span> <span class="pl-en">isr_radio</span>(<span class="pl-k">void</span>)
{
    <span class="pl-k">if</span> (NRF_RADIO-><span class="pl-smi">EVENTS_END</span> == <span class="pl-c1">1</span>) {
        NRF_RADIO-><span class="pl-smi">EVENTS_END</span> = <span class="pl-c1">0</span>;
        <span class="pl-c"><span class="pl-c">/*</span> did we just send or receive something? <span class="pl-c">*/</span></span>
        <span class="pl-k">if</span> (state == STATE_RX || rx_buf.<span class="pl-smi">pkt</span>.<span class="pl-smi">hdr</span>.<span class="pl-smi">len</span> > <span class="pl-c1">0</span>) {
            <span class="pl-c"><span class="pl-c">/*</span> drop packet on invalid CRC <span class="pl-c">*/</span></span>
            <span class="pl-k">if</span> ((NRF_RADIO-><span class="pl-smi">CRCSTATUS</span> != <span class="pl-c1">1</span>) || !(nrfmin_dev.<span class="pl-smi">event_callback</span>)) {
                rx_buf.<span class="pl-smi">pkt</span>.<span class="pl-smi">hdr</span>.<span class="pl-smi">len</span> = <span class="pl-c1">0</span>;
                NRF_RADIO-><span class="pl-smi">TASKS_START</span> = <span class="pl-c1">1</span>;
            } <span class="pl-k">else</span> {
                rx_lock = <span class="pl-c1">0</span>;
                nrfmin_dev.<span class="pl-c1">event_callback</span>(&nrfmin_dev, NETDEV_EVENT_ISR);
            }
        }
        <span class="pl-k">if</span> (state == STATE_TX) {
            <span class="pl-c1">goto_target_state</span>();
        }
    }

    <span class="pl-c1">cortexm_isr_end</span>();
}</pre></div>
<p>This greatly improved reliability, yet now sending function would sometimes spinlock waiting for <code>state</code> to change from <code>STATE_TX</code>:</p>
<p><a href="https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L335">https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L335</a></p>
<p>After a few attempts I managed to avoid that by adding code that waits for <code>EVENT_READY</code> after triggering <code>TASK_TXEN</code> and <code>TASK_RXEN</code>:</p>
<p><a href="https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L166">https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L166</a></p>
<p><a href="https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L357">https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L357</a></p>
<p>Changed to:</p>
<div class="highlight highlight-source-c"><pre>        NRF_RADIO->EVENTS_READY = <span class="pl-c1">0</span>;
        NRF_RADIO->TASKS_RXEN = <span class="pl-c1">1</span>;
        <span class="pl-k">while</span> (NRF_RADIO->EVENTS_READY == <span class="pl-c1">0</span>) {}</pre></div>
<p>With these changes, after a lot of time, system sometimes hard faults. I was unable to recover the details because UART disconnected prior to that.</p>
<h4>Other observations</h4>
<p><code>cortexm_isr_end()</code> is docummented as to be called at the end of every interrupt routine, but in this driver, in <code>isr_radio</code>, line <a href="https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L315">https://github.com/RIOT-OS/RIOT/blob/3e6336ce89d64d58ab07764ef7f65fc86800cb85/cpu/nrf5x_common/radio/nrfmin/nrfmin.c#L315</a> can return early and not fufill that specification. Should this be corrected?</p>
<p>Also, when <code>NRF_RADIO->POWER</code> is set to <code>0</code> (off), it appears to reset some of the registers. When waking periphery up, <code>nrfmin_init</code> should probably be called again.</p>


<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="https://github.com/RIOT-OS/RIOT/issues/10878">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AEn7YFRrJyTGGJhsIRQ4IK1pVYRwD8ONks5vHIfGgaJpZM4aUQqu">mute the thread</a>.<img src="https://github.com/notifications/beacon/AEn7YFWxYu6kQmGBbyxqZdKgOsSHJQGwks5vHIfGgaJpZM4aUQqu.gif" height="1" width="1" alt="" /></p>
<script type="application/json" data-scope="inboxmarkup">{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/RIOT-OS/RIOT","title":"RIOT-OS/RIOT","subtitle":"GitHub repository","main_image_url":"https://github.githubassets.com/images/email/message_cards/header.png","avatar_image_url":"https://github.githubassets.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/RIOT-OS/RIOT"}},"updates":{"snippets":[{"icon":"DESCRIPTION","message":"nrfmin can get stuck and never reach RX (while TX works) (#10878)"}],"action":{"name":"View Issue","url":"https://github.com/RIOT-OS/RIOT/issues/10878"}}}</script>
<script type="application/ld+json">[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "https://github.com/RIOT-OS/RIOT/issues/10878",
"url": "https://github.com/RIOT-OS/RIOT/issues/10878",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]</script>