A solution is now at hand with respect to interrupts being lost when they are close to a system timer tick interrupt, with AM335X CPU. THE LATEST WEC203 UDATE NOW HAS THIS SOLUTION.

Background

Last Year I observed a bug with respect to the serial port interrupts with an OS built from a Compact 2013 TI AM335X BSP that I had ported from Compact 7. The bug does not occur on a Compact 7 OS on the same target hardware. The target hardware is a Variscite AM3352/4 SOM. The bug has also been reported on the BeagleBone Black running Compact 2013 which has a TI processor from the same family (which I was able to also demonstrate). 

When a serial port was "hammered" by an application, the application would eventually lock up waiting for a receive to complete. By hammered I mean, send a message over a serial loopback and synchronously receive the message over the same port; both actions occur repeatedly one after the other on the same thread. Variations of the application such as send and receive running in separate threads and with signalling between the threads exhibited the same failure..Eventually the receive function would timeout. Changing the configuration with respect to serial port timeouts etc had no impact upon the outcome. Close and reopening the serial port did not resolve the issue. But unloading the driver and reloading it did reset things and enable the application to run again, but only until the same serial receive timeout occurred.

There is a thread wrt this issue on the Microsoft Platform Builder Forum:

https://social.msdn.microsoft.com/Forums/en-US/9d022b9c-5e5f-4543-b73f-e5bf3e5792e4/my-ist-is-no-longer-called-no-further-interrupts-could-be-handled?forum=winembplatdev

Discussion

The bug was investigated using Kernel Tracker. It was observed that when serial receive interrupts occurred within about 60uS of a system timer tick interrupt those serial interrupts did not complete. In the Kernel Tracker screen below, Int #0 is the  timer tick and Int #23 is the serial receive interrupt:

image

The ss,exe thread waits for the serial receive interrupt that then gets handled and the thread then waits again These are more frequent than the timer ticks. This occurs also in the next screen but a receive interrupt occurs just before a timer tick (as circled):

image

This serial interrupt never completes.

Discussion suggested that what happens is that the IST InterruptDone is not not called for the serial receive as circled, Indeed a simple fix is to add an explicit InterruptDone in the serial handler/application but that would have impacts upon various usage scenarios. This was done as a test and it did "fix" the problem.

Discussion also suggested that it was a more general interrupt issue, not specific to the serial port driver. If so, this would probably suggest that the problem is in the Microsoft (Private) code and not in the Adeneo code upon which AM335X BSPs are base. It was concluded that this was ARM specific problem not manifested inx86 systems. Is it due to the change to the Thumb2 compiler with Compact 2013?  Also it is possibly specific to the TI AM33X BSPs.


The InterruptDone Issue:

Function: SerialDispatchThread( )
File: SERIAL\COM_MDD2\mdd.c
Code:

while ( !pSerialHead->KillRxThread ) {
DEBUGMSG (ZONE_THREAD, (TEXT("Event %X, %d\r\n"),
pSerialHead->hSerialEvent,
pSerialHead->pHWObj->dwIntID ));
WaitReturn = WaitForSingleObject(pSerialHead->hSerialEvent, INFINITE);

SerialEventHandler(pSerialHead);
InterruptDone(pSerialHead->pHWObj->dwIntID);
}

The issue was the “WaitForSingleObject” was not returning because pSerialHead->hSerialEvent was not being set by the kernel.

This was easy to prove.

  1. Create a program that is continually read/writing the serial port.
  2. Change the WaitForSingleObject to timeout, i.e. change INFINITE to 10000 (timeout after 10 seconds).
  3. Check the “WaitReturn” value to see whether the WaitForSingleObject was signalled or timedout
    • a. If signaled, all good!
    • b. If timed-out, kernel has failed!

The “work-around” is to just continue the loop, this will reset the required flags and stuff “appears to continue to work”. The problem with this is we needed to set a timeout value quite small e.g. 10ms, meaning we ended up with a poll rather than even driven system Sad smile

Thx to CH for this.


Further Activity

The bug was reported to Microsoft and they were able to demonstrate it and accepted it as an issue. I was informed about one week ago that there is a solution at hand which I have just been able to test. The solution involves replacing two lib files with updated ones. Without the "fix" the serial app fails after a few minutes; although some variants ran for hours. One test ran for 23.5 hours; and I thought I had solved it! With this fix, the application has been running for 24 hours without failure. The test was performed by firstly building a Retail build of the OS without KITL with the original libs and running the serial app (which fails), The OS is then built with the updated libs, the OS is booted on the same target and the serial application run, with success (no failures thus far).

Just in:

I ran for 4 days + …no problems. DV

Mine is now past 2days Smile

Resolution

I recently received the two updated lib files from Microsoft to test.

The files are nkmain.lib  and nkprmain.lib

Although two others have successfully tested these updates, they are not for distribution yet. The correction is still undergoing testing.

The updates will be part on a subsequent Compact 2013 Monthly Update.  Watch this space

The problem code is in the Private tree and involves a misplaced label in (.s) assembler source.

File: C:\WINCE800\Private\winceos\COREOS\nk\kernel\arm\armtrap.s

Errant Code:

        mov     r3, #0                          ; done, no need to restart once reached here
Done1   mov     r0, r12                         ; (r0) = return original value

Correct Code:

Done1   mov     r3, #0                          ; done, no need to restart once reached here
        mov     r0, r12                         ; (r0) = return original value

For single core device, r3 is an address marker for restarting the InterlockedExchange() operation during interrupt. Because the label is misplaced, when interrupt happens during “mov r3 #0” (i.e. the exchange is completed but r3 is not yet set to 0), IRQHandler() will mistakenly restart the InterlockedExchange() – causing the previously assigned value to be returned back. The thread scheduler is one such victim. NextThread() issues:

InterlockedExchange ((PLONG)&PendEvents1(ppcb), 0), to retrieve and reset the pending interrupt(s). Since the mistakenly restarted InterlockedExchange() returns 0 in this case, all the pending interrupts at that moment are lost. As a result, the corresponding IST(s) are not signalled to process the interrupt(s).