XUART Performance and Latency Tuning

[Ed. Note: Originally posted by Jesse Off on April 17, 2012]

I thought I'd step in and clarify some things on XUARTs. I designed both the XUART hardware FPGA core and the xuartctl software about 5 years ago. Theres often confusion on what average response latency to expect out of the XUART hardware + software since the XUART software (xuartctl) is a 100% userspace device driver.

Firstly, the RX FIFO is so deep in the XUART hardware that we don't even need a real interrupt for 90% of applications out there. This is why by default, the xuartctl server process simply polls at 100hz (10ms) -- even if all 8 possible channels are actively receiving data at 115200 baud the hardware is in no danger of FIFO overflow. In the polling mode, you can expect around 10ms latency before your app sees the RX data because thats how long it could be in the hardware RX FIFO before being consumed and sent out to pipes/sockets to downstream application code. Anything beyond 10ms is due to process competition for the CPU -- if you have other threads/processes wanting the CPU but you want to ensure your application still sees XUART serial data promptly, its a simple matter of setting process priorities. Its worth noting this latency is the norm for kernel serial drivers too, since they post their data to the kernel termios layer which holds on to bytes and does not present them to userspace until after 10-100ms unless a not-well-known low latency mode ioctl() is run on the serial driver.

The XUART hardware and software can also be told to use IRQs. There really is only 2 reasons to use IRQs: 1) You are using baud rates above 115200, or 2) you need minimal or sub-10ms latency on RX characters. To use IRQs on the TS-7550, edit the /linuxrc and change the line that invokes the xuartctl server from "xuartctl --server" to "xuartctl --server --irq=29"

Once you start allowing the software to use IRQs, xuartctl starts paying attention to a couple extra parameters -- specifically, the "rlw=" (Receive low-water mark) and the "ithr=" (Idle-threshold). The "rlw=" arg takes a number between 1-255 is the the minimum number of RX characters in the RX FIFO that will be accumulated before sending an IRQ and waking up the hardware consumer process. When you enable irq's, it defaults to "1" meaning it will send an IRQ after every single RX character received. The "ithr=" argument is used to tune the number of bit-times the RX line has to be idle before sending an IRQ before the rlw= threshold is met.

The simple latency test we run as part of "xuartctl --test" is run with TX and RX looped back to each other. The test transmits a single character, waits to receive it on the rx line, then sends it again, over and over. This emulates a system with a 1 byte 0-latency request/response protocol and serves as our benchmark. The XUART polling at 100hz, predictably gives us approximately 100 request/response transactions per second, when using IRQs, we get approx 600 per second. When using the CPU internal uart with the low latency ioctl, we get 1500. A modern multi-ghz Linux-x86 PC on a motherboard (not USB) serial port gets 2000.

Why did we implemented XUART's in userspace over TCP? There are actually several reasons:

Talking to customers over the years I've found a very large portion of people use our boards as serial to network(TCPIP) converters, so I figured I'd save them some development by allowing that out-of-the-box.
Linux kernel version independence -- xuartctl doesn't need to be reworked when kernel innards APIs change.
Linux kernel serial port API deficiencies -- XUART hardware has features that cannot be represented in the normal serial API. 9-bit serial, timed idle/break, non-standard arbitrary baud rates, pipelined baud rate changes, etc...
Debuggability -- since we designed both the hardware and software, we had to debug both at the same time. This sort of design flow is much faster and efficient when you have good debug tools and don't have to worry about crashing/deadlocking the kernel.
As a userspace process, its priority can be tweaked lower if there are more important things to process than serial ports. A kernel driver is always on and interrupting at its full rate while its open.
Exportability -- xuartctl can export its serial ports to other UNIX machines across the internet to serve as virtual "local" serial ports since xuartctl.c compiles on any UNIX.

Now, there is a disadvantage on the ARM9 processor that is amplified on certain system designs that include a large number of distinct userspace processes. Since the ARM9 uses a VIVT (Virtually indexed, virtually tagged) L1 CPU cache, as you add processes competing for the CPU and performing heavy IPC, the effectiveness of the L1 cache is reduced and total CPU performance drops. Newer ARM CPU's do not have this architectural deficiency but there is a simple tactic for those wanting to push the performance limits on these low end processors. By avoiding multiple "processes" and instead creating multiple "threads" the L1 cache effectiveness is retained. Incorporating the xuartctl.c sources into your own application, renaming its main() to xuartctl_main(), and invoking its services under a new pthread improvements in performance can be realized without changing or rewriting any code.

More information on XUARTs: http://wiki.embeddedTS.com/wiki/Xuartctl

embeddedTS Support

How can we help you today?

XUART performance and latency tuning on TS products

Related Articles