An Assembly Language Delay Routine for ARM Microcontrollers

I'm in the midst of writing some code to initialize an LCD screen. The initialization requires me to send a command, then wait for at least 50 milliseconds, then send more commands.

If you're using an Arduino, this is easy: you call the delay() function. If you're graduating from the Arduino to a more 'bare metal' development environment, you may find yourself looking for this function that doesn't exist on your platform.

Furthermore, if you do a Google search for delay routines on an ARM microcontroller, the near universal reply is, "you should use a hardware timer for that!"

I disagree with that answer. For some instances, yes, it makes sense to use a hardware timer. In other instances, I'm not as convinced. (Feel free to post your arguments in the comments.) In my case, I want to write a library for my LCD that I can later incorporate into other projects. There's no way for me to know what hardware timers might already be in use in those other projects, so choosing one particular hardware timer for my LCD library has challenges.

I might want to port this LCD library to another microcontroller family that uses a completely different timer architecture.

But even more to the point, it seems wasteful for me to initialize and spin up a hardware timer only to initialize an LCD, when my code will never use that timer again.

So let's look harder at using a software delay.

At its core, a delay routine is pretty simple. If we know which instructions are executing, and we know how many clock  cycles each instruction takes, and we know the frequency of the instruction clock, we should have no problem coding up a delay routine, right?

Well... it's a little more complicated than that. On many 8-bit micros, that's largely true. For many ARM parts, however, there are some complications that we need to take into account.

The first complication is interrupts. If our delay code is interrupted, our code will have no idea how long the interrupt service routine takes. We won't be able to take that into account in our delay routine.

A second complication is wait states. Many faster processors, including the STM32F4 that I'm using, have flash memory that is significantly slower than the CPU. The CPU is able to execute instructions faster than they can be fetched from memory.

So we just incorporate the wait states into our calculations, right? Well, no. The STM32F4, and other platforms, incorporate caching to help alleviate the problem of wait states. In a best case scenario, all of the instructions in our delay routine are cached and execute at the instruction clock frequency. In other cases, some instructions are cached and other instructions aren't. In a worst case, some instructions might be cached on one run through the delay loop but not on another run through the delay loop, causing each iteration through the loop to take a different amount of time. (Granted, that's probably a pathological case that we'd have to really really try to implement, but it's at least possible.)

Another complication is that other peripherals might be tying up access to memory. The STM32F4 has a pretty complex DMA system. If a DMA channel is using a memory access, it could delay the CPU from fetching an instruction for a few clock cycles.

The takeaway from this is that we simply don't know how long it's going to take to execute our delay code. In many cases, we can't know.

That doesn't mean that all is lost, however! My requirements don't specify that I delay for a precise period of time. My LCD requires that I delay at least a certain amount of time. And that we can guarantee with a software delay function! (If you need a precise delay, you really should use  a hardware timer.)

So without further ado, here's my delay routine:

void delay_ms(unsigned int msDelay, unsigned int f_clk)
╥╥╥╥unsigned int loopsPerMillisecond = (f_clk/1000) / 3; //3 clock cycles per loop
╥╥╥╥for (; msDelay > 0; msDelay --)
╥╥╥╥╥╥╥╥asm volatile //this routine waits (approximately) one millisecond
╥╥╥╥╥╥╥╥╥╥╥╥"mov r3, %[loopsPerMillisecond] \n\t" //load the initial loop counter
╥╥╥╥╥╥╥╥╥╥╥╥"loop: \n\t"
╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥"subs r3, #1 \n\t"
╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥"bne loop \n\t"

╥╥╥╥╥╥╥╥╥╥╥╥: //empty output list
╥╥╥╥╥╥╥╥╥╥╥╥: [loopsPerMillisecond] "r" (loopsPerMillisecond) //input to the asm routine
╥╥╥╥╥╥╥╥╥╥╥╥: "r3", "cc" //clobber list

If you've never used inline assembly code before, this code probably looks pretty foreign to you. Let's go through it line by line. Note that I'm using the GCC compiler here, so the syntax is particular to this compiler. Your compiler may have different syntax. Read the documentation.

The function prototype is pretty simple: it's a  function that accepts a number of milliseconds to delay, and the frequency of the instruction clock, and doesn't return a value.

We set up the variable loopsPerMillisecond by dividing the number of instruction cycles in a millisecond by three. We divide by three because the two assembly instructions used to implement the delay loop take 3 instruction cycles. If you're trying to port this code to another architecture, you'll have to determine how many instructions cycles these instructions take on your processor.

The next line just sets up a loop, one iteration per millisecond of delay.

The next line tells the compiler that we're putting assembly instructions here. The asm volatile block is surrounded by parentheses and ends with a semicolon.

The first block of information within those parentheses is the assembly code itself. In this case, there are four instructions. Each instruction is on a separate line, surrounded by quotation marks, and ends with escaped formatting characters (because the references I found told me to do that).

The first instruction, mov r3, %[loopsPerMillisecond], moves the value loopsPerMillisecond into the r3 register.

The second line isn't actually an instruction. It's a label called loop.

The third line, subs r3, #1, subtracts 1 from register r3.

The fourth line branches back to the loop label if r3 isn't equal to zero. (Strictly speaking, the branch is taken if the Z flag isn't set. The Z, or zero, flag, is set when the result of an arithmetic operation is zero. Because the most recent arithmetic operation is the subtraction of one from r3 (the result of which is stored back into r3), we can say that the branch is taken if r3 isn't zero.)

Hopefully, you can see that this assembly code loop takes one millisecond to execute. Well, more or less. Actually a tiny bit more, because we initialize r3 at the top of the loop, which takes time that isn't accounted for in the calculations. But all in all, the code should make straightforward sense to you.

After the inlined assembly code, there's a colon followed by a comment. This is the output list. The output list tells the GCC tools which registers are the output of our assembly code, and how to map those registers to variables in the C code. Since we don't have an output value to our delay routine, this list is empty.

After that is a colon and the input list where we map the C variable loopsPerMillisecond to a register labeled loopsPerMillisecond in the assembly code. (The assembler and the compiler use different symbol tables.) I don't want to dive into the particulars of the syntax here; do a Google search for GCC inline assembly to learn more.

After that is a colon with our "clobber list," which includes r3 and cc. This tells GCC that when our assembly finishes running, these two registers may have different values than when our code started. So if GCC was using these registers, it'll have to reload them. In our case, r3 is the loop counter, and cc is the processor condition register, which includes the zero (Z) flag that is affected by the subtraction statement and is the trigger for our conditional branch.

All in all, pretty simple, eh?

Alas, there is one further complication. The compiler might look at this code and optimize it away. We need to tell the complier not to do that. If you're using the GCC compiler, you can do this by putting an attribute on the function's declaration, like so:

void delay_ms(unsigned int msDelay, unsigned int f_clk) __attribute__ ((optimize(0)));

Don't take my word, for it, though. Check for yourself. Compile the code with the debug configuration (which typically is set to no optimization) and the compile in the release configuration (which is typically optimized) and compare the compiler's output. I did and found that both compile to the same instructions. In my case, the compiler left my loop intact even without the optimize attribute, but since there's no guarantee that this will always be the case, I left the attribute in my code.

One more time, folks: remember, I'm not an expert in any of this. My college degree is in biology. Any idiot, even me, can put a blog up on the web and sound like he knows what he's talking about. Much more experienced and knowledgeable people are likely to post in the comments so read what they have to say. Lastly, for goodness sake, don't take ANYTHING I say as gospel.

I hope this helps some of you, and happy coding!