Thread-Local Storage

From SEGGER Knowledge Base
Jump to navigation Jump to search

Thread-Local Storage (TLS) enables the use of local and global variables to be unique to a thread. The most popular thread-local variable is `errno`.

Thread-local variables in the standard library

C standard libraries can support usage of thread-local storage.

Library objects that need thread-local storage when used in multiple tasks are for example:

  • error functions - errno, strerror.
  • locale functions - localeconv, setlocale.
  • time functions - asctime, localtime, gmtime, mktime.
  • multibyte functions - mbrlen, mbrtowc, mbsrtowc, mbtowc, wcrtomb, wcsrtomb, wctomb.
  • rand functions - rand, srand.
  • etc functions - atexit, strtok.
  • C++ exception engine.

locales in the standard library

locales might not be directly used or changed, but the locale is used by standard library functions which support localization. By the C standard and POSIX.1 extension, locales may be thread local. Therefore thread-local storage handling is required when using these functions.

The functions with support for localization are: fprintf, isprint, iswdigit, localeconv, tolower, fscanf, ispunct, iswgraph, mblen, toupper, isalnum, isspace, iswlower, mbstowcs, towlower, isalpha, isupper, iswprint, mbtowc, towupper, isblank, iswalnum, iswpunct, setlocale, wcscoll, iscntrl, iswalpha, iswspace, strcoll, wcstod, isdigit, iswblank, iswupper, strerror, wcstombs, isgraph, iswcntrl, iswxdigit, strtod, wcsxfrm, islower, iswctype, isxdigit.


locales in Embedded Studio

The SEGGER Runtime Library implements locales and support for per-thread locales with its full implementation.

If localization is not required, i.e. only the minimum "C" locale (7-bit ASCII characters) shall be used, the implementation can be set to minimal. This minimal implementation does not require LTS anymore. It does not support some functions, such as setlocale, uselocale, or POSIX *_l xlocale convenience functions.

The locale implementation can be set with the project option Code -> Library -> Enable Locales.

Handling thread-local storage

When thread-local storage is used, it needs to be handled by the system or OS.

With multiple tasks or threads, the TLS blocks need to be initialized per thread and the OS is responsible to provide information about the currently active thread.

If there is only one thread, the system can treat thread-local variables like regular variables and share them across the whole system. The system only needs to provide information about where the "global thread-local block" is located, but does not need to explicitly initialize it.

embOS

embOS is prepared to support TLS, but does not enable it per default. This has the advantage of no additional overhead as long as TLS is not needed by the application. The embOS implementation of thread-local storage allows activation of TLS separately for every task. Only tasks that call functions using TLS need to activate it by calling an initialization function when the task is started.

No OS / Single Thread System

Even when there are no threads, the compiler will generate code to access thread-local variables like in a multi-tasking system. The system needs to provide the information which matches the architecture-specific implementation and the memory layout.

Some architectures have or define a register to be the "thread pointer". The thread pointer needs to change when there is a task or thread change and points to the data relevant to the current thread.

In the case for Arm, there is no dedicated thread pointer. Instead the function __aeabi_read_tp is used by the compiler and required to be implemented.

When the memory layout has the sections .tbss and .tdata (in this order), __aeabi_read_tp can simply return the start address of .tbss - 8.

 .section .text.__aeabi_read_tp, "ax", %progbits
 -type __aeabi_read_tp, function
 __aeabi_read_tp:
         ldr     R0, =__tbss_start__-8
         bx      LR

Why .tbss - 8?

__aeabi_read_tp is not directly intended to return a pointer to the thread-local data. Instead it shall return the "thread pointer", which points to a structure describing the data of the thread.

With dynamically loaded libraries, the structure content needs to be evaluated to get to the actual data of a variable. With statically linked and loaded applications, which is usually the case fore embedded firmware, the compiler can take a shortcut.

In the TLS structure the thread-local data is stored after the "task control block" at a known offset. The compiler knows the offset of a variable in the thread-local storage section and the fixed offset. It can therefore directly use the offset to the thread pointer to address a thread-local variable.

Since the "task control block" is not used by a firmware, this is a virtual construct. The bytes preceding the thread-local storage section do not have to be available anywhere in memory, only the theoretical number needs to be known to be used to get the thread pointer.

For Arm the known offset from thread pointer to start of data is: 8.

Linker configuration

Thread-local data and thread-local bss need to be placed in memory in a block and order which is known to the OS. The OS creates a copy of the block for each thread or task and on access of a thread-local variable points to the block copy belonging to the active thread.

SEGGER Linker

With the SEGGER Linker thread-local data and thread-local bss can be put into a block, which can then be placed regularly in RAM.

  define block tls with fixed order { block tbss, block tdata };
  
  place in RAM with auto order      { block tls, readwrite, zeroinit };

Linker warning "thread-local and non-thread-local sections cannot be mixed"

When the tls block is declared without a specific ordering, the SEGGER Linker uses auto order to reduce the loss due to alignment and may interfere with the ordering expected by the OS.

To resolve this warning, define the block which contains tbss and tdata with fixed order.

GNU Linker

With the GNU Linker thread-local data and thread-local bss need to be placed with fixed order and layout, too.

In Embedded Studio the section placement file can take care of this:

 <MemorySegment name="$(FLASH_NAME:FLASH);FLASH1">
   ...
   <ProgramSection alignment="4" load="Yes" runin=".data_run" name=".data" />
   <ProgramSection alignment="4" load="Yes" runin=".tdata_run" name=".tdata" />
   ...
 </MemorySegment>
 <MemorySegment name="$(RAM_NAME:RAM);SRAM;RAM1">
   ...
   <ProgramSection alignment="4" load="No" name=".data_run" />
   <ProgramSection alignment="4" load="No" name=".bss" />
   <ProgramSection alignment="4" load="No" name=".tbss" />
   <ProgramSection alignment="4" load="No" name=".tdata_run" />
   ...
 </MemorySegment>

Note: The default *placement.xml in Embedded Studio prior to version 6.30 created mixed TLS and non-TLS sections and might need to be updated in existing projects.

In a manually created linker script the ordering can look like this:

 __tbss_load_start__ = ALIGN(__bss_end__ , 4);
 /* Create area for "global" thread-local storage tbss section. */
 .tbss ALIGN(__bss_end__ , 4) (NOLOAD) : AT(ALIGN(__bss_end__ , 4))
 {
   __tbss_start__ = .;
   *(.tbss .tbss.*)
 }
 __tbss_end__ = __tbss_start__ + SIZEOF(.tbss);
 __tbss_size__ = SIZEOF(.tbss);
 __tbss_load_end__ = __tbss_end__;

 /* Create load image in Flash for initialization of thread-local storage tdata section. */
 __tdata_load_start__ = ALIGN(__data_load_start__ + SIZEOF(.data) , 4);
 .tdata ALIGN(__tbss_end__ , 4) : AT(ALIGN(__data_load_start__ + SIZEOF(.data) , 4))
 {
   __tdata_start__ = .;
   *(.tdata .tdata.*)
 }
 __tdata_end__ = __tdata_start__ + SIZEOF(.tdata);
 __tdata_size__ = SIZEOF(.tdata);
 __tdata_load_end__ = __tdata_load_start__ + SIZEOF(.tdata);
 
 /* Create area for "global" thread-local storage tdata section after tbss section. */ 
 .tdata_run ALIGN(__tbss_end__ , 4) (NOLOAD) :
 {
   __tdata_run_start__ = .;
 }
 __tdata_run_end__ = __tdata_run_start__ + SIZEOF(.tdata);
 __tdata_run_size__ = __tdata_run_end__ - __tdata_run_start__;
 __tdata_run_load_end__ = __tdata_run_end__;

The runtime init code initializes the areas:

   ldr r0, =__tdata_load_start__
   ldr r1, =__tdata_start__
   ldr r2, =__tdata_end__
   bl memory_copy
   ldr r0, =__tbss_start__
   ldr r1, =__tbss_end__
   movs r2, #0
   bl memory_set

In externally created linker scripts, such as the ones coming from STM32CubeMX, the additional sections and its ordering can look like this:

 /* Create load image in Flash for initialization of thread-local storage areas. */
 . = ALIGN(4);
 .tdata_load (READONLY):
 {
   __tbss_load_start__ = .;
   *(.tbss .tbss.*);
   __tbss_load_end__ = .;
   __tdata_load_start__ = .;
   *(.tdata .tdata.*);
   __tdata_load_end__ = .;
 } >ROM
 
 /* Create area for "global" thread-local storage */
 . = ALIGN(8);
 .tls :
 {
   __tbss_start__ = .;
   . += __tbss_load_end__ - __tbss_load_start__;
   __tbss_end__ = .;
   __tdata_start__ = .;
   . += __tdata_load_end__ - __tdata_load_start__;
   __tdata_end__ = .;
 } >RAM

The startup code initializes the global areas. Note: The code might need to be added to the startup file.

 /* Copy the tdata segment initializers from flash to SRAM */
   ldr r0, =__tdata_start__
   ldr r1, =__tdata_end__
   ldr r2, =__tdata_load_start__
   movs r3, #0
   b LoopCopyTDataInit
 
 CopyTDataInit:
   ldr r4, [r2, r3]
   str r4, [r0, r3]
   adds r3, r3, #4
 
 LoopCopyTDataInit:
   adds r4, r0, r3
   cmp r4, r1
   bcc CopyTDataInit
 
 /* Zero fill the tbss segment. */
   ldr r2, =__tbss_start__
   ldr r4, =__tbss_end__
   movs r3, #0
   b LoopFillZeroTbss
 
 FillZeroTbss:
   str  r3, [r2]
   adds r2, r2, #4
 
 LoopFillZeroTbss:
   cmp r2, r4
   bcc FillZeroTbss

Note: Other section ordering might lead to the error message "TLS sections are not adjacent:"