niedziela, 4 marca 2018

ARMv7-a vs Cortex-A7 vs Cortex-A8 vs TI DRA62x +VFP +NEON

"Linaro focuses on the use of the ARM instruction set in its versions 7a (32-bit) and 8 (64-bit) including concrete implementations of these, such as SoCs that contain Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A15, Cortex-A53 or Cortex-A57 processor(s)."(

"The ARM Cortex-A8 is a 32-bit processor core licensed by ARM Holdings implementing the ARMv7-A architecture." (

"What is VFP?
VFP is a floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. If a processor like ARM does not have floating hardware, then it relies on software math libraries which can prohibitively slow down floating point calculations. The VFP supports both single and double precision floating point calculations compliant with IEEE754. Further, the VFP is not fully pipelined like Neon, so it will not have equivalent performance to Neon.
Neon and VFP both support floating point, which should I use?
The VFPv3 is fully compliant with IEEE 754
Neon is not fully compliant with IEEE 754, so it is mainly targeted for multimedia applications
.. example of showing how Neon pipelining will outperform VFP...
Compile the above function for both Neon and VFP and compare results:
arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -ftree-vectorize -mfloat-abi=softfp"

"Using NEON and VFPv3 on Cortex-A8
The compiler supports two different options to control NEON and VFPv3.

--float_support=VFPv3 --neon

The --float_support=VFPv3 option instructs the compiler to generate code that utilizes the VFPv3 coprocessor for both double and single precision floating point operations. The option is also used to enable the assembler to accept VFPv3 instructions in assembly source. To enable VFPv3 the EABI mode must also be enabled through the --abi=eabi option. This is necessary because the calling convention for floating point paramemters changes when VFPv3 is enabled and that convention is only supported in EABI mode.

The --neon option instructs the compiler to automatically vectorize loops to use the NEON instructions. To get benefit from this option you should be using --opt_level=2 or higher and be generating code for performance by using the --opt_for_speed=[3-5] option.
Combining options
The TI ARM compiler supports four modes related to Cortex-A8, NEON, and VFPv3. By default neither NEON or VFPv3 is enabled. In addition to the default the following 3 modes are supported:
VFP enabled without NEON
The compiler will generate VFPv3 instructions for single and double precision floating point operations
NEON enabled without VFP
In this mode the compiler will generate NEON instructions for SIMD integer operations. It will not generate NEON instructions to vectorize floating point operations. The motivation for not allowing floating point NEON instructions if VFP is not enabled is because it is possible to have an integer only variant of NEON implemented. In order for the NEON unit to support floating point operations the VFPv3 coprocessor must be present.
NEON enabled and VFP enabled
In this mode the compiler will generate a mix of NEON and VFP instructions. The NEON instructions can be either integer or floating point.
VFPv3 vs. NEON performance
A common question with regard to TI ARM compiler's support for NEON is how to get more floating point operations on the NEON unit instead of the VFPv3. The reason this is desirable is because the VFPv3 coprocessor is not a pipelined architecture on the Cortex-A8, but the NEON is. The compiler will always use VFP instructions for scalar floating point operations, even if the --neon option is used. The hardware is capable of issuing VFP instructions on the NEON coprocessor if the following conditions are met:

The instruction must be a single precision data processing instruction
The processor must be in flush-to-zero mode. In this mode the processor will treat all denormalized numbers as zero.
The processor must be in default NaN mode. In this mode the operation will return the default NaN regardless of the input, whereas in full-compliance mode the returned NaN follows the rules in the ARM Architecture Reference Manual.
The FPEXC.EX bit must be set to 0. This tells the processor that there is no additional state that must be handled by a context switch."

DRA62x Automotive Application DSP + ARM Processors
The ARM Cortex-A8 processor has a Harvard architecture and provides a complete high-performance subsystem, including:
• ARM Cortex-A8 Integer Core
• Superscalar ARMv7 Instruction Set
• Thumb-2 Instruction Set
• Jazelle RCT Acceleration
• CP14 Debug Coprocessor
• CP15 System Control Coprocessor
• NEON™ 64-/128-bit Hybrid SIMD Engine for Multimedia
• Enhanced VFPv3 Floating-Point Coprocessor
• Enhanced Memory Management Unit (MMU)
• Separate Level-1 Instruction and Data Caches
• Integrated Level-2 Cache
• 128-bit Interconnect with Level 3 Fast (L3) System Memories and Peripherals
• Embedded Trace Module (ETM).

sobota, 12 listopada 2016

base class operator== detection

Class for items storage. Items can be added, read, updated. In case of modification callback notification is sent to all listeners.
To discover modification operator==() can be used. It seems to be natural, non-intrusive solution.
Seems to be fine, but problem comes with inheritance.
Consider types:

struct A {
A(int v) : v(v) {}

int v;

friend bool operator==(const A &lhs, const A &rhs)
return lhs.v == rhs.v;
struct B : A {
B(int v, int w) : A(v), w(w) {}

int w;

Note that opeartor==() is defined as friend (therefore not class member - see: ADL, friend name injection, Barton–Nackman trick) but it can also be defined as member function or global function.

The potential problem is visible here:

A a1{1}, a2{1};

std::cout << (a1 == a2); B b1{1,1}, b2{1,2}; std::cout << (b1 == b2);

Both print '1', but is 'b1' really equal to 'b2'?
According to current implementation yes, because only A part of B-type object is compared.
The real problem is no notification of such potential problem. And when generic code is used to compare object, where types are defined in different files it might be easy to forget about operator==() definition for class B.
To avoid potential problem it might be better to decide that such situation is prohibited and shall end in compile-time error.
Required is compile time checking if some type (some, because comparison will be used in generic code) has defined operator==.
Tool to solve the problem might be found e.g. in Boost (has_equal_to from typetraits or Concept Check Library), but simple solution is presented here
For C++98:

namespace CHECK
class No { bool b[2]; };
template No operator== (const T&, const Arg&);

bool Check (...);
No& Check (const No&);

struct EqualExists
enum { value = (sizeof(Check(*(T*)(0) == *(Arg*)(0))) != sizeof(No)) };

Simplified version for C++11:

namespace CHECK
struct No {};
template No operator== (const T&, const Arg&);

struct EqualExists
enum { value = !std::is_same<decltype(*(T*)(0) == *(Arg*)(0)), No>::value };

Using CHECK::EqualExists::value with static assert allows to detect potential problem.

niedziela, 12 czerwca 2016

Old dog,old tricks (in C++)

Compile time safety, compile time error handling - all about typesystem, constness, ...

C-array safety

If function uses constant length array it is tempting to use:

void f(int t[3])
t[2] = ...;


int t[3];

as array size is merely for human-programmer, it is possible to use:

int t[4];

which perhaps is ok (but not nice).
But it is also possible to do

int t[2];

Which for sure is wrong.

Solution to ensure strict array size is:

void f(int (&t)[3])

Now only three elements arrays are allowed.
BTW - above construct is frequently used for C-array handling in templates.

poniedziałek, 7 grudnia 2015

C++11 and initialization

Uniform initialization in C++11 can be tricky.

Invocations mean something completely different and also give different results:

std::vector<int> v(1); // "normal" constructor invocation with (int) param - creates vector<int> with 1 element initialized to default value (i.e. 0)
std::vector<int> v{1}; // invocation of constructor with initializer_list<> param - content of initializer_list is copied into vector (i.e. one value of 1)

Following invocations also mean something completely different and give different results:

std::vector<int> v(1, 1); // "normal" constructor invocation with (int,int) params - creates vector<int> with 1 element initialized to specified value value (i.e. 1)
std::vector<int> v{1, 1}; // invocation of constructor with initializer_list<> param - content of initializer_list is copied into vector (i.e. two values of 1)

But following mean something completely different but give same results:

std::vector<int> v(2, 2); // "normal" constructor invocation with (int,int) params - creates vector<int> with 2 elements initialized to specified value value (i.e. 2)
std::vector<int> v{2, 2}; // invocation of constructor with initializer_list<> param - content of initializer_list is copied into vector (i.e. two values of 2)

Note that following will not compile:

std::vector<int> v(1, 1, 1); // no such "normal" constructor

whereas following is completely fine C++11 statement:

std::vector<int> v{1, 1, 1}; // invocation of constructor with initializer_list<> param - content of initializer_list is copied into vector (i.e. three values of 1)

Also note what is perhaps even more surprising that when container value type cannot be initialized from values in the list (no such conversion), then following construction will invoke "normal" constructor:

std::vector<std::string> v{1}; // same as - std::vector<std::string> v(1);

To avoid such behavior assign can be used in initialization (this is still construction, not assignment):

std::vector<std::string> v = {1}; // this will fail to compile
std::vector<int> v = {1}; // this will invoke initializer_list<> constructor as for std::vector<int> v{1};

There is even more about uniform initialization, especially using "auto" keyword - please check e.g. "Effective Modern C++" by Scott Meyers.
Also please check stackoverflow.

poniedziałek, 9 listopada 2015

Thread-safe Catch

Catch test framework is nice, but not thread-safe - see

There is thread-safe fork of Catch
At the time of writing this original Catch is 1.2.1 whilst thread-safe fork 1.1

piątek, 6 listopada 2015

With or without you (exceptions)

To exception, or not to exception, that is the question.
What about cyclomatic complexity argument? E.g.

piątek, 28 sierpnia 2015

upgradeable RW-locks - no, no

Never to be forgotten - upgradeable RW-locks always lead to deadlock.
I.e. when reader upgrades to writer without first leaving read lock, then other tread doing same lead to deadlock - no one moves back - deadlock.
- do not think on upgradeable RW-locks,
- if do so, then try-upgrade() function might be good approach, or
- deadlock detection, or
- 3rd user type - beside reader and writer - upgreadeableReader (like EnterUpgradeableReadLock from .NET). There can be only one upgreadeableReader (mutually exclusive). Normal readers cannot upgrade, therefore it is certain that only one user-reader will try to upgrade.