niedziela, 4 marca 2018

ARMv7-a vs Cortex-A7 vs Cortex-A8 vs TI DRA62x +VFP +NEON


"Linaro focuses on the use of the ARM instruction set in its versions 7a (32-bit) and 8 (64-bit) including concrete implementations of these, such as SoCs that contain Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A15, Cortex-A53 or Cortex-A57 processor(s)."(https://en.wikipedia.org/wiki/Linaro)

"The ARM Cortex-A8 is a 32-bit processor core licensed by ARM Holdings implementing the ARMv7-A architecture." (https://en.wikipedia.org/wiki/ARM_Cortex-A8)

https://en.wikipedia.org/wiki/Comparison_of_ARMv7-A_cores

https://en.wikipedia.org/wiki/ARM_architecture#VFP


"What is VFP?
VFP is a floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. If a processor like ARM does not have floating hardware, then it relies on software math libraries which can prohibitively slow down floating point calculations. The VFP supports both single and double precision floating point calculations compliant with IEEE754. Further, the VFP is not fully pipelined like Neon, so it will not have equivalent performance to Neon.
Neon and VFP both support floating point, which should I use?
The VFPv3 is fully compliant with IEEE 754
Neon is not fully compliant with IEEE 754, so it is mainly targeted for multimedia applications
.. example of showing how Neon pipelining will outperform VFP...
Compile the above function for both Neon and VFP and compare results:
arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -ftree-vectorize -mfloat-abi=softfp"
(http://processors.wiki.ti.com/index.php/Cortex-A8)

"Using NEON and VFPv3 on Cortex-A8
The compiler supports two different options to control NEON and VFPv3.

--float_support=VFPv3 --neon

The --float_support=VFPv3 option instructs the compiler to generate code that utilizes the VFPv3 coprocessor for both double and single precision floating point operations. The option is also used to enable the assembler to accept VFPv3 instructions in assembly source. To enable VFPv3 the EABI mode must also be enabled through the --abi=eabi option. This is necessary because the calling convention for floating point paramemters changes when VFPv3 is enabled and that convention is only supported in EABI mode.

The --neon option instructs the compiler to automatically vectorize loops to use the NEON instructions. To get benefit from this option you should be using --opt_level=2 or higher and be generating code for performance by using the --opt_for_speed=[3-5] option.
Combining options
The TI ARM compiler supports four modes related to Cortex-A8, NEON, and VFPv3. By default neither NEON or VFPv3 is enabled. In addition to the default the following 3 modes are supported:
VFP enabled without NEON
The compiler will generate VFPv3 instructions for single and double precision floating point operations
NEON enabled without VFP
In this mode the compiler will generate NEON instructions for SIMD integer operations. It will not generate NEON instructions to vectorize floating point operations. The motivation for not allowing floating point NEON instructions if VFP is not enabled is because it is possible to have an integer only variant of NEON implemented. In order for the NEON unit to support floating point operations the VFPv3 coprocessor must be present.
NEON enabled and VFP enabled
In this mode the compiler will generate a mix of NEON and VFP instructions. The NEON instructions can be either integer or floating point.
VFPv3 vs. NEON performance
A common question with regard to TI ARM compiler's support for NEON is how to get more floating point operations on the NEON unit instead of the VFPv3. The reason this is desirable is because the VFPv3 coprocessor is not a pipelined architecture on the Cortex-A8, but the NEON is. The compiler will always use VFP instructions for scalar floating point operations, even if the --neon option is used. The hardware is capable of issuing VFP instructions on the NEON coprocessor if the following conditions are met:

The instruction must be a single precision data processing instruction
The processor must be in flush-to-zero mode. In this mode the processor will treat all denormalized numbers as zero.
The processor must be in default NaN mode. In this mode the operation will return the default NaN regardless of the input, whereas in full-compliance mode the returned NaN follows the rules in the ARM Architecture Reference Manual.
The FPEXC.EX bit must be set to 0. This tells the processor that there is no additional state that must be handled by a context switch."
(http://processors.wiki.ti.com/index.php/Using_NEON_and_VFPv3_on_Cortex-A8)

DRA62x Automotive Application DSP + ARM Processors
The ARM Cortex-A8 processor has a Harvard architecture and provides a complete high-performance subsystem, including:
• ARM Cortex-A8 Integer Core
• Superscalar ARMv7 Instruction Set
• Thumb-2 Instruction Set
• Jazelle RCT Acceleration
• CP14 Debug Coprocessor
• CP15 System Control Coprocessor
• NEON™ 64-/128-bit Hybrid SIMD Engine for Multimedia
• Enhanced VFPv3 Floating-Point Coprocessor
• Enhanced Memory Management Unit (MMU)
• Separate Level-1 Instruction and Data Caches
• Integrated Level-2 Cache
• 128-bit Interconnect with Level 3 Fast (L3) System Memories and Peripherals
• Embedded Trace Module (ETM).