GNOME Bugzilla – Bug 742843
ORC compiler is disabled on the iOS devices
Last modified: 2018-11-03 10:47:30 UTC
Here’s the ORC output in log. Init stage: ORC: INFO: orcdebug.c(70): void _orc_debug_init()(): orc-0.4.23.1 debug init ORC: INFO: orcprogram-neon.c(129): void orc_neon_init()(): marking neon backend non-executable and then there’s continuous warnings like this one during the pipeline execution: ORC: WARNING: orccompiler.c(392): OrcCompileResult orc_program_compile_full(OrcProgram *, OrcTarget *, unsigned int)(): program orc_combine4_12xn_u8 failed to compile, reason: Compilation disabled, using emulation There’s nothing more specific about why NEON is disabled but tracing with debugger shows that orc_arm_get_cpu_flags in orccpu-arm.c has practically no executable code for IOS (one branch is #if'd for linux only and one for Android-only) and will always return 0 (which means no NEON support). Verified on iPad 3rd gen and iPad mini 1st gen. If i apply a hack to orc_arm_get_cpu_flags to return NEON flag support ORC compiler is enabled and works well on the iPad mini 1st gen. On the iPad 3rd gen i’m getting segfaults from different gstreamer->orc bridges like video_orc_chroma_up_v2_u8 (videoscale plugin), video_test_src_orc_splat_u32 (videotestsrc) etc. per hw info iPad 3rd get uses A5x chip while iPad mini (1st gen) uses A5 so that’s really strange why orc works ok with 2nd but not with 1st. Example stack trace for the iPad 3 segfault:
+ Trace 234547
Variables for frame 1: d1 guint8 * "" 0x03181000 p1 int -2139034625 -2139034625 n int 91 91 _ex OrcExecutor program OrcProgram * NULL 0x00000000 n int 91 91 counter1 int 108279884 108279884 counter2 int 460800 460800 counter3 int 16 16 arrays void *[64] params int [64] accumulators int [4] ex OrcExecutor * NULL 0x00000000 program OrcProgram * NULL n int counter1 int counter2 int counter3 int arrays void *[64] params int [64] accumulators int [4] func void (*)(OrcExecutor *) NULL p OrcProgram * NULL Code void video_test_src_orc_splat_u32 (guint8 * ORC_RESTRICT d1, int p1, int n) { OrcExecutor _ex, *ex = &_ex; static volatile int p_inited = 0; static OrcCode *c = 0; void (*func) (OrcExecutor *); if (!p_inited) { orc_once_mutex_lock (); if (!p_inited) { OrcProgram *p; #if 1 static const orc_uint8 bc[] = { 1, 9, 28, 118, 105, 100, 101, 111, 95, 116, 101, 115, 116, 95, 115, 114, 99, 95, 111, 114, 99, 95, 115, 112, 108, 97, 116, 95, 117, 51, 50, 11, 4, 4, 16, 4, 128, 0, 24, 2, 0, }; p = orc_program_new_from_static_bytecode (bc); orc_program_set_backup_function (p, _backup_video_test_src_orc_splat_u32); #else p = orc_program_new (); orc_program_set_name (p, "video_test_src_orc_splat_u32"); orc_program_set_backup_function (p, _backup_video_test_src_orc_splat_u32); orc_program_add_destination (p, 4, "d1"); orc_program_add_parameter (p, 4, "p1"); orc_program_append_2 (p, "storel", 0, ORC_VAR_D1, ORC_VAR_P1, ORC_VAR_D1, ORC_VAR_D1); #endif orc_program_compile (p); c = orc_program_take_code (p); orc_program_free (p); } p_inited = TRUE; orc_once_mutex_unlock (); } ex->arrays[ORC_VAR_A2] = c; ex->program = 0; ex->n = n; ex->arrays[ORC_VAR_D1] = d1; ex->params[ORC_VAR_P1] = p1; func = c->exec; func (ex); <---------------- SEGFAUL HERE }
Do you know if it actually executed anything from the function? Or did it just explode trying to execute the first instruction (because the memory was not executable maybe)?
Update. Turns out the ORC segfault issue is not device related. iPad mini has initially been running under iOS 7.1.1. After i updated to 8.1.2 i get same segfaults on iPad mini as well. So something changed in iOS 8 which broke ORC compiler!
Crash dumped when 1st reaching the bottom line of mentioned bridge-methods. I.e. when i step in debugger the segfault is dumped as soon as i reach this line: func (ex);
That's expected, as there are no debug information or anything about the function. It is dynamically generated at runtime (which might be the problem). I think if you set the debugger in assembly mode it would be possible to step in there
Stepping into the 'func(ex)' function call with (lldb)si shows: 0x2fb6e90: add r1, r0, #0x174 0x2fb6e94: vld1.32 {d4[], d5[]}, [r1] 0x2fb6e98: ldr r2, [r0, #4] 0x2fb6e9c: cmp r2, #0x40 0x2fb6ea0: bgt 0x2fb6ed8 0x2fb6ea4: asr r1, r2, #2 0x2fb6ea8: str r1, [r0, #12] 0x2fb6eac: and r2, r2, #0x3 then (lldb)ni and i get the segfault.
Ok, so we don't actually get executable memory here I guess. For iOS we should IMHO also just let orc output the assembly during compilation and use that directly instead of going through the orc JIT at runtime. But that requires build system changes.
Strange thing is why this worked with iOS 7.1.1
Maybe we have to use a new way for allocating executable memory. Officially this is not supported by iOS to prevent JITs for performance reasons or something like that.
Trying to locally change the orcc CL in orc.mak to generate asm code during the recipe build. loadupdb implementation is missing for neon which breaks the asm code generation for a bunch of functions: Failed to compile assembly for 'video_orc_unpack_I420' Failed to compile assembly for 'video_orc_unpack_YUV9' Failed to compile assembly for 'video_orc_unpack_A420' Failed to compile assembly for 'video_orc_resample_bilinear_u32' Failed to compile assembly for 'video_orc_convert_I420_AYUV' Failed to compile assembly for 'video_orc_convert_I420_BGRA' Failed to compile assembly for 'video_orc_resample_h_near_u32_lq' Failed to compile assembly for 'video_orc_resample_h_2tap_4u8_lq' Failed to compile assembly for 'video_orc_chroma_down_v2_u16' Failed to compile assembly for 'video_orc_chroma_down_v4_u16'
Yeah, for those functions you would then have to use the backup C code, which can also be generated. Maybe orcc should output that automatically if compiling the assembly failed. Of course implementing loadupbd for NEON would be even better, but making it work with fallback to the backup functions would be good to have in any case.
There's also a problem with *.orc files generating asm code with labels. Compiler spits a number of erros like: tmp-orc.s:6058:1: error: invalid symbol redefinition .L10: The fix is to generate labels containing only digits. This is explained here: http://stackoverflow.com/questions/3898435/labels-in-gcc-inline-assembly and here http://stackoverflow.com/questions/14506151/invalid-symbol-redefinition-in-inline-asm-on-llvm This will require a patch in orc component.
also neon missing implementation for: muld convfd convld convdf convdl mulslq
ldreslinl ldresnearl splitql
Having enabled static assembly compilation by orc for videotestsrc component (chose this one since it has just one most trivial orc-implemented function) i've made a successful iOS sdk build. But running the "videotestsrc ! autovideosink" pipeline i instantly get a EXC_ARM_DA_ALIGN from inside of video_test_src_orc_splat_u32. Error is dumped at this instruction. vst1.64 { d4, d5 }, [r2,:128] The address referenced in error message is not 16-bytes aligned as required by instruction. I've tried setting clang flag -mno-unaligned-access but no luck. No idea how to track back in C code what piece of memory lacks alignment specified. Also, is there any way to generate at least close to neon-optimized assembly code in clang/gcc so that we won't have to start from scratch with asm implementation for the functions that missing it?
(In reply to comment #14) > Having enabled static assembly compilation by orc for videotestsrc component > (chose this one since it has just one most trivial orc-implemented function) > i've made a successful iOS sdk build. But running the "videotestsrc ! > autovideosink" pipeline i instantly get a EXC_ARM_DA_ALIGN from inside of > video_test_src_orc_splat_u32. > Error is dumped at this instruction. > vst1.64 { d4, d5 }, [r2,:128] > The address referenced in error message is not 16-bytes aligned as required by > instruction. > I've tried setting clang flag -mno-unaligned-access but no luck. No idea how to > track back in C code what piece of memory lacks alignment specified. In theory ORC should emit some assembly that makes sure the memory is correctly aligned, i.e. that processes the first part of the memory without the SIMD instructions and only then uses the vector instructions. However it assumes the the memory is at least aligned to the "unit size", i.e. in this case 4 byte. Maybe that assembly code is not emitted when outputting static assembly and only used for the JIT, or there's a bug in the handling of that instruction. The C code should have proper alignment in any case. > Also, is there any way to generate at least close to neon-optimized assembly > code in clang/gcc so that we won't have to start from scratch with asm > implementation for the functions that missing it? All these ORC opcodes should map to <10 SIMD instructions, or would not be possible to implement at all for one backend. Often there's a 1:1 mapping between ORC opcodes and the SIMD instructions. Note that support for generating something (C backup code, which quite often can be vectorized by compilers too) for functions that use unimplemented opcodes must be there. There will always be opcodes that are not implemented in one backend or another.
Thanks for working on this btw!
Denis, any progress on this?
I think the best here would be to let orcc output C backup code and inline assembly (or inside a separate file is ok too), and then let it select at runtime which one to use. That way this can also be used on e.g. Android.
Created attachment 300959 [details] [review] Support for static code compilation for iOS compatibility and various orc assembly fixes. I'm hereby submitting an initial patch that encapsulates the work I've been doing on my own orc fork at https://github.com/ijsf/OpenWebRTC-orc. This fork was used to add support for static code compilation, as described in this issue, which is necessary for proper performance on iOS. The patch contains a couple of things: * Run-time NEON detection for ARM CPUs on iOS. Not really relevant for static code compilation, but may prove to be useful for future reference. * Assembly fixes for ARM NEON including addition of alignment directives, label refactoring and minor documentation changes. * Fix for invalid offset pointers in NEON assembly code due to orc's erroneous use of "sizeof" at compile-time (see https://github.com/ijsf/OpenWebRTC-orc/issues/4). * Support for --static-implementation and --outputasm arguments. This will effectively assemble the orc code at compile-time into a separate .S file, accompanied by the usual .c file wrapping the functions. These can both be linked into the project. The separate .S file construction was absolutely necessary due to bugs in Apple's LLVM compiler (these have been filed with Apple) making inline assembly impossible. Also, intrinsics would have been good instead but this would require a rewrite of the entire NEON compiler. In the end, this was the easiest and least invasive solution.
Great work :) The C file wrapping, does it do the CPU detection you mentioned and in the worst case falls back to the C implementations of functions? And do you support creating static assembly for multiple targets, and then select the best available one at runtime?
There's no support for any run-time type checking; this patch does everything fully static, assuming the CPU specifications are known at compile-time, as is the case for iOS. Your points would certainly improve upon that though. Since iOS is currently the only platform requiring these changes, I would not advise the use of this option for anything else (e.g. Android) as is.
Created attachment 301025 [details] [review] Support for static code compilation for iOS compatibility and various orc assembly fixes. REV 2.0 Forgot proper file closure, updated patch (REV 2.0).
Created attachment 301574 [details] [review] Support for static code compilation for iOS compatibility and various orc assembly fixes. REV 3.0 Here's a critical update that fixes the corruption of immediate values reading out of the OrcExecutor structure. Implementation has been tested further and orc functions seem to be working. Note that this implementation only covers ARMv7 NEON, and not AArch64 support.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/orc/issues/5.