After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 742843 - ORC compiler is disabled on the iOS devices
ORC compiler is disabled on the iOS devices
Status: RESOLVED OBSOLETE
Product: GStreamer
Classification: Platform
Component: orc
1.4.5
Other Mac OS
: Normal normal
: git master
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2015-01-13 12:41 UTC by Denis
Modified: 2018-11-03 10:47 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Support for static code compilation for iOS compatibility and various orc assembly fixes. (27.90 KB, patch)
2015-04-04 20:48 UTC, Cecill Etheredge (ijsf)
none Details | Review
Support for static code compilation for iOS compatibility and various orc assembly fixes. REV 2.0 (28.17 KB, patch)
2015-04-06 15:07 UTC, Cecill Etheredge (ijsf)
none Details | Review
Support for static code compilation for iOS compatibility and various orc assembly fixes. REV 3.0 (28.08 KB, patch)
2015-04-14 20:58 UTC, Cecill Etheredge (ijsf)
none Details | Review

Description Denis 2015-01-13 12:41:52 UTC
Here’s the ORC output in log. Init stage:

ORC: INFO: orcdebug.c(70): void _orc_debug_init()(): orc-0.4.23.1 debug init
ORC: INFO: orcprogram-neon.c(129): void orc_neon_init()(): marking neon backend non-executable

and then there’s continuous warnings like this one during the pipeline execution:

ORC: WARNING: orccompiler.c(392): OrcCompileResult orc_program_compile_full(OrcProgram *, OrcTarget *, unsigned int)(): program orc_combine4_12xn_u8 failed to compile, reason: Compilation disabled, using emulation

There’s nothing more specific about why NEON is disabled but tracing with debugger shows that orc_arm_get_cpu_flags in orccpu-arm.c has practically no executable code for IOS (one branch is #if'd for linux only and one for Android-only) and will always return 0 (which means no NEON support).
Verified on iPad 3rd gen and iPad mini 1st gen.

If i apply a hack to orc_arm_get_cpu_flags to return NEON flag support ORC compiler is enabled and works well on the iPad mini 1st gen. On the iPad 3rd gen i’m getting segfaults from different gstreamer->orc bridges like video_orc_chroma_up_v2_u8 (videoscale plugin), video_test_src_orc_splat_u32 (videotestsrc) etc.
per hw info iPad 3rd get uses A5x chip while iPad mini (1st gen) uses A5 so that’s really strange why orc works ok with 2nd but not with 1st.

Example stack trace for the iPad 3 segfault:

  • #0 0x02ec8e90
  • #1 video_test_src_orc_splat_u32 at /Users/D/cerbero/sources/ios_universal/armv7/gst-plugins-base-1.0-static-1.5/gst/videotestsrc/tmp-orc.c:215
  • #2 gst_video_test_src_smpte at /Users/D/cerbero/sources/ios_universal/armv7/gst-plugins-base-1.0-static-1.5/gst/videotestsrc/videotestsrc.c:350
  • #3 gst_video_test_src_fill at /Users/D/cerbero/sources/ios_universal/armv7/gst-plugins-base-1.0-static-1.5/gst/videotestsrc/gstvideotestsrc.c:951
  • #4 gst_base_src_default_create at /Users/D/cerbero/sources/ios_universal/armv7/gstreamer-1.0-1.5/libs/gst/base/gstbasesrc.c:1482
  • #5 gst_base_src_get_range at /Users/D/cerbero/sources/ios_universal/armv7/gstreamer-1.0-1.5/libs/gst/base/gstbasesrc.c:2455
  • #6 gst_base_src_loop at /Users/D/cerbero/sources/ios_universal/armv7/gstreamer-1.0-1.5/libs/gst/base/gstbasesrc.c:2731
  • #7 gst_task_func at /Users/D/cerbero/sources/ios_universal/armv7/gstreamer-1.0-1.5/gst/gsttask.c:316
  • #8 g_thread_pool_thread_proxy at /Users/D/cerbero/sources/ios_universal/armv7/glib-2.42.0/glib/gthreadpool.c:307
  • #9 g_thread_proxy at /Users/D/cerbero/sources/ios_universal/armv7/glib-2.42.0/glib/gthread.c:764
  • #10 _pthread_body
  • #11 _pthread_start

Variables for frame 1:

d1	guint8 *	""	0x03181000
p1	int	-2139034625	-2139034625
n	int	91	91
_ex	OrcExecutor		
program	OrcProgram *	NULL	0x00000000
n	int	91	91
counter1	int	108279884	108279884
counter2	int	460800	460800
counter3	int	16	16
arrays	void *[64]		
params	int [64]		
accumulators	int [4]		
ex	OrcExecutor *	NULL	0x00000000
program	OrcProgram *	NULL	
n	int		
counter1	int		
counter2	int		
counter3	int		
arrays	void *[64]		
params	int [64]		
accumulators	int [4]		
func	void (*)(OrcExecutor *)	NULL	
p	OrcProgram *	NULL	

Code 

void
video_test_src_orc_splat_u32 (guint8 * ORC_RESTRICT d1, int p1, int n)
{
  OrcExecutor _ex, *ex = &_ex;
  static volatile int p_inited = 0;
  static OrcCode *c = 0;
  void (*func) (OrcExecutor *);

  if (!p_inited) {
    orc_once_mutex_lock ();
    if (!p_inited) {
      OrcProgram *p;

#if 1
      static const orc_uint8 bc[] = {
        1, 9, 28, 118, 105, 100, 101, 111, 95, 116, 101, 115, 116, 95, 115, 114, 
        99, 95, 111, 114, 99, 95, 115, 112, 108, 97, 116, 95, 117, 51, 50, 11, 
        4, 4, 16, 4, 128, 0, 24, 2, 0, 
      };
      p = orc_program_new_from_static_bytecode (bc);
      orc_program_set_backup_function (p, _backup_video_test_src_orc_splat_u32);
#else
      p = orc_program_new ();
      orc_program_set_name (p, "video_test_src_orc_splat_u32");
      orc_program_set_backup_function (p, _backup_video_test_src_orc_splat_u32);
      orc_program_add_destination (p, 4, "d1");
      orc_program_add_parameter (p, 4, "p1");

      orc_program_append_2 (p, "storel", 0, ORC_VAR_D1, ORC_VAR_P1, ORC_VAR_D1, ORC_VAR_D1);
#endif

      orc_program_compile (p);
      c = orc_program_take_code (p);
      orc_program_free (p);
    }
    p_inited = TRUE;
    orc_once_mutex_unlock ();
  }
  ex->arrays[ORC_VAR_A2] = c;
  ex->program = 0;

  ex->n = n;
  ex->arrays[ORC_VAR_D1] = d1;
  ex->params[ORC_VAR_P1] = p1;

  func = c->exec; 
  func (ex);                 <---------------- SEGFAUL HERE
}
Comment 1 Sebastian Dröge (slomo) 2015-01-13 12:51:00 UTC
Do you know if it actually executed anything from the function? Or did it just explode trying to execute the first instruction (because the memory was not executable maybe)?
Comment 2 Denis 2015-01-13 13:21:06 UTC
Update. Turns out the ORC segfault issue is not device related. iPad mini has
initially been running under iOS 7.1.1. After i updated to 8.1.2 i get same
segfaults on iPad mini as well. So something changed in iOS 8 which broke ORC
compiler!
Comment 3 Denis 2015-01-13 13:23:34 UTC
Crash dumped when 1st reaching the bottom line of mentioned bridge-methods. I.e. when i step in debugger the segfault is dumped as soon as i reach this line:

func (ex);
Comment 4 Sebastian Dröge (slomo) 2015-01-13 13:37:30 UTC
That's expected, as there are no debug information or anything about the function. It is dynamically generated at runtime (which might be the problem).

I think if you set the debugger in assembly mode it would be possible to step in there
Comment 5 Denis 2015-01-13 13:49:44 UTC
Stepping into the 'func(ex)' function call with 

(lldb)si 

shows:

0x2fb6e90:  add    r1, r0, #0x174
0x2fb6e94:  vld1.32 {d4[], d5[]}, [r1]
0x2fb6e98:  ldr    r2, [r0, #4]
0x2fb6e9c:  cmp    r2, #0x40
0x2fb6ea0:  bgt    0x2fb6ed8
0x2fb6ea4:  asr    r1, r2, #2
0x2fb6ea8:  str    r1, [r0, #12]
0x2fb6eac:  and    r2, r2, #0x3

then 

(lldb)ni

and i get the segfault.
Comment 6 Sebastian Dröge (slomo) 2015-01-13 14:02:39 UTC
Ok, so we don't actually get executable memory here I guess. For iOS we should IMHO also just let orc output the assembly during compilation and use that directly instead of going through the orc JIT at runtime. But that requires build system changes.
Comment 7 Denis 2015-01-13 14:34:37 UTC
Strange thing is why this worked with iOS 7.1.1
Comment 8 Sebastian Dröge (slomo) 2015-01-13 14:38:51 UTC
Maybe we have to use a new way for allocating executable memory. Officially this is not supported by iOS to prevent JITs for performance reasons or something like that.
Comment 9 Denis 2015-01-14 20:58:04 UTC
Trying to locally change the orcc CL in orc.mak to generate asm code during the recipe build.
loadupdb implementation is missing for neon which breaks the asm code generation for a bunch of functions: 

Failed to compile assembly for 'video_orc_unpack_I420'
Failed to compile assembly for 'video_orc_unpack_YUV9'
Failed to compile assembly for 'video_orc_unpack_A420'
Failed to compile assembly for 'video_orc_resample_bilinear_u32'
Failed to compile assembly for 'video_orc_convert_I420_AYUV'
Failed to compile assembly for 'video_orc_convert_I420_BGRA'
Failed to compile assembly for 'video_orc_resample_h_near_u32_lq'
Failed to compile assembly for 'video_orc_resample_h_2tap_4u8_lq'
Failed to compile assembly for 'video_orc_chroma_down_v2_u16'
Failed to compile assembly for 'video_orc_chroma_down_v4_u16'
Comment 10 Sebastian Dröge (slomo) 2015-01-15 10:16:32 UTC
Yeah, for those functions you would then have to use the backup C code, which can also be generated. Maybe orcc should output that automatically if compiling the assembly failed.

Of course implementing loadupbd for NEON would be even better, but making it work with fallback to the backup functions would be good to have in any case.
Comment 11 Denis 2015-01-15 12:56:03 UTC
There's also a problem with *.orc files generating asm code with labels. Compiler spits a number of erros like:

tmp-orc.s:6058:1: error: invalid symbol redefinition
.L10:


The fix is to generate labels containing only digits. This is explained here: 

http://stackoverflow.com/questions/3898435/labels-in-gcc-inline-assembly

and  here

http://stackoverflow.com/questions/14506151/invalid-symbol-redefinition-in-inline-asm-on-llvm

This will require a patch in orc component.
Comment 12 Denis 2015-01-15 14:26:58 UTC
also neon missing implementation for:
muld
convfd
convld
convdf
convdl
mulslq
Comment 13 Denis 2015-01-15 15:51:04 UTC
ldreslinl
ldresnearl
splitql
Comment 14 Denis 2015-01-16 18:45:07 UTC
Having enabled static assembly compilation by orc for videotestsrc component (chose this one since it has just one most trivial orc-implemented function) i've made a successful iOS sdk build. But running the "videotestsrc ! autovideosink" pipeline i instantly get a EXC_ARM_DA_ALIGN from inside of video_test_src_orc_splat_u32.
Error is dumped at this instruction. 
vst1.64 { d4, d5 }, [r2,:128]
The address referenced in error message is not 16-bytes aligned as required by instruction.
I've tried setting clang flag -mno-unaligned-access but no luck. No idea how to track back in C code what piece of memory lacks alignment specified.

Also, is there any way to generate at least close to neon-optimized assembly code in clang/gcc so that we won't have to start from scratch with asm implementation for the functions that missing it?
Comment 15 Sebastian Dröge (slomo) 2015-01-17 11:14:06 UTC
(In reply to comment #14)
> Having enabled static assembly compilation by orc for videotestsrc component
> (chose this one since it has just one most trivial orc-implemented function)
> i've made a successful iOS sdk build. But running the "videotestsrc !
> autovideosink" pipeline i instantly get a EXC_ARM_DA_ALIGN from inside of
> video_test_src_orc_splat_u32.
> Error is dumped at this instruction. 
> vst1.64 { d4, d5 }, [r2,:128]
> The address referenced in error message is not 16-bytes aligned as required by
> instruction.
> I've tried setting clang flag -mno-unaligned-access but no luck. No idea how to
> track back in C code what piece of memory lacks alignment specified.

In theory ORC should emit some assembly that makes sure the memory is correctly aligned, i.e. that processes the first part of the memory without the SIMD instructions and only then uses the vector instructions.
However it assumes the the memory is at least aligned to the "unit size", i.e. in this case 4 byte.

Maybe that assembly code is not emitted when outputting static assembly and only used for the JIT, or there's a bug in the handling of that instruction.

The C code should have proper alignment in any case.

> Also, is there any way to generate at least close to neon-optimized assembly
> code in clang/gcc so that we won't have to start from scratch with asm
> implementation for the functions that missing it?

All these ORC opcodes should map to <10 SIMD instructions, or would not be possible to implement at all for one backend. Often there's a 1:1 mapping between ORC opcodes and the SIMD instructions.


Note that support for generating something (C backup code, which quite often can be vectorized by compilers too) for functions that use unimplemented opcodes must be there. There will always be opcodes that are not implemented in one backend or another.
Comment 16 Sebastian Dröge (slomo) 2015-01-17 11:14:39 UTC
Thanks for working on this btw!
Comment 17 Sebastian Dröge (slomo) 2015-02-17 07:45:30 UTC
Denis, any progress on this?
Comment 18 Sebastian Dröge (slomo) 2015-02-17 07:50:21 UTC
I think the best here would be to let orcc output C backup code and inline assembly (or inside a separate file is ok too), and then let it select at runtime which one to use. That way this can also be used on e.g. Android.
Comment 19 Cecill Etheredge (ijsf) 2015-04-04 20:48:18 UTC
Created attachment 300959 [details] [review]
Support for static code compilation for iOS compatibility and various orc assembly fixes.

I'm hereby submitting an initial patch that encapsulates the work I've been doing on my own orc fork at https://github.com/ijsf/OpenWebRTC-orc. This fork was used to add support for static code compilation, as described in this issue, which is necessary for proper performance on iOS.

The patch contains a couple of things:

* Run-time NEON detection for ARM CPUs on iOS. Not really relevant for static code compilation, but may prove to be useful for future reference.
* Assembly fixes for ARM NEON including addition of alignment directives, label refactoring and minor documentation changes.
* Fix for invalid offset pointers in NEON assembly code due to orc's erroneous use of "sizeof" at compile-time (see https://github.com/ijsf/OpenWebRTC-orc/issues/4).
* Support for --static-implementation and --outputasm arguments. This will effectively assemble the orc code at compile-time into a separate .S file, accompanied by the usual .c file wrapping the functions. These can both be linked into the project.

The separate .S file construction was absolutely necessary due to bugs in Apple's LLVM compiler (these have been filed with Apple) making inline assembly impossible. Also, intrinsics would have been good instead but this would require a rewrite of the entire NEON compiler. In the end, this was the easiest and least invasive solution.
Comment 20 Sebastian Dröge (slomo) 2015-04-05 01:12:52 UTC
Great work :)

The C file wrapping, does it do the CPU detection you mentioned and in the worst case falls back to the C implementations of functions? And do you support creating static assembly for multiple targets, and then select the best available one at runtime?
Comment 21 Cecill Etheredge (ijsf) 2015-04-05 12:40:30 UTC
There's no support for any run-time type checking; this patch does everything fully static, assuming the CPU specifications are known at compile-time, as is the case for iOS.

Your points would certainly improve upon that though. Since iOS is currently the only platform requiring these changes, I would not advise the use of this option for anything else (e.g. Android) as is.
Comment 22 Cecill Etheredge (ijsf) 2015-04-06 15:07:28 UTC
Created attachment 301025 [details] [review]
Support for static code compilation for iOS compatibility and various orc assembly fixes. REV 2.0

Forgot proper file closure, updated patch (REV 2.0).
Comment 23 Cecill Etheredge (ijsf) 2015-04-14 20:58:05 UTC
Created attachment 301574 [details] [review]
Support for static code compilation for iOS compatibility and various orc assembly fixes. REV 3.0

Here's a critical update that fixes the corruption of immediate values reading out of the OrcExecutor structure. Implementation has been tested further and orc functions seem to be working.

Note that this implementation only covers ARMv7 NEON, and not AArch64 support.
Comment 24 GStreamer system administrator 2018-11-03 10:47:30 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/orc/issues/5.