After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 753531 - GNOME shell crashing on VMware
GNOME shell crashing on VMware
Status: RESOLVED NOTGNOME
Product: cogl
Classification: Platform
Component: EGL
git master
Other Windows
: Normal normal
: ---
Assigned To: Cogl maintainer(s)
Cogl maintainer(s)
Depends on:
Blocks:
 
 
Reported: 2015-08-11 21:38 UTC by Jimmy Jones
Modified: 2015-09-03 19:35 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
defer texture access until we have master (8.90 KB, patch)
2015-08-25 19:45 UTC, Ray Strode [halfline]
rejected Details | Review
kms-winsys: drop support for old gbm versions (3.43 KB, patch)
2015-08-27 18:49 UTC, Ray Strode [halfline]
none Details | Review
kms-winsys: use correct surface format (5.31 KB, patch)
2015-08-27 18:49 UTC, Ray Strode [halfline]
none Details | Review
kms-winsys: clean up error handling in _cogl_winsys_renderer_connect (6.16 KB, patch)
2015-08-27 18:49 UTC, Ray Strode [halfline]
none Details | Review
kms-winsys: rename device_name to card_device_name (4.60 KB, patch)
2015-08-27 18:49 UTC, Ray Strode [halfline]
none Details | Review
kms-winsys: rework handling of the drm fd (29.22 KB, patch)
2015-08-27 18:49 UTC, Ray Strode [halfline]
none Details | Review
kms-winsys: support drm render nodes (24.82 KB, patch)
2015-08-27 18:49 UTC, Ray Strode [halfline]
none Details | Review
kms-winsys: don't disable page flipping on EACCES (2.39 KB, patch)
2015-08-27 20:07 UTC, Ray Strode [halfline]
needs-work Details | Review
i think this will fix gbm (8.60 KB, text/plain)
2015-08-28 14:33 UTC, Ray Strode [halfline]
  Details

Description Jimmy Jones 2015-08-11 21:38:03 UTC
There are a number of downstream bug reports of gnome shell crashing sporadically when running in VMware. To quote one of the kernel module developers, "I think the root problem is gnome-shell(gdm) dropping its master privileges and then trying to render."

Downstream bug reports:

https://bugzilla.redhat.com/show_bug.cgi?id=1227193
https://bugs.freedesktop.org/show_bug.cgi?id=91098
http://lists.opensuse.org/opensuse-bugs/2015-06/msg01786.html
Comment 1 Ray Strode [halfline] 2015-08-25 13:38:04 UTC
trace from freedesktop bug:


Core was generated by `gnome-shell --mode=gdm --wayland --display-server'.
Program terminated with signal SIGSEGV, Segmentation fault.
  • #0 vmw_region_size
    at vmw_screen_ioctl.c line 76
  • #0 vmw_region_size
    at vmw_screen_ioctl.c line 76
  • #1 vmw_svga_winsys_surface_create
  • #2 svga_screen_surface_create
    at svga_screen_cache.c line 449
  • #3 svga_texture_create
    at svga_resource_texture.c line 729
  • #4 st_texture_create
    at state_tracker/st_texture.c line 97
  • #5 guess_and_alloc_texture
    at state_tracker/st_cb_texture.c line 464
  • #6 st_AllocTextureImageBuffer
    at state_tracker/st_cb_texture.c line 517
  • #7 st_TexImage
    at state_tracker/st_cb_texture.c line 875
  • #8 teximage
    at main/teximage.c line 3364
  • #9 _mesa_TexImage2D
    at main/teximage.c line 3403
  • #10 ??
    from /usr/lib/libcogl.so.20
  • #11 ??
    from /usr/lib/libcogl.so.20
  • #12 cogl_texture_allocate
    from /usr/lib/libcogl.so.20
  • #13 cogl_texture_2d_new_from_data
    from /usr/lib/libcogl.so.20
  • #14 pixbuf_to_cogl_texture
    at st/st-texture-cache.c line 473
  • #15 finish_texture_load
    at st/st-texture-cache.c line 518
  • #16 on_symbolic_icon_loaded
    at st/st-texture-cache.c line 553
  • #17 ??
    from /usr/lib/libgio-2.0.so.0
  • #18 ??
    from /usr/lib/libgio-2.0.so.0
  • #19 g_main_context_dispatch
    from /usr/lib/libglib-2.0.so.0
  • #20 ??
    from /usr/lib/libglib-2.0.so.0
  • #21 g_main_loop_run
    from /usr/lib/libglib-2.0.so.0
  • #22 meta_run
    from /usr/lib/libmutter.so.0
  • #23 main
    at main.c line 463

Comment 2 Ray Strode [halfline] 2015-08-25 13:51:18 UTC
Thomas, the trace looks like it's only trying to allocate textures, not render to the scan out buffer, or set the scan out buffer (unless I'm missing something?)

I don't think allocating a texture in the background should require drm master. as I understand it, drm master is supposed to denote who can effect on screen drawing, e.g. call drmModeSetCrtc or drmModePageFlip

Right now we freeze the clutter master clock when VT switched away, so most rendering won't happen. I think the above trace is happening because it's in response to the icon getting loaded instead of in response to the master clock updating.

Still, going forward we're ultimately going to want to load sessions in the background.  See the discussion here:

https://github.com/magcius/weston/commit/410db59b64f5802ee57c6b508b6f7113fd2de7e3#commitcomment-12754597

(and the discussion linked from it)
Comment 3 Thomas Hellström 2015-08-25 14:12:30 UTC
(In reply to Ray Strode [halfline] from comment #2)
> Thomas, the trace looks like it's only trying to allocate textures, not
> render to the scan out buffer, or set the scan out buffer (unless I'm
> missing something?)

In the old drm authentication model, the master is also responsible for authenticating clients who then have full access to buffer objects and GPU memory. That means that when we switch master we must also protect the new master's data from access by the old master or the old master's clients. This is what's happening here.


> 
> I don't think allocating a texture in the background should require drm
> master. as I understand it, drm master is supposed to denote who can effect
> on screen drawing, e.g. call drmModeSetCrtc or drmModePageFlip

All ioctls requiring authentication are blocked on vmware for the above reason, although clients are only suspended as opposed to previous masters that receive an error instead (Otherwise they'd deadlock and can never call drmSetMaster). 
FWIW, I see invalid calls to the execbuf ioctl in the kernel log as well (GPU command submission).

It's not that VMware is particularly vulnerable to this security issue. Only that other driver writers haven't cared to implement a test for it.


> 
> Right now we freeze the clutter master clock when VT switched away, so most
> rendering won't happen. I think the above trace is happening because it's in
> response to the icon getting loaded instead of in response to the master
> clock updating.
> 
> Still, going forward we're ultimately going to want to load sessions in the
> background.  See the discussion here:
> 
> https://github.com/magcius/weston/commit/
> 410db59b64f5802ee57c6b508b6f7113fd2de7e3#commitcomment-12754597

I think the only correct way to do rendering or accessing the GPU when the master is switched away is to use DRM render nodes. That is, never do rendering through the modesetting fd when master is swithched away, but do that through render nodes.

> 
> (and the discussion linked from it)

/Thomas
Comment 4 Ray Strode [halfline] 2015-08-25 14:56:47 UTC
(In reply to Thomas Hellström from comment #3)
> In the old drm authentication model, the master is also responsible for
> authenticating clients who then have full access to buffer objects and GPU
> memory. That means that when we switch master we must also protect the new
> master's data from access by the old master or the old master's clients.
> This is what's happening here.

are you talking about auth magic? My understanding is authenticating with any master using auth magic gives you access to all buffers. I think this security limitation was one of the reasons render nodes were created.

> All ioctls requiring authentication are blocked on vmware for the above
> reason, although clients are only suspended as opposed to previous masters
> that receive an error instead (Otherwise they'd deadlock and can never call
> drmSetMaster). 
I guess X also does glxSuspendClients, which is why we're seeing this primarily on the login screen (which doesn't use X)

> FWIW, I see invalid calls to the execbuf ioctl in the kernel log as well
> (GPU command submission).
okay, we're going to need to do offscreen rendering in the background going forward though.

> It's not that VMware is particularly vulnerable to this security issue. Only
> that other driver writers haven't cared to implement a test for it.
I could be wrong, but my understanding is this security issue is well known, and we just live with it.

> I think the only correct way to do rendering or accessing the GPU when the
> master is switched away is to use DRM render nodes. That is, never do
> rendering through the modesetting fd when master is swithched away, but do
> that through render nodes.
Right, I guess using a render node is the right way forward.  Still, I think VMWare is behaving differently than the other drivers, trying to close up a hole that the other drivers live with (could be wrong, but that's my recollection of past discussions)
Comment 5 Thomas Hellström 2015-08-25 16:24:35 UTC
(In reply to Ray Strode [halfline] from comment #4)
> (In reply to Thomas Hellström from comment #3)
> > In the old drm authentication model, the master is also responsible for
> > authenticating clients who then have full access to buffer objects and GPU
> > memory. That means that when we switch master we must also protect the new
> > master's data from access by the old master or the old master's clients.
> > This is what's happening here.
> 
> are you talking about auth magic? My understanding is authenticating with
> any master using auth magic gives you access to all buffers. I think this
> security limitation was one of the reasons render nodes were created.

Yes, the auth magic mechanism.

Perhaps. Render-nodes were explicitly created to allow master-less rendering, and to my knowledge no other user-space clients than gnome-shell + gdm relies on being able to to do it with legacy nodes.

> 
> > All ioctls requiring authentication are blocked on vmware for the above
> > reason, although clients are only suspended as opposed to previous masters
> > that receive an error instead (Otherwise they'd deadlock and can never call
> > drmSetMaster). 
> I guess X also does glxSuspendClients, which is why we're seeing this
> primarily on the login screen (which doesn't use X)

The error should only be visible when a master drops privileges and tries to render. In the X case it doesn't render when master drops and its dri clients are simply suspended by the vmwgfx kernel driver if they happen to do so. They are woken up when the X server regains master privileges, or killed if it dies.

> 
> > FWIW, I see invalid calls to the execbuf ioctl in the kernel log as well
> > (GPU command submission).
> okay, we're going to need to do offscreen rendering in the background going
> forward though.

That's perfectly fine if using render nodes.

> 
> > It's not that VMware is particularly vulnerable to this security issue. Only
> > that other driver writers haven't cared to implement a test for it.
> I could be wrong, but my understanding is this security issue is well known,
> and we just live with it.

It might be well known, but there are ways to fix this issue in the drivers. Some drivers apparently don't care. The Radeon driver is reported safe against this issue, as is the vmwgfx driver.

> 
> > I think the only correct way to do rendering or accessing the GPU when the
> > master is switched away is to use DRM render nodes. That is, never do
> > rendering through the modesetting fd when master is swithched away, but do
> > that through render nodes.
> Right, I guess using a render node is the right way forward.  Still, I think
> VMWare is behaving differently than the other drivers, trying to close up a
> hole that the other drivers live with (could be wrong, but that's my
> recollection of past discussions)

See above. It might be that we are more paranoid than others but I do think that we should avoid at all cost introducing new user-space clients that rely upon a known security flaw and that breaks attempts to plug it!

/Thomas
Comment 6 Jasper St. Pierre (not reading bugmail) 2015-08-25 16:31:49 UTC
Unfortunately, render nodes aren't turned on by default yet, so I'm not sure how we should fix this.

Additionally, and unfortunately, gbm only lets you pass in a single fd, which is used for both importing scanout buffers, and rendering, so it's not easy to fix up userspace to pass around and plug through render nodes everywhere.

I'd imagine Weston has the same issue with clients rendering to the legacy node while the master is dropped, since it doesn't try to pause clients while master is gone.
Comment 7 Thomas Hellström 2015-08-25 16:44:06 UTC
(In reply to Jasper St. Pierre from comment #6)
> Unfortunately, render nodes aren't turned on by default yet, so I'm not sure
> how we should fix this.


> 
> Additionally, and unfortunately, gbm only lets you pass in a single fd,
> which is used for both importing scanout buffers, and rendering, so it's not
> easy to fix up userspace to pass around and plug through render nodes
> everywhere.

Is it perhaps possible to just have gnome-shell flush its rendering and avoid rendering when switched away?

> 
> I'd imagine Weston has the same issue with clients rendering to the legacy
> node while the master is dropped, since it doesn't try to pause clients
> while master is gone.

Clients are generally not the problem, In the vmwgfx case, the kernel driver can and will suspend them if they try to render. The problem is the master itself.

/Thomas
Comment 8 Ray Strode [halfline] 2015-08-25 18:17:05 UTC
Hi,

> > > It's not that VMware is particularly vulnerable to this security issue. Only
> > > that other driver writers haven't cared to implement a test for it.
> > I could be wrong, but my understanding is this security issue is well known,
> > and we just live with it.
> 
> It might be well known, but there are ways to fix this issue in the drivers.
> Some drivers apparently don't care. The Radeon driver is reported safe
> against this issue, as is the vmwgfx driver.
Let's make sure we're talking about the same issue. As I understand it the problem is:

1) One process needs to make a buffer available to another process
2) It accomplishes this by "naming" the buffer using the GEM FLINK ioctl
3) The naming pool is assigned from a global, incrementing counter that
starts at 0.
4) Given that integer, other clients can read the buffer
5) this includes clients on other VTs
6) so a VT running in the background can look at the window contents of the session a user is actively using

Is this the security issue you're talking about?  If so, how does the vmware driver deal with the opposite problem (user in active session looks at buffers from inactive sessions) ?  If this isn't the same issue you're talking about, can you explain to me the issue the vmware and radeon driver have addressed that others drivers haven't?

> See above. It might be that we are more paranoid than others but I do think
> that we should avoid at all cost introducing new user-space clients that
> rely upon a known security flaw and that breaks attempts to plug it!
Well, I think the only reason we've avoided the problem before is because the X server suspends all clients when vt switched away (and I guess the vmware driver does too). One idea is, as a near term hack, we could do the same for mutter in wayland. block clients and block ourselves. there's an unavoidable race, though, I think.
Comment 9 Ray Strode [halfline] 2015-08-25 18:21:58 UTC
> 6) so a VT running in the background can look at the window
> contents of the session a user is actively using
oops, I didn't finish this sentence. Should end... " the easily guessable name assigned in step 3"
Comment 10 Thomas Hellström 2015-08-25 18:56:21 UTC
(In reply to Ray Strode [halfline] from comment #8)
> Hi,
> 
> > > > It's not that VMware is particularly vulnerable to this security issue. Only
> > > > that other driver writers haven't cared to implement a test for it.
> > > I could be wrong, but my understanding is this security issue is well known,
> > > and we just live with it.
> > 
> > It might be well known, but there are ways to fix this issue in the drivers.
> > Some drivers apparently don't care. The Radeon driver is reported safe
> > against this issue, as is the vmwgfx driver.
> Let's make sure we're talking about the same issue. As I understand it the
> problem is:
> 
> 1) One process needs to make a buffer available to another process
> 2) It accomplishes this by "naming" the buffer using the GEM FLINK ioctl
> 3) The naming pool is assigned from a global, incrementing counter that
> starts at 0.
> 4) Given that integer, other clients can read the buffer
> 5) this includes clients on other VTs
> 6) so a VT running in the background can look at the window contents of the
> session a user is actively using
> 
> Is this the security issue you're talking about?  

Correct.

> If so, how does the vmware
> driver deal with the opposite problem (user in active session looks at
> buffers from inactive sessions) ?

First, vmwgfx is not using gem, and recent driver versions have fixed the
cpu access part of the issue by only allowing buffer sharing within the same "master realm (clients with the same master)". The initial idea for GPU access was to simply throw out buffer contents from GPU memory on VT switch, making it invisible to the new master realm. This has now been replaced by careful command stream validation making sure that any GPU memory accessed belongs to buffer objects also within the same "master realm". However, if fbdev is not enabled we still need to empty VRAM when leaving VT because the device may use it for other purposes.

So strictly, at least for fairly new kernel versions I believe the driver is safe from this security issue even if we don't block old masters in the way we do, so as a last resort, for recent kernel versions, we can probably hack the kernel driver to not error in this case.


> If this isn't the same issue you're
> talking about, can you explain to me the issue the vmware and radeon driver
> have addressed that others drivers haven't?

It is the same issue. I'm not sure how Radeon addresses the issue with the gem open ioctl, but according to Jerome Glisse they do and they also have an elaborate command verifier that makes sure the GPU doesn't access stuff it may not access.

> 
> > See above. It might be that we are more paranoid than others but I do think
> > that we should avoid at all cost introducing new user-space clients that
> > rely upon a known security flaw and that breaks attempts to plug it!
> Well, I think the only reason we've avoided the problem before is because
> the X server suspends all clients when vt switched away (and I guess the
> vmware driver does too). One idea is, as a near term hack, we could do the
> same for mutter in wayland. block clients and block ourselves. there's an
> unavoidable race, though, I think.

Note that X servers can only block clients when they interact with the X server, like in copyBuffers, so drivers really caring about this issue must deal with the clients in the kernel anyway, so no need to block clients from user-space. The reason we can't suspend the old master like its clients is
that we would block it from calling drmSetMaster....

/Thomas
Comment 11 Ray Strode [halfline] 2015-08-25 19:16:12 UTC
Hi,

> > Is this the security issue you're talking about?  
> Correct.
So this issue doesn't affect wayland sessions, in practice, since wayland sessions don't ever use globally named buffers.

I guess wayland sessions can, in theory, still peep at X sessions that are globally named.

> So strictly, at least for fairly new kernel versions I believe the driver is
> safe from this security issue even if we don't block old masters in the way
> we do, so as a last resort, for recent kernel versions, we can probably hack
> the kernel driver to not error in this case.
Okay probably worth doing that, in addition to whatever other changes we come up with.

> Note that X servers can only block clients when they interact with the X
> server, like in copyBuffers, so drivers really caring about this issue must
> deal with the clients in the kernel anyway, so no need to block clients from
> user-space. The reason we can't suspend the old master like its clients is
> that we would block it from calling drmSetMaster....
okay, but not all drivers block clients, so if we block the display server in userspace and don't block the clients in userspace, then on some machines the session will crash as the client send queues overflow. (see the discussion around bug 753891 comment 18 )
Comment 12 Jasper St. Pierre (not reading bugmail) 2015-08-25 19:18:54 UTC
Also, I'm going to admit that I never saw glTexImage2D as "rendering". What would running on a render node do differently if we call glTexImage2D at a random time?
Comment 13 Ray Strode [halfline] 2015-08-25 19:28:22 UTC
actually I wonder if bug 753891 is the cause of the drawing you alluded to in comment 3 

It could be fixing it and adding some meta_later_add calls to st-texture-cache.c is enough to make this problem go away.
Comment 14 Ray Strode [halfline] 2015-08-25 19:30:55 UTC
actually more than likely comment 3 is just a race, where we don't freeze the master clock in time, since we freeze it in response to the vt change, not in preparation for the vt change.
Comment 15 Jasper St. Pierre (not reading bugmail) 2015-08-25 19:35:14 UTC
I don't see why freezing the master clock would have any influence on a file load completing and us calling cogl_texture_2d_new_for_data.
Comment 16 Thomas Hellström 2015-08-25 19:35:37 UTC
(In reply to Jasper St. Pierre from comment #12)
> Also, I'm going to admit that I never saw glTexImage2D as "rendering". What
> would running on a render node do differently if we call glTexImage2D at a
> random time?

As mentioned previously, the kernel logs show numerous calls to submit GPU command buffers as well. The problem specific to glTexImage2D was that it caused the crash, and that buffer allocation is listed as an ioctl requiring authentication.

Now in principle, however, if a driver were to implement GPU data privacy for legacy nodes by throwing out buffers at VT switch, a glTexImage2D would have (on hw like vmware at least) caused the data to be read back again for DMA purposes, which wouldn't be allowed. A render node implementation would have had to perform command validation or use per-client virtual GPU memory.

/Thomas
Comment 17 Thomas Hellström 2015-08-25 19:39:28 UTC
(In reply to Jasper St. Pierre from comment #15)
> I don't see why freezing the master clock would have any influence on a file
> load completing and us calling cogl_texture_2d_new_for_data.

If you look at the corresponding redhat bug, there are backtraces that seem to originate from the master clock (although I'm clueless about gnome-shell internals).

/Thomas
Comment 18 Ray Strode [halfline] 2015-08-25 19:45:18 UTC
Created attachment 309981 [details] [review]
defer texture access until we have master

(In reply to Jasper St. Pierre from comment #15)
> I don't see why freezing the master clock would have any influence on a file
> load completing and us calling cogl_texture_2d_new_for_data.

it doesn't, that's what caused the stack trace, but there is more than one issue. thomas says he saw a completely different issue in comment 3.

the above attachment (modulo any fixes it probably needs, since i haven't tested it) could fix the original crash, but it won't solve all the problems
Comment 19 Owen Taylor 2015-08-25 19:51:08 UTC
Very generally, approaches are:

 1) Make the vmwgfx driver not fail ioctls when Mutter drops master, except for ones that *have* to be failed, like switching the mode or front buffer contents.

 2) Make logind synchronize with Mutter so that Mutter reliably doesn't try to make ioctls when suspended.

 The fact that there is no such synchronization is handy for development, since you can switch away from a suspended or hung copy of Mutter to a different VT. I would prefer if we didn't have to give that up.

 3) Make Mutter robust against ioctls failing. This probably is best handled in the context of ARB_robustness - treating getting a failed ioctl as a reset, rather than trying to propagate error status back up on individual calls. 

 This is a *lot* of work - the "reset" idea makes it a bit easier, but there's still a need to find all the places in the GNOME Shell/Mutter/Clutter code where a GPU object is cached.

 4) Make Mutter do most of its DRM interactions as a "client" - either with render node or a non-master connection to /dev/dri/cardN, and only use the master connection for stuff that *has* to be on a master connection. (Is that just modesetting and page-flipping?) This greatly reduces the scope for dealing with failure. Since Mutter doesn't call drmSetMaster() itself, it's basically OK if it gets suspended.

I think we would certainly appreciate it we weren't doing extra work just for oen driver, if all other drivers behave in a different fashion. I don't see how GPU access to one dri master domain is the future-looking approach, especially since we are looking to have effective sandboxing between clients running in the same Wayland session in the future. So if 1) could be done, that would be great! Beyond that, approach 4) has the most appeal, but might require API changes at the Clutter/Cogl level.
Comment 20 Owen Taylor 2015-08-25 19:56:29 UTC
Review of attachment 309981 [details] [review]:

This is only useful if we added synchronization at the logind level so that the master clock was reliably suspended whenever master privileges are dropped; right now the suspending of the master clock is just approximate.
Comment 21 Ray Strode [halfline] 2015-08-25 20:18:09 UTC
>  4) Make Mutter do most of its DRM interactions as a "client" - either with
> render node or a non-master connection to /dev/dri/cardN, and only use the
> master connection for stuff that *has* to be on a master connection. (Is
> that just modesetting and page-flipping?) This greatly reduces the scope for
> dealing with failure. Since Mutter doesn't call drmSetMaster() itself, it's
> basically OK if it gets suspended.
This is clearly the best mid-term option, since it has other benefits too:

1) uses shiny new technology
2) get's us half way to optimus support. gives us gpu offloading for "free".

For F22/gnome 3.16 we're sort of screwed. If vmware can work around it in the driver near term, great, otherwise, i'll probably push a change to gdm to force the login screen to use Xorg if vmware is in use.
Comment 22 Ray Strode [halfline] 2015-08-25 20:20:01 UTC
(In reply to Owen Taylor from comment #20)
> This is only useful if we added synchronization at the logind level so that
> the master clock was reliably suspended whenever master privileges are
> dropped; right now the suspending of the master clock is just approximate.
well, i think it's useful in that it will make crashes less frequent, but it's perhaps not useful enough to push, since there will still be crashes without it.
Comment 23 Owen Taylor 2015-08-25 21:24:16 UTC
I talked to Jerome Glisse some about this:

 * He said that the Radeon driver is not significantly different from the Intel or Nouveau drivers. What the command stream checker does is check that clients can't access DRM objects that they don't have a handle to through the GPU - it doesn't affect the ability of one DRM master to open an object named by another master via FLINK.

 * Once we have masters that are not "trusted" (Wayland compositors running as a user as compared to Xservers running as root, for example), the only planned way to handle the FLINK security issues is for people to stop using FLINK.

 * He didn't immediately see any security benefit to the vmwgfx policy of blocking DRM access for clients that are not currently the master.

 * His take (Rob Clark chimed in and agreed) is that in the modern world, ignoring legacy, mastership should specifically be about modesetting, and not anything else.
Comment 24 Ray Strode [halfline] 2015-08-26 01:06:23 UTC
so given that vmware is apparently the only driver to lock down rendering/texture access like it does, it would be great if it could be changed to behave like the other drivers.

of course, we should do the render node split, too.
Comment 25 Thomas Hellström 2015-08-26 06:15:32 UTC
(In reply to Owen Taylor from comment #23)
> I talked to Jerome Glisse some about this:
> 
>  * He said that the Radeon driver is not significantly different from the
> Intel or Nouveau drivers. What the command stream checker does is check that
> clients can't access DRM objects that they don't have a handle to through
> the GPU - it doesn't affect the ability of one DRM master to open an object
> named by another master via FLINK.

So in fact, the GPU command stream checker can be pretty much bypassed as long as the FLINK security issue remains even on Radeon.
I haven't looked at Intel for quite some time, but in the old days they relied on low-priority command buffers that were designed to not let you do nasty things. But they'd allow you to access the common GTT including blitting data into the high-priority ring... Hopefully that has changed since.

> 
>  * Once we have masters that are not "trusted" (Wayland compositors running
> as a user as compared to Xservers running as root, for example), the only
> planned way to handle the FLINK security issues is for people to stop using
> FLINK.

I agree stopping using FLINK is good. However, we're not there yet and this is not about "trusted" masters. Anyone could easily craft a daemon master, VT switch and start it, vt switch back, log out and leave the workstation, having full background access to anything subsequently displayed on the screen. I'm pretty sure most of our high end customers wouldn't like that.

> 
>  * He didn't immediately see any security benefit to the vmwgfx policy of
> blocking DRM access for clients that are not currently the master.

The blocking doesn't fix the FLINK issue, but in the absence of a good command verifier, throwing out GPU memory and blocking old masters is a simple and cheap way to restrict the GPU visibility to the current master realm, It might also be needed for some hardware in situations where fbdev is disabled and vesafb is allowed to survive a drm module load, but that's a different problem. VMware probably doesn't strictly need it going forward, but I think it's important to know the implications if you start to rely on non-blocking DRMs.

> 
>  * His take (Rob Clark chimed in and agreed) is that in the modern world,
> ignoring legacy, mastership should specifically be about modesetting, and
> not anything else.

I agree, but "ignoring legacy" is the key word here.

/Thomas
Comment 26 Thomas Hellström 2015-08-26 06:21:29 UTC
(In reply to Ray Strode [halfline] from comment #24)
> so given that vmware is apparently the only driver to lock down
> rendering/texture access like it does, it would be great if it could be
> changed to behave like the other drivers.
> 
> of course, we should do the render node split, too.

I'll look at the driver today and see what we can do. 

/Thomas
Comment 27 David Herrmann 2015-08-26 09:37:23 UTC
So there are two security issues: FLINK per-se is insecure, as it allows guessing buffers of someone else. The second issue is that you can access VMEM that you don't have buffer handles to (by crafting malicious command-streams).
The first can be protected against by using dma-buf instead of FLINK (and subsequently providing an option to no-op FLINK). The second is protected against by either doing command-stream validation, IOMMUs or "realm-protection" as Thomas described.

The "realm-protection" is the oldest of the ideas and basically means you can only access the GPU while your 'realm' is active. It's unsuitable for off-screen rendering / GPGPU, sandboxing and alike, though. In those cases, there always must be multiple independent (and untrusted!) users at the same time accessing the GPU.

Of the drivers in use, only legacy-radeon (UMS) and vmgfx implement realm-protection. Everyone else just doesn't care (or implements proper cs validation and employs MMU protection). More importantly, everyone implementing render-nodes cannot support realm-protection, as both concepts are incompatible.

(@Jasper, side note: render nodes *are* enabled by default)

My suggestion is twofold:

1) Always use render nodes for any real rendering. This will eventually allow off-loading it to other GPUs (once x-gpu buffer allocation is solved).

2) Use the 'drm-master' concept *only* for access to mode-setting resources. Always assume you might get revoked asynchronously. Handle EACCES as 'probably revoked' and deal with it gracefully. Additionally, never assume you're master during startup.

@Thomas: If vmgfx supports render-nodes, you must already employ some other protection than realm-based. Why is this not used for the primary node? Or in other words, why is rendering / texture-allocation on render nodes *different* from the same access on the master node? The only issue I see is FLINK (which is disabled on render nodes). If that's the only issue, how about a mode that disables FLINK but then uses render-node behavior on the master node?

Last, but not least: logind allows *synchronous* device revocation. As long as the compositor did not acknowledge it, DRM-master is *not* revoked. However, I do plan to provide a way to override this so to force a VT switch.
Comment 28 Thomas Hellström 2015-08-26 10:25:33 UTC
(In reply to David Herrmann from comment #27)

> @Thomas: If vmgfx supports render-nodes, you must already employ some other
> protection than realm-based. Why is this not used for the primary node? Or
> in other words, why is rendering / texture-allocation on render nodes
> *different* from the same access on the master node? The only issue I see is
> FLINK (which is disabled on render nodes). If that's the only issue, how
> about a mode that disables FLINK but then uses render-node behavior on the
> master node?

This is exactly what I planned to implement as a workaround. I'll let render-node -like behaviour through for dropped masters.

Thanks,
Thomas
Comment 29 Owen Taylor 2015-08-26 13:41:08 UTC
(In reply to Thomas Hellström from comment #25)
 
> >  * Once we have masters that are not "trusted" (Wayland compositors running
> > as a user as compared to Xservers running as root, for example), the only
> > planned way to handle the FLINK security issues is for people to stop using
> > FLINK.
> 
> I agree stopping using FLINK is good. However, we're not there yet and this
> is not about "trusted" masters. Anyone could easily craft a daemon master,
> VT switch and start it, vt switch back, log out and leave the workstation,
> having full background access to anything subsequently displayed on the
> screen. I'm pretty sure most of our high end customers wouldn't like that.

drmSetMaster() requires CAP_SYSADMIN, so in the past, any master could be assumed to have CAP_SYSADMIN, and be "trusted". And on most Linux systems that was just Xorg.

With logind, that's no longer the case, yes, anybody can write a program that calls the TakeControl and TakeDevice methods of logind to get master privileges. There are restrictions - the session has to be running on the seat that the device belongs to - but nothing that prevents arbitrary code from running. (Not to mention the ability to run arbitrary code within an existing Wayland compositor by various means.)

So basically, with logind we've introduced the problem of untrusted masters, and made the FLINK issue more of an issue. (Though it sounds like vmwgfx may do better than some other drivers.)
Comment 30 Owen Taylor 2015-08-26 13:42:30 UTC
(In reply to David Herrmann from comment #27)
> Last, but not least: logind allows *synchronous* device revocation. As long
> as the compositor did not acknowledge it, DRM-master is *not* revoked.
> However, I do plan to provide a way to override this so to force a VT switch.

This is only the behavior for seats without VTs, right?
Comment 31 Thomas Hellström 2015-08-26 14:05:26 UTC
(In reply to Owen Taylor from comment #29)
> (In reply to Thomas Hellström from comment #25)
>  
> > >  * Once we have masters that are not "trusted" (Wayland compositors running
> > > as a user as compared to Xservers running as root, for example), the only
> > > planned way to handle the FLINK security issues is for people to stop using
> > > FLINK.
> > 
> > I agree stopping using FLINK is good. However, we're not there yet and this
> > is not about "trusted" masters. Anyone could easily craft a daemon master,
> > VT switch and start it, vt switch back, log out and leave the workstation,
> > having full background access to anything subsequently displayed on the
> > screen. I'm pretty sure most of our high end customers wouldn't like that.
> 
> drmSetMaster() requires CAP_SYSADMIN, so in the past, any master could be
> assumed to have CAP_SYSADMIN, and be "trusted". And on most Linux systems
> that was just Xorg.
> 

But if you open an fd to drm and there is no current master, your connection automatically becomes master regardless if you have CAP_SYSADMIN or not (think non-root EGL). (drm_fops.c)

This means a malicious program just needs to repeatedly open an fd to drm and try to authenticate itself to verify it's master. In the old world a malicious user would have to switch VT to succeed. In the new world the malicious program just have to wait for gdm to switch away and be fast enough to authenticate itself before the X server grabs master privileges. 

/Thomas
Comment 32 David Herrmann 2015-08-26 14:07:07 UTC
(In reply to Owen Taylor from comment #30)
> (In reply to David Herrmann from comment #27)
> > Last, but not least: logind allows *synchronous* device revocation. As long
> > as the compositor did not acknowledge it, DRM-master is *not* revoked.
> > However, I do plan to provide a way to override this so to force a VT switch.
> 
> This is only the behavior for seats without VTs, right?

Nope. On seats without VTs you use the dbus api, on VTs you just override logind's VT_SETMODE handler. So it's really synchronous on all seats, but I dislike advertising that feature as it shouldn't be used. It's an ugly way to fix issues, and I'd prefer if asynchronous switches become the default.
Comment 33 David Herrmann 2015-08-26 14:08:18 UTC
(In reply to Thomas Hellström from comment #31)
> This means a malicious program just needs to repeatedly open an fd to drm
> and try to authenticate itself to verify it's master. In the old world a
> malicious user would have to switch VT to succeed. In the new world the
> malicious program just have to wait for gdm to switch away and be fast
> enough to authenticate itself before the X server grabs master privileges. 

Sure, but that default should just be disabled just like FLINK. So once we support a sysctl that disables legacy DRM, then this should cover "automatic DRM-Master" *and* FLINK, imho.
Comment 34 Ray Strode [halfline] 2015-08-27 00:25:55 UTC
(In reply to Ray Strode [halfline] from comment #24)
> of course, we should do the render node split, too.
I started to sketch this out today here:

https://git.gnome.org/browse/cogl/log/?h=halfline/wip/render-node

probably not functional yet, will test it and iterate in the next couple days
Comment 35 Thomas Hellström 2015-08-27 11:50:56 UTC
Workaround vmwgfx kernel patches are now posted on the fedora bug.
Comment 36 Ray Strode [halfline] 2015-08-27 18:46:50 UTC
okay I fleshed out the branch a bit and got it working (in brief testing). i'll post the patchset in a few minutes.

There's a couple things pending, but they can happen later:

1) if drmModePageFlip fails we disable page flipping. Because of the races mentioned in this bug, it's "normal" for drmModePageFlip to fail when VT switching.  We need to only disable page flipping if it fails and the vt is still active I guess.

2) the separation between cogl and mutter isn't really ideal. we call some drm api in cogl and some in mutter and the abstraction between them is strange.  I think it might be worth investigating consolidating it all in mutter, making the cogl side very thin.
Comment 37 Ray Strode [halfline] 2015-08-27 18:49:29 UTC
Created attachment 310119 [details] [review]
kms-winsys: drop support for old gbm versions

Right now cogl tries to accomodate older gbm versions that
have the function gbm_bo_get_pitch() instead of the more
recently named gbm_bo_get_stride().

This adds an ugly #ifdef in the code. Furthermore, we are
going to rely on newer gbm for dma-buf support anyway.

This commit drops the #ifdef.
Comment 38 Ray Strode [halfline] 2015-08-27 18:49:34 UTC
Created attachment 310120 [details] [review]
kms-winsys: use correct surface format

gbm confusingly has two different format types, and cogl
is using the wrong one in some of its calls to gbm_surface_create

This commit fixes the calls that are wrong.
Comment 39 Ray Strode [halfline] 2015-08-27 18:49:40 UTC
Created attachment 310121 [details] [review]
kms-winsys: clean up error handling in _cogl_winsys_renderer_connect

If cogl fails to open the drm device, initialize gbm, or open the
egl display, then it closes the drm fd, uninitializes gbm, closes the
display and then calls _cogl_winsys_renderer_disconnect which does
most of those things again, on the, now deinitialized, members.

This commit removes the explicit failure handling in renderer_connect and
defers cleanup to disconnect.
Comment 40 Ray Strode [halfline] 2015-08-27 18:49:45 UTC
Created attachment 310122 [details] [review]
kms-winsys: rename device_name to card_device_name

The variable device_name is currently used to hold
the name of the drm device to open ("/dev/dri/card0").
We're going to be opening other drm devices in the future
(render nodes), so device_name will become ambiguous.

This commit renames it to card_device_name
Comment 41 Ray Strode [halfline] 2015-08-27 18:49:51 UTC
Created attachment 310123 [details] [review]
kms-winsys: rework handling of the drm fd

At the moment the drm fd is stored in the renderer structure twice:
once for reading and once for closing. This is a little messy, and
will only get worse when we start throwing render nodes into the mix,
too.

This commit abstracts the device handling out to another structure.
Rather than having two members for each fd, this commit employees a
boolean to decide whether or not the fd needs to get explicitly closed.
Comment 42 Ray Strode [halfline] 2015-08-27 18:49:57 UTC
Created attachment 310124 [details] [review]
kms-winsys: support drm render nodes

Right now cogl does all rendering and modesetting using
/dev/dri/card0. If rendering were moved to /dev/dri/renderD128
then it would be in a better position to support offloading
rendering operations to a second gpu.

Also, some versions of the vmware driver don't support using
card0 when running on an inactive VT.  The result is crashes
low in the rendering stack.

This commit leaves card0 for modesetting operations, but defers
rendering operations to a render node.  The output is synchronized
to card0 using dma-buf. render nodes can be used even when VT
switched away, so this will fix crashes on vmware.
Comment 43 Ray Strode [halfline] 2015-08-27 20:06:35 UTC
(In reply to Ray Strode [halfline] from comment #36)
> 1) if drmModePageFlip fails we disable page flipping. Because of the races
> mentioned in this bug, it's "normal" for drmModePageFlip to fail when VT
> switching.  We need to only disable page flipping if it fails and the vt is
> still active I guess.
owen pointed out on irc we can just check for EACCES to know that master disappeared.

(we could also go the other way and check for EINVAL but some drivers (mga) return EBUSY for "not supported" so that's more dicey)
Comment 44 Ray Strode [halfline] 2015-08-27 20:07:02 UTC
Created attachment 310127 [details] [review]
kms-winsys: don't disable page flipping on EACCES

If the user switches VTs in the middle of a page flip, the
page flip operation may fail with EACCES. page flipping will
work next time the VT becomes active, so we shouldn't disable
page flipping in that case.
Comment 45 Ray Strode [halfline] 2015-08-27 21:56:37 UTC
Review of attachment 310127 [details] [review]:

::: cogl/winsys/cogl-winsys-egl-kms.c
@@ +659,3 @@
                                  crtc->id, fb_id,
                                  DRM_MODE_PAGE_FLIP_EVENT, flip);
+          if (ret != 0 && ret != EACCES)

that should be -EACCES of course
Comment 46 Ray Strode [halfline] 2015-08-28 02:51:59 UTC
Review of attachment 310120 [details] [review]:

this I need to investigate more. why are there two different format types?
Comment 47 Jasper St. Pierre (not reading bugmail) 2015-08-28 03:10:15 UTC
Review of attachment 310120 [details] [review]:

Either should work. See http://cgit.freedesktop.org/mesa/mesa/tree/src/gbm/backends/dri/gbm_dri.c#n839
Comment 48 Ray Strode [halfline] 2015-08-28 12:05:45 UTC
that only works in the create case though, not the import case.  perhaps a bug in gbm
Comment 49 Ray Strode [halfline] 2015-08-28 12:08:57 UTC
(still i think it's better to be consistent than the mix of the two types we were doing before)
Comment 50 Ray Strode [halfline] 2015-08-28 14:33:47 UTC
Created attachment 310193 [details]
i think this will fix gbm

i think this will fix GBM, though I haven't tried it yet. I'll test it and upstream it, but even so, we should still push attachment 310120 [details] [review] (maybe with an updated commit message)
Comment 51 Ray Strode [halfline] 2015-08-28 15:13:50 UTC
Review of attachment 310124 [details] [review]:

::: cogl/winsys/cogl-winsys-egl-kms.c
@@ +986,3 @@
+
+      if (kms_onscreen->card_bo != NULL)
+        handle = gbm_bo_get_handle (kms_onscreen->card_bo).u32;

the failure handling here is a little wonky. If gbm_bo_import fails, handle won't get set and then we'll get a handle to the render_offload_device bo and try to use it on the card device which will fail.  we should probably post a warning here and bail early instead.
Comment 52 Ray Strode [halfline] 2015-08-28 18:21:11 UTC
Comment on attachment 310193 [details]
i think this will fix gbm

this ended up needing to be a little different, since it turns out the dri layer had two different images formats as well (one older and one newer) and the patch above picked the older one but we need the newer one.  I sent the working patch here:

http://lists.freedesktop.org/archives/mesa-dev/2015-August/092931.html
Comment 53 Ray Strode [halfline] 2015-08-31 23:04:00 UTC
Review of attachment 310124 [details] [review]:

::: cogl/winsys/cogl-winsys-egl-kms.c
@@ +334,3 @@
+  kms_renderer->render_offload_device.fd = -1;
+
+  ret = glob (render_offload_device_glob, 0, NULL, &glob_buffer);

airlied suggests using drmGetRenderDeviceNameFromFd instead of glob here.
Comment 54 Ray Strode [halfline] 2015-08-31 23:45:38 UTC
actually i'll just paste the conversation (slightly pared down) since there's some other suggestions in there to get us to full optimus support:

<halfline> airlied: did you see https://bugzilla.gnome.org/show_bug.cgi?id=753531 ?
<airlied> I've heard about it, didn't look into the details
<airlied> halfline: cogl still hardcodes /dev/dri/card0
<airlied> wow cogl is all kinds of broken, it shuold really use udev
<halfline> well cogl only hardcodes /dev/dri/card0 if you don't override it by sending your own fd
<halfline> and mutter sends its own fd
<halfline> but it hardcodes /dev/dri/card0 too :-)
<airlied> but how does it pick the rendernode?
<airlied> does it get it from mutter
<airlied> ?
<halfline> my patches just use glob("/dev/dri/renderD*") to pick the rendernode at the moment
<airlied> that patch seems to do something bad like pick the first rendreNode
<airlied> yes that isn't in any way sane
<airlied> the ordering isn't even guaranteed across boot
<airlied> you have have more than one graphics card
<airlied> most laptops that aren't bought in RH IT have two :)
<halfline> the downside of picking the wrong render node, is less severe then the downside of picking the wrong card
<halfline> and mutter currently picks the wrong card
<airlied> it's still pretty wrong
<airlied> mutter needs to grow the code from X to find the primary GPU or have a config file to override it
<halfline> doesn't it just mean you'll get rendering on integrated graphics instead of nvidia ?
<halfline> airlied: i'm not disputing that
<halfline> airlied: but at the moment it doesn't have it
<airlied> there is no definite what it means
<halfline> sure, it could mean you'll get rendering on nvidia and not integrated graphics
<airlied> you could end up rendering on the nvidia, and cause the laptop to heat up for no reason
<airlied> I'm not even sure why it is picking a rendernode, just to avoid the vmware bug?
<halfline> and to get us in a better place to support optimus when we finally fix https://bugzilla.gnome.org/show_bug.cgi?id=753434
<halfline> (which is not now, but it sounds like you really think it should be now)
<airlied> it should at least be using drmGetRenderDeviceNameFromFd
<halfline> i'll add a comment about that function to the bug
<airlied> all cogl should do now is pick the primary gpu and if it needs a rendernode pick that
<halfline> that's a fair point
<airlied> halfline: because rendering the mutter output on the nvidia won't magically make it scanoutable on the intel
<airlied> the compositor should be rendering on the display
<airlied> granted ideally we'd have a compositor context per GPU somehow
<airlied> and be able to move things between them
<airlied> but I've no idea how we are going to get there yet
<halfline> ah i gotcha
<airlied> maybe we should just check at startup if /dev/dri has more than one device and give up until we have enough time to fix it
<airlied> it would at least put a pause to the wayland is ready for default
<airlied> we also don't consider USB offload devices either
<airlied> again they will have device nodes
<airlied> though not rendernodes
<airlied> at least I don't think they get rendernodes since they don't render much :)
<halfline> well in that case you don't want to use drmGetRenderDeviceNameFromFd i guess
<halfline> instead we'd need to do a manual copy i suppose
<halfline> but i just want the kernel magic you spoke of
<airlied> yes, but again it shows issues hardcoding /dev/dri/card0
<halfline> i know there's issues hard coding /dev/dri/card0
<halfline> maybe i should just spend a day or two and fix mutter to use udev
<halfline> i'll talk with mclasen about it tomorrow
<airlied> pci_device_is_boot_vga from libpciaccess :)
<halfline> does that just check sysfs for boot_vga = 1?
<mclasen> airlied: its very obvious: pick the one with the highest number
<mclasen> more is always better
<airlied> halfline: pretty much on Linux
<halfline> so there are some cases where the pci devices associated with a graphics device don't have boot_vga but i think those cases are cases that don't involve kms
<halfline> so that's probably okay
<halfline> (was thinking of efifb and hyperv )
<halfline> matters for plymouth but not cogl
<halfline> airlied: is there any plan in place to support cross gpu buffer sharing with dma-buf down the line?
<airlied> halfline: we already do cross device sharing
<airlied> it works fine for X.org use cases
<halfline> then what did you mean by "because rendering the mutter output on the nvidia won't magically make it scanoutable on the intel"
<airlied> well sync is kinda busted but the whole point of dma-buf is that
<airlied> because to scanout something it has to be in a certain type of memory usuaully
<airlied> like for reverse prime cases, where we render with the intel and display on the nvidia, we have to do two copies
<airlied> one from intel rendering to a shared buffer
<airlied> and one from the shader buffer to nvidia VRAM
<airlied> I'm not 100% sure how wayland is working in that area
<airlied> I think mesa hides the copy
<airlied> I think weston offloading has worked at some point
<airlied> yeah we have patches in mesa for wayland prime support
<airlied> but the compositor I think is special
<airlied> this only works for the sending buffers to a compositor
<halfline> hmm there's something going on there
<halfline> but not sure what, will need to dig more
<halfline> ah i see, there's one reference_buffer function that can take a dma-buf fd or a name
<halfline> but it's just calling dri2_dpy->image->createImageFromFds
<halfline> which is all gbm_bo_import does
<halfline> so i don't see any copies unless i'm missing something
<airlied> src/egl/drivers/dri2/platform_wayland.c
<airlied> is_different_gpu
<airlied> but that code is for the client side as I said, compositor shouldn't run in that configuration
<airlied> though it probably will have to support it at some point if we ever want to get to OSX levels of support
<airlied> though I suspect you'd just run the compositor on the nvidia, and have it scanout offload to the intel
<airlied> and have the compositor "restart" behind the scenes to switch GPUs
<halfline> okay but this is good to crib off
<halfline> grr i want kernel magic
<airlied> there isn't much the kernel can do, userspace has to be in control
<airlied> otherwise you'd get very inconsistent behaviour due to the kernel doing extra copies
Comment 55 Daniel Stone 2015-09-01 08:16:47 UTC
A more clean solution (IMO) would involve using EGL_EXT_device_base, possibly in tandem with platform_surfaceless, to open the rendering device. Still-unreviewed Mesa patches here:
http://lists.freedesktop.org/archives/mesa-dev/2015-July/thread.html#89783
Comment 56 Ray Strode [halfline] 2015-09-01 13:00:06 UTC
interesting. though, like you mentioned, it's not completely landed yet.  Also, looking here:

http://cgit.freedesktop.org/mesa/mesa/tree/src/egl/drivers/dri2/platform_surfaceless.c

it only ever opens one render node, and it picks it arbitrarily, so it's not going to help us with with cross-gpu work.
Comment 57 Daniel Stone 2015-09-01 13:04:55 UTC
> it only ever opens one render node, and it picks it arbitrarily, so it's not 
> going to help us with with cross-gpu work.

All in good time. device_base lets us actually enumerate GPUs on the system - something totally new to EGL, gbm notwithstanding[0]. The next step would be to add support to platform_base for passing an EGLDeviceEXT as one of the attribs to eglGetPlatformDisplay, which would let you pick a render device. We're planning to contribute support for this on the surfaceless, gbm, and wayland winsys, which is where it becomes interesting.

But right now it's just blocked on device_base review anyway. :\

[0]: Which is totally broken anyway: gbm wants you to pass in the fd to the target display device, which has nothing necessarily to do with the _render_ device. We have two disjoint cases there, where for ARM we only ever really have one so just have a hardcoded association; in true multi-GPU systems we just punt on the entire problem space, aside from misc environment variables or symlink trickery.
Comment 58 Ray Strode [halfline] 2015-09-01 13:24:50 UTC
(In reply to Daniel Stone from comment #57)
> All in good time. device_base lets us actually enumerate GPUs on the system
> - something totally new to EGL, gbm notwithstanding[0]. The next step would
> be to add support to platform_base for passing an EGLDeviceEXT as one of the
> attribs to eglGetPlatformDisplay, which would let you pick a render device.
> We're planning to contribute support for this on the surfaceless, gbm, and
> wayland winsys, which is where it becomes interesting.
Sounds good.

> But right now it's just blocked on device_base review anyway. :\
ugh

> [0]: Which is totally broken anyway: gbm wants you to pass in the fd to the
> target display device, which has nothing necessarily to do with the _render_
> device.
Not sure I follow, gbm_create_device can take a render node fd or a card node fd.  (see attachment 310124 [details] [review] where I feed it both)

As I understand, from my conversation above with airlied, is the problem is that different display device targets have different requirements for scan out buffers, so the naïve approach of shipping the buffer over wholesale using dma-buf will fail if it's in the wrong memory space or has the wrong rowstride requirements, or uses the wrong memory layout or whatever. So to make cross-gpu access work in the general case, you have to copy from gpu memory to system memory, and then blit back to the other gpu's memory.  but gbm doesn't export a blit operation at the moment.

> We have two disjoint cases there, where for ARM we only ever really
> have one so just have a hardcoded association; in true multi-GPU systems we
> just punt on the entire problem space, aside from misc environment variables
> or symlink trickery.
yea, i gotta say i'm not a fan of this whole hybrid graphics trend. what a mess
Comment 59 Daniel Stone 2015-09-01 13:30:50 UTC
(In reply to Ray Strode [halfline] from comment #58)
> (In reply to Daniel Stone from comment #57)
> > [0]: Which is totally broken anyway: gbm wants you to pass in the fd to the
> > target display device, which has nothing necessarily to do with the _render_
> > device.
> Not sure I follow, gbm_create_device can take a render node fd or a card
> node fd.  (see attachment 310124 [details] [review] [review] where I feed it both)

Mesa's gbm implementation will accept basically any DRM device, but that doesn't mean that it's a) portable, b) future-proof, or c) a good idea. Say you have a Samsung Exynos display controller and an ARM Mali GPU, then you pass a Mali GPU render node to GBM - how is it supposed to know what to allocate for? Of course, the answer is to grub through /dev/dri looking for random control nodes, ignoring anything called 'vgem', and then hoping the driver name does the right thing.

I'd prefer to not pass anything other than actual display-control nodes to GBM for that reason.

> As I understand, from my conversation above with airlied, is the problem is
> that different display device targets have different requirements for scan
> out buffers, so the naïve approach of shipping the buffer over wholesale
> using dma-buf will fail if it's in the wrong memory space or has the wrong
> rowstride requirements, or uses the wrong memory layout or whatever. So to
> make cross-gpu access work in the general case, you have to copy from gpu
> memory to system memory, and then blit back to the other gpu's memory.  but
> gbm doesn't export a blit operation at the moment.

Or really, the way the GBM API was meant to work, is that you pass it the target _display_ node, and then you use the EGLDevice API to select the source _render_ node. Working out appropriate allocation then becomes an internal exercise for EGL, instead of pure guesswork externally.

tl;dr this is all basically going to break on ARM
Comment 60 Ray Strode [halfline] 2015-09-01 14:27:16 UTC
Hi,

(In reply to Daniel Stone from comment #59)
> Say you have a Samsung Exynos display controller and an ARM Mali GPU, then you
> pass a Mali GPU render node to GBM - how is it supposed to know what to
> allocate for? Of course, the answer is to grub through /dev/dri looking for
> random control nodes, ignoring anything called 'vgem', and then hoping the
> driver name does the right thing.
Are you saying that: 

1) find the card device from udev (probably /dev/dri/card0)
2) find the render node from drmGetRenderDeviceNameFromFd (card_device_fd);

won't give the you the right nodes on arm?
 
> Or really, the way the GBM API was meant to work, is that you pass it the
> target _display_ node, and then you use the EGLDevice API to select the
> source _render_ node.
But it can't be used that way today, until the patches are updated and land right?

> Working out appropriate allocation then becomes an internal exercise for EGL, instead of pure guesswork externally.
users of gbm don't guess allocation formats. They select a device to allocate buffers for, and they specify
what the buffers getting allocated will be used for. The details are figured out by mesa.

> tl;dr this is all basically going to break on ARM
I read what you said, and i'm still not exactly clear on why.  Is your concern that the mali gpu allocated buffer,
won't be the right format to scan out on the exynos display controller? (basically the same concern as the dual
gpu case highlighted above?)
Comment 61 Daniel Stone 2015-09-01 14:35:32 UTC
(In reply to Ray Strode [halfline] from comment #60)
> (In reply to Daniel Stone from comment #59)
> > Say you have a Samsung Exynos display controller and an ARM Mali GPU, then you
> > pass a Mali GPU render node to GBM - how is it supposed to know what to
> > allocate for? Of course, the answer is to grub through /dev/dri looking for
> > random control nodes, ignoring anything called 'vgem', and then hoping the
> > driver name does the right thing.
> Are you saying that: 
> 
> 1) find the card device from udev (probably /dev/dri/card0)

'the card device'?

Exynos (the display controller, which will do KMS calls, buffer allocation, and _nothing else_) will have /dev/dri/card0. Or maybe card1. Or card2 if you have vgem as well.

> 2) find the render node from drmGetRenderDeviceNameFromFd (card_device_fd);

Getting a rnode from a display controller won't give you the right device to send rendering commands to, because a display controller isn't a GPU.

> won't give the you the right nodes on arm?

... or any situation where you have more than one answer for 'give me a device which is a KMS display controller and/or a GL-capable GPU'.

> > Or really, the way the GBM API was meant to work, is that you pass it the
> > target _display_ node, and then you use the EGLDevice API to select the
> > source _render_ node.
> But it can't be used that way today, until the patches are updated and land
> right?

Right. Which is why the EGL implementation ignores everything and just opens /dev/mali0.

> > Working out appropriate allocation then becomes an internal exercise for EGL, instead of pure guesswork externally.
> users of gbm don't guess allocation formats. They select a device to
> allocate buffers for, and they specify
> what the buffers getting allocated will be used for. The details are figured
> out by mesa.

The context is important ...

GBM needs to allocate buffers which are suitable for a) display from by a display controller, and b) rendering into by a GPU. In any situation where you have multiple devices (because you want to render on NVIDIA but scan out from Intel, or render on Mali and scan out from Exynos because you have to in that case), then you need to communicate two pieces of information:
  - which display controller do I want to allocate buffers for? i.e. the KMS node you opened (or a rendernode from the _same_ _device) and will be calling AddFB/PageFlip/Atomic on - this is the argument to gbm_device_create
  - which GPU do I want to allocate buffers for? i.e. the device EGL is going to open - this is the future hypothetical EGLDeviceEXT-based API

I get that the answer to #2 not existing isn't great yet, but the solution isn't to give answer #2 in response to question #1.

(And yes, the GBM API is really badly documented this way. Apologies.)

If it helps, imagine rendering on NVIDIA (because you want to be fast), but displaying on Intel (because your panel is literally hardwired that way). Passing an NVIDIA device to gbm_device_create strips the crucial context, which is that the buffer ultimately needs to be accessible by the Intel device as a scanout format.

> > tl;dr this is all basically going to break on ARM
> I read what you said, and i'm still not exactly clear on why.  Is your
> concern that the mali gpu allocated buffer,
> won't be the right format to scan out on the exynos display controller?
> (basically the same concern as the dual
> gpu case highlighted above?)

Yes, it's a special case of dual-GPU really. And just the principle in general that rendering devices and display controllers are really very different things, except on x86 where they're almost the same if you squint hard enough, except in the cases where they aren't anyway.
Comment 62 Ray Strode [halfline] 2015-09-01 15:03:21 UTC
(In reply to Daniel Stone from comment #61)

> 'the card device'?
right the device with the name card in it.  the thing you page flip with. the exynos.

> will have /dev/dri/card0. Or maybe card1. Or card2 if you have vgem as well.
which we can figure out from udev.

> > 2) find the render node from drmGetRenderDeviceNameFromFd (card_device_fd);
> 
> Getting a rnode from a display controller won't give you the right device to
> send rendering commands to, because a display controller isn't a GPU.
It will if they're provided by the same driver.  basically:

ls /sys/dev/char/226:0/device/drm

will list the rnode associated with a display controller for the single gpu case.  Clearly, the situation is different in hybrid graphics cases and I guess on arm.

> GBM needs to allocate buffers which are suitable for
> a) display from by a display controller

So that buffer needs GBM_BO_USE_SCANOUT usage

b) rendering into by a GPU.
and this buffer needs GBM_BO_USE_RENDERING usage

> you need to communicate two pieces of information:
>   - which display controller do I want to allocate buffers for? i.e. the KMS
> node you opened (or a rendernode from the _same_ _device) and will be
> calling AddFB/PageFlip/Atomic on - this is the argument to gbm_device_create
>   - which GPU do I want to allocate buffers for? i.e. the device EGL is
> going to open - this is the future hypothetical EGLDeviceEXT-based API
> 
> I get that the answer to #2 not existing isn't great yet, but the solution
> isn't to give answer #2 in response to question #1.
Still not following why it doesn't cover the use cases we have here.  The one thing I can think of is this: in order for it work generally in the dual gpu case there has to be up to 2 copies (from gpu memory of one card, to host memory, to gpu memory of another card).  In the ARM case, the buffer is presumably completely shareable between the display controller and the gpu, and would require no copies provided that when the buffer is allocated up front on the gpu render node that it uses a format that's kosher with the display controller.  So I'm guessing your concern is gbm_surface_create (mali_node_fd, width, height, format, GBM_BO_USE_RENDERING | GBM_BO_USE_SCANOUT) won't provide a surface that's explicitly compatible with the display controller and so would require extra copies or wouldn't function. Is that your concern?
Comment 63 Daniel Stone 2015-09-01 15:18:05 UTC
(In reply to Ray Strode [halfline] from comment #62)
> (In reply to Daniel Stone from comment #61)
> > > 2) find the render node from drmGetRenderDeviceNameFromFd (card_device_fd);
> > 
> > Getting a rnode from a display controller won't give you the right device to
> > send rendering commands to, because a display controller isn't a GPU.
> It will if they're provided by the same driver.  basically:
> 
> ls /sys/dev/char/226:0/device/drm

Yeah - it depends which side of it you view as a special case. ;)
 
> > you need to communicate two pieces of information:
> >   - which display controller do I want to allocate buffers for? i.e. the KMS
> > node you opened (or a rendernode from the _same_ _device) and will be
> > calling AddFB/PageFlip/Atomic on - this is the argument to gbm_device_create
> >   - which GPU do I want to allocate buffers for? i.e. the device EGL is
> > going to open - this is the future hypothetical EGLDeviceEXT-based API
> > 
> > I get that the answer to #2 not existing isn't great yet, but the solution
> > isn't to give answer #2 in response to question #1.
> Still not following why it doesn't cover the use cases we have here.

Because it throws information away, as well as being completely contradictory to what you said above ('the thing you page flip with'). So you're now into the land of guesswork. Who do you allocate for? Do you optimise (NVIDIA-only internal tiling format on internal/non-exposed RAM) and run the risk of it not working when you actually pageflip on a different device, or do you pessimise (force a suboptimal format and allocation to be as visible as possible) and throw power/performance away?

Just because GBM takes a fd, doesn't mean it's the right place to put the render device's FD in. That's literally what EGLDeviceEXT exists for. And there are real GBM implementations out in the wild which will break if you pass a non-KMS device to gbm_device_create.

> The
> one thing I can think of is this: in order for it work generally in the dual
> gpu case there has to be up to 2 copies (from gpu memory of one card, to
> host memory, to gpu memory of another card).  In the ARM case, the buffer is
> presumably completely shareable between the display controller and the gpu,
> and would require no copies provided that when the buffer is allocated up
> front on the gpu render node that it uses a format that's kosher with the
> display controller.  So I'm guessing your concern is gbm_surface_create
> (mali_node_fd, width, height, format, GBM_BO_USE_RENDERING |
> GBM_BO_USE_SCANOUT) won't provide a surface that's explicitly compatible
> with the display controller and so would require extra copies or wouldn't
> function. Is that your concern?

Yeah, wouldn't function, because all users of the GBM API thus far have answered the question the same way as you ('the thing you page flip with'), so the implementation is written to assume this. IIRC, gbm_device_create would just fail.

In some cases on the x86 multi-GPU case, you can actually make a sensible allocation: allocate  with a format both devices agree on, in a manner both devices agree on. ARM has framebuffer compression upcoming which needs to be specifically negotiated, which requires knowledge of exactly which devices are present.

Passing the render-device as an argument to literally the only place we can provide the scanout-device as an argument, ignoring the possiblity of using the bog-standard EGL API that will allow you to select the render device, precludes the possibility of any of this ever working well, which is why I'd like to nip it in the bud.

I don't see why, when we need to pass two pieces of information, it's controversial to assume the following:
  - the scanout allocation API, which has always accepted the scanout device as an argument, continues to do so
  - the rendering API (EGL), with widely-supported extensions to enumerate and select the rendering device, accepts the render device as an argument

Especially when the way to find out things like which EGL extensions are supported and capabilities/rendering information in general, is through EGLDevice rather than scraping udev ...
Comment 64 Ray Strode [halfline] 2015-09-01 17:10:06 UTC
(In reply to Daniel Stone from comment #63)
> Because it throws information away, as well as being completely
> contradictory to what you said above ('the thing you page flip with').
I don't think i've been contradictory. the proposal in this bug (attachment  310124 [details] [review] with the tweak to use drmGetRenderDeviceNameFromFd instead of glob("/dev/dri/renderD*") ) is to create two gbm devices, one for rendering, one for scan out. buffers are created on the rendering device and imported into the scan out device to be shown on screen at the next flip. This only works if buffers are compatible. They would be compatible if the render node was gleaned from the card node using the previously mentioned drm api. on arm, there would be no render node directly associated with the card node, so no render node would be picked and it would just use /dev/dri/card0 with no render node as it does today. of course that patch doesn't try to address hybrid graphics.

> So you're now into the land of guesswork. Who do you allocate for? Do you
> optimise (NVIDIA-only internal tiling format on internal/non-exposed RAM)
> and run the risk of it not working when you actually pageflip on a different
> device, or do you pessimise (force a suboptimal format and allocation to be
> as visible as possible) and throw power/performance away?
well you have to pick a format that can be shared, or you have to perform a copy or two. There's no getting around that.  The problem you've raised, I guess, is we don't pick the format, just a usage, and the thing that does pick the format doesn't know the ultimate destination of the buffer (it knows "this is for scan out" not "this is for scan out on this other device"), so we would have to always copy and face a performance hit or never copy and face rendering failure. I guess we could fix GBM to be more aware of the rendering device / card device split with new API but the point of using gbm was to evolve the code we already have with functionality that already exists, so adding new api doesn't really serve that purpose.

> In some cases on the x86 multi-GPU case, you can actually make a sensible
> allocation: allocate  with a format both devices agree on, in a manner both
> devices agree on.
Well I think there are still cases where you have to copy on x86. like the reverse prime case airlied mentions above.

> Passing the render-device as an argument to literally the only place we can
> provide the scanout-device as an argument, ignoring the possiblity of using
> the bog-standard EGL API that will allow you to select the render device,
> precludes the possibility of any of this ever working well, which is why I'd
> like to nip it in the bud.

Okay, I think at this point we should shelve the main patch.  The near term impetus for the patch was to work around a problem in the vmware driver, which is now resolved via a driver fix anyway.  We do still need attachment 310127 [details] [review] to fix a related bug that will affect all drivers, but I'll move it to a new bug.  We can use bug 753434 to handle any future egl work.

> Especially when the way to find out things like which EGL extensions are
> supported and capabilities/rendering information in general, is through
> EGLDevice rather than scraping udev ...

Well, udev is another discussion and should probably happen on bug 753434, but as I see it we have two "right now" options:

1) continue to hardcode /dev/dri/card0 and continue to fail on hybrid graphics machines
2) use udev like weston does

We can use EGL extensions when they land of course, but that shouldn't block unbreaking two video card machines today.

Anyway, i'm going to close this NOTGNOME and will file the ancillary patches as a new bug.
Comment 65 Ray Strode [halfline] 2015-09-03 19:35:11 UTC
i've filed the other patches in bug 754540