I'm currently trying to implement a QtQuick2.0 Video-Backend to use glTexDirectVIVMap, so that we can get to <10% (maybe <5%) for 1080p video playback.
I'm already rather far and it does look promising!
Here is the interesting part of the code:
I'm just pasting part of the code for simplicity reasons.
'bind()' is the function which is called with a active glContext and should result in a glBindTexture call.
All the other code is glue-code and not important for this question.
(If someone want the code never the less, just ask me. It will eventually be open sourced anyway so no problem in sharing the unfinished version)
I still have two mayor problems.
This question is about my struggle with glTexDirectVIVMap:
It seams that glTexDirectVIVMap is mapping the memory in the background...?
What I meaning is that it's completely async ... I think.
I seems that the time it requires to do the map is heavily depending on the chip (GC800 or GC2000, imx6solo vs. quad) and also(!) very much depending on what else is going on on the GPU.
It can very
from < 1/24 seconds (==smooth playback)
to ~ 1/4 seconds (the only way to get non-stuttering playback to reduce to playback speed to ~1/6)
This is a huge problem.
*edit - missed to describe to problem*
The problem is that the frame start to 'stutter' as soon as the delay is to high.
On the solo it's stuttering all the time.
On the quad it's only stuttering if one renders something else in the background (which results in 60fps instead of 24fps opengl-rendering)
*edit - end*
I have a couple of possible solutions in mind ... but either I don't have enough knowledge for it or it has massive drawbacks.
1. I'm doing something wrong with glTexDirectVIVMap or glTexDirectInvalidateVIV
... if someone spots an error, let me know.
... I've tried to use glTexDirectVIV as well without success, I think it's not meant for this use-case (==where memory allocation happens somewhere else)
2. queue with a specific value.
I could do a N-buffering approach
So this would mean one would:
1. take a new frame
2. map it to texture
3. push texture on a stack/FIFO
4. pop oldest texture from stack (and free memory)
During all this time the 'oldest texture' is used for rendering.
This could be configurable but at the end of the day one would need to find a reasonable value for N.
It looks like that the max. value for N could be ~6.
... this is just a guess!
If frames come along and the stack is full, one would drop them.
6 frames are 0.25 seconds (24fps/6frame).
Meaning one would need to buffer 0.25 seconds audio as well.
3. queue until map is done.
Same thing as '2.' but rather then guessing 'N', one could use (if existing) some API to get the state of the memory mapping?
GL_MAPPING_DONE flag or something like that?
Then one would pop the oldest texture as soon as the second-oldest texture is ready.
Solved! Go to Solution.
VPU is expected to create 7 buffers for 1080p video. But
we see 13 textures are created. So the current way of mapping is correct. No
harm in it.
When you run the sintel_trailer-1080p.mp4, the frame
address incoming is not in round robin fashion, since this video is high
VPU will take the available free buffer.
In this case I can see the same address is being repeated
for 3 consecutive frames. That means, the both GPU and VPU accessing the
buffer, causing this distortion.
For this case I added a check if the same address is
repeating immediately for the next frame do a glFinish before
Fixed the problem to some extent.
Also I tried the base line profile video, where the
address were not repeated, it worked in round robin fashion. But I can see the
distortion when I add parallel GLES drawings.
Here are my findings
does not wait for the frame to complete. It flushes the command buffer and
gives back control for the application for the next frame. So the command
buffers which including the texture to be rendered is in GPU queue, not
rendered by the GPU. In the meanwhile VPU also can take the buffer control and
can decode on it. Hence this asynchronous behavior causing this problem.
- One solution
is to make sure the frame is completely processed by GPU and give back control
to VPU to process the next frame. As like below
will work good in this case, because the texture will be copied to GPU memory.
No race conditions.
- Please let me know whether the glFinish is solving the problem on wandsolo, also would
like to know the performance impact.
I attached an old sample application for glTexDirectVIVMap() demo.
Basically, I retrieved the decoded video frame's physical addresses of Y and UV planes from gstreamer pipeline lTexDirectVIVMap() to two textures alpha, luminance textures at run time, did the color space conversion in shader, then did a simple rendering. You can change the video resolution and name in the code. It used to work well in old bsp( I cannot remember which one) for small resolution video.
If you don't want to do color space conversion in shader. You just map the video format to one corresponding format texture.
From this simple application, you can first make sure glTexDirectVIVMap() can reach your 30 FPS. If not, need VSI to improve glDirectVIVMap() performance. If this simple application works ok, then the lagging is in gstreamer application.
Thank you Thomas, Volki, Prabhu and Karina. I recently encountered this same problem on an app I am working on and this thread was very helpful. My situation is somewhat simpler because I am not using Qt or gstreamer. I use the Freescale IMX video API directly and I allocate all the buffers just once and reuse them.
I found that the video rendered through OpenGL ES 2.0 would tear intermittently, but the problem gets worse as the OpenGL rendering load is increased. The solution of adding the glFinish() call resolved the problem for me, but there is some cost associated with this. I saw the rendering speed of 1080p video drop by about 3 fps when I added the glFinish() call and it is really only needed when the OpenGL loading is significant. I am using an 800MHz i.MX6 Quad and I have not tried this on a solo core.
Overall, rendering video through OpenGL ES 2.0 works well on this platform (as it did on the i.MX53) and the special effects and animations that can be applied to the video are fantastic. Decoding two 1080p videos simultaneously, the CPU use is only about 10%. I will be publishing more information about my application soon on my blog at http://montgomery1.com.
Here is the complete code.
It should apply to qtmultimedia dev (also stable I guess)
Also added the qml file I'm using to test right now.
Reducing the playspeed (line #36) and/or disabling the animationblue rectangle (line #62) reduces/removes the stuttering.
I still believe that there is something funny going on with glTexDirectVIVMap (the same code works wonderful with glTexImage2d) ... I got no clue what! :smileyhappy:
I couldn't download the attached zip packages, but what I guess from your pastebin code, is that you recall glTexDirectVIVMap everytime you call bind. As I understood it from reading Computer Vision on i.MX Processors it is only necessary to call it once and later call glTexDirectInvalidateVIV only. Maybe I overlooked something in the pastebin code, but the static initialized variable seems to be used only for binding the API functions and not the creation of the GPU buffer mapping.
Is there anything one can do wrong @ uploading file?
I've just tested and I can download them.
Calling glTexDirectVIVMap only once is only applicable when one used the same memory over and over again .. isn't it?
With gstreamer one gets new buffers continuously ... right?
if not ...
I could try to remember 'all' buffers including their textures.
When a new one comes -> new texture -> glTexDirectVIVMap + glTexDirectInvalidateVIV -> active texture for SG rendering
When one comes, one already knows -> take existing texture -> glTexDirectInvalidateVIV -> active texture for SG rendering
... after writing those line, this does sound logical :smileyhappy:
I'll give it a try and let you know!
Just tried. Same result.
It works surprisingly well.
It's about ~13 different memory locations per video playback.
As the format and size and everything stays the same within one video playback it's actually working very well.
- I call glTexDirectVIVMap 13 times, for the "first" 13 frames, for 13 different textureIDs.
- After that I only call:
But it has the same stuttering problem.
As soon as something else happens on the GPU (e.g. a simple animation), the playback starts to go forth and back rapidly (/stutter).
solved the stuttering on nitrogen as well as on wandsolo!
So it might be a vsync vs. memory sync problem.
I'm working on something very similar, although it is not tight to Qt. I'm using gstreamer and copy the data with memmove in the handoff callback of a fakesink. Because the callback is in a different thread, I post a message about the frame change to my message loop and call glBindTexture and glTexDirectInvalidateVIV in the main thread afterwards. I have two textures created with glTexDirectVIVMap that I use as flip buffer for the thread. So it's ensured, that during rendering only the texture is used where the callback thread does currently NOT write into.
I exported FB_MULTI_BUFFER=2 that solved tearing problems with normal OpenGL rendering (so I guess it works as expected), but regarding the glTexDirectInvalidateVIV thing for the gstreamer data, I currently have exactly the same effect of stuttering video playback when something else is done on the GPU. If I only render after the video data has changed it works flawlessly, but if I render again before a new image from gstreamer becomes available, it seems like the video texture is flipped to the last one, although I made sure it is the same as before.
I thought it might be a problem within my code, but after printing out every flip, copy, update and rendering of the textures it might be something else, because everything prints out in the expected order.
I made the same observations.
.. and apparently FB_MULTI_BUFFER=2 only shallows the problem .. the back-forth-stuttering still happens on the i.MX6s .. and I suspect on the 'big' ones also when doing more complex painting.
I got another implementation .. which is a bit dumb/wrong to some degree .. it doesn't 'cleanup' the textures, I just remembers the textures for each pointer value it gets.
... for a simple gstreamer setup this is working flawlessly as there is only around ~15 different memory positions (/pointer values) coming out of gstreamer.
The important difference is that I only call glTexDirectVIVMap on new data positions with new(!) textures... and for the "old" data positions I just call glTexDirectInvalidateVIV.
The funny thing is I always use the most recent texture .. so no double-buffering or anything like that.
There is a lot of 'buffering' as I don't clean up ... but from a data point of view, always the latest data from gstreamer, and therefor also the latest texture glTexDirectInvalidateVIV was called on is used
... but this is hardly the final solution! :smileyhappy:.
I should have mentioned that the new implementation doesn't have any stuttering at all
...it just has minor 'glitches' once in a while (3/minute)...I suspect and additional double buffering (which is easy to do as we already have quite some buffers) would solve that.
I just realized now, that you are using directly the GStreamer buffers using the MAP method. It seems to be obvious to avoid copying data, as I did it until now. I thought about changing it to your direct buffer solution, but apart from the problems you mentioned above, it's also not clear to me, how you synchronize the gl calls with GStreamer?
I'm currently using the fakesink handoff signal, but the callback is being called from a GStreamer thread. So all I could do is posting a message to my main thread with the buffer passed to the callback. But because of the async behaviour of my posted event, the buffer might already be reused by GStreamer when I want to use it within OpenGL (is it even guaranteed, that the buffer stays valid after leaving the handoff callback?). I would expect that you might have similar problems, if you don't have any other synchronization method, do you?
BTW. I have a i.MX6Q so yes, the behaviour is the same as on the i.MX6s
Qt implements it a bit different .. rather then using fakesink, Qt implements the final sink and takes the buffers directly.
.. rather then using fakesink as a buffer-container (that what you're doing right?)
Then, after one is 'done' with the buffer one needs to de-ref it to free it .. I'm not entirely sure if this prevents gstreamer from reusing the buffers to early...
This is happening in the unmap() function of QVideoFrame ... in another implementations (with n-buffering) I always 'unmap'ed as late as possible (the moment before it needed to be replaced) .... this never worked nicely (even with 12-buffering).
Reusing the same texture and data-area and just invalidated the data-content seems to work significantly better altough I unmap right after glTexDirectVIVMap/glTexDirectInvalidateVIV (which is possibly the worst thing one can do).
... this could also be a source for the 3/minute-flickering (gstreamer reusing the buffer to fast)
So do you copy the GStreamer data passed to Qt's final sink? If not (directly using the GStreamer buffer data), I doubt that the pointers are ensured to stay valid. I just did a quick look at the fakesink source code and it does not provide any information if GStreamer cares about the ref count of the buffers being passed to it's render method. There's a property num-buffers in the fakesink but I couldn't find out if it has any relationship to the number of different data buffers within GStreamer's pipeline.
Your proposed solution with unmaping right after glTexDirectInvalidateVIV seems to be the only proper solution to me. Why do you think that's the worst thing one can do?
Let's assume the buffer won't be reused by GStreamer until it's dereferenced, I would assume that the final solution might be:
- Create GStreamer Pipeline with either Qt's final sink or GStreamer's fake-sink
- add ref to the buffer passed to the render method of Qt's final sink or the handoff callback of fake-sink
- notify main thread about new frame being available (I guess that's necessary in Qt too, as GStreamer is multithreaded and the rendering is done in another thread, isn't it?)
- render main application using glTexDirectVIVMap for non existing texture addresses following a glTexDirectInvalidateVIV call.
- deref the buffer to allow GStreamer to reuse it for decoding
The main problem I see here, is that depending on the performance of the main application thread vs. the decoding thread, GStreamer might have to allocate more and more buffers, because the main application might not render the referenced buffers fast enough.
The other problem might be that GStreamer does not care about the reference count of the buffers passed to the sink at all and reuses them even if they might currently used by the rendering thread which may have strange side effects.
Did I miss anything?
My first implementation was basically as described above, but with copying the data in a flip-buffer texture
- Create GStreamer Pipeline with GStreamer's fake-sink
- copy buffer passed to the handoff callback of fake-sink into a texture address retrieved by glTexDirectVIV before
- notify main thread about new frame being available and swap texture to the "backbuffer", allowing the decode thread to update the texture currently not used for rendering
- call glTexDirectInvalidateVIV after binding the updated texture
- render main application
Although I can't find any logical problem in my first implementation it suffers from the stuttering if anything else is rendered. If only newly decoded frames are used as rendering trigger it's rendering absolutly fine.
I finally did another test, by keeping the pipeline from my first implementation, but replaced glTexDirectVIV with glTexDirectVIVMap and my own buffers. But if I did not overlooked some other mistakes, it makes the things even worse. Basically the stuttering is exactly the same but additionally I got some update artifacts which look like a conflicting read/write operation on the same memory region.
Maybe it's a problem that I update the texture memory from within the GStreamer thread (although I don't use the texture in the current rendering path) and during the eglSwapBuffer call the current "backbuffer" texture is some kind of invalidated.
I guess if that's the case, only an expert from vivante or fresscale can confirm that. I will try to add another buffer only managed by the cpu to verify that the stuttering does not happen when the texture is not updated in a background thread.
I did create a CPU managed buffer with the size of the texture memory and copied the GStreamer buffer data passed to the fakesink callback into it. I then notify the main thread about the new frame being available and copy the content of that CPU managed buffer to the pointer I got from glTexDirectVIV from within the main (OpenGL) thread. With that solution I got a lot of copies which makes the use of glTexDirectVIV not really better than just using glTexSubImage (although I guess glTexImage2D does not support the NV12 format, does it?) but at least no stuttering and performance for a 1080p@24fps seems to be ok (around 15ms for only uploading and rendering the video).
So maybe the documentation of glTexDirectVIV and glTexDirectInvalidateVIV regarding threading can be a bit more detailed in future versions. Currently my impression is that updating a memory area from another thread than the OpenGL thread, even if the texture behind it is currently not used by the rendering is not allowed.
The 'final solution' you described above is exactly what I'm doing (I've only implemented the last two steps, the rest is already in place within QtMultimedia)
The reason I said unmapping (/def_ref) right away is possibly the worst thing to do is the reason that I still believe (or better hope) that gstreamer doesn't blindly reuse the buffers how ever he likes ... this would mean that one should unmap after(!) the texture -which uses this data- isn't visible anymore (because a new one is available) .... unmapping (/def_ref) right away says: "hey gstreamer, I don't need this data anymore, do with it what you want", but just just started(!) to display this data
.. this can be fixed easily
I agree @ the documentation isn't really complete in regards to multi-threading :smileyhappy: ... I hope someone from freescale will quote on this thread eventually :smileyhappy:
... I got some contact I reached out to .. if they tell me anything I past in it in this thread.
I don't want to copy the buffer at all... this is not efficient.. and avoidable!... otherwise it's (as you said) just a slightly optimized version of glTexSubImage !
15ms .. that's very close to 16.666 which is the complete frame for 60fps :smileyhappy: ... I assume you're using a dual or quad? .. 15ms is not what you'll get on a wandboard solo (which is what I'm after in the end)
... so (unless I've overlook anything) glTexDirectVIV is not an option for me (as it will always requires a cpu bases memory operation, doesn't it?)
I'm not Thomas, but for me it's not solved, as the performance is not optimal when we're not able to directly use the memory pointer of glTexDirectVIV in a background thread.
The solution by using glTexDirectVIVMap would be even better. It would be nice if there is an example that shows usage of gstreamer decoding with glTexDirectVIVMap that checks for thread safeness and should allow zero copy rendering as an OpenGL texture.
Using gstreamer's gl plugins would be probably not a good solution as sharing textures with gstreamer is not very convenient to implement in custom applications.