Help with i.mx6q YUV streaming

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Help with i.mx6q YUV streaming

Jump to solution
6,410 Views
ivankozic
Contributor IV

Hi all,

I have a problem that I can't seem to solve on my own, so any help is greatly appreciated. I am currently streaming YUV422 from Aptina AR0330 using parallel interface with i.MX6q (iWave board). I have issues figuring out how to properly use IPU for this case. For start, I need to get Full-HD image on the HDMI display. The resolution coming from the sensor is always 1920x1080 @60fps. I am using 3.0.35-4.0.0 kernel. I am using mxc_v4l2_overlay.out unit test for this.

So my current situation is:

1. YUV stream is properly being read and I get a good image when using fb1 (foreground FB), as this is YUV framebuffer. However, as far as I can see it is mainly used as a preview, and thus resolution is limited to 240x320, which is quite low and not really usable. The image is perfect however.

2. fb0 does not work, as it is a RGB24 framebuffer and I am sending YUV stream to it. At first I thought IC might be skipped (I saw some threads here regarding IDMAC using NV12):

Capturing YUV420 from MIPI CSI camera for VPU compression

However, it is not clear from the discussion if IC can really be skipped. As far as I see, there has to be some sort of color space conversion done before the data comes to the framebuffer (and CSC is done in IC, so it can not really be skipped, can it?).

So in general, I have difficulty understanding what is the exact difference in these two cases - I know that for case 2., CSI grabs the data and sends it to the memory and afterwards this is fetched and sent to the display controller. However, preview path with YUV framebuffer I don't get - is there some sort of direct path in the IPU only for this preview functionality? If so, is the resolution so low because of the bandwidth limitation of this path?

I have also seen that there are two possible options for Overlay routing in menuconfig:

1. Queue IPU device for overlay library,

2. Pre-processor VF SDC library.

of which I am using 1. Although it seems to me that fb1 path is actually this Pre-processor VF SDC path. But not really sure.

So in general if anyone could explain these in just a few words I would be very grateful. I suppose that the ideal way would be to activate IPU for CSC and use fb0 for Full-HD, but I am actually not 100% sure of this.

Thanks in advance!

Labels (4)
0 Kudos
1 Solution
2,435 Views
DraganOstojic
Contributor V

I think you can do something like this:

1) use capture mode to get buffer directly into memory from the sensor, check image by saving into file

2) (optional) if you need to do csc here, invoke IPU multiple times to get full frame converted, or invoke gpu to do the same thing. Check image by saving into file.

3) run your encoder on this frame, wait until completed

4) run gpu to transfer image into video memory

.

I have examples for steps 2) and 4)

I'm convinced this approach should work. In the first iteration performance might be an issue but I think with some kind of buffer management, things could be made to work in parallel.

View solution in original post

0 Kudos
16 Replies
2,435 Views
evgenymolchanov
Contributor III

Hello Ivan. In the begining of this post you wrote : "I am currently streaming YUV422 from Aptina AR0330 using parallel interface with i.MX6q (iWave board)."

My question is: How did you get YUV from this sensor?

0 Kudos
2,435 Views
ivankozic
Contributor IV

I am not really sure that Freescale makes the distinction here, but V4L certainly does (you can see it in the driver code), so I will quote them:

Capture:

"The video capture interface grabs video data from a tuner or camera device. For many, video capture will be the primary application for V4L2. Since your editor's experience is strongest in this area, this series will tend to emphasize the capture API, but there is more to V4L2 than that."

Overlay:

"A variant of the capture interface can be found in the video overlay interface, whose job is to facilitate the direct display of video data from a capture device. Video data moves directly from the capture device to the display, without passing through the system's CPU."

i.MX unit tests for V4L are also based upon this - there are "mxc_v4l2_capture" and "mxc_v4l2_overlay". Capture uses the Sensor->CSI_MEMx case from the V4L/IPU drivers, while Overlay uses some variation of PRPVF (either Sensor->MEM->IC or Sensor->IC->MEM or similar). Capture uses SMFC to write to system memory, while Overlay uses IC to write to system memory, at least this is what I've seen in the code (they have different channels in IDMAC).

Capture is nice and would do the work just fine if I didn't need HDMI, but I need it and therefore, I need Overlay or some sort of it anyway.

Also, I've just seen that my previous post could be misinterpreted - so the requirement is 60fps on HDMI, but only 1080p30 for h.264, as i.MX's VPU cannot encode 1080p60 h.264.

But VPU is not really important currently - I was trying to somehow turn off IC whole day yesterday and did not succeed in it. I need to try out a special case where Overlay flow would be completely altered so that it would be possible to make the following pipeline:

Sensor->CSI->CSI_MEM, CSI_MEM->IPU(DP - CSC) or even CSI_MEM->HDMI (CSC).

It is not really clear whether this is possible from RM, but I will certainly give it a go. I have killed the IC init function, but this does not help - it goes through it either way, just doesn't do what it should (no CSC). There are too many abstraction layers in overlay (Capture is much simpler somehow) and simple patching will not work here - as I said, complete overlay flow needs to change for this to work.

0 Kudos
2,436 Views
DraganOstojic
Contributor V

I think you can do something like this:

1) use capture mode to get buffer directly into memory from the sensor, check image by saving into file

2) (optional) if you need to do csc here, invoke IPU multiple times to get full frame converted, or invoke gpu to do the same thing. Check image by saving into file.

3) run your encoder on this frame, wait until completed

4) run gpu to transfer image into video memory

.

I have examples for steps 2) and 4)

I'm convinced this approach should work. In the first iteration performance might be an issue but I think with some kind of buffer management, things could be made to work in parallel.

0 Kudos
2,435 Views
ivankozic
Contributor IV

Yeah, something similar was my idea as well. I've got the new board yesterday (1.2 silicon) and VPU works on it :smileyhappy: I have successfully compressed a video from camera with H.264 yesterday. I still have issues to make it run on the network (maybe I am even configuring VLC incorrectly) - tbd :smileyhappy:

BUT! The video was, of course, bad. And I think I now realized why.

So the current situation is like this:

1. Capture uses direct path to memory CSI->MEM (fmem). These are IDMACs channels 0-3.

2. Overlay with option 1 from menuconfig (Queue IPU library...) uses the same path first, but afterwards initializes IC (so CSI->MEM, then MEM->IC for preprocessing).

3. Overlay with option 2 from menuconfig (VF processing) uses different path (CSI->IC->MEM), so RGB data is fed to memories, not YUV.

In cases 1 and 2 I get garbled data (output is just green/blue white noise) - in fact every time the image goes to MEM as YUV I get bad data. I've looked at the capture file and the values for YUV are bad (V is always 0 for one), so there is something bad going on in the CSI->SMFC->IDMAC->MEM path. Somewhere there is a bug. So I'm chasing this one now. There is also couple of problems with this - in several places I've read that IDMAC always makes YUV444 from YUV422 when capturing. I've also read that it makes YUV420. So I have to go through this very carefully and consult the community with the issues. Case 3 is fully functional, but it has issues with IC limitations.

Incidentally, Option 1 from menuconfig uses IPU library, where split mode is implemented. Option 2 uses only preprocessing from IPU without the possibility to use split mode.

Regarding your stuff - I would avoid using IC/IPU to do CSC. As far as I see it, CSC is not needed for VPU as it needs to be fed with YUV420 anyway. For HDMI, I would rather use DP to do the conversion, or if that fails, configure HDMI core for it. Although as always, since I'm still new with Linux, examples are always welcome. The GPU CSC still seems quite interesting. By saying "have examples" do you mean you're willing to share them :smileyhappy: ?

Oh, and big thanks - you've really been very helpful through all this.

0 Kudos
2,435 Views
ivankozic
Contributor IV

Update:
I found the error - the solution is similar to yours actually :smileyhappy: Maybe the explanation is a bit longer.

So... I was using 10bit capture with bt.1120. 10 bits are fine when using IC right after the capture, as it is automatically cut-down to 8 bits per color RGB in IC and IDMAC successfully writes this to and reads it from the memory (and therefore image is nice and pretty on the screen with overlay). The problem is when IC is not used (can be seen with capture case). First I thought that there is a bug somewhere in the IDMAC init or even SMFC and that it garbles the data when it's not 8-bit aligned. But, if looking at the CPMEM config, there's really no options for 10bit/component transfer. There are only options for 8, 16, 18 and 24 bits. Then I realized that even though the data should be extended to 16 bits by the packing unit in the CSI/SMFC (Table 37-17, page 2762 of the RM), it doesn't for some reason and it gets garbled. I also saw this same behaviour with MIPI. For instance, similar bad info can be found on Page 3246 of the RM - bit 7 of IPUx_CSI0_SENS_CONF - read the description when this bit is 0.

The current workaround is to use 8 bits for data (ipu_capture.c is generally very badly written and most of the bugs I've encountered come from there - 10bit bt.1120 is completely useless if this file is not modified). It is not really a nice workaround (as I'm losing quality even more than I should) but it seems to work. I think you're using the same workaround, don't you?

Either way, the fore-mentioned table from RM is wrong, as the packing unit is not working according to it - maybe it is different with tight-packing, but I haven't yet tried out this case. For now, I have successfully tested capture (so CSI->mem case) and I can verify that it works. So if anyone from Freescale is following this thread, please explain why your packing unit is not working up to the spec.

I have also tried out several other IDMAC configurations and none work (according to the table, bpp should be 32 for 20-bit YUV and regular packing, but it isn't)...

0 Kudos
2,435 Views
DraganOstojic
Contributor V

I didn't try 10bit YUV over bt1120 standard and given that there is no example code it might be a challange. On the MIPI interface I capture as YUV422 8-bit and on parallel interface I capture as 16-bit (still images) and 8-bit (preview mode) generic data. The only difference between 8-bit generic and 16-bit generic case is IDMA CPMEM burst size=64 (8-bit case) and bust size=8 (16-bit case) and line has 2x more bytes. Bit 7 in IPUx_CSI0_SENS_CONF is 0 in 16-bit generic data mode because this bit is not applicable for generic data. I'm thinking now that you could avoid complications with capturing as bt1120 YUV if you could get your camera to output 10bit YUV in the gated mode. If you capture your data as generic, IDMA will place the data into memory without touching it. If you need to transform it you could do it yourself.

I can give you code for generic data capture in 8-bit and 16-bit mode parallel in gated mode and for MIPI camera in YUV422 8-bit mode and GPU code. Let me know what you want.

0 Kudos
2,435 Views
ivankozic
Contributor IV

Hmm, there should not be any difference - my "10bit" mode is in fact bt.1120, so it is basically 20 bits in one cycle, 10-bit per component. According to this, pipeline should be:

1. IPU/CSI - capture 20bit (10bit/component) data.

2. IPU/CSI packing unit - pack it regular (since IPUx_CSI0_SENS_CONF is 0) => place 10 bits as MSBs of 16-bit words (so we need 32bits per pixel, as components are extended to 16bits),

3. SMFC + IDMAC - transfer everything to RAM.

Should be quite straight-forward, but it isn't. So, currently I am using 16bit capture (8bit/per component) and this is working. This is due to the fact that step 2 doesn't really work for some reason. Currently data is set as YUYV, but with MIPI I had the same issue - I have set the data to Generic 12bit. It was not possible to get any good results (always some bit/word slipping unless everything is at 8bit). So there is something wrong with this packing unit.

I have also seen that IDMAC can be used for some packing (ipu_param_mem.c or .h file, I forgot - I'm in Windows again), but this is reserved for RGB formats according to the RM (states this at the end of CPMEM config for interleaved mode table). However, if you look at the source file (ipu_param_mem), it is also done for YUV444 formats, and I am not sure now whether this actually is how the packing is done? Hmm...?

Anyway, my 16bit parallel capture is working - so I think I am good for now (would like to see 20bit capture work, though). Other than that - GPU code would be really cool to see, if you don't mind of course.

0 Kudos
2,435 Views
ivankozic
Contributor IV

Minor update: So currently I have 16bit YUYV working (8bit Y, 8bit U/V subsampled). It works for both Option 1 and 2 for Overlay routing from LTIB. So, IPU split mode works out of the box with 8bit/component precision, which is not that much of a news.

I can verify that split mode is almost useless:

1. Upper part of the frame is processed with _ipu_ic_init_prpvf from ipu_ic.c file, which inits IC to do CSC as well,

2. Lower part of the frame is processed with _ipu_ic_init_pp from ipu_ic.c file, which also inits IC to do CSC.

If you comment out the _init_csc function at the right place in the file (where in_fmt is YUV and out_fmt is RGB), you'll get wrong colors for the respective part of the frame. So now everything is clear.

The best way to clear this whole mess is to manually program a case where VF would go from CSI->MEM->DP(CSC)->HDMI or CSI->MEM->DP->HDMI(CSC) or CSI->MEM->GPU(CSC)->MEM->DP. From memory, the pipeline would also branch to VPU.

Not really sure if the whole pipeline is even possible, but it would be interesting to try it out. I am not sure how the data is transferred to GPU or VPU (IDMAC should not be relevant here, as it only does CSI->MEM transfer, eventually MEM->DP), which is why I am eagerly awaiting for your GPU example and running through RM like crazy :smileyhappy:

0 Kudos
2,435 Views
ivankozic
Contributor IV

Update: I have downloaded, compiled and installed the GPU examples (GPU SDK). However, I have issues with it:

1. I can't run some of the examples due to kernel panic (don't really know why it panics). The issue:

kernel BUG at arch/arm/mm/dma-mapping.c:478!
Unable to handle kernel NULL pointer dereference at virtual address 00000000

2. Kernel panic happens after some time on the examples I can run (01, 02 and 07 for instance).

3. The colors are wrong on the screen for the examples I can run - everything is black/white and striped somehow (vertical stripes). It looks as if it is outputting RGB16 to RGB24 framebuffer or similar.

I have tried several solutions, but none of them worked, as OpenGL ES abstraction layer is too high for my taste and I still didn't go through all the documentation. I tried:

1. Change s_configAttribs - they were set to RGB565 - I've set them to RGB888, but no difference whatsoever,

2. Test DirectFB stuff - I think I've read somewhere that OpenGL ES driver invokes it for display, so I've installed examples (DirectFB-examples - df_neo, df_andi), but even though I fixed some things (added directfbrc), I cannot make it run on more than 640x480 AND it doesn't seem to make any difference to GPU examples, so that's not it I guess.

Question is - does anyone else have any issues with GPU examples - I don't really know how to fix this?

0 Kudos
2,435 Views
ivankozic
Contributor IV

I have a small update - I have found the IC limitation and it doesn't end at 1024x1024 - this is only one limitation. The next one is output pixel rate which is 100 Mpix/sec. This is quite low for me (the system should be able to do FHD at 60fps). I'm still puzzled by all this - I cannot get more than 1024x835 at the output with 60 fps, which is somewhere around 50 Mpix/sec (a bit above - true XGA is a bit under), so the numbers don't match (both the width-height and fill rate). There is something wrong with the IC. My input rate is well below 200 Mpix/sec (also IC limitation).

However, big thanks to Dragan Ostojic and his post:

https://community.freescale.com/message/340548#340548

it seems that GPU is able to do color conversion, but this will involve more work I guess - I am yet to discover the GPU, but it seems like a good idea.

BUT! It is not clear from the RM about this IC/IPU split-mode - RM states that bigger frames can be rendered by splitting them into vertical stripes:

Chapter 9.2.2.3, page 482 of RM:

Wider frames can be processed by the IC by splitting them to vertical stripes.

It is not really clear how to achieve this in IPU. I would still prefer to use IC/IPU as it is easier and does not require additional work. If it's not possible, I guess I will have to use GPU for all this...

I hate to use "bump", but any comments are welcome here :smileyhappy: - I would really like to know what others think of all this.

0 Kudos
2,435 Views
DraganOstojic
Contributor V

Ivan, Freecale says that they support striping mode in the driver and indeed it appears they do by looking into the code but I couldn't get it working nor there is any example. I think there is some bug in that code so I decided to not use it. Instead, you can get the same result if you divide your picture that exceeds 1024x1024 output size into pieces (provided proper alignment) and call into IPU driver repeatedly. In my particular case I used programmable color space conversion of the IPU block to get gray scale in and RGB out. That was in essence RGB->RGB conversion but you can do other combinations like YUV->RGB etc. If you're interested I can post the code here. In my case performance was 70ms to color space convert 2592x1944 from gray scale into RGB (expand it from 1 byte to 3 bytes and optionally zero some of the color components). I think this time can be halved to 35 ms by passing two segments to IPU because IPU has 2 h/w block as driver will use them in parallel if one block is idle. I didn't try this because I now use GPU which is a lot faster.

0 Kudos
2,435 Views
ivankozic
Contributor IV

Hmm... That seems interesting, although 70ms is quite slow. Even 35ms is slow for me (I need FHD with 60fps, so 16ms would be great for FHD frame). But I am still interested in the code - my IPU driver also has some mods (as it was poorly written by Freescale anyway), but I would like to see your code if not a problem. By the way - if I may ask - do you use some profiling functions to get the times needed for frame processing?

Also, empowered by your GPU post, I did some further reading, but I didn't have enough time to get into the code more seriously. I also think it would involve much more work than with IPU (I've never done any GPU programming, especially not OpenGL stuff), so I should probably push the IPU to the limits first.

I am also trying to figure out this current limitation (as basically I am maxing out at half of the quoted IPU bandwidth with XGA and 60fps). Also my VPU is quite rebellious, as it refuses to initialize => a lot of work :smileyhappy:

Oh, and thanks for replying :smileyhappy:

0 Kudos
2,435 Views
DraganOstojic
Contributor V

I didn't touch IPU driver except to allow for loading custom color space conversion coefficients to get expanding gray scale to RGB and clearing some color channels. I could have done expanding in a different way (as YUV to RGB conversion) without changing the IPU code but for clearing color channels I needed custom csc matrix.

Are you trying to convert YUV422 to YUV420 for VPU to process? Can you check if IPU supports this conversion. If it doesn't you can apply a version of my change but if it does you probably don't my change.

For feeding parts of the image to avoid going over the 1024x1024 limti, check the following post https://community.freescale.com/thread/306462 and update on Apr 29, 2013 3:08 PM. Let me know if it's not clear, I'll send you the files I changed directly.

For profiling I use the following code:

In the header file:

    #ifdef PROFILE_CODE

        #define init_profile() \

            struct timeval begin, end;  \

            int sec, usec;

                 

       #define start_profile() \

            gettimeofday(&begin, NULL);

         

       #define end_profile(run_time) \

         gettimeofday(&end, NULL); \

            sec = end.tv_sec - begin.tv_sec; \

            usec = end.tv_usec - begin.tv_usec; \

            if (usec < 0) \

            {\

                sec--; \

                usec = usec + 1000000; \

            } \

            run_time = (sec * 1000000) + usec;

    #else

        #define init_profile()                

        #define start_profile()         

        #define end_profile(run_time)

    #endif

In the implementation file:

     #define PROFILE_CODE

        init_profile(); 

        start_profile();

          ...

        int run_time = 0;

        end_profile(run_time);

After that you can print run_time value to the console.

0 Kudos
2,435 Views
ivankozic
Contributor IV

Wow a lot of info :smileyhappy: Let's go one step at the time:

My ideal situation would be to have h.264 over ethernet and HDMI out at the same time (FHD at 60fps at least). The thing is - I am struggling with HDMI. VPU should be no problem, but it appears that it is, since it doesn't want to initialize on my system (I have iWave board and a proto sample of i.MX - T1.0). Some of the issues I am experiencing are because this is a prototype IC.

Thanks for the profiling function - I'll try it out today.

I have also made some additional discoveries - maybe you are interested in those as well:

1. It appears that there is no way to avoid using IC when overlay case is used. With Capture case it is different, as I know that IC was not used when I used Capture only. I am not 100% sure of this and will try to get into it today.

2. I guess I've seen split mode yesterday - it is really strange and I think something is messed up in the driver, since it doesn't work as it should. Basically, with Overlay routing set to option 1, IPU is using this mode if image is larger than 1024x1024. On the first look it works quite dumb - preprocessing function (something like _ipu_prpvf_...) is called for "upper" half of the FHD frame, while postprocessing function (something like _ipu_pp_...) is called for "lower" half of the frame. I'm not currently in Linux, so I don't really have correct names. Also, FHD frame is split in only two sub-blocks - so it is not that dumb (2*1024*1024 is just a bit more than a single 1920*1080), but it doesn't work properly - even though _init_csc is called it doesn't work and I get YUV interpreted as RGB at the output. I figured this out once I saw that for option 1 of Overlay routing, init_ic() is called all the time, while with option 2 it is called exactly once.

3. I have found at least two more ways to do the CSC, which would be quite useful if I could get rid of IC - DP can do CSC and HDMI core can do it. For instance I've tried out DP - it seems to work, but weird, I yet need to see if it can be used in any constructive way. HDMI should be no problem from the theory side (a lot of Silicon Image HDMI ICs support YUV input), but implementation could be hard - somehow I sense that Freescale did not implement this in their driver :smileyhappy:

Anyway, off to work - I'll try out your code today and post back. Thanks.

0 Kudos
2,435 Views
DraganOstojic
Contributor V

I see, you want to do VPU processing on the buffer while at the same time HDMI is displaying from the same buffer. Can you explain what is exactly "overlay mode" and what is "capture mode"?

0 Kudos
2,435 Views
ivankozic
Contributor IV

Quick update:

I seem to have missed the proper uImage last time (too many recompilations :smileyhappy: ). Implication:

1. fb0 and fb1 are in fact the same - I have color conversion issues when using Menuconfig option 1. , which is quite logical, as IC is off and there is no preprocessing (no color conversion). I am now almost certain that it is not possible to do color conversion without the IC.

2. The use-case where almost everything works is the 2. option in Menuconfig - this is where screen (fb0/fb1) is actually used as a viewfinder or preview, whatever this mode is called. I get a very pretty picture in this case, but with one major flaw.

This major flaw is the fact that resolution cannot go past a certain ratio:

- XGA works perfect and if I use line width of 1024, image height can go a bit higher than 768 (800 is possible, 1000 is not),

- also some values for width higher than 1024 (1025-1200) work with lower image height (320).

otherwise (when using 1920x1080 for instance), I just get a black image and a lot of IPU warnings regarding wrongful EOFs.

I first thought that this is the famous IC limitation where output resolution is limited to 1024x1024 (this is mentioned in the original post I've made - in the link provided), but I am not sure now, as:

1. I have issues finding this limitation in the newest documentation (maybe it is there, but I can't really find it),

2. The output size is not really limited by specific line width or height as I said before,

3. 1024x1024 resolution does not work - as I said with line width of 1024, height needs to be less than 1000 (not really sure which exact value, but 1024x1024 does not work).

In the link from original post some sort of IPU Split-Mode is also mentioned which is the way to go past this 1024x1024 limitation, but I don't really know how to activate this (although I am yet to do a search for this in the source files).

To sum it up with a few q's:

1. Why is IC limiting my resolution?

2. If because of IC limitation, how do I go past this (how to activate split-mode)?

3. If not because of IC limitation, can anyone help why?

0 Kudos