Hi Alexander,
This could work for you except for one thing... The VPU encoder(s) mostly want an array of frame-buffers, and the X11 drivers don't currently support even double-buffering.
The encoders generally need at least two, but often three or more frames to build "B"etween-frames inside a GOP (Group of Pictures).
Even if they did support this (as Wayland does), coordinating ownership of the buffers (releasing them back to the display when the owner is done with them) would be tricky at best.
You might consider a simpler form of DMA'ing from the frame-buffer into your next encode buffer at each vertical sync
(or perhaps every other to get 30fps from a 60fps display). If used properly, either the IPU or GPU can do the DMA transfer for you.