Hi Sachin,
From your description, you need to insert text into the presumably YUV bitstream of each video frame before it's handed off to the video encoder, producing an altered video stream that has the text overlay.
There is hardware support in the IPU for layers in the display subsystem, but it doesn't sound like that's what you need.
Instead, you simply need something to render text into YUV buffers. The GPU can provide acceleration of either the drawing primitives here or blending,
but this is not really an expensive operation and may be over-kill.
There is a gstreamer text overlay that may do what you're after ("pango", I think), but I haven't used it.
It may also disrupt the buffer allocations in the pipeline.
What you're after is a pipeline like this:
Input (camera or decode output as YUV) -> textoverlay -> VPU
You need to make sure that the hardware buffers provided on the input side are handed off to the VPU.