Sorry, but I don't have any such example.
I almost always use the provided HAL layer of the SDK, usually coming as a library (in source code). With optimization settings higher than "none", there is very little gain by the extra effort.
As mentioned in another of your threads, you could use code from a SDK example as a starting point, and strip it down. And, as also mentioned, the gain in speed rarely ever justifies the increased development effort. With proper optimization and linker settings, there is often no measurable difference in code size either (if you stay away from expensive clib functions like printf).
Other important sources for bare-metal are the user manual / reference manual which document the peripheral units and their registers, and the device-specific header that provides defines, structs and macros for the access. Having a project for a LPC54628, this file is "LPC54628.h" in my case. With more than 20.000 lines not a small one, though.
However, I think your issue is one of a plain C language & logic, not a register access issue.
Instead of reading the current state back, I use a simple toggle counter in such cases.
Incrementing the counter once in every "while (1)" loop, I set the respective bit it the LSB of the counter is 1, or reset it of the LSB is zero. Here a pseudo code example:
while (1)
{
...
counter++;
if (counter % 2)
B[<n>][<m>] = LED_BITMASK;
else
B[<n>][<m>] = ~LED_BITMASK;
delay();
}