@marcelabuena
I think you have a situation that is more challenging than it initially appears.
Maybe you could share your IRQ Handler and base code (please not the entire project) but I suspect that there is too much code overhead for you to be able to process the incoming data bits at higher speeds.
To explain what I mean, consider that on a 120MHz processor, a 12MHz signal has only 10 clock cycles to do the processing. A 3MHz signal has only 40 clock cycles to do the processing.
I use, as a rule of thumb, that a simple C statement with local variables in an Cortex M4 requires four clock cycles (and takes up 4 words or 12 bytes). When you are reading/writing IO registers, this at least doubles. If you are calling a method, it is at least tripled (assuming no parameters). I haven't tried to develop a model for IRQ Handler entry and return but I think it would be safe to say each would require 20 cycles or more.
So, while I can see 300-400kHz working where you have a few hundred clock cycles working between bits, your going to have issues has higher datarates.
I'm sure that you can increase your responsiveness by continuously polling the incoming clock bit but even then, at top speed, you're going to have some challenges and that seems like a waste of a K64 that you want to do more with to have it spend it's time in a polling loop.
I think the only way you're going to get full speed out of this application is to use external logic for this function to allow 8 bit (or whatever the data word size is) communication with the K64 - either put together a couple of parallel to shift registers with a counter and some logic or design a PLD for the task.
Sorry, this probably isn't the answer you were hoping for,
myke