In the meantime i found the LQRUG_bme_ex1 sample, so i have to answer myself..
In that sample we have macro definitions for BME as inlinine C functions. Once i got those functions really inlined by the compiler, results were as expected: While the standard C implementations needed 13 or 14 systicks, BME achieved the same in 10 or 11 systicks. In general BME was faster by 4 systicks per operation, or 40 %.
With an optimizing compiler the things get confusing. For example once the processor has an address register for GPIOA_PDDR it may reuse it inside the benchmark for the logical operation on GPIOA_PDOR (which is nearby), while BME needs a completely new destination address.