MFS Invalid Sector Errors

jschepler · ‎04-06-2018

I am using KSDK 1.3.0 with MQX and MFS. We are using a 16 GB SD Card.

I am evaluating the reliability of the MFS file writes by using 1 task that writes to files. I was running into issues with timeouts originally (see here). I have a tick time of 1 ms for my system so I set the timeouts to the following until I hear back from what they should be set to in my other post:

#define ESDHC_CMD_TICK_TIMEOUT 40 // 40ms
#define ESDHC_CMD12_TICK_TIMEOUT 1000 // 1 Sec
#define ESDHC_TRANSFER_TIMEOUT_MS 750 // 750ms

I am still using a similiar test procedure as before:

1. Create a new text file.

2. Write 1 KB of data into the file.

3. After 1024 Writes (1 MB), close the file.

4. Open the same file, seek to the end.

5. Perform steps 2 - 4 until I write 1,000 MB to the file.

6. After 1,000 MB has been written, close the current file and create a new one. Repeat steps 2-4 until 8 files have been created. (We are using a 16 GB card)

The farthest I have gotten is 5 files written before a failure during the writing of the 6th. The errors I am receiving are the following:

MFS_SECTOR_NOT_FOUND

MFS_INVALID_CLUSTER_NUMBER

These appear to happen at random. Again, all I am doing in my program is writing to files.

The latest error happened in the function "MFS_get_cluster_from_fat". I set a breakpoint here so I am able to see the context of variables:

drive_ptr information and fat_entry:

So, fat_entry is a value of 0x0141_4141, but the LAST_CLUSTER is 0x001D_1C5B, so we are trying to write into a cluster that is greater than the max, which is why it fails.

However, I am not sure how fat_entry gets its value.

How did it become greater than the LAST_CLUSTER?

I noticed there are a some places where "location" is a byte value, and then other places where "location" is converted into a sector number. Could it be related to some 32-bit overflow issue?

Although I don't plan on creating such large files in my application, I am still concerned with the errors that are happening because I do not know the root cause.

jschepler · ‎04-12-2018

The root cause of this issue is an overflow of a unsigned 32-bit number when trying to access the sector. It is confusing to follow through the functions because sometimes the LOCATION is referred to by a SECTOR, and sometimes it is referred to by BYTE.

In my SD Card, a sector is 512 bytes. In the function mfs_rw.c a sector location is converted into a byte location. This is accomplished by multiplying the byte location by the number of bytes in a sector. This is all done with 32-bit variables. This variable will overflow when the sector is 0x0080_0000. When multiplying by 512 the byte location becomes 0x0001_0000_0000. Since it is a 32-bit value it will be saved as 0x0000_0000 and writing to this location will corrupt the file system.

By the way, a sector location of 0x0080_0000 with 512 bytes per sector corresponds to 4 GB. When I tested with a 4 GB SD card I never saw the issue. We will be using 16 GB going forward.

The following changes pass the test I created as described in the original post:

mfs_format.c

Original

/* sector size is obtained during open operation and stored in the drive structure */
 sector_size = drive_ptr->SECTOR_SIZE;
/* get the number of sectors */
 error_code = ioctl(drive_ptr->DEV_FILE_PTR, IO_IOCTL_GET_NUM_SECTORS, &num_sectors);
#if !MQX_USE_IO_OLD
 if (error_code == -1 && errno == NIO_ENOTSUP)
 {
 num_sectors = lseek(drive_ptr->DEV_FILE_PTR, 0, SEEK_END) / sector_size;
 error_code = MFS_NO_ERROR;
 }‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Changes

if (error_code == -1 && errno == NIO_ENOTSUP)
{
 //num_sectors = lseek(drive_ptr->DEV_FILE_PTR, 0, SEEK_END) / sector_size;
 /* lseek returns 32-bit value, so if > 4 GB we would have an issue */

 num_sectors = _nio_lseek(drive_ptr->DEV_FILE_PTR, 0, SEEK_END, &error) / sector_size;

 error_code = MFS_NO_ERROR;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Explanation

lseek returns a 32-bit value (off_t) for my compiler. lseek is going to return the number of bytes, which when greater than 4 GB will overflow. (This might actually be 2 GB if its signed...) _nio_lseek returns a 64-bit number.

mfs_rw.c

Changes are made to the functions MFS_Write_device_sectors and MFS_Read_device_sectors

Original

uint32_t attempts;
int32_t num, expect_num, seek_loc, shifter;
char *data_ptr;
_mfs_error error;

...

MFS_LOG(printf("MFS_Write_device_sectors %d %d\n", sector_number, sector_count));

if (sector_number > drive_ptr->MEGA_SECTORS)
{
 return MFS_SECTOR_NOT_FOUND;
}

if (drive_ptr->BLOCK_MODE)
{
 shifter = 0;
 seek_loc = sector_number;
 expect_num = sector_count;
}

else
{
 shifter = drive_ptr->SECTOR_POWER;
 seek_loc = sector_number << shifter;
 expect_num = sector_count << shifter;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

#if MQX_USE_IO_OLD
 fseek(drive_ptr->DEV_FILE_PTR, seek_loc, IO_SEEK_SET);
#else
 lseek(drive_ptr->DEV_FILE_PTR, seek_loc, SEEK_SET);
//TODO: check errno lseek
#endif‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Changes

uint32_t attempts;
int64_t seek_loc;    /* Added for LSEEK issue */
int32_t expect_num;  /* Can be 32-bit because its just a count of sectors */
int32_t num, shifter;
int nio_error;       /* Added for LSEEK issue */
char *data_ptr;
 _mfs_error error;

...

// MFS_LOG(printf("MFS_Write_device_sectors %d %d\n", sector_number, sector_count));
 
if (sector_number > drive_ptr->MEGA_SECTORS)
{
 return MFS_SECTOR_NOT_FOUND;
}
 
if (drive_ptr->BLOCK_MODE)
{
 shifter = 0;
 seek_loc = sector_number;
 expect_num = sector_count;
}

else
{
 shifter = drive_ptr->SECTOR_POWER;
 seek_loc = (int64_t)sector_number << shifter;  /* Need to typecast to avoid overflow */
 expect_num = sector_count << shifter;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

#if MQX_USE_IO_OLD
 fseek(drive_ptr->DEV_FILE_PTR, seek_loc, IO_SEEK_SET);
#else
//    lseek(drive_ptr->DEV_FILE_PTR, seek_loc, SEEK_SET);
 _nio_lseek(drive_ptr->DEV_FILE_PTR, seek_loc, SEEK_SET, &nio_error); /* off_t issue */
//TODO: check errno lseek
#endif‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Explanation

Commented out MFS_LOG since it is not needed. Changed seek_loc to int64_t since it will be converted into a BYTE location from a SECTOR. Type-cast sector_number with an (int64_t) since it will be multiplied by 512 and will overflow if it stays at 32-bit. Changed call from lseek to _nio_lseek due to off_t issue.

part_mgr.c

These changes need to be applied to the _io_part_mgr_write and _io_part_mgr_read functions. (Note: Make sure to call read when editing the _mgr_read)

Original

uint64_t location;
uint64_t part_start;
uint64_t part_end;
int32_t result;

...

/* Perform seek and data transfer */
result = lseek(pm_struct_ptr->DEV_FILE_PTR, location, SEEK_SET);

if (result >= 0)
{
 result = write(pm_struct_ptr->DEV_FILE_PTR, data_ptr, num);
}
‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Changes

int64_t location;    /* _nio_lseek returns a signed 64-bit number. */
int64_t part_start;
int64_t part_end;
int32_t result;

...

/* Perform seek and data transfer */
// result = lseek(pm_struct_ptr->DEV_FILE_PTR, location, SEEK_SET, error);

/*  Return could be a byte address so put into location which is an int64_t else
 *  we will have an overflow issue.
 */

location = _nio_lseek(pm_struct_ptr->DEV_FILE_PTR, location, SEEK_SET, error);

if (location >= 0)
{
 result = write(pm_struct_ptr->DEV_FILE_PTR, data_ptr, num, error);
}
 
 // JS: Return -1 if the seek fails
else
{
 if(error)
 {
  *error = MFS_ERROR_SEEK;
 }

  result = -1;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Explanation

_nio_lseek returns a signed 64-bit number because a value less than 0 is an error. Changed location, part_start, part_start, and part_end to signed as well. My application will only address up to 16 GB, so the signed / unsigned will not impact it.

Additional notes:

off_t is described by my compiler (GCC) as 32-bits. It was mentioned in another post that the comp.h file for an IAR project explicitly defines off_t as 64-bits signed. I tried redefining off_t in comp.h for the GCC project, but I ran into many issues and didn't feel like trying to solve them, fearing I could unintentionally break something else. My solution was to remove all calls to lseek and replace them with _nio_lseek.

comp.h can be found at \KSDK_1.3.0\rtos\mqx\mqx\source\psp\cortex_m\compiler\iar

View solution in original post

jschepler · ‎04-12-2018

The root cause of this issue is an overflow of a unsigned 32-bit number when trying to access the sector. It is confusing to follow through the functions because sometimes the LOCATION is referred to by a SECTOR, and sometimes it is referred to by BYTE.

In my SD Card, a sector is 512 bytes. In the function mfs_rw.c a sector location is converted into a byte location. This is accomplished by multiplying the byte location by the number of bytes in a sector. This is all done with 32-bit variables. This variable will overflow when the sector is 0x0080_0000. When multiplying by 512 the byte location becomes 0x0001_0000_0000. Since it is a 32-bit value it will be saved as 0x0000_0000 and writing to this location will corrupt the file system.

By the way, a sector location of 0x0080_0000 with 512 bytes per sector corresponds to 4 GB. When I tested with a 4 GB SD card I never saw the issue. We will be using 16 GB going forward.

The following changes pass the test I created as described in the original post:

mfs_format.c

Original

/* sector size is obtained during open operation and stored in the drive structure */
 sector_size = drive_ptr->SECTOR_SIZE;
/* get the number of sectors */
 error_code = ioctl(drive_ptr->DEV_FILE_PTR, IO_IOCTL_GET_NUM_SECTORS, &num_sectors);
#if !MQX_USE_IO_OLD
 if (error_code == -1 && errno == NIO_ENOTSUP)
 {
 num_sectors = lseek(drive_ptr->DEV_FILE_PTR, 0, SEEK_END) / sector_size;
 error_code = MFS_NO_ERROR;
 }‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Changes

if (error_code == -1 && errno == NIO_ENOTSUP)
{
 //num_sectors = lseek(drive_ptr->DEV_FILE_PTR, 0, SEEK_END) / sector_size;
 /* lseek returns 32-bit value, so if > 4 GB we would have an issue */

 num_sectors = _nio_lseek(drive_ptr->DEV_FILE_PTR, 0, SEEK_END, &error) / sector_size;

 error_code = MFS_NO_ERROR;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Explanation

lseek returns a 32-bit value (off_t) for my compiler. lseek is going to return the number of bytes, which when greater than 4 GB will overflow. (This might actually be 2 GB if its signed...) _nio_lseek returns a 64-bit number.

mfs_rw.c

Changes are made to the functions MFS_Write_device_sectors and MFS_Read_device_sectors

Original

uint32_t attempts;
int32_t num, expect_num, seek_loc, shifter;
char *data_ptr;
_mfs_error error;

...

MFS_LOG(printf("MFS_Write_device_sectors %d %d\n", sector_number, sector_count));

if (sector_number > drive_ptr->MEGA_SECTORS)
{
 return MFS_SECTOR_NOT_FOUND;
}

if (drive_ptr->BLOCK_MODE)
{
 shifter = 0;
 seek_loc = sector_number;
 expect_num = sector_count;
}

else
{
 shifter = drive_ptr->SECTOR_POWER;
 seek_loc = sector_number << shifter;
 expect_num = sector_count << shifter;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

#if MQX_USE_IO_OLD
 fseek(drive_ptr->DEV_FILE_PTR, seek_loc, IO_SEEK_SET);
#else
 lseek(drive_ptr->DEV_FILE_PTR, seek_loc, SEEK_SET);
//TODO: check errno lseek
#endif‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Changes

uint32_t attempts;
int64_t seek_loc;    /* Added for LSEEK issue */
int32_t expect_num;  /* Can be 32-bit because its just a count of sectors */
int32_t num, shifter;
int nio_error;       /* Added for LSEEK issue */
char *data_ptr;
 _mfs_error error;

...

// MFS_LOG(printf("MFS_Write_device_sectors %d %d\n", sector_number, sector_count));
 
if (sector_number > drive_ptr->MEGA_SECTORS)
{
 return MFS_SECTOR_NOT_FOUND;
}
 
if (drive_ptr->BLOCK_MODE)
{
 shifter = 0;
 seek_loc = sector_number;
 expect_num = sector_count;
}

else
{
 shifter = drive_ptr->SECTOR_POWER;
 seek_loc = (int64_t)sector_number << shifter;  /* Need to typecast to avoid overflow */
 expect_num = sector_count << shifter;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

#if MQX_USE_IO_OLD
 fseek(drive_ptr->DEV_FILE_PTR, seek_loc, IO_SEEK_SET);
#else
//    lseek(drive_ptr->DEV_FILE_PTR, seek_loc, SEEK_SET);
 _nio_lseek(drive_ptr->DEV_FILE_PTR, seek_loc, SEEK_SET, &nio_error); /* off_t issue */
//TODO: check errno lseek
#endif‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Explanation

Commented out MFS_LOG since it is not needed. Changed seek_loc to int64_t since it will be converted into a BYTE location from a SECTOR. Type-cast sector_number with an (int64_t) since it will be multiplied by 512 and will overflow if it stays at 32-bit. Changed call from lseek to _nio_lseek due to off_t issue.

part_mgr.c

These changes need to be applied to the _io_part_mgr_write and _io_part_mgr_read functions. (Note: Make sure to call read when editing the _mgr_read)

Original

uint64_t location;
uint64_t part_start;
uint64_t part_end;
int32_t result;

...

/* Perform seek and data transfer */
result = lseek(pm_struct_ptr->DEV_FILE_PTR, location, SEEK_SET);

if (result >= 0)
{
 result = write(pm_struct_ptr->DEV_FILE_PTR, data_ptr, num);
}
‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Changes

int64_t location;    /* _nio_lseek returns a signed 64-bit number. */
int64_t part_start;
int64_t part_end;
int32_t result;

...

/* Perform seek and data transfer */
// result = lseek(pm_struct_ptr->DEV_FILE_PTR, location, SEEK_SET, error);

/*  Return could be a byte address so put into location which is an int64_t else
 *  we will have an overflow issue.
 */

location = _nio_lseek(pm_struct_ptr->DEV_FILE_PTR, location, SEEK_SET, error);

if (location >= 0)
{
 result = write(pm_struct_ptr->DEV_FILE_PTR, data_ptr, num, error);
}
 
 // JS: Return -1 if the seek fails
else
{
 if(error)
 {
  *error = MFS_ERROR_SEEK;
 }

  result = -1;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Explanation

_nio_lseek returns a signed 64-bit number because a value less than 0 is an error. Changed location, part_start, part_start, and part_end to signed as well. My application will only address up to 16 GB, so the signed / unsigned will not impact it.

Additional notes:

off_t is described by my compiler (GCC) as 32-bits. It was mentioned in another post that the comp.h file for an IAR project explicitly defines off_t as 64-bits signed. I tried redefining off_t in comp.h for the GCC project, but I ran into many issues and didn't feel like trying to solve them, fearing I could unintentionally break something else. My solution was to remove all calls to lseek and replace them with _nio_lseek.

comp.h can be found at \KSDK_1.3.0\rtos\mqx\mqx\source\psp\cortex_m\compiler\iar

jschepler · ‎04-10-2018

EDIT 2018-04-12: nio_error of MFS_FILE_NOT_FOUND was caused by a different call to _nio_open outside of the loop. I did not correctly reset nio_error to 0 before I started the loop. The issue I am describing below does not exist.

Update: I am still trying to find the root cause of this issue. Interestingly, I am writing one character repeatedly to the file, 'A', which happens to be a value of 0x41. This is the value seen for the cluster number above.

As this error occurred again today, I noticed the error value was 0x3002 (MFS_FILE_NOT_FOUND) before the call to write.

How did this happen? I have a check to make sure the file descriptor is not < 0:

chk_error = _nio_close(index_fd, &nio_error);

if(chk_error < 0)
{
  testfunction();
}

index_fd = _nio_open(myfilenames, O_RDWR | O_APPEND, &nio_error);

// Check that we can access the file descriptor
if(index_fd < 0)
{
  testfunction();
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

I added logic to not only check if the file descriptor is OK, but also to check the nio_error value:

index_fd = _nio_open(myfilenames, O_RDWR | O_APPEND, &nio_error);

// Check that we can access the file descriptor
if(index_fd < 0 || nio_error != 0)
{
  testfunction();
//  _task_block();
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

And it turns out that the _nio_open function can return a valid file descriptor even when an error has occurred in the _nio_open function.

Has anyone else seen a case like this?

danielchen · ‎04-10-2018

Hi jschepler

I would suggest you try this test with MQX 4.2 first. Because in KSDK 1.3, MQX & MFS & SDHC modues are ported from MQX 4.2. The only difference is there is a NIO layer in KSDK1.3, classic MQX 4.2 not.

I guess this issue maybe related with 32-bit overflow issue somewhere. for cluster?

Regards

Daniel