11.1 Out of Space Issues
A media partition that is full can cause issues such as the following:- Failure to download release updates, or failure to start new/updated services after a download has occurred
- Failure for a service to store data into defined volumes
- Failure of services to start up (mostly those that need to store data that isn’t in
tmpfs)
-h switch makes the figures returned ‘human readable’. Without this switch the returned figures will be in block sizes (usually 1k or 512byte blocks).
The two main mounts where full space problems commonly occur are /mnt/data and /mnt/state. The former is the data partition where all service images, containers and volumes are stored. The latter is the state partition, where overlays for the root FS (such as user defined network configurations) and the permanent logs are stored.
There are a few ways to try and relieve out of space issues on a media drive.
11.1.1 Image and Container Pruning
One fairly easy cleanup routine to perform is that of pruning the Docker tree so that any unused images, containers, networks and volumes are removed. It should be noted that in the day-to-day operation of the Supervisor, it attempts to ensure that anything that is no longer used on the device is removed when not required. However, there are issues that sometimes occur that can cause this behavior to not work correctly. In these cases, a prune should help clean anything that should not be present:Note that in the above, all unused images, containers, networks and volumes will be removed. To just remove dangling images, you can use
balena system prune -a.11.1.2 Customer Data
Occasionally, customer volumes can also fill up the data partition. This obviously causes more issues, because usually this is data that cannot just be deleted. In these cases, it’s imperative that the customer is informed that they’ve filled the data partition and that appropriate pruning is required. Filling disk space does not tend to stop access to devices, so in these cases customers should be asked to enter the relevant services and manually prune data. Before discussion on persistent data, it’s worth noting that occasionally customer apps store data to the service container instead of a persistent data volume. Sometimes, this data is intended as temporary, so doing so is not an issue (although if they are doing so and expecting it to stay permanent, this will not occur as service container rebuilds will remove the layers where new data is stored). However there are cases where even this temporary data can be so large that it fills the storage media. In these cases, the Supervisor can be stopped, and then the service container affected, allowing that container to be removed so the Supervisor can rebuild from the service image. This will remove the layers filling the space. Care should be taken and customers informed first, in case this data is required. They should also be informed of persistent data and how to use it. Because persistent data is stored as volumes, it’s also possible to prune data for a service from within the host OS. For example, should a service be filling a volume so quickly as to prevent sensible data removal, an option is to stop that service and then manually remove data from the service’s volume. Data volumes are always located in the/var/lib/docker/volumes directory. Care needs to be taken to ensure the right volumes are examine/pruned of data, as not all volumes pertain directly to customer data. Let’s list the volumes:
_balena-data string.
In multicontainer apps, the suffix always corresponds with the name of the bound volume. For example, let’s look at the docker-compose manifest for the multicontainer-app app used in this debugging masterclass:
backend-data volume is defined, and then used by the backend service. Assuming your device is still running the multicontainer app for this masterclass, SSH into the device, and then examine the running services:
/var/lib/docker/volumes directory, with the relevant suffixes.
Knowing this, it becomes fairly simple to stop services that have filled volumes and to clear these out:
Stop the Supervisor and start timer (
balena-supervisor.service and update-balena-supervisor.timer).11.2 Storage Media Corruption
Many device types use storage media that has high wear levels. This includes devices such as the Raspberry Pi series, where SD cards are the usual storage media. Because we recommend very hard-wearing cards (the SanDisk Extreme Pro family are extremely resilient), we don’t regularly have issues with customer devices dying due to SD card failure. However, they do occur (and not just on SD cards, any type of flash memory based storage includes a shorter lifespan compared to media such as platter drives). Initially, media corruption and wearing exhibit ‘random’ signs, including but not limited to:- Release updates failing to download/start/stop.
- Services suddenly restarting.
- Devices not being mapped to device nodes.
- Extreme lag when interacting with services/utilities from the CLI.
- Spurious kernel errors.