Anatomy of a Backup
Understanding the process flow for a backup operation greatly helps in troubleshooting issues allowing a potential problem area to be quickly identified which should hopefully reduce the time it takes to resolution. The following describes the flow of a backup job :
At a high level the backup flow is
When a schedule backup is due, nbpem ( policy execution manager ) running on the master server intiaties the backup. To view which backups are due to run the nbpemreq command be used to
# nbpemreq -due -date mm/dd/yyyy hh:mm:ss
# nbpemreq -due -date 10/20/2019 19:00:00
It is also possible to use a client filter and policy filter option with the nbpemreq command
# nbpemreq -due -date 10/20/2019 19:00:00 -client_filter <client_name> -policy_filter <policy_name>
nbpemreq can also be used to suspend and resume scheduling
# nbpemreq -suspend_scheduling
# nbpemreq -resume_scheduling
When a backup is initiated nbpem notifies the nbjm ( job manager ) process which generates the backup jobid and notifies bpjobd. The bpjobd process is responsible for updating the Java GUI and the status of the job can be viewed via the bpdbjobs command. When the job is first submitted it will be in a queued state until resources have been allocated
# bpdbjobs -summary -L
# bpdbjobs -jobid <jobid> -all_columns
# bpdbjobs -jobid <jobid> -most_columns
Once the job has been added to the jobs database, nbjm contacts nbrb ( resource broker ) to request resources from nbemm ( Enterprise Media Manager ). The nbrb process is the only process which communicates with nbemm.
- Once resources have been allocated nbrb notifies nbjm which in turn notifies bpjobd. Once the resources have been allocated the job will move from queued to an active state.
It is possible to view details regarding resource allocations with the nbrbutil command. There are many options to this command so it is worth spending some time reviewing it.
To list all the active jobs for the disk volume with the media id @aaaaa or for the storage unit entitled STORAGEUNIT use the following commands
# nbrbutil -listactivemediajobs @aaaaa
# nbrbutil -listactivestujobs STORAGEUNIT
After the job has gone active, nbjm connects to bpcd ( client daemon ) on the media server which in turn starts bpbrm ( backup and restore manager ) on the media server. The connection between nbjm on the master server and bpcd on the media server is set up via the PBX ( private branch exchange process ) running on the media server.
Due to the use of certificates, from NBU 8.1 and above only PBX is used to establish the connection during backup operations, it does not fall back to VNETD or direct connect to BPCD as in earlier versions of NBU. To confirm PBX is running use the -x switch to bpps
# bpps -x
The bpbrm process running on the media server contacts bpcd on the relevant client via PBX, to start the bpbkar (client backup and archive manager ) process on the client. The bpbkar process is the process which is responsible for reading the client data. For each file to be backed up the process will perform an open, read , close operation on the file.
The bpbrm process on the media server is also responsible for starting the bptm ( tape management ) process on the media server. If the backup target is a tape drive then bptm is responsible for issuing the mount request to ltid ( logical tape interface daemon) running on the media server. If a TLD library is being used then the mount request is passed to tldd ( tape library daemon ) on the media server which in turn passes the mount request to tldcd ( tape library control daemon ) . The tldcd process will be running on the media server or master on which the tape library robotic control is configured. In the case of a TLD library there will only be a single host running the tldcd process for a specific library at any one time. If the tldcd process was running on a clustered master server then the robotic changer would be configured on both cluster nodes but only one node would be active at any one time.
When the backup target is a disk based target then bptm will communicate directly with the disk.
NetBackup knows which tape to load or disk volume to use for the backup based on the resource information returned by nbemm to nbjm via bprd earlier in the backup process.
The bptm process will spawn a child bptm process which will receive the backup data from the client. The child bptm process will receive the data from the client whilst the parent bptm process will read the data from shared memory and write the data to the backup target. It is at this point of the backup process were data buffers come into play. Each full data buffer will be a single write to the backup device hence tuning of the data buffer size and number of data buffers can have a positive impact on backup performance.
Whilst bptm is receiving the actual data to write to the backup target, bpbrm is receiving the metadata about the files and directories which are being backed up. This data is sent to bpbrm running on the master server which in turn updates the NBU image database.
For each backup running on a media server, a bpbrm process will be started along with the relevant bptm processes, hence on busy media servers it is not unusual to find a large number of bpbrm and bptm processes running.
When the backup is complete, the client's bpbkar process will notify bpbrm on the media server that there is no more data to send. The bpbrm process will notify nbjm that the backup is complete. At this point nbjm will notify bpbrm that the backup is complete and to verify that the image files have been successfully written. the nbjm process will update EMM via nbrb for the backup resources to be deallocated and nbrm will update bpjobd on the status of the completed job. The job state will be updated in the activity monitor to DONE with the relevant job status code.
Finally nbjm will update nbpem with the exit status of the job. If the backup job was successful then nbpem will calculate the next due date and time for the backup but if the job was unsuccessful then the job could be schedule for resubmission based on the retry configuration. next due date and time will be determined by the retry configuration.
The retry configuration can be viewed via the bpconfig command
# bpconfig -L
The relevant fields are
Job Retry Delay
The Job Retry Delay setting is how long in minutes NBU will wait between job submissions.
The Backup Tries setting will determine how many times within the given period NBU will attempt to run the backup job.
If the Job Retry Delay setting is 5 minutes and the Backup Tries setting is 2 in 12 hours then NBU will try to rerun the job 5 minutes after the first backup job failed. If the second attempt failed then NBU will not try to run the job again for the next 12 hours. Consideration needs to be given to which values will work best within your environment.