
Watchdog timers are essential in remote, automated systems such as this Mars Exploration Rover.
A watchdog timer (sometimes called a computer operating properly or COP timer, or simply a watchdog) is an electronic timer that is used to detect and recover from computer malfunctions. During normal operation, the computer regularly restarts the watchdog timer to prevent it from elapsing, or "timing out". If, due to a hardware fault or program error, the computer fails to restart the watchdog, the timer will elapse and generate a timeout signal. The timeout signal is used to initiate corrective action or actions. The corrective actions typically include placing the computer system in a safe state and restoring normal system operation.
Watchdog timers are commonly found in computer-controlled equipment where humans cannot easily access the equipment or would be unable to react to faults in a timely manner. Many embedded systems cannot depend on a human to reboot them if the software hangs; they must be self-reliant. For example, remote embedded systems such as space probes are not physically accessible to human operators; these could become permanently disabled if they were unable to autonomously recover from faults. A watchdog timer is usually employed in cases like these.
Watchdog timers may also be used when running untrusted code in a sandbox, to limit the CPU time available to the code and thus prevent some types of denial-of-service attacks.[1]
Architecture and operation
Watchdog timers come in many configurations, and many allow their configurations to be altered. Microcontrollers often include an integrated, on-chip watchdog. In other computers the watchdog may reside in a nearby chip that connects directly to the CPU, or it may be located on an external expansion card in the computer's chassis. The watchdog and CPU may share a common clock signal, as shown in the block diagram below, or they may have independent clock signals.
The act of restarting the watchdog timer is commonly referred to as "kicking the dog"[2][3] or other similar terms; this is typically done by writing to a watchdog control port. Alternatively, in microcontrollers that have an integrated watchdog timer, the watchdog is sometimes kicked by executing a special machine language instruction. An example of this is the CLRWDT (clear watchdog timer) instruction found in the instruction set of some PIC microcontrollers.
Multistage watchdog
Two or more timers are sometimes cascaded to form a multistage watchdog timer, where each timer is referred to as a timer stage, or simply a stage. For example, the block diagram below shows a three-stage watchdog. In a multistage watchdog, only the first stage is kicked by the processor. Upon first stage timeout, a corrective action is initiated and the next stage in the cascade is started. As each subsequent stage times out, it triggers a corrective action and starts the next stage. Upon final stage timeout, a corrective action is initiated, but no other stage is started because the end of the cascade has been reached. Typically, single-stage watchdog timers are used to simply restart the computer, whereas multistage watchdog timers will sequentially trigger a series of corrective actions, with the final stage triggering a computer restart.[3]
Time intervals
Watchdog timers may have either fixed or programmable time intervals. Some watchdog timers allow the time interval to be programmed by selecting from among a few selectable, discrete values. In others, the interval can be programmed to arbitrary values. Typically, watchdog time intervals range from ten milliseconds to ten seconds. In a multistage watchdog, each timer may have its own, unique time interval.
Corrective actions
A watchdog timer may initiate any of several types of corrective action, including processor reset, non-maskable interrupt, maskable interrupt, power cycling, fail-safe state activation, or combinations of these. Depending on its architecture, the type of corrective action or actions that a watchdog can trigger may be fixed or programmable. Some computers (e.g., PC compatibles) require a pulsed signal to invoke a processor reset. In such cases, the watchdog typically triggers a processor reset by activating an internal or external pulse generator, which in turn creates the required reset pulses.[3]
In embedded systems and control systems, watchdog timers are often used to activate fail-safe circuitry. When activated, the fail-safe circuitry forces all control outputs to safe states (e.g., turns off motors, heaters, and high-voltages) to prevent injuries and equipment damage while the fault persists. In a two-stage watchdog, the first timer is often used to activate fail-safe outputs and start the second timer stage; the second stage will reset the computer if the fault cannot be corrected before the timer elapses.
Watchdog timers are sometimes used to trigger the recording of system state information—which may be useful during fault recovery[3]—or debug information (which may be useful for determining the cause of the fault) onto a persistent medium. In such cases, a second timer—which is started when the first timer elapses—is typically used to reset the computer later, after allowing sufficient time for data recording to complete. This allows time for the information to be saved, but ensures that the computer will be reset even if the recording process fails.
For example, the above diagram shows a likely configuration for a two-stage watchdog timer. During normal operation the computer regularly kicks Stage1 to prevent a timeout. If the computer fails to kick Stage1 (e.g., due to a hardware fault or programming error), Stage1 will eventually timeout. This event will start the Stage2 timer and, simultaneously, notify the computer (by mean of a non-maskable interrupt) that a reset is imminent. Until Stage2 times out, the computer may attempt to record state information, debug information, or both. The computer will be reset upon Stage2 timeout.
See also
- Immunity Aware Programming
- Power distribution unit
- Dead man's switch
- Heartbeat (computing)
- Keepalive
References
External links