387 lines
15 KiB
Groff
387 lines
15 KiB
Groff
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
|
||
%$% %$%
|
||
$%$ Electronic Switching System Faults $%$
|
||
%$% %$%
|
||
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
|
||
|
||
|
||
|
||
|
||
"Notes from No 2 ESS Administration and Maintenance Plan,"
|
||
"BSTJ Vol 48, 1969"
|
||
|
||
"Data Maintenance"
|
||
|
||
|
||
Memory mutilation results from hardware faults and program bugs.
|
||
During nonsynchronous operation mismatch detection not available so
|
||
there may be a long period of time during which mutilation occurs.
|
||
Mismatch detection useless in finding data mutilation caused by program bugs.
|
||
|
||
Data maintenance aided by
|
||
ease of communication among programs,
|
||
absence of linked lists, and
|
||
per call memory allocation (Call processing program addressing is relative to the allocated memory, reducing scope of data accesses).
|
||
|
||
Defensive programming techniques:
|
||
|
||
Range check table indexes,
|
||
Zero check derived transfer-to addresses, and
|
||
Distinct program and data errors prevent programs being read as data.
|
||
Audit programs detect bad data.
|
||
Audits run periodically or as requested from tty.
|
||
Separate audits for different memory blocks
|
||
Audits correct by idling memory blocks containing bad data.
|
||
System recovery initiated by control unit switch during simplex operation, control
|
||
unit switch can be caused by bad data or bugs that cause sanity time out.
|
||
|
||
System recovery Funtions:
|
||
Make call store consistent with state of periphery.
|
||
Clear memory associated with program in control at time of recovery,
|
||
Run audits,
|
||
Repeat the above with widening scope of memory initialization until sanity obtained
|
||
|
||
|
||
|
||
|
||
"Notes from Design of Recovery Strategies for A Fault Tolerant No. 4 ESS"
|
||
"by R. J Willet - BSTJ vol 61, no 10, 4-13-82"
|
||
|
||
"Objectives"
|
||
|
||
616,000 call attempts/hour
|
||
100,000 acive terminations
|
||
Downtime less than 2 hours in 40 years
|
||
Not cost-effective (or possible) to remove all software errors - minimize
|
||
number of service effecting errors and analyze data for cause.
|
||
|
||
|
||
"Software Recovery"
|
||
Reconstruct data from associated information - slow, disturbs few calls.
|
||
Reinitialize memory structure - fast, disturbs many calls.
|
||
|
||
|
||
"Audit Programs"
|
||
Provide for integrity of system memory
|
||
Structured into mutilation detection and correction modules
|
||
Detection modules run continiously in background
|
||
Detection modules augmented by defensive checks in operational programs
|
||
Call correction modules to correct errors found by background audits or
|
||
defensive checks.
|
||
|
||
|
||
"System Integrity Programs"
|
||
Provide for integrity of programs
|
||
Monitor job scheduling and sequencing for frequency and execution times
|
||
Use sanity timers
|
||
Call audits or reinitialize system to correct errors.
|
||
|
||
|
||
"Recovery from software problems"
|
||
Software problems caused by program errors or bad data
|
||
Out-of-range accesses trigger hardware interrupt, recovery
|
||
requires correction of data, or killing of call and return of control
|
||
to a safe point.
|
||
Inhibit (pest) interrupts while audits are correcting problem,
|
||
risky, but assumes single software fault.
|
||
In cases where the out-of-range error can be isolated to a single unit can use frame level pesting, otherwise use system level pesting.
|
||
Software recovery does not consider the possibility of a hardware fault.
|
||
Recovery cannot fix a program bug. Running pested may allows the system to
|
||
operate in a degraded fashion while maintenance personnel analyze data and
|
||
correct program.
|
||
The buffer overflow problem - may be caused by program error.
|
||
Buffers protected by hardware overflow interrupts.
|
||
Recovery runs the buffer unloader program to unload the buffer and audits the task dispenser program to ensure the unloader is scheduled properly.
|
||
The overflow interrupt is pested.
|
||
If problem continues, hardware is suspect.
|
||
|
||
|
||
|
||
"No. 4 ESS: Maintenance Software"
|
||
"by M. N. Meyers, W. A. Routt and K. W. Yoder,"
|
||
"BSTJ Vol. 56, No. 7, September 1977"
|
||
|
||
|
||
"Software Error Recovery"
|
||
Since system operation is dependent on data in memories, and memories can be written, there is a possibility the memory will be in a state that precludes operation.
|
||
System must be as error-free as possibile.
|
||
Since system cannot be completely error-free, it must be error tolerant.
|
||
|
||
|
||
"Classification of software errors"
|
||
Errors in interfaces between software modules.
|
||
Non-conformity to systems rules.
|
||
KsO$ |