387 lines
15 KiB
Groff
387 lines
15 KiB
Groff
![]() |
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
|
|||
|
%$% %$%
|
|||
|
$%$ Electronic Switching System Faults $%$
|
|||
|
%$% %$%
|
|||
|
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
"Notes from No 2 ESS Administration and Maintenance Plan,"
|
|||
|
"BSTJ Vol 48, 1969"
|
|||
|
|
|||
|
"Data Maintenance"
|
|||
|
|
|||
|
|
|||
|
Memory mutilation results from hardware faults and program bugs.
|
|||
|
During nonsynchronous operation mismatch detection not available so
|
|||
|
there may be a long period of time during which mutilation occurs.
|
|||
|
Mismatch detection useless in finding data mutilation caused by program bugs.
|
|||
|
|
|||
|
Data maintenance aided by
|
|||
|
ease of communication among programs,
|
|||
|
absence of linked lists, and
|
|||
|
per call memory allocation (Call processing program addressing is relative to the allocated memory, reducing scope of data accesses).
|
|||
|
|
|||
|
Defensive programming techniques:
|
|||
|
|
|||
|
Range check table indexes,
|
|||
|
Zero check derived transfer-to addresses, and
|
|||
|
Distinct program and data errors prevent programs being read as data.
|
|||
|
Audit programs detect bad data.
|
|||
|
Audits run periodically or as requested from tty.
|
|||
|
Separate audits for different memory blocks
|
|||
|
Audits correct by idling memory blocks containing bad data.
|
|||
|
System recovery initiated by control unit switch during simplex operation, control
|
|||
|
unit switch can be caused by bad data or bugs that cause sanity time out.
|
|||
|
|
|||
|
System recovery Funtions:
|
|||
|
Make call store consistent with state of periphery.
|
|||
|
Clear memory associated with program in control at time of recovery,
|
|||
|
Run audits,
|
|||
|
Repeat the above with widening scope of memory initialization until sanity obtained
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
"Notes from Design of Recovery Strategies for A Fault Tolerant No. 4 ESS"
|
|||
|
"by R. J Willet - BSTJ vol 61, no 10, 4-13-82"
|
|||
|
|
|||
|
"Objectives"
|
|||
|
|
|||
|
616,000 call attempts/hour
|
|||
|
100,000 acive terminations
|
|||
|
Downtime less than 2 hours in 40 years
|
|||
|
Not cost-effective (or possible) to remove all software errors - minimize
|
|||
|
number of service effecting errors and analyze data for cause.
|
|||
|
|
|||
|
|
|||
|
"Software Recovery"
|
|||
|
Reconstruct data from associated information - slow, disturbs few calls.
|
|||
|
Reinitialize memory structure - fast, disturbs many calls.
|
|||
|
|
|||
|
|
|||
|
"Audit Programs"
|
|||
|
Provide for integrity of system memory
|
|||
|
Structured into mutilation detection and correction modules
|
|||
|
Detection modules run continiously in background
|
|||
|
Detection modules augmented by defensive checks in operational programs
|
|||
|
Call correction modules to correct errors found by background audits or
|
|||
|
defensive checks.
|
|||
|
|
|||
|
|
|||
|
"System Integrity Programs"
|
|||
|
Provide for integrity of programs
|
|||
|
Monitor job scheduling and sequencing for frequency and execution times
|
|||
|
Use sanity timers
|
|||
|
Call audits or reinitialize system to correct errors.
|
|||
|
|
|||
|
|
|||
|
"Recovery from software problems"
|
|||
|
Software problems caused by program errors or bad data
|
|||
|
Out-of-range accesses trigger hardware interrupt, recovery
|
|||
|
requires correction of data, or killing of call and return of control
|
|||
|
to a safe point.
|
|||
|
Inhibit (pest) interrupts while audits are correcting problem,
|
|||
|
risky, but assumes single software fault.
|
|||
|
In cases where the out-of-range error can be isolated to a single unit can use frame level pesting, otherwise use system level pesting.
|
|||
|
Software recovery does not consider the possibility of a hardware fault.
|
|||
|
Recovery cannot fix a program bug. Running pested may allows the system to
|
|||
|
operate in a degraded fashion while maintenance personnel analyze data and
|
|||
|
correct program.
|
|||
|
The buffer overflow problem - may be caused by program error.
|
|||
|
Buffers protected by hardware overflow interrupts.
|
|||
|
Recovery runs the buffer unloader program to unload the buffer and audits the task dispenser program to ensure the unloader is scheduled properly.
|
|||
|
The overflow interrupt is pested.
|
|||
|
If problem continues, hardware is suspect.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
"No. 4 ESS: Maintenance Software"
|
|||
|
"by M. N. Meyers, W. A. Routt and K. W. Yoder,"
|
|||
|
"BSTJ Vol. 56, No. 7, September 1977"
|
|||
|
|
|||
|
|
|||
|
"Software Error Recovery"
|
|||
|
Since system operation is dependent on data in memories, and memories can be written, there is a possibility the memory will be in a state that precludes operation.
|
|||
|
System must be as error-free as possibile.
|
|||
|
Since system cannot be completely error-free, it must be error tolerant.
|
|||
|
|
|||
|
|
|||
|
"Classification of software errors"
|
|||
|
Errors in interfaces between software modules.
|
|||
|
Non-conformity to systems rules.
|
|||
|
KsO$ |