SICS Trouble Shooting


There is no such thing as bug free software. There are always bugs, nasty behaviour etc. This document shall help to solve these problems. The usual symptom will be that a client cannot connect to the server or the server is not responding. Or error messages show up. This section helps to solve such problems.

Looking at Log Files

The first thing to do, especially when confronted with confusing statements from either users or instrument scientists, is to look at the SICS servers log files. The last 1000 lines of the instrument log are accessible from any SICS client or through the WWW interface. The SICS commands:

commandlog tail
shows the last 20 lines of the log.
commandlog tail n
shows the last n lines of the log.
will show you the information available. In order to see more, log in to the instrument account. There the following unix commands might help: In order to see some more, cd into the log directory of the instrument account. In there are files with names like:
auto2001-08-08@00-01-01.log
This means the log file has been started at August, 8, 2001 at 00:01:01. There is a new log file daily. Load appropriate files into the editor and look what really happened.

Another good ideas is to use the unix command grep on assorted log files. A grep for the strings ERROR or WARNING will more ofteh then not give an indication for the nature of the problem.

The log files show you all commands given and all the responses of the system. Additionally there are hourly time stamps in the file which allow to narrow in when the problem started. Things to watch out for are:

MOTOR ALARM
This message means that the motor failed to reach his position for a couple of times. This is caused by either a concrete shielding element blocking the movement of the instrument, badly adjusted motor parameters, mechanical failures or the air cushions not operating properly.
EL734__BAD_EMERG_STOP
Somebody has pushed the emergency stop button. This must be released before the instrument can move again. Moreover the motor controller will not respond to further commands in this mode. Thus restarting SICS on this error message will make SICS fail to initialize the motors affected!
EL***__BAD_PIPE, BAD_RECV, BAD_ILLG, BAD_TMO, BAD_SEND
Network communication problems. Can generaly be solved by restarting SICS.
EL737__BAD_BSY
A counting operation was aborted while the beam was off. Unfortunately, the counter box does not respond to commands in this state and ignores the stop command sent to it during the abort operation. This can be safely ignored, SICS fixes this condition.

Restarting SICS

All of SICS can be restarted through the command:

monit restart all

Starting SICS

An essential prerequisite of SICS is that the server is up and running. The system is configured to restart the SICServer whenever it fails. Only after a reboot or when the keepalive processes were killed (see below) the SICServer must be restarted. This is done for all instruments by typing:

monit
at the command prompt. startsics actually starts two programs: one is the replicator application which is responsible for the automatic copying of data files to the laboratory server. The other is the SICS server. Both programs are started by means of a shell script called keepalive. keepalive is basically an endless loop which calls the program again and again and thus ensures that the program will never stop running.

When the SICS server hangs, or you want to enforce an reinitialization of everything the server process must be killed. This can be accomplished either manually or through a shell script.

Stopping SICS

All SICS processes can be stopped through the commands:

monit stop all
monit quit
given at the unix command line. You must be the instrument user (for example DMC) on the instrument computer for this to work properly.

Restart Everything

If nothing seems to work any more, no connections can be obtained etc, then the next guess is to restart everything. This is especially necessary if mechanics or electronics people were closer to the instrument then 400 meters.

  1. Reboot the histogram memory. It has a tiny button labelled RST. That' s the one. Can be operated with a hairpin, a ball point pen or the like.
  2. Wait 5 minutes.
  3. Restart the SICServer. Watch for any messages about things not being connected or configured.
  4. Restart and reconnect the client programs.
If this fails (even after a second) time there may be a network problem which can not be resolved by simple means.

Checking SICS Startup

Sometimes it happens that the SICServer hangs while starting up or hardware components are not properly initialized. In such cases it is useful to look at the SICS servers startup messages. On the instrument account issue the commands:

monit stop sicsserver
cd inst_sics
./SICServer inst.tcl | more
Replace inst with the name of the appropriate instrument in lower case. For example, from the home directory of the hrpt account on the computer hrpt:
cd
monit stop sicsserver
cd hrpt_sics
./SICServer hrpt.tcl | more
This allows to page through SICS startup messages and will help to identify the troublesome component. The proceed to check the component and the connections to it.

HELP debugging!!!!

The SICS server hanging or crashing should not happen. In order to sort such problems out it is very helpful if any available debugging information is saved and presented to the programmers. Information available are the log files as written continously by the SICS server and posssible core files lying around. They have just this name: core.pid, where pid is the process identification number. In order to save them create a new directory (for example dump2077) and copy the stuff in there. This looks like:

/home/DMC> mkdir dump2077
/home/DMC> cp log/*.log dump2077
/home/DMC> cp core.2077 dump2077
The /home/DMC> is just the command prompt. Please note, that core files are only available after crashes of the server. These few commands will help to analyse the cause of the problem and to eventually resolve it.