# Research Update

## Project

### Exascale Monitoring and Prediction

For those of you not aware of the project, we are using data mining to increase knowledge about HPC resource monitoring systems with a goal of increasing efficiency of monitoring operations and predicting failures.

## Status

The LANL folks implied that giving us data to examine would be essentially impossible or not useful. The main monitoring tools capture syslog data and I was told to work developing tools or methodology on “any syslog data”. If you’re not aware of how syslog data is structured, here is a sample:

Mar  2 07:43:26 diamondreo anacron[129442]: Job cron.daily' terminated
Mar  2 07:43:26 diamondreo anacron[129442]: Normal exit (1 job run)
Mar  2 07:43:30 diamondreo NetworkManager[1264]: <warn> error monitoring device for netlink events: No buffer space available
Mar  2 07:44:01 diamondreo CRON[4056]: (root) CMD (/usr/bin/rsync -a loghost::deny/hosts.deny /etc/hosts.deny 2> /dev/null)
Mar  2 07:45:30 diamondreo NetworkManager[1264]: <warn> error monitoring device for netlink events: No buffer space available`

This came from a dump of departmental syslog data George gave me. I met with Mueen regarding how to model this as a time series; this will remain a key challenge in our project as the model will greatly affect its predictive power. He sent me several papers to read regarding this.

We also had a conference call with LANL operations people who gave us a bit more information about how their logging process - storage, event types, filtering, etc.

## Upcoming

• -contact person- (LANL point of contact) may set up another meeting with the LANL sysadmin team and also coordinate a communication system.
• I will read the papers that Mueen sent me. Perhaps I’ll propose one of them for reading group?
• I will continue to characterize the syslog data and hopefully evaluate how different models can work on it based on its time-series representation.

