Research Update

Content Warning: Boring Research Update

Project

Exascale Monitoring and Prediction

For those of you not aware of the project, we are using data mining to increase knowledge about HPC resource monitoring systems with a goal of increasing efficiency of monitoring operations and predicting failures.

Status

The LANL folks implied that giving us data to examine would be essentially impossible or not useful. The main monitoring tools capture syslog data and I was told to work developing tools or methodology on “any syslog data”. If you’re not aware of how syslog data is structured, here is a sample:

Mar  2 07:43:26 diamondreo anacron[129442]: Job `cron.daily' terminated
Mar  2 07:43:26 diamondreo anacron[129442]: Normal exit (1 job run)
Mar  2 07:43:30 diamondreo NetworkManager[1264]: <warn> error monitoring device for netlink events: No buffer space available
Mar  2 07:44:01 diamondreo CRON[4056]: (root) CMD (/usr/bin/rsync -a loghost::deny/hosts.deny /etc/hosts.deny 2> /dev/null)
Mar  2 07:45:30 diamondreo NetworkManager[1264]: <warn> error monitoring device for netlink events: No buffer space available

This came from a dump of departmental syslog data George gave me. I met with Mueen regarding how to model this as a time series; this will remain a key challenge in our project as the model will greatly affect its predictive power. He sent me several papers to read regarding this.

We also had a conference call with LANL operations people who gave us a bit more information about how their logging process - storage, event types, filtering, etc.

Upcoming

  • -contact person- (LANL point of contact) may set up another meeting with the LANL sysadmin team and also coordinate a communication system.
  • I will read the papers that Mueen sent me. Perhaps I’ll propose one of them for reading group?
  • I will continue to characterize the syslog data and hopefully evaluate how different models can work on it based on its time-series representation.