Content Warning: Boring Research Update
Exascale Monitoring and Prediction
For those of you not aware of the project, we are using data mining to increase knowledge about HPC resource monitoring systems with a goal of increasing efficiency of monitoring operations and predicting failures.
The LANL folks implied that giving us data to examine would be essentially impossible or not useful. The main monitoring tools capture syslog data and I was told to work developing tools or methodology on “any syslog data”. If you’re not aware of how syslog data is structured, here is a sample:
Mar 2 07:43:26 diamondreo anacron: Job `cron.daily' terminated Mar 2 07:43:26 diamondreo anacron: Normal exit (1 job run) Mar 2 07:43:30 diamondreo NetworkManager: <warn> error monitoring device for netlink events: No buffer space available Mar 2 07:44:01 diamondreo CRON: (root) CMD (/usr/bin/rsync -a loghost::deny/hosts.deny /etc/hosts.deny 2> /dev/null) Mar 2 07:45:30 diamondreo NetworkManager: <warn> error monitoring device for netlink events: No buffer space available
This came from a dump of departmental syslog data George gave me. I met with Mueen regarding how to model this as a time series; this will remain a key challenge in our project as the model will greatly affect its predictive power. He sent me several papers to read regarding this.
We also had a conference call with LANL operations people who gave us a bit more information about how their logging process - storage, event types, filtering, etc.
- -contact person- (LANL point of contact) may set up another meeting with the LANL sysadmin team and also coordinate a communication system.
- I will read the papers that Mueen sent me. Perhaps I’ll propose one of them for reading group?
- I will continue to characterize the syslog data and hopefully evaluate how different models can work on it based on its time-series representation.