Research Update

Content Warning: Boring Research Update


Exascale Monitoring and Prediction

For those of you not aware of the project, we are using data mining to increase knowledge about HPC resource monitoring systems with a goal of increasing efficiency of monitoring operations and predicting failures.


I’ve had some communication with -contact person- (LANL contact person) regarding tools and administrativa. It’s possible that we will use one of the LANL github repos for collaboration (e.g., the HPC repo here), which could make some aspects of collaboration easy. Side note: I see that Sam is a contributor to a number of things at this lanl repo.

Dorian suggested that we submit an abstract to the CSGSA conference on the project, mostly to solicit input from any attendees. Obviously, we have nothing to offer in terms of novel research yet, but we could glean some insight if there is any to be gleaned at the event.

I told Dorian that I’d have a draft of the abstract ready, so here it is. Please feel free to poke at it – it’s my first CS abstract. Actually, poke at it, please.


Authors: Aaron Gonzales, [LANL people], Abdullah Mueen, Dorian Arnold?

We propose a method and system for identifying patterns in high-performance computing (HPC) systems that lead to notable system events – e.g., hardware failures, or faults – and a general qualitative investigation of resource monitoring paradigms. We hope our investigation will both lead to more effective monitoring strategies by identifying the most informative monitoring features and rates, and reduce bottlenecks and slowdowns by predicting faults early enough to mitigate. Our methods of prediction will primarily involve time-series mining and experimentation by using different monitoring paradigms, including event-based unix syslog and continuous Sandia National Laboratory based LDMS monitoring. Research will be conducted with Los Alamos National Laboratory’s [something something team/system].

I don’t have any ideas for titles yet other than an alliterative thing based on the words:

  • Exascale
  • Monitoring
  • Prediction
  • Events

I probably won’t go with “Adventures in collaboration: This Will Be an Interesting Project Without Access To Data”.

Suggestions are welcome.


  • Refine abstract, subit abstract, make poster out of not too much.
  • Continue setup and background investigation with -contact person-.