HUST 2018
Fifth Annual Workshop on HPC User Support Tools
November 11, 2018, Dallas, Texas, USA.
Held in conjunction with SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis.
HUST Survey
Overview
Supercomputing centers exist to drive scientific discovery by supporting researchers in computational science fields. To make users more productive in the complex HPC environment, HPC centers employ user support teams. These teams serve many roles, from setting up accounts, to consulting on math libraries and code optimization, to managing HPC software stacks. Often, support teams struggle to adequately support scientists. HPC environments are extremely complex, and combined with the complexity of multi-user installations, exotic hardware, and maintaining research software, supporting HPC users can be extremely demanding.
With the fifth HUST workshop, we will continue to provide a necessary forum for system administrators, user support team members, tool developers, policy makers and end users. We will provide a forum to discuss support issues and we will provide a publication venue for current support developments. Best practices, user support tools, and any ideas to streamline user support at supercomputing centers are in scope.
Program
HUST18 Workshop Program
Sunday November 11, 2018
09:00-09:30 -OOOPS: An Innovative Tool for IO Workload Management on Supercomputers
- Presented by Dr. Lei Huang - TACC
- ABSTRACT
Modern supercomputer applications are demanding powerful storage resources in addition to fast computing resources. However, these storage resources, especially parallel shared filesystems, have become the Achilles’ heel of many powerful supercomputers. A single user’s IO-intensive work can result in global filesystem performance degradation and even unresponsiveness. In this project, we developed an innovative IO workload managing tool that controls the IO workload from user applications’ side. This tool is capable of automatically detecting and throttling intensive IO workload caused by supercomputer users to protect parallel shared filesystems.
09:30-10:00 -CView and NWPerf for supercomputer performance collection and display
- Presented by Evan Felix - Pacific Northwest National Laboratory
- ABSTRACT
CView is a full 3D interactive graphics tool for displaying collected System and Application performance data. Data is collected using various open source tools common on large clusters such as ganglia, collectl, or collectd. When collecting data on large scale systems with thousands of nodes to tens of thousands of nodes, this data set becomes quite large over time. For example, on The Cascade System at the Environmental Molecular Sciences Laboratory at PNNL one days collection of performance data stored as raw floats is about 2.5 Gigabytes of data. The CView Graphics environment allows a user to easily visualize such data-sets. This allows System Administrators to ‘see’ how the system is performing in near-time data streams. System Administrators and System Users use it for visualizing the performance of the whole systems or just the nodes involved in a specific job run. The Systems at PNNL use the NWPerf software, backed by a Ceph Object store to route and store the performance data our clusters. By storing the data this method a user can display data from any point in time and investigate various metrics at the system or job level.
10:00-10:30
- Morning Coffee Break
10:30-11:10 -ReFrame: A Regression Testing and Continuous Integration Framework for HPC systems
- Presented by Dr. Vasileios Karakasis - CSCS
- ABSTRACT
Regression testing of HPC systems is of crucial importance when it comes to ensure the quality of service offered to the end users. At the same time, it poses a great challenge to the systems and application engineers to continuously maintain regression tests that cover as many aspects as possible of the user experience. In this presentation, we introduce ReFrame, a new framework for writing regression tests for HPC systems. ReFrame is designed to abstract away the complexity of the interactions with the system and separate the logic of a regression test from the low-level details, which pertain to the system configuration and setup. Regression tests in ReFrame are simple Python classes that specify the basic parameters of the test plus any additional logic. The framework will load the test and send it down a well-defined pipeline which will take care of its execution. All the system interaction details, such as programming environment switching, compilation, job submission, job status query, sanity checking and performance assessment, are performed by the different pipeline stages. Thanks to its high-level abstractions and modular design, ReFrame can also serve as a tool for continuous integration (CI) of scientific software, complementary to other well-known CI solutions. Finally, we present the use cases of two large HPC centers that have adopted or are now adopting ReFrame for regression testing of their computing facilities.
11:10-11:40 -A Compiler and Profiler Based Tool for Querying HPC Application Characteristics
- Presented by Aaron Welch - Oak Ridge National Laboratory
- ABSTRACT
Emerging HPC platforms are becoming extraordinarily difficult to program as a result of complex, deep and heterogeneous memory hierarchies, heterogeneous cores, and the need to divide work and data among them. New programming models and libraries are being developed to aid in porting efforts of application developers, but substantial code restructuring is still necessary to fully make use of these new technologies. To do this effectively, these developers need information about their source code characteristics, including static and dynamic (e.g. performance) information to direct their optimisation efforts and make key decisions. On the other hand, system administrators need to understand how users are using the software stack and resources on their systems to understand what software they need to provide to improve the productivity of the users on a platform.
In this presentation we describe a tool that combines compiler and profiler information to query performance characteristics within an application or across multiple applications on a given platform. Static and dynamic data about applications are collected and stored together in an SQL database that can be queried by either a developer or a system administrator. We will demonstrate the capabilities of this tool via a real-world example from application-driven case studies that aims at understanding the use of scientific libraries on a routine from the molecular simulation application CP2K.
11:40 - 12:20 -ColdFront: An Open Source HPC Resource Allocation System
- Presented by Dr. Mohammed Zia - State Univerisity of New York at Buffulo
- ABSTRACT
ColdFront is an open source resource allocation system designed to provide a central portal for administering HPC resources and collecting return on investment (ROI) metrics. ColdFront was created to help HPC centers manage access to center resources across large groups of users and provide a rich set of data for comprehensive reporting of ROI metrics such as user publications and external funding. ColdFront is written in Python and released under the GPLv3 license. This presentation will include an overview of ColdFront and include a live demo of its installation and use.
12:20 - 12:30 -HUST 2018
- Presented by Chris Bording - IBM Research
- Discussion
- Review Survey results as time permits.
Dates
Important Dates
Submissions EXTENDED: September 10
Workshop paper reviews due: September 21
Acceptance notifications: October 2
Camera-ready papers: October 12
Workshop: Sunday, November 11 at SC`18
Committees
Organizing Committee
- Chris Bording, IBM Research, Hartree Centre, United Kingdom
- Elsa Gonsiorowski, Lawrence Livermore National Laboratory, USA
- Olli-Pekka Lehto, Jump Trading, LLC, Singapore
General Chair
- Chris Bording, IBM Research, Hartree Centre, United Kingdom
Program Committee Chairs
- Elsa Gonsiorowski, Lawrence Livermore National Laboratory, USA
- Olli-Pekka Lehto, Jump Trading, LLC, Singapore
Program Committee
- Daniel Ahlin, PDC HPC Center, KTH Royal Institute of Technology, Sweden
- David E. Bernholdt, Oak Ridge National Laboratory, USA
- Fabrice Cantos, National Institute of Water and Atmospheric Research, New Zealand
- Erik Engquist, Rice University, USA
- Christopher Harris, Pawsey Supercomputing Centre, Australia
- Paul Kolano, NASA, USA
- John C. Linford, ARM, USA
- Robert McLay, TACC University of Texas
- David Montoya, Los Alamos National Laboratory, USA
- Randy Schauer, Raytheon Company, USA
- Alan D Simpson, Edinburgh Parallel Computing Centre, United Kingdom
- Karen Tomko, Ohio Supercomputing Center, USA
- Jianwen Wei, Shanghai Jiao Tong University, China