Digital Twin Tool and Platforms

Digital Twin Tool and Platforms aims to develop the digital technologies required to operate the tools developed under Modeling the Joint Management of Energy and Water and Socio-Economic and Environmental Aspects. It aims at incorporating, into a holistic way, climate change’s underlying drivers (e.g., energy use, water availability, mobility) from a detailed multi-sectoral data acquisition including critical urban infrastructures, such as energy, water, buildings and mobility, to create a multi-scale virtual or digital twin running on a large-scale IT infrastructure or datacenter. This digital twin will enable the analysis of data and monitoring of urban systems to head off problems before they even occur, prevent downtime, develop new opportunities and planning strategies for the future. Based on an in-depth exploration of co-design of IT infrastructure to execute efficiently advanced numerical models, and using real-time field data, this new digital twin technology can reveal complex patterns of optimization for highly complex urban systems.

This solution will valorise a large-scale computing infrastructure at the EPFL. This complete computing infrastructure at EPFL includes in the DC design 20k servers. The server and storage power consumption is within a total between 10 to 15 kW. This complete modeling infrastructure enables modeling and development of complex management schemes and power behaviors of large-scale computing infrastructures. Also, we include a complete three-layered tree network topology model of a ToR type, aggregation-layer switches. This framework has been used for the development of virtual twins of the digital infrastructure of Credit Suisse including more than 7000 servers in three distributed datacenters in Switzerland.

Tasks included are:

Design a configurable digital twin framework to support the multi-scale models to scale in computational complexity, and consistent model updates and interoperability across scales.
Develop a sustainability-oriented methodology for balancing data acquisition with consumed energy on edge AI sensors to update the state of the virtual twin.
Implement learning approaches for a single or multiple IT infrastructure locations to develop edge-to-cloud solutions for model order reduction and digital twin updates.
Develop and implement tailored learning techniques that simultaneously preserve performance and privacy.
Develop approaches that will protect the integrity of edge sensing, and the integrity and confidentiality of raw data and computation in the cloud.
Design and develop smart IoT sensing and edge computing technologies optimized for cost and energy efficiency

Research Partners

Publications

Ansari, A., Lin, S., Chakraborty, A., Eryilmaz, B., Alian, M., Falsafi, B., & Ferdman, M. (2024). Silicon efficiency in post-Moore servers. Workshop on Hot Topics in Ethical Computer Systems.

Abstract:

Server CPUs in the cloud have inherited their core microarchitecture from the desktop and mobile world, with per- formance primarily measured by single-core IPC. Furthermore, cores are integrated with large cache hierarchies within sockets and rely heavily on these caches to contain chip power envelopes, with little consideration given to utilization by workloads. Wasted silicon impacts both operational and embodied emissions in server platforms. In this work, we measure and compare silicon efficiency measured in performance per area and performance per watt of online and analytic services running on two x86 and an ARM server. We show that while x86 platforms offer higher single-core performance, the ARM server has the potential to achieve up to 2.5× higher socket-level performance per area and performance per watt than the x86 servers in the absence of system-level bottlenecks (e.g., memory or network bandwidth).

Link to full paper:

https://hotethics.github.io/papers/Ansari_Hotethics24.pdf

Baghersalimi, S., Amirshahi, A., Forooghifar, F., Teijeiro, T., Aminifar, A., & Atienza, D. (2024). M2SKD: Multi-to-single knowledge distillation of real-time epileptic seizure detection for low-power wearable systems. ACM Transactions on Intelligent Systems and Technology (TIST), 20(6), 1–17.

Abstract:

The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of Accelerated Processing Units (APUs), processors that incorporate both a central processing unit (CPU) and an integrated graphics processing unit (iGPU).

However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new intermediate address space (IAS) in both APU and CPU systems. IAS serves as a bridge between virtual address (VA) spaces and physical address (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant translation look-aside buffer (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation.

Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.

DOI link:

https://doi.org/10.1145/3659207

Constantinescu, D. A., & Atienza Alonso, D. (2024). Neuro-inspired edge AI architectures for distributed federated learning. Presented at the EMERALD 1st Workshop on Distributed Computing with Emerging Hardware Technology.

Abstract:

Edge computing is becoming an essential concept covering multiple domains nowadays as our world becomes increasingly connected to enable the smart world concept. In addition, the new wave of Artificial Intelligence (AI), particularly complex Machine Learning (ML) and Deep Learning (DL) models, is driving the need for new computing paradigms and edge AI architectures beyond traditional general-purpose computing to make viable a sustainable smart world.
In this presentation, Dr. Constantinescu will discuss the potential benefits and challenges of using emerging edge AI hardware architectures for distributed Federated Learning (FL) in the biomedical domain. These novel computing architectures take inspiration from how the brain processes incoming information and adapts to changing conditions. First, it exploits the idea of accepting computing inexactness at the system level while integrating multiple computing accelerators (such as in-memory computing or coarse-grained reconfigurable accelerators). Second, these edge AI architectures can operate ensembles of neural networks to improve the ML/DL outputs' robustness at the system level while minimizing memory and computation resources for the target final application. These two concepts have enabled the development of the open-source eXtended and Heterogeneous Energy-Efficient Hardware Platform (X-HEEP). X-HEEP will be showcased as a means for developing new edge AI and distributed FL systems for personalized healthcare.

Presentation link:

https://infoscience.epfl.ch/server/api/core/bitstreams/989ccbe2-3425-4f5c-a3b1-594f355def87/content

Constantinescu, D. A., Huang, D., & Atienza Alonso, D. (2024). Scaling sustainable computing: Advanced ML techniques for enhancing the efficiency in the SKA regional data centers. Presented at Swiss SKA Days 2024, Campus Biotech, Geneva, Switzerland.

Abstract:

The SKA Regional Centres Network (SRCNet) of globally distributed data centers will have to find innovative ways to store and disseminate some 600Pbytes/year of incoming data products from SKAO while providing the necessary federated computing resources to the research community to analyze it. As the amount SKAO data continues to grow over its 50+ years lifespan, the salability of SRCNet's capabilities to cater data and computing services to the scientific community will be critical to the meet SKAO’s Science and Sustainability Goals.

In this talk, we will present new ML-based techniques to exploit trade-offs between execution time and resource utilization toward more scalable and sustainable cloud systems. These techniques address the challenge of accurate performance prediction and run-time optimization of resources in servers and data centers for maximizing energy efficiency. In particular, CloudProphet will be presented as a robust ML-based method that identifies each type of running application in VMs and predicts the performance level of concurrently executed cloud applications to optimize the use of resources in the data center. Then, GreenDVFS uses the inputs of CloudProphet to successfully determine the best tuning points for the DVFS knobs in cloud servers and racks to maximize energy efficiency under performance constraints. Finally, we can combine the results of GreenDVFS with novel renewable energy sources for multi-location public clouds and infrastructures through a new workflow with a novel Green-performance degradation manager called ECOGreen.

This new ML-based flow effectively exploits trade-offs between execution time and resource utilization toward more scalable and sustainable cloud systems (20% less energy, 35% less temperature, and 48% increase of renewables). These techniques, together with the development of novel hardware architectures designed specifically for energy efficiency in the radio-interferometry domain, have the potential to help achieve the sustainability goals of SKAO while guaranteeing QoS for its users.

Presentation link:

https://indico.skatelescope.org/event/1146/contributions/10823/attachments/9899/17446/2024_SKACH-Days-Sept-4-Denisa.pdf

Constantinescu, D. A., Kartsch, V., Nakatsuka, Y., Wiese, P., Orbanovik, P., Baghersalimi, S., Abdollahinejad, G., Rodríguez Álvarez, R., Ansari, A., Shevchik, S., Ouvrard, X. E., Peon Quiros, M., Shevchik, S., Hoffmann, P., Capkun, S., Hoefler, T., Falsafi, B., Benini, L., & Atienza, D. (2024). UrbanTwin: An urban digital twin for climate action. Edge-to-cloud and multi-scale digital twin technologies for data gathering and application-driven optimization of resources. EcoCloud Annual Event on IT Sustainability, Lausanne.

Frey, S., et al. (2024). GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG. IEEE Transactions on Biomedical Circuits and Systems.

Abstract:

Recent advancements in head-mounted wearable technology are revolutionizing the field of biopotential measurement, but the integration of these technologies into practical, user-friendly devices remains challenging due to issues with design intrusiveness, comfort, reliability, and data privacy. To address these challenges, this paper presents GAPSES, a novel smart glasses platform designed for unobtrusive, comfortable, and secure acquisition and processing of electroencephalography (EEG) and electrooculography (EOG) signals.We introduce a direct electrode-electronics interface within a sleek frame design, with custom fully dry soft electrodes to enhance comfort for long wear. The fully assembled glasses, including electronics, weigh 40 g and have a compact size of 160 mm × 145 mm. An integrated parallel ultra-low-power RISC-V processor (GAP9, Greenwaves Technologies) processes data at the edge, thereby eliminating the need for continuous data streaming through a wireless link, enhancing privacy, and increasing system reliability in adverse channel conditions. We demonstrate the broad applicability of the designed prototype through validation in a number of EEG-based interaction tasks, including alpha waves, steady-state visual evoked potential analysis, and motor movement classification. Furthermore, we demonstrate an EEG-based biometric subject recognition task, where we reach a sensitivity and specificity of 98.87% and 99.86% respectively, with only 8 EEG channels and an energy consumption per inference on the edge as low as 121 μJ. Moreover, in an EOG-based eye movement classification task, we reach an accuracy of 96.68% on 11 classes, resulting in an information transfer rate of 94.78 bit/min, which can be further increased to 161.43 bit/min by reducing the accuracy to 81.43%. The deployed implementation has an energy consumption of 40 μJ per inference and a total system power of only 12.4 mW, of which only 1.61% is used for classification, allowing for continuous operation of more than 22 h with a small 75 mAh battery.

DOI link:

10.1109/TBCAS.2024.3478798

Ingolfsson, T. M., et al. (in review). VowelNet: Enhancing communication with wearable EEG-based vowel imagery. Presented at BioCAS 2024, Xi’an, China. Pending final publishing.

Liu, Q., Huang, D., Costero, L., Zapater, M., & Atienza, D. (2024). Intermediate address space: Virtual memory optimization of heterogeneous architectures for cache-resident workloads. ACM Transactions on Architecture and Code Optimization, 20(4).

Abstract:

DOI link:

https://doi.org/10.1145/3659207

Mei, L., et al. (2024). An ultra-low power wearable BMI system with continual learning capabilities. IEEE Transactions on Biomedical Circuits and Systems.

Abstract:

Driven by the progress in efficient embedded processing, there is an accelerating trend toward running machine learning models directly on wearable Brain-Machine Interfaces (BMIs) to improve portability and privacy and maximize battery life. However, achieving low latency and high classification performance remains challenging due to the inherent variability of electroencephalographic (EEG) signals across sessions and the limited onboard resources. This work proposes a comprehensive BMI workflow based on a CNN-based Continual Learning (CL) framework, allowing the system to adapt to inter-session changes. The workflow is deployed on a wearable, parallel ultra-low power BMI platform (BioGAP). Our results based on two in-house datasets, Dataset A and Dataset B, show that the CL workflow improves average accuracy by up to 30.36% and 10.17%, respectively. Furthermore, when implementing the continual learning on a Parallel Ultra-Low Power (PULP) microcontroller (GAP9), it achieves an energy consumption as low as 0.45mJ per inference and an adaptation time of only 21.5 ms, yielding around 25 h of battery life with a small 100 mAh, 3.7 V battery on BioGAP. Our setup, coupled with the compact CNN model and on-device CL capabilities, meets users’ needs for improved privacy, reduced latency, and enhanced inter-session performance, offering good promise for smart embedded real-world BMIs.

DOI link:

https://doi.org/10.1109/TBCAS.2024.3457522

Mei, L., et al. (2024). Train-On-Request: An on-device continual learning workflow for adaptive real-world brain machine interfaces. Presented at BioCAS 2024, Xi’an, China.

Abstract:

Brain-machine interfaces (BMIs) are expanding beyond clinical settings thanks to advances in hardware and algorithms. However, they still face challenges in user-friendliness and signal variability. Classification models need periodic adaptation for real-life use, making an optimal re-training strategy essential to maximize user acceptance and maintain high performance. We propose TOR, a train-on-request workflow that enables user-specific model adaptation to novel conditions, addressing signal variability over time. Using continual learning, TOR preserves knowledge across sessions and mitigates inter-session variability. With TOR, users can refine, on demand, the model through on-device learning (ODL) to enhance accuracy adapting to changing conditions. We evaluate the proposed methodology on a motor-movement dataset recorded with a non-stigmatizing wearable BMI headband, achieving up to 92% accuracy and a re-calibration time as low as 1.6 minutes, a 46% reduction compared to a naive transfer learning workflow. We additionally demonstrate that TOR is suitable for ODL in extreme edge settings by deploying the training procedure on a RISC-V ultra-low-power SoC (GAP9), resulting in 21.6 ms of latency and 1 mJ of energy consumption per training step. To the best of our knowledge, this work is the first demonstration of an online, energy-efficient, dynamic adaptation of a BMI model to the intrinsic variability of EEG signals in real-time settings.

Pre-print link:

https://www.arxiv.org/abs/2409.09161

Okanovic, P., Kirsch, A., Kasper, J., Hoefler, T., Krause, A., & Gürel, N. M. (2024). All models are wrong, some are useful: Model selection with limited labels.

Abstract:

We introduce MODEL SELECTOR, a framework for label-efficient selection of pretrained classifiers. Given a pool of unlabeled target data, MODEL SELECTOR samples a small subset of highly informative examples for labeling, in order to efficiently identify the best pretrained model for deployment on this target dataset. Through extensive experiments, we demonstrate that MODEL SELECTOR drastically reduces the need for labeled data while consistently picking the best or near-best performing model. Across 18 model collections on 16 different datasets, comprising over 1,500 pretrained models, MODEL SELECTOR reduces the labeling cost by up to 94.15% to identify the best model compared to the cost of the strongest baseline. Our results further highlight the robustness of MODEL SELECTOR in model selection, as it reduces the labeling cost by up to 72.41% when selecting a near-best model, whose accuracy is only within 1% of the best model.

Pre-print link:

https://arxiv.org/abs/2410.13609

Okanovic, P., Kwasniewski, G., Labini, P. S., Besta, M., Vella, F., & Hoefler, T. (2024). High performance unstructured SpMM computation using tensor cores. Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 1–14.

Abstract:

High-performance sparse matrix–matrix (SpMM) multiplication is paramount for science and industry, as the ever- increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introduce (S)parse (Ma)trix Matrix (T)ensor Core-accelerated (SMaT): a novel SpMM library that utilizes TCs for unstructured sparse matrices. Our block- sparse library leverages the low-level CUDA MMA (matrix- matrix-accumulate) API, maximizing the performance offered by modern GPUs. Algorithmic optimizations such as sparse matrix permutation, further improve performance by minimizing the number of non-zero blocks. The evaluation on NVIDIA A100 shows that SMaT outperforms SotA libraries (DASP, cuSPARSE, and Magicube) by up to 125x (on average 2.6x). SMaT can be used to accelerate many workloads in scientific computing, large model training, inference, and others.

Full paper link:

https://spcl.inf.ethz.ch/Publications/.pdf/okanovic-sc24-high.pdf

Spacone, G., et al. (2024). Tracking of wrist and hand kinematics with ultra-low power wearable A-mode ultrasound. IEEE Transactions on Biomedical Circuits and Systems.

Abstract:

Ultrasound-based Hand Gesture Recognition has gained significant attention in recent years. While static gesture recognition has been extensively explored, only a few works have tackled the task of movement regression for real-time tracking, despite its importance for the development of natural and smooth interaction strategies. In this paper, we demonstrate the regression of 3 hand-wrist Degrees of Freedom (DoFs) using a lightweight, A-mode-based, truly wearable US armband featuring four transducers and WULPUS, an ultra-low-power acquisition device. We collect US data, synchronized with an optical motion capture system to establish a ground truth, from 5 subjects. We achieve state-of-the-art performance with an average root-mean-squared-error (RMSE) of 7.32◦ ± 1.97◦ and mean-absolute-error (MAE) of 5.31◦ ± 1.42◦. Additionally, we demonstrate, for the first time, robustness with respect to transducer repositioning between acquisition sessions, achieving an average RMSE value of 11.11◦ ± 4.14◦ and a MAE of 8.46◦ ± 3.58◦. Finally, we deploy our pipeline on a real-time low-power microcontroller, showcasing the first instance of multi-DoF regression based on A-mode US data on an embedded device, with a power consumption lower than 30mW and end-to-end latency of ≈ 80 ms.

DOI link:

10.1109/TBCAS.2024.3465239

Taji, H., Miranda Calero, J. A., Peon Quiros, M., & Atienza Alonso, D. (2024). Energy-efficient frequency selection method for bio-signal acquisition in AI/ML wearables. ISLPED'24: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design.

Abstract:

In wearable sensors, energy efficiency is crucial, particularly during phases where devices are not processing, but rather acquiring biosignals for subsequent analysis. This study focuses on improving the power consumption of wearables during these acquisition phases, a critical but often overlooked aspect that substantially affects overall device energy consumption, especially in low-duty-cycle applications. Our approach optimizes power consumption by leveraging application-specific requirements (e.g., required signal profile), platform characteristics (e.g., transition-time overhead for the clock generators and power-gating capabilities), and analog biosignal front-end specifications (e.g., ADC buffer sizes). We refine the strategy for switching between low-power idle and active states for the storage of acquired data, introducing a novel method to select optimal frequencies for these states. Based on several case studies on an ultra-low power platform and different biomedical applications, our optimization methodology achieves substantial energy savings. For example, in a 12-lead heartbeat classification task, our method reduces total energy consumption by up to 58% compared to state-of-the-art methods. This research provides a theoretical basis for frequency optimization and practical insights, including characterizing the platform's power and overheads for optimization purposes. Our findings significantly improve energy efficiency during the acquisition phase of wearable devices, thus extending their operational lifespan.

DOI link:

https://doi.org/10.1145/3665314.3670815

Tirelli, C., Sapriza, J., Rodríguez Álvarez, R., Ferretti, L., Denkinger, B., Ansaloni, G., Miranda, J., Atienza, D., & Pozzi, L. (2024). SAT-based exact modulo scheduling mapping for resource-constrained CGRAs. ACM Journal on Emerging Technologies in Computing Systems (JETC), 20(6), 1–22.

Abstract:

Coarse-Grain Reconfigurable Arrays (CGRAs) represent emerging low-power architectures designed to accelerate Compute-Intensive Loops (CILs). The effectiveness of CGRAs in providing acceleration relies on the quality of mapping: how efficiently the CIL is compiled onto the platform. State-of-the-Art (SoA) compilation techniques utilize modulo scheduling to minimize the Iteration Interval (II) and use graph algorithms like Max-Clique Enumeration to address mapping challenges. Our work approaches the mapping problem through a satisfiability (SAT) formulation. We introduce the Kernel Mobility Schedule (KMS), an ad hoc schedule used with the Data Flow Graph and CGRA architectural information to generate Boolean statements that, when satisfied, yield a valid mapping. Experimental results demonstrate SAT-MapIt outperforming SoA alternatives in almost 50% of explored benchmarks. Additionally, we evaluated the mapping results in a synthesizable CGRA design and emphasized the runtime metrics trends, i.e., energy efficiency and latency, across different CILs and CGRA sizes. We show that a hardware-agnostic analysis performed on compiler-level metrics can optimally prune the architectural design space, while still retaining Pareto-optimal configurations. Moreover, by exploring how implementation details impact cost and performance on real hardware, we highlight the importance of holistic software-to-hardware mapping flows, as the one presented herein.

DOI link:

https://doi.org/10.1145/3663675

Research Partners

David Atienza (WP3 lead)

Luca Benini

Patrik Hoffmann

Babak Falsafi

Torsten Hoefler

Fabio Nobile

Srdjan Capkun

Publications