In my role as a Senior OpenShift Technical Account Manager at Red Hat, I focus on mission-critical stability; helping organisations navigate the shift from cloud-native architectures to AI-ready operations. But there is a distinct difference between advising on a scalable MLOps workflow and trusting a local LLM to trade your own capital in a volatile market.
Would you trust an AI agent with your bank account? I did; and it was a masterclass in ‘Boom or Bust’ logic.
Over the past couple of weeks I’ve been wrestling with building vllm (with CUDA support) under Fedora 42. Here’s the short version of what went wrong:-
Python version confusion
My virtualenv was pointing at Python 3.11 but CMake kept complaining it couldn’t find “python3.11.”
Fix: explicitly passed -DPYTHON_EXECUTABLE=$(which python) to CMake, which got past the Python lookup errors.
CUDA toolkit headers/libs not found
Although Fedora’s CUDA 12.9 RPMs were installed, CMake couldn’t locate CUDA_INCLUDE_DIRS or CUDA_CUDART_LIBRARY.
Fix: set CUDA_HOME=/usr/local/cuda-12.9 and passed -DCUDA_TOOLKIT_ROOT_DIR & -DCUDA_SDK_ROOT_DIR to CMake.
cuDNN import errors
Pip’s PyTorch import of libcudnn.so.9 failed during the vllm build.
Fix: reinstalled torch/cu121 via the official PyTorch Cu121 wheel index so that all the nvidia-cudnn-cu12 wheels were in place.
GCC / Clang version mismatches
CUDA 12.9’s nvcc choked on GCC 15 (“unsupported GNU version”) and later on Clang 20.
Tried installing gcc-14 and symlinking it into PATH, exporting CC=/usr/bin/gcc-14 / CXX=/usr/bin/g++-14, and even passing -DCMAKE_CUDA_HOST_COMPILER, but CMake’s CUDA‐ID test was still failing on the Fedora header mismatch.
Ultimately we switched to Clang 20 with --allow-unsupported-compiler, which let us get past the version “block.”
Math header noexcept conflicts
CMake’s nvcc identification build then ran into four “expected a declaration” errors in CUDA’s math_functions.h, caused by mismatched noexcept(true) on sinpi/cospi vs system headers.
I patched those lines (removing or adding, I forget which, the trailing noexcept(true)) so cudafe++ could preprocess happily.
Missing NVToolsExt library
After all that, CMake could find CUDA and compile—but hit kotlinCopyEditThe link interface of target "torch::nvtoolsext" contains: CUDA::nvToolsExt but the target was not found.
Looking under /usr/local/cuda-12.9, there was no libnvToolsExt.so* at all—only the NVTX‐3 interop helper (libnvtx3interop.so*) lived under the extracted toolkit tree.
Current hurdle I still don’t have the core NVTX library (libnvToolsExt.so.*) in /usr/local/cuda-12.9/…/lib, so the CMake target CUDA::nvToolsExt remains unavailable. This library appears to be missing from the both the Fedora cuda-nvtx and the NVIDIA nvtx toolkit download/runfile. This appears to be a known issue with recent versions.
Work continues and a full process will be documented, once successful.
Discover how to leverage the power of kcli and libvirt to rapidly deploy a full OpenShift cluster in under 30 minutes, cutting through the complexity often associated with OpenShift installations.
Prerequisites
Server with 8+ cores, minimum of 64GB RAM (96+ for >1 worker node) Fast IO – dedicated NVMe libvirt storage or – NVMe LVMCache fronting HDD (surprisingly effective!) OS installed (tested with CentOS Stream 8) Packages libvirt + git installed Pull-secret (store in openshift_pull.json) obtained from https://cloud.redhat.com/openshift/install/pull-secret
Install kcli
[steve@shift ~]$ git clone https://github.com/karmab/kcli.git
[steve@shift ~]$ cd kcli; ./install.sh
Note 1: To deploy Single Node Openshift (SNO) set ctlplanes to 1 and workers to 0.
Note 2: Even a fast Xeon with NVMe storage may have difficulty deploying more than 3 workers before the installer times out. An RFE exists to make the timeout configurable, see:
Once cluster is deployed you'll receive the following message:-
INFO Waiting up to 40m0s (until 3:42PM) for the cluster at https://api.ocp413.lab.local:6443 to initialize...
INFO Checking to see if there is a route at openshift-console/console...
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/root/.kcli/clusters/ocp413/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.ocp413.lab.local
INFO Login to the console with user: "kubeadmin", and password: "qTT5W-F5Cjz-BIPx2-KWXQx"
INFO Time elapsed: 16m18s
Deleting ocp413-bootstrap
Note: Whilst the above credentials can be found later, it’s worthwhile making a note of the above. I save to a text file on the host.
Confirm Status
[root@shift ~]# export KUBECONFIG=/root/.kcli/clusters/ocp413/auth/kubeconfig
[root@lab ~]# oc status
In project default on server https://api.ocp413.lab.local:6443
svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 172.30.0.1:443 -> 6443
View details with 'oc describe <resource>/<name>' or list resources with 'oc get all'.
[root@shift ~]# oc get nodes
NAME STATUS ROLES AGE VERSION
ocp413-ctlplane-0.lab.local Ready control-plane,master 68m v1.26.5+7d22122
ocp413-ctlplane-1.lab.local Ready control-plane,master 68m v1.26.5+7d22122
ocp413-ctlplane-2.lab.local Ready control-plane,master 68m v1.26.5+7d22122
ocp413-worker-0.lab.local Ready worker 51m v1.26.5+7d22122
ocp413-worker-1.lab.local Ready worker 51m v1.26.5+7d22122
ocp413-worker-2.lab.local Ready worker 52m v1.26.5+7d22122
[root@shift ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.13.4 True False 42m Cluster version is 4.13.4
Note: If the cluster is not installed on your workstation, it’s may be easier to install a browser on the server then forward X connections, rather than maintaining a local hosts file or modifying local DNS to catch and resolve local cluster queries:
Currently, the ‘Use CPU if no CUDA device detected’ [1] pull request has not merged. Following the instructions at [2] and jumping down the dependency rabbit hole, I finally have Stable Diffusion running on an old dual XEON server.
Notes: 1) Typically only 18 (out of 32 cores) active regardless of render size. 2) As expected, the calculation is entirely CPU bound. 3) For an unknown reason, even with –n_samples and –n_rows of 1, two images were still created (time halved for single image in above table).
Another CPU Rendered Cat 512×512
Conclusion:
It works. We gain resolution at the huge expense of memory and time.
I recently purchased AmigaOS 4.1 with a plan to familiarise myself with the OS via emulation before purchasing the Freescale QorIQ P1022 e500v2 ‘Tabor’ motherboard. In particular, I wanted to investigate the ssh and X display options, including AmiCygnix.
OS4.1 running under FS-UAE & QEMU, showing config and network status
However, despite being familiar with OS3.1 and FS-UAE I still managed to hit a few gotchas with the OS4 install and configuration.
Installation of the QEMU module was simple using the download and simple instructions from: https://fs-uae.net/download#plugins. In my case this was version 3.8.2qemu2.2.0 and installed in ~/Documents/FS-UAE/Plugins/QEMU-UAE/Linux/x86-64/ (your path may vary).
I then tried multiple FS-UAE configurations in order to get the emulated machine to boot with PPC, RTG and network support. A few options clash resulting in a purple screen on boot. Rather than work through the process from scratch, it’s easier to simply list my config here:-
I used FS-UAE (and FS-UAE-Launcher) version 2.8.3.
Things to note:
See http://eab.abime.net/showthread.php?t=75195 for install advice regarding disk partitioning and FS type. This is important!
Shared folders (between host OS and Emulation) are *not* currently supported when using PPC under FS-UAE. Post install, many additional packages were required, including network drivers which resulted in a catch-22 situation. I worked around this by installing a 3.1.4 instance and mounting both the OS4 and ‘shared’ drives here, copying the required files over then booting back into the OS4 PPC environment.
For networking, UAE.bsdsocket.library in UAE should be disabled but the A2065 network card enabled. The correct driver from aminet is: http://aminet.net/package/driver/net/Ethernet
The latest updates to OS4.1 (final) enable Zorro III RAM to be used in addition to accelerator RAM; essential for AmiCygnix. Once OS4.1 is installed and network configured, use the included update tool to pull OS4.1 FE updates.
I couldn’t find any good quality 1920×1080 (so called ‘full HD’) desktop wallpapers featuring either Atari ST GEM or Commodore Amiga Workbench 1.3. So, assembled from parts taken from various images on google, scaled with correct aspect ration maintained, tidied and assembled to fill the full resolution and with no JPEG compression artifacts – here we are:-
With both my previous bad experience building qtel (the Linux EchoLink client) and recent discussions on a forum around similar difficulties – I thought I’d identify, resolve and document the issues.
I’m not sure what’s changed but the process is now very simple (Fedora 28):-
git clone https://github.com/sm0svx/svxlink.git
cd svxlink/
cd src
sudo dnf install cmake libsigc++20-devel qt-devel popt-devel libgcrypt-devel gsm-devel tcl-devel
cmake .
make
cp bin/qtel DESTINATION_PATH_OF_CHOICE
Depending on libs already installed, additional packages may be required – as indicated by failures during the ‘cmake’ stage.
We had a requirement to gather LVM (VG) metrics via Prometheus to alert when GlusterFS is running low on ‘brick’ storage space. Currently, within Openshift 3.9 the only metrics seem to relate to mounted FS. A ‘heketi exporter module’ exists but this only reports space within allocated blocks. There doesn’t appear to be any method to pull metrics from the underlying storage.
We solved this by using a Prometheus pushgateway. Metrics are pushed from Gluster hosts using curl (via cron) and then pulled using a standard Prometheus scrape configuration (via prometheus configmap in OCP). Alerts are then pushed via alertmanager and eventually Cloudforms.
Import the pushgateway image:
oc import-image openshift/prom-pushgateway --from= docker.io/prom/pushgateway --confirm
Create pod and expose route. Then, add scrape config to prometheus configmap:-