I'm really rooting for AMD to break the CUDA monopoly. To this end, I genuinely don't know whether a translation layer is a good thing or not. On the upside it makes the hardware much more viable instantly and will boost adoption, on the downside you run the risk that devs will never support ROCm, because you can just use the translation layer.
I think this is essentially the same situation as Proton+DXVK for Linux gaming. I think that that is a net positive for Linux, but I'm less sure about this. Getting good performance out of GPU compute requires much more tuning to the concrete architecture, which I'm afraid devs just won't do for AMD GPUs through this layer, always leaving them behind their Nvidia counterparts.
However, AMD desperately needs to do something. Story time:
On the weekend I wanted to play around with Stable Diffusion. Why pay for cloud compute, when I have a powerful GPU at home, I thought. Said GPU is a 7900 XTX, i.e. the most powerful consumer card from AMD at this time. Only very few AMD GPUs are supported by ROCm at this time, but mine is, thankfully.
So, how hard could it possibly to get Stable Diffusion running on my GPU? Hard. I don't think my problems were actually caused by AMD: I had ROCm installed and my card recognized by rocminfo in a matter of minutes. But the whole ML world is so focused on Nvidia that it took me ages to get a working installation of pytorch and friends. The InvokeAI installer, for example, asks if you want to use CUDA or ROCm, but then always installs the CUDA variant whatever you answer. Ultimately, I did get a model to load, but the software crashed my graphical session before generating a single image.
The whole experience left me frustrated and wanting to buy an Nvidia GPU again...
ROCm has been challenging to work with - we're actively talking to AMD to keep apprised of ways we can mitigate some of the more troublesome experiences that users have with getting Invoke running on AMD (and hoping to expand official support to Windows AMD)
The problem is that a lot of the solutions proposed involve significant/unsustainable dev effort (i.e., supporting an entirely different inference paradigm), rather than "drop in" for the existing Torch/diffusers pipelines.
While I don't know enough about your set up to offer immediate solutions, if you join the discord, am sure folks would be happy to try walking through some manual troubleshooting/experimentation to get you up and running - discord.gg/invoke-ai
Hi! I really appreciate you taking the time to reply.
I have since gotten Invoke to run and was already able to get some results I'm really quite happy with, so thank you for your time and commitment working on Invoke!
I understand that ROCm is still challenging, but it seems my problems were less related to ROCm or Invoke itself and more to Python dependency management. It really boiled down to getting the correct (ROCm) versions of packages installed. Installing Invoke from PyPi always removed my Torch and installed CUDA-enabled Torch (as well as cuBLAS, cuDNN, ...). Once I had the correct versions of packages, everything just worked.
To me, your pyproject.toml looks perfectly sane, so I wasn't sure how to go about fixing the problem.
What ended up working for me was to use one of AMD's ROCm OCI base images, manually installing all dependencies, foregoing a virtual environment, cloning your repo (, building the frontend), and then installing from there.
The majority of my struggle would have been solved by a recent working Docker image containing a working setup. (The one on Docker Hub is 9 months old.) Trying to build the Dockerfile from your repo, I also ended up with a CUDA-enabled Torch. It did install the correct one first, but in a later step removed the ROCm-enabled Torch to switch it for the CUDA-enabled one.
I hope you'll consider investing some resources into publishing newer, working builds of your Docker image.
You bet - Thanks for the feedback. Glad you're enjoying Invoke!
We do have Docker packages hosted on GH, but I'll be the first to admit that we haven't prioritized ROCm. Contributors who have AMDs are a scant few, but maybe we'll find some help in wrangling that problem now that we know there's an avenue to do so.
> Installing Invoke from PyPi... To me, your pyproject.toml looks perfectly sane, so I wasn't sure how to go about fixing the problem.
You can't install the PyTorch that's best for the currently running platform using a pyproject.toml with a setuptools backend, for starters. Invoke would have to author a setup.py that deals with all the issues, in a way that is compatible with build isolation.
> The majority of my struggle would have been solved by a recent working Docker image containing a working setup. (The one on Docker Hub is 9 months old.)
> You can't install the PyTorch that's best for the currently running platform using a pyproject.toml with a setuptools backend, for starters.
I see. I do know Python, but my knowledge of setuptools, pip, poetry and whatever else have you. To get my working setup, I specified an --index-url for my Torch installation. Does that not work while using their current setup?
> Why? Given the state of the ecosystem, what guarantee is there really that the documentation for Docker Desktop with AMD ROCm device binding is going to actually work for your device?
Well, they did work for me. Though I think only passing /dev/{dri,kfd} and setting seccomp=unconfined was sufficient. So for my particular case, getting a working image was the only missing step.
From a more general POV: it might not make sense to invest in a ROCm OCI image from a short-term business perspective, but in the long term and based purely on principal, I do think the ecosystem should strive to be less reliant on CUDA and only CUDA.
Bazzite is a ublue (Universal Blue) fork of the Fedora Kinoite (KDE) or Fedora Silverblue (Gnome) rpm-ostree Linux distributions; ublue-os/bazzite//Containerfile :
https://github.com/ublue-os/bazzite/blob/main/Containerfile#... has, in addition to fan and power controls, automatic updates on desktop, supergfxctl, system76-scheduler, and an fsync kernel:
But it's not `rpm-ostree install --apply-live` because its a Containerfile.
To install a ublue-os distro, you install any of the Fedora ostree distros: {Silverblue, Kinoite, Sway Atomic, or Budgie Atomic} from e.g. a USB stick and then `rpm-ostree rebase <OCI_host_image_url>`:
rpm-ostree rebase ostree-unverified-registry:ghcr.io/ublue-os/bazzite:stable
rpm-ostree rebase ostree-unverified-registry:ghcr.io/ublue-os/bazzite-nvidia:stable
rpm-ostree rebase ostree-image-signed:
I actually used the rocm/pytorch image you also linked.
I'm not sure what you're pointing to with your reference to the Fedora-based images. I'm quite happy with my NixOS install and really don't want to switch to anything else. And as long as I have the correct kernel module, my host OS really shouldn't matter to run any of the images.
And I'm sure it can be made to work with many base images, my point was just that the dependency management around pytorch was in a bad state, where it is extremely easy to break.
Unfortunately NixOS (and Debian and Ubuntu) lack SELinux policies or other LSM implementations by default out of the box, and container-selinux contains more than e.g. docker.
Is there a way to 'restorecon --like / /nix/os/root72`; to apply SELonix extended filesystem attributes labels just to NixOS prefixes?
Some research is done with RPM-based distros; which have become so advanced with rpm-ostree support.
FWICS Bazzite has NixOS support, too; in addition to distrobox containers.
Bazzite has alot of other stuff installed that's not necessary when attempting to isolate sources of variance in the interest of reproducible research; but being for gaming it has various optimizations.
InvokeAI might be faster to install and to compute with with conda-forge builds.
Invoke is awesome. Let me know if you guys want some MI300x to develop/test on. =) We've also got some good contacts at AMD if you need help there as well.
As other folks have commented, CUDA not being an open standard is a large part of the problem. That and the developers who target CUDA directly when writing Stable Diffusion algorithms—they are forcing the monopoly. Even at the cost of not being able to squeeze every ounce out of the GPU, portability greatly improves software access when people target Vulkan et al.
BLAS will only get you so far. About the highest level operation it has is matmul, which you can use to build convolution (im2col, matmul, col2im), but that won't be as performant as a hand optimized cuDNN convolution kernel. Same goes for any other high level neural net building blocks - trying to build them on top of BLAS will not get you remotely close to performance of a custom kernel.
What's nice about BLAS is that there are optimized implementations for CPUs (Intel MKL) as well as NVIDIA (cuBLAS) and AMD (hipBLAS), so while it's very much limited in what it can do, you can at least write portable code around it.
You were asking if this CUDA compatability layer might hold any advantage over HIP (e.g. for use by llama.cpp) ?
I think the answer is no, since HIP includes pretty full-featured support for many of the higher level CUDA-based APIs (cuDNN, cuBLAS, etc), while per the Phoronix article ZLUDA only (currently) has minimal support for them.
I wouldn't expect ZLUDA to provide any performance benefit over HIP either, since on AMD hardware HIP is just a pass-thru to MIOpen (AMD's equivalent to cuDNN), rocBLAS, etc.
They are focusing on HPC first. Which seems reasonable if your software stack is lacking. Look for sophisticated customers that can help build an ecosystem.
As I mentioned elsewhere, 25% of GPU compute on the Top 500 Supercomputer list is AMD. This all on the back of a card that came out only three years ago. We are very rapidly moving towards a situation where there are many, many high-performance developers that will target ROCm.
No, it isn't. What is a better measure is to look at businesses like what I'm building (and others), where we take on the capex/opex risk around top end AMD products and bring them to the masses through bare metal rentals. Previously, these sorts of cards were only available to the Top 500.
Yes it is, it's how cuda got it's dominance 10 years ago. Businesses don't release their source code, super computers are attached to labs and universities and have much better licenses for software, or publish papers about it.
> I'm really rooting for AMD to break the CUDA monopoly. To this end, I genuinely don't know whether a translation layer is a good thing or not. On the upside it makes the hardware much more viable instantly and will boost adoption, on the downside you run the risk that devs will never support ROCm, because you can just use the translation layer.
On the other hand:
> The next major ROCm release (ROCm 6.0) will not be backward [source] compatible with the ROCm 5 series.
Even worse, not even the driver is backwards-compatible:
> There are some known limitations though like currently only targeting the ROCm 5.x API and not the newly-released ROCm 6.x releases.. In turn having to stick to ROCm 5.7 series as the latest means that using the ROCm DKMS modules don't build against the Linux 6.5 kernel now shipped by Ubuntu 22.04 LTS HWE stacks, for example. Hopefully there will be enough community support to see ZLUDA ported to ROCM 6 so at least it can be maintained with current software releases.
I have heard that DirectML was a somewhat easier story, but allegedly has worse performance (and obviously it's Windows only...). But I'm not entirely suprised that setup is somewhat easier on Windows, where bundling everything is an accepted approach.
With AMD's official 15GB(!) Docker image, I was now able to get the A1111 UI running. With SD 1.5 and 30 sample iterations, generating an image takes under 2s. I'm still struggling to get InvokeAI running.
It actually doesn't include the models! The image is Ubuntu with ROCm and a number of ML libraries, such as Torch, preinstalled.
> Also, nothing is easier on Windows.
As much as I, too, dislike Windows, I still have to disagree. I have encountered (proprietary) software which was much easier to get working on Windows. For example, Cisco AnyConnect with SmartCard authentication has been a nightmare for me on Linux.
> I'm really rooting for AMD to break the CUDA monopoly
Personally I want Nvidia to break the x86-64 monopoly, with how amazing properly spec'd Nvidia cards are to work with I can only dream of a world where Nvidia is my CPU too.
Maybe he meant homogeneity which Nvidia did try and tries with Arm.. but, on the other hand how wild would it be for Nvidia to enter x86-64 as well? It's probably never going to happen due to licensing if nothing else, lest we remember nForce chipset ordeal with intel legal.
Indeed, but I think people forget that the reason AMD have a license in the first place was because Intel's customers in the early days required a second source for it's processors.
How would this a good idea? I am not very familiar with GPU programming but the small amount I've tried was nothing but pain a few years ago on linux, it was so bad that Torvald publicly used the f word in a very public event. That aside, CUDA seem like a great way to lock people in even further like AWS does with absolutely everything
>CUDA seem like a great way to lock people in even further like AWS does with absolutely everything
Lock people in to something that didn’t exist in a way any user could use before it existed? I get people hate CUDAs dominance but no one else was pushing this before CUDA and Apple+AMD completely fumbled OpenCL.
Can’t hate on something good just because it’s successful and I can’t be angry the talent behind the success wanting to profit.
>I am not very familiar with GPU programming but the small amount I've tried was nothing but pain a few years ago on linux, it was so bad that Torvald publicly used the f word in a very public event.
I'm pretty sure Torvalds was giving the finger over the subject of GPU drivers (which run on the CPU), not programming on the Nvidia GPU itself. Particularly, they namedropped Bumblebee (and maybe Optimus?) which was more about power-management and making Nvidia cooperate with a non-Nvidia integrated GPU than it was about the Nvidia GPU itself.
This is the exact reason* I bought a 4090 for my recent rebuild instead of the rDNA card I actually wanted. I really wanted to go with AMD for the driver integration with the Linux graphics stack —- I’m so, so tired of shenanigans when it comes to decades old features of X not working or working poorly due to some nvidia bug/non-integration.
But being able to leverage my graphics card for GPGPU was a top priority for me, and like you, I was appalled with the
ROCm situation. Not necessarily the tech itself (though I did not enjoy the docker approach), but more the developer situation surrounding it.
I think this is essentially the same situation as Proton+DXVK for Linux gaming. I think that that is a net positive for Linux, but I'm less sure about this. Getting good performance out of GPU compute requires much more tuning to the concrete architecture, which I'm afraid devs just won't do for AMD GPUs through this layer, always leaving them behind their Nvidia counterparts.
However, AMD desperately needs to do something. Story time:
On the weekend I wanted to play around with Stable Diffusion. Why pay for cloud compute, when I have a powerful GPU at home, I thought. Said GPU is a 7900 XTX, i.e. the most powerful consumer card from AMD at this time. Only very few AMD GPUs are supported by ROCm at this time, but mine is, thankfully.
So, how hard could it possibly to get Stable Diffusion running on my GPU? Hard. I don't think my problems were actually caused by AMD: I had ROCm installed and my card recognized by rocminfo in a matter of minutes. But the whole ML world is so focused on Nvidia that it took me ages to get a working installation of pytorch and friends. The InvokeAI installer, for example, asks if you want to use CUDA or ROCm, but then always installs the CUDA variant whatever you answer. Ultimately, I did get a model to load, but the software crashed my graphical session before generating a single image.
The whole experience left me frustrated and wanting to buy an Nvidia GPU again...