Migrating IBM i RPMs to use OpenSSL 3
Today, we're pushing out updates to various rpms to migrate from OpenSSL 1.1.1 to OpenSSL 3. As you may know, OpenSSL 1.1.1 is no longer supported by the upstream OpenSSL project so we had to rebuild our packages to use OpenSSL 3.0, the current Long Term Support (LTS) release.
While OpenSSL 1.1.1 and 3.0 are relatively API compatible, so most software using OpenSSL needed little to no changes to make it build with OpenSSL 3.0 instead. However, in the process we ran in to some major issues which caused this migration to take longer than one might otherwise expect. These issues should now be resolved and we have a documentation page detailing the issues a bit. I'll be going in to more details below, if you want to learn more.
Long story short: With the mitigations in place the transition should be pretty smooth. It's still recommended to update all packages, so that you don't end up with a mixture of OpenSSL 1.1.1 and 3.0 using packages, but our mitigations should handle that should it occur.
Issues Found
While rebuilding the rpms, I would install the built rpms in to a sandbox to
play around with them first before merging. After building curl against
OpenSSL, yum
commands started to segfault:
$ yum install mc
https://public.dhe.ibm.com/software/ibmi/products/pase/rpms/repo-base-7.3/repodata/repomd.xml: [Errno 14] curl#35 - "OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to public.dhe.ibm.com:443 "
Trying other mirror.
Segmentation fault (core dumped)
The backtrace was rather interesting:
#0 0x090000000c2e9cfc in ?? () from /QOpenSys/pkgs/lib/libcrypto.so.1.1(shr_64.o)
#1 0x090000000c387018 in ?? () from /QOpenSys/pkgs/lib/libcrypto.so.1.1(shr_64.o)
#2 0x090000000c387018 in ?? () from /QOpenSys/pkgs/lib/libcrypto.so.1.1(shr_64.o)
#3 0x090000000c8e62d4 in ?? () from /QOpenSys/pkgs/lib/libcrypto.so.3(shr_64.o)
...
#6 0x090000000c8bfcd8 in ?? () from /QOpenSys/pkgs/lib/libcrypto.so.3(shr_64.o)
#7 0x090000000c7ca654 in ?? () from /QOpenSys/pkgs/lib/libssl.so.3(shr_64.o)
...
#9 0x090000000c798ff0 in ?? () from /QOpenSys/pkgs/lib/libssl.so.3(shr_64.o)
#10 0x090000000c6d490c in ?? () from /QOpenSys/pkgs/lib/libcurl.so.4(shr_64.o)
...
#20 0x090000000c6ab204 in ?? () from /QOpenSys/pkgs/lib/libcurl.so.4(shr_64.o)
#21 0x090000000cddfc30 in ?? () from /QOpenSys/pkgs/lib/python2.7/site-packages/pycurl.so
#22 0x09000000076dd9c4 in PyEval_EvalFrameEx () from /QOpenSys/pkgs/lib/libpython2.7.so
The pycurl package calls in to Curl (libcurl.so.4), which ends up calling in to OpenSSL 3 (libssl.so.3). OpenSSL is composed of two libraries: libssl which has high-level SSL/TLS related functions and libcrypto, which has low-level cryptographic algorithms. The OpenSSL 3 TLS code eventually calls in to libcrypto.so.3, but at some point from there it calls in to libcrypto.so.1.1. This is bad. OpenSSL 1.1.1 and 3.0 are not ABI compatible and they have separate global state.
So we knew what was the problem, but now the million dollar question: why?
Debugging
So the first question was why is OpenSSL 1.1.1 and OpenSSL 3.0 being loaded together in the first place. This it turns out was easy to answer: when yum starts it ends up loading both the built in ssl Python module and pycurl. Because we hadn't rebuilt Python 2 with OpenSSL 3, it was still using 1.1.1 and we ended up with both versions of OpenSSL loaded once pycurl was loaded.
So the next question was why is this causing a problem? On AIX (and PASE by
extension), unlike many other platforms, function references are resolved at
link time. We can see this by using the dump
command with the -Tv
flags to
show the symbols:
libcurl.so.4:
[374] 0x00000000 undef IMP DS EXTref libssl.so.3(shr_64.o) SSL_new
_ssl.so:
[206] 0x00000000 undef IMP DS EXTref libssl.so.1.1(shr_64.o) SSL_new
We can see that each library specifies that it wants to call the SSL_new
function, but libcurl.so.4 says it should be found in libssl.so.3 while _ssl.so
says to load it from libssl.so.1.1. So we should be fine, right??? Well,
turns out it's not so simple.
Debug Rabit Hole
We started by creating a simple example program which would load mock OpenSSL 1.1.1 and 3.0 libssl and libcrypto libraries. Once we had the example in place and it was able to recreate the problem, we could do more diagnosis. We tried the same examples on AIX directly and discovered that we weren't able to recreate the problem there. We do use different default compile options in our gcc compiler, so was it a PASE bug or a difference in compiler? To determine this, we took the binaries from AIX and copied them in to PASE and the binaries from PASE and copied them to AIX. With a bit of LIBPATH magic we were able to get them running with their corresponding libgcc libraries and discovered that the PASE binaries failed on AIX while the AIX binaries were fine. 🤔
We played around with a variety of compiler and linker options trying to
determine what was different between our build and AIX. The main culprit was
the Runtime Linking flag -brtl
. This flag causes a variety of changes to be
more compatible with Linux applications, where function references are resolved
at runtime instead of link time which is exactly what we were experiencing.
However this flag didn't seem to make any difference when building our mock
libraries. It turns out we were missing a critical piece of information:
whether the runtime linker is used or not is based on whether the main program
binary has runtime linking enabled or not and how the libraries it loads were
built doesn't matter at all. So of course the flag didn't make any
difference on the libraries, we needed to apply that to the example binary
instead. Indeed, this was our issue and linking the main program without
-brtl
would allow OpenSSL 1.1.1 and 3.0 to coexist peacefully in the same
process without crashing!
Runtime Linking Considered Harmful?
Ok, so we had our smoking gun: runtime linking was causing the crash. Any program using runtime linking (ie. all of them in our RPM ecosystem) would crash if they loaded both OpenSSL libraries at the same time. Now the question is how do we deal with this problem?
Well, we don't really need runtime linking. Pretty much all of the software
we build doesn't depend on this behavior, so why do we enable it automatically
unlike on AIX? Well, we use it because it makes packaging software for RPMs
easier to deal with vs the traditional AIX library packaging scheme. It allows
the linker to find libraries with the .so extension like on Linux and are
produced when using the --with-aix-soname=svr4
configure flag from
libtool.
Maybe one day I'll write a post explaining how all this works, but for now all
you need to know is that it makes it easier for us to build RPMs and that it
requires using libraries with a .so extension.
In Search of a Solution
Now, the simplest solution to this problem is just don't have the problem in the first place. If we upgraded everything to use OpenSSL 3.0 all at once, well then there's no problem, right? Well, this solution is not ideal for a few reasons:
- it causes a flag day and we can't control when and how users upgrade
- we don't have control over third-party applications, which may still be using OpenSSL 1.1.1
- some software we don't want to upgrade (eg. Python 2, which is EOL)
What we really wanted was a way to get the "please search for .so files"
behavior without the "enable runtime linking" linker behavior that comes with
-brtl
. Unfortunately, after looking over the linker docs
it didn't seem like there was anything, but as a hail mary I asked a new team
member if he knew anything. He had recently come over from AIX development and
still knew someone who worked on the linker code. After talking with his
contact, we learned that there is indeed a way to do this using an undocumented
🤫 command line option -blibsuff:so
.
With this, we had our solution: rebuild any packages using OpenSSL (either
directly or indirectly via dependencies) with -blibsuff:so
instead of
-brtl
. One final snag was that of the third pary software I could check, PHP
was using OpenSSL and it turns out it actually requires runtime linking
behavior in its default build configuration. If you are using the
CP+ PHP from Seiden
Group, they have a modified version of PHP which runs without runtime linking.
Path Forward
So we've now rebuilt all software using OpenSSL without runtime linking
enabled, but we didn't rebuild everything. There's still a lot of software to
rebuild without -brtl
and eventually we no longer want to be using it at all.
To this end, our recently released GCC 12 package has replaced -brtl
with
-blibsuff:so
in its default linker options. There's still more investigation
we need to do on how removing -brtl
will affect libtool builds as well, but
if you are building and packaging PASE software you should building with the
-bnortl
linker flag or using GCC 12 to ensure your software doesn't depend on
runtime linking behavior.