LAPACK tests are failing with OpenBLAS-0.3.20 and GCC-11.3.0

Creating this issue to properly log all the progress.

## How it started
It was observed that the VASP6 installation with `foss/2022a` lead to inaccurate results. After some digging the culprit was found - `DGGEV` subroutine from `LAPACK`. To simplify debugging of the problem we isolated LAPACK tests from the official netlib distribution (3.10.1) and started to run them using different combinations of compiler flags and OpenBLAS versions.

## What we have
The following tests are performed on AMD EPYC ROME (zen2 architecture):
- `OpenBLAS/0.3.15-GCC-10.3.0` (taken from `foss/2021a`) results in ~130 failed tests:
```bash
[wimr@int1 OUTPUT]$ grep failed foss-2021a-openblas-0.3.15/* | grep -v "error exits"
foss-2021a-openblas-0.3.15/ced.out: CEV:    4 out of  1096 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/ced.out: CVX:   24 out of  5196 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/cgd.out: CGV drivers:      5 out of   1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/cgd.out: CGV drivers:      6 out of   1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zed.out: ZEV:    8 out of  1100 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zed.out: ZVX:   36 out of  5208 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zgd.out: ZGV drivers:     25 out of   1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zgd.out: ZGV drivers:     22 out of   1092 tests failed to pass the threshold 
```
- `OpenBLAS/0.3.20-GCC-11.3.0` (taken from `foss/2022a`) results in ~4.2k failed tests
```bash
[wimr@int1 OUTPUT]$ grep failed foss-2022a-openblas-0.3.20/* | grep -v "error exits"
foss-2022a-openblas-0.3.20/ced.out: CEV:   30 out of  1122 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/ced.out: CVX:  194 out of  5366 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGV drivers:    129 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGV drivers:    135 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGS drivers:    123 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGS drivers:    126 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG:  119 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG:  115 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG:  117 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG:  116 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGS drivers:    129 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGS drivers:    129 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGV drivers:    166 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGV drivers:    171 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG:  143 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG:  146 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG:  163 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG:  150 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGS drivers:    144 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGS drivers:    132 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGV drivers:    173 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGV drivers:    186 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG:  153 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG:  140 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG:  147 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG:  150 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zed.out: ZEV:   50 out of  1142 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zed.out: ZVX:  296 out of  5468 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zgd.out: ZGV drivers:     52 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zgd.out: ZGV drivers:     50 out of   1092 tests failed to pass the threshold
```
- `OpenBLAS/0.3.15-GCC-11.3.0` (new build) results in ~4.2k failed tests

@zao got similar results on Ryzen 9 3900X (zen2 desktop) when built LAPACK tests with full `foss/2022a` and `buildenv` that picked up `FlexiBLAS` as the `USE_OPTIMIZED_BLAS` implementation:
```bash
[easybuild@eb-rocky8 build-lapack-ob-0.3.20-benv]$ grep failed TESTING/testing_results.txt
 SGG:  163 out of  2184 tests failed to pass the threshold
 SGG:  159 out of  2184 tests failed to pass the threshold
 SGG:  162 out of  2184 tests failed to pass the threshold
 SGG:  155 out of  2184 tests failed to pass the threshold
 SGS drivers:    144 out of   1560 tests failed to pass the threshold
 SGS drivers:    159 out of   1560 tests failed to pass the threshold
 SGV drivers:    180 out of   1092 tests failed to pass the threshold
 SGV drivers:    184 out of   1092 tests failed to pass the threshold
  STFSM auxiliary routine:     1 out of  7776 tests failed to pass the threshold
 DGG:  161 out of  2184 tests failed to pass the threshold
 DGG:  150 out of  2184 tests failed to pass the threshold
 DGG:  166 out of  2184 tests failed to pass the threshold
 DGG:  151 out of  2184 tests failed to pass the threshold
 DGS drivers:    135 out of   1560 tests failed to pass the threshold
 DGS drivers:    156 out of   1560 tests failed to pass the threshold
 DGV drivers:    174 out of   1092 tests failed to pass the threshold
 DGV drivers:    172 out of   1092 tests failed to pass the threshold
 CEV:   30 out of  1122 tests failed to pass the threshold
 CVX:  194 out of  5366 tests failed to pass the threshold
 CGG:  122 out of  2184 tests failed to pass the threshold
 CGG:  118 out of  2184 tests failed to pass the threshold
 CGG:  129 out of  2184 tests failed to pass the threshold
 CGG:  121 out of  2184 tests failed to pass the threshold
 CGV drivers:    135 out of   1092 tests failed to pass the threshold
 CGV drivers:    121 out of   1092 tests failed to pass the threshold
 CGS drivers:    126 out of   1560 tests failed to pass the threshold
 CGS drivers:    135 out of   1560 tests failed to pass the threshold
 ZHS:    1 out of  1764 tests failed to pass the threshold
 ZHS:    1 out of  1764 tests failed to pass the threshold
 ZHS:    1 out of  1764 tests failed to pass the threshold
 ZHS:    1 out of  1764 tests failed to pass the threshold
 ZEV:   50 out of  1142 tests failed to pass the threshold
 ZVX:  296 out of  5468 tests failed to pass the threshold
 ZGV drivers:     54 out of   1092 tests failed to pass the threshold
 ZGV drivers:     39 out of   1092 tests failed to pass the threshold
```

The main question - are failing tests caused by FlexiBLAS or by the optimization flags?

## Update 1
From @zao :
Stripping -ftree-vectorize from the build flags that buildenv sets (leaving -O2 -march=native) makes it behave, so it's probably the better vectorizer in GCC11 lifting up some latent problem in OpenBLAS. It wouldn't be the first time... 
```bash
STFSM auxiliary routine:     1 out of  7776 tests failed to pass the threshold
 CEV:    4 out of  1096 tests failed to pass the threshold
 CVX:   24 out of  5196 tests failed to pass the threshold
 CGV drivers:      5 out of   1092 tests failed to pass the threshold
 CGV drivers:      5 out of   1092 tests failed to pass the threshold
 ZEV:    8 out of  1100 tests failed to pass the threshold
 ZVX:   36 out of  5208 tests failed to pass the threshold
 ZGV drivers:     26 out of   1092 tests failed to pass the threshold
 ZGV drivers:     26 out of   1092 tests failed to pass the threshold
```

## Update 2
From @zao 
I've set up a fresh environment on a **Haswell** machine, got **the same grade of broken outcome as on our zen2** so not µarch-dependent. Steps:
```bash
Make and install a buildenv-default-GCC-11.3.0.eb
$ ml GCC/11.3.0 OpenBLAS/0.3.20 CMake/3.23.1
$ ml buildenv  # defines all the various flags variables to "-O2 -ftree-vectorize -march=native"
$ tar xf v3.10.1.tar.gz  # extract LAPACK sources
$ cmake -B build-tests lapack-3.10.1/ -DUSE_OPTIMIZED_BLAS=ON -DBUILD_TESTING=ON -DBLAS_LIBRARIES=$EBROOTOPENBLAS/lib/libopenblas.so
$ cmake --build build-tests -j 4 && cmake --build build-tests -t test
$ (cd lapack-3.10.1; ./lapack_testing.py; grep failed TESTING/testing_results.txt)
```

## Update 3
From @zao 
> Ran some exhaustive tests on zen2 from GCC 9.5.0 through GCC 12.2.0 with OpenBLAS 0.3.20. It's not looking great.
I'll try to provide data later but it seems that starting with GCC 12 we get elevated test error rates even without -ftree-vectorize , but builds with the flag have fewer categories of test errors comparatively than GCC 11 does.
Interesting enough, even on the 9.5 and 10 series there's slightly different error counts if you look at with/without the flag. I don't know enough about this test suite to tell whether any errors at all is a problem.

## Update 4
I got the following number of numerical errors using `lapack_testing.py -p x -t eig` from the LAPACK distribution:

**Build with `GCC-11.3`:**
- `-O2 -march=znver2 -funroll-all-loops -fno-math-errno -ftree-vectorize` : 4090
- `-O2 -march=znver2 -funroll-all-loops -fno-math-errno` : 136
- `-O2 -march=znver2 -fno-math-errno` : 136
- `-O2 -fno-math-errno` : 7

**Build with `GCC-10.3`:**
- `-O2 -march=znver2 -funroll-all-loops -fno-math-errno -ftree-vectorize` : 136

every OpebBLAS version was built manually using `GCC/11.3.0` or `GCC/10.3.0` module (no FlexiBLAS involved)

Way to reproduce:
```bash
$ wget https://github.com/Reference-LAPACK/lapack/archive/refs/tags/v3.10.1.tar.gz
$ tar -xf v3.10.1.tar.gz
$ cd lapack-3.10.1
$ cp make.inc.example make.inc
$
$ # Modify make.inc by removing paths to BLASLIB, CBLASLIB, TMGLIB and LAPACKELIB
$ # Change LAPACKLIB to, e.g. $(EBROOTOPENBLAS)/lib/libopenblas.so
$
$ cd TESTING
$ make 
$ cd .. 
$ lapack_testing.py -p x -t eig
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LAPACK tests are failing with OpenBLAS-0.3.20 and GCC-11.3.0 #16380

How it started

What we have

Update 1

Update 2

Update 3

Update 4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LAPACK tests are failing with OpenBLAS-0.3.20 and GCC-11.3.0 #16380

Description

How it started

What we have

Update 1

Update 2

Update 3

Update 4

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions